عنوان دوره | طول دوره | زمان برگزاری | تاریخ شروع دوره | شهریه | استاد | نوع برگزاری | وضعیت ثبت نام | ثبت نام | فیلم جلسه اول | |
---|---|---|---|---|---|---|---|---|---|---|
Applied Big Data Fundamentals | 15 جلسه 45 ساعت |
یکشنبه
از
ساعت 17:30
الی 20:30
سه شنبه از ساعت 17:30 الی 20:30 |
یکشنبه ۲ دی ۱۴۰۳ | 6,700,000 تومان | مهندس حسن احمدخانی | آنلاین | - |
Distributed
ingestion, storage and processing with modern data stack
معرفی و هدف
دوره :
در دوره آموزشی مفاهیم بنیادی پردازش و
مدیریت کلان داده ها، ابزار ها و تکنیک های مدرن جمع آوری، آماده سازی، پاکسازی، تحلیل و مدیریت
داده های کلان بررسی خواهند شد.
هدف این دوره آموزشی پرداختن به مباحث پایه ای در حوزه ابزارها و روش های توزیع شده جهت احراز نیازمندی
های مشاغل Data
Engineer، Data Scientist و Analytics Engineer و
همچنین پوشش مطالب بنیادی کلان داده جهت احراز نیازمندی های لازم برای مشاغل Platform
Engineer
و Data
Administration می باشد.
در پایان دوره انتظار میرود دانشجو علاوه بر شناخت کلان داده و چالش های آن، درک کاملی از تکنیک ها و ابزارهای اصلی و پایه ای جمع آوری، پردازش و نگهداری داده های کلان پیدا کرده و بتواند از ابزارهای مربوطه جهت انجام فعالیت های لازم در حوزه مهندسی، تحلیل و علم داده استفاده کند.
در این دوره یک پروژه در مقیاس نیازمندی ها و چالش های محیط های عملیاتی بر مبنای مباحث مطرح شده در سرفصل ها تعریف خواهد شد و اجرای این پروژه در فرآیند دوره خواهد بود.
:Course content at a glance
Big Data, distributed
storage and distributed processing concepts – 1.5 hours
Data lake, Lakehouse and
Streamhouse concepts and challenges – 1.5 hours
Distributed storage with
Hadoop HDFS – 5 hours
Distributed storage with
MinIO – 6 hours
Distributed storage and
processing with ClickHouse – 7 hours
Data ingestion with Apache
NiFi – 5 hours
Distributed processing
on data lake and lakehouse with Trino - 5 hours
Distributed processing
on data lake and lakehouse with Spark and Spark SQL - 8 hours
Kafka and streaming fundamentals – 6 hours
:Course content, details
Big Data, distributed storage and distributed processing concepts – 1.5 hours
Big data definition
Distributed storage
definition and challenges
Distributed processing
definition and challenges
NoSQL data bases focus
area
Big data platforms focus
area
Big data in AI era
What is modern data
stack?
How to modernize our
data platform and keep it up to date?
big data processing at scale, lesson learned
Data lake, Lakehouse and
Streamhouse concepts and challenges – 1.5 hours
Data lake definition and
limitations
Data lakehouse
definition and features
Streamhouse definition,
features and challenges
Data pipelines and
modern data pipeline features
Distributed storage with
Hadoop HDFS – 5 hours
Apache Hadoop
architecture and components
HDFS cluster deployment
Working with Hadoop HDFS
Small files issue in
HDFS
HDFS erasure coding
HDFS balancing, HDFS
federation, quota, rack awareness, …
HDFS tip and tricks
Distributed storage with
MinIO – 6 hours
Object storage vs block storage
vs file storage
Object storage platforms
use cases
AWS S3 compatible object
storages
MinIO architecture and
components
MinIO vs S3 vs Apache
Ozone vs Ceph
MinIO cluster deployment
Working with MinIO
MinIO Command Line
Lifecycle Management
Object Retention
Replication and EC
Buckets and bucket
notifications
Users and quotas
Distributed storage and
processing with ClickHouse – 7 hours
Distributed OLAP engines
Pinot vs Druid vs
ClickHouse
ClickHouse cluster components
ClickHouse cluster deployment
Data modeling: features
and limitations
Working with ClickHouse and
CH SQL review
Data types
Sharding and replication
strategy in ClickHouse
Table engines
User management and
roles
Partitioning and
ordering tip and tricks
ClickHouse best
practices
Data ingestion and
integration with Apache NiFi – 5 hours
NiFi use cases and data
ingestion with NiFi
Processors and data flow
design: multiple scenario and data flow
Data ingestion into
HDFS, MinIO, ClickHouse, RDBMS and NoSQL DBs
Processor groups, remote
processor groups and ports
Custom processors
Ingestion into lakehouse
with NiFi
NiFi registry
NiFi best practices
Distributed processing
on data lake and lakehouse with Trino - 5 hours
Trino architecture
Apache Hive meta store
and other meta store services for Trino
Trino functions and SQL
HDFS, MinIO, ClickHouse,
RDBMS, Hive and Iceberg catalogs
Trino user defined
functions
Deployment best
practices
Administration tips and tricks
Distributed processing
on data lake and lakehouse with Spark and Spark SQL - 8 hours
Apache Spark
architecture
Spark job deployment in
YARN and K8s
Working with Spark data
structures
Reading from and writing
data into HDFS, Clickhouse, MinIO, RDBMS, File, …
Implementation of
multiple scenario and developing batch jobs with Spark
Spark user defined
functions
Spark execution plan
Spark job and tasks
monitoring
Spark external shuffle
service
Spark parameter tuning
Kafka and streaming
fundamentals – 6 hours
Apache Kafka
architecture
Cluster components and
deployment
Working with topic,
partition, replication, producer, consumer and consumer group, …
Message structure and
message formats in Kafka
Message ordering
Kafka connect
Kafka monitoring
Kafka storage internals
Kafka integration with HDFS,
ClickHouse, MinIO, NiFi and Spark