عنوان دوره | طول دوره | زمان برگزاری | تاریخ شروع دوره | شهریه | استاد | وضعیت ثبت نام | ثبت نام | |
---|---|---|---|---|---|---|---|---|
Advanced Big Data Analytics (فقط فروش فیلم) | 13 جلسه 52 ساعت | - | پس از تکمیل ظرفیت | 3,000,000 تومان | مهندس حسن احمدخانی |
سرفصل و محتوای دوره ی Advanced Big Data Analytics
معرفی و هدف دوره :
در
دوره آموزشی مباحث پیشرفته در تحلیل داده های کلان، ابزار ها و تکنیک های مهم در
آماده سازی، پاکسازی، تحلیل و مدیریت داده های کلان بررسی خواهند شد.
هدف
این دوره آموزشی بررسی مباحث پیشرفته در حوزه ابزارهای اکوسیستم هادوپ جهت احراز
نیازمندی های مشاغل
Data Engineer ، Data Scientist و Integration Engineer می باشد.
طول دوره : 54 ساعت
پیش
نیاز دوره: دورهApplied Big Data Fundamentals یا تجربه کاری در زمینه Hadoop و اکوسیستم Hadoop
Course Content, At a Glance
Build a Data Lake and Data Lakehouse for Advance
Analytics at Scale: 33 hours
Spark SQL
Apache Hive
Trino (aka
PrestoSQL)
HDFS Files
Format Internals: ORC, Parquet, Iceberg, Delta and Hudi
Apache Kudu
Spark ML over
Lakehouse
Spark Graph
Frame and GraphX
Apache
Airflow for Workflow Management
Real Time Analytics and Streaming: 12 hours
Advance NiFi
and Data Flow Management with NiFi
Kafka as
commit log and stream storage
Advance HBase
and How HBase works
Spark
Streaming
Flink
Streaming
Data Security: 3 hours
Kerberos and
Hadoop Ecosystem Integration with Krb5
Ranger and
Ranger Integration with Ecosystem
Data Governance in Practice: 3 hours
Data
Governance Knowledge Areas (DMBOCK 2.0)
Data
Modeling: Star, Snowflake and Data Vault
Meta Data
Management and Data lineage with Apache Atlas
Ranger and
Atlas Integration and Tag Based Access Control
Cloud Platforms Review: 3 hours
Databricks
GCP
AWS
Snowflake
Course Content, Detail
Build a Data Lake and Data Lakehouse for Advance
Analytics at Scale: 33 hours
Preparation
of development environment
Hadoop
Cluster
Zeppelin
Notebook
Create
interpreters in Zeppelin
Zeppelin
security
IntelliJ IDEA
Spark SQL
Data frame
and Dataset
Reading data
from different sources: HDFS, RDBMS and NoSQL
Spark SQL
functions and case studies for batch processing, ETL and ELT
Apache Hive
Hive
introduction and Deployment
HS2 and HMS
Data
operations with HQL-CRUD
Partitioning
and Bucketing
Join methods
and internal of joins in distributed model
Hive UDF:
UDF, UDAF and UDTF
SerDe in Hive
Hive on Spark
Hive on Tez
Hive LLAP
Spark to Hive
connections methods
Hive tuning
Trino (aka
PrestoSQL)
PrestoSQL/Trino
introduction and deployment
Data
operations with Trino-CRUD
Trino
catalogs to Hive and HDFS, RDBMS and MongoDB
Trino
parameters and Trino tuning
Presto/Trino
on Spark
HDFS Files
Format Internals: ORC, Parquet, Iceberg, Delta and Hudi
ORC and
Parquet file format internals
Avro file
format
Schema
evolution
Update and
delete challenges in HDFS file formats
Data lake vs
lakehouse
Iceberg file
format internals
Delta file
format
Hudi file
format internals
Data
operation with Spark on Iceberg, Delta and Hudi
Data
operation with Trino on Iceberg, Delta and Hudi
Apache Kudu
Kudu
introduction and deployment
Kudu
architecture and data modeling in Kudu
Data
operations on Kudu with Spark and Trino
Build Kudu
from source code
Kudu parameters and tuning options
Spark ML over Lakehouse
Machine learning introduction and ML in distributed manner
ML algorithms in Spark ML
Spark ML
pipelines
Case studies
and examples for clustering, classification and recommendation
Deep learning
with Spark
Model
deployment
Spark Graph
Frame and GraphX
Graph
processing, graph query and graph algorithms introduction
Graph query
and algorithms with Spark GraphX
Graph query
and algorithms with Spark Graph Frame
Spark SQL, ML
and Graph in one picture, how to in practice
Apache
Airflow for Workflow Management
Airflow
introduction and deployment
Data flow
design with Airflow
DAG, Task,
Operator, Executor and Scheduler in Airflow
Airflow
operators and custom operators
Real Time Analytics and Streaming: 12 hours
Kafka as
commit log and stream storage
Kafka introduction
and review (prerequisite: Kafka topics in Applied Big Data Fundamentals course)
Kafka cluster
deployment
Advance NiFi
and Data Flow Management with NiFi
NiFi
introduction and deployment
Data flow
design with NiFi and streaming to Kafka and Data lake
NiFi custom
processors
NiFi Registry
NiFi Toolkit
NiFi
clustering and tuning
Advance HBase
and How HBase works
HBase
introduction and deployment
Data modeling
and schema design in HBase
Data
operations with Spark and Trino/PrestoSQL over HBase
HBase and
Phoenix as operational database
Spark
Streaming
Streaming
definition and use cases
Spark
streaming, real time and near real time processing models
Streaming concepts and operations on streaming data
Basic operations
Window operations
Join operations
on stream data
State store
and check pointing
Streaming
sink to HDFS, Kafka, Delta, Iceberg, Kudu and RDBMS
Implementation
of Incremental ETL and ELT
Implementation
of end-to-end real time ETL and ingestion from RDBMS to Lakehouse with CDC,
Kafka, Spark Streaming, Delta and Kudu
Implementation
of real time ML and Graph processing system
Flink
Streaming
Flink
streaming vs Spark streaming
Flink
introduction and deployment
DataStream
API and Table API
Stateful
functions
Interactive
processing with Flink
Window
operations with Flink
Data Security: 3 hours
Security
introduction
Authentication
Authorization
Transportation
security - SSL/HTTPS
At Rest
security – Encryption at Rest
Audit
Kerberos and
Hadoop Ecosystem Integration with Krb5
Kerberos and
how krb5 works
Kafka
integration with Kerberos
Hive and HDFS
integration with Kerbros
Ranger and
Ranger Integration with Ecosystem
Ranger
introduction and build Apache Ranger
Kudu Security
with Ranger
PrestoSQL/Trino
security with Ranger
HDFS security
with Ranger
Data Governance in Practice: 3 hours
Data
Governance Knowledge Areas (DMBOCK 2.0)
Data
Governance
Data
Architecture
Data Modeling
and Design
Data Storage
and Operations
Data Security
Data
Integration and Interoperability
Documents and
Content
Reference and
Master Data
Data
Warehousing and Business Intelligence
Metadata
Data Quality and Data Modeling: Star, Snowflake and Data Vault
Review Star
and Snowflake
Introduction
to Data Vault 2.0
Meta Data
Management and Data lineage with Apache Atlas
Introduction
to Atlas
How to build
and use Atlas
Ranger and
Atlas Integration and Tag Based Access Control
Tag based
security control introduction
Atlas and
Ranger integration for tag based security management
Cloud Platforms Review: 3 hours
Databricks
Databricks
features and options for big data processing
Introduction
to data engineering with Databricks
Introduction
to data science and ML with Databricks
GCP
Google cloud
platform features and options for big data processing
Introduction
to data engineering with GCP
Introduction
to data science and ML with GCP
AWS
AWS features
and options for big data processing
Introduction
to data engineering with AWS
Introduction
to data science and ML with AWS
Snowflake
Snowflake
features and options for big data processing
Introduction
to data engineering with Snowflake
Introduction
to data science and ML with Snowflake