Advanced Big Data Analytics

مدرس دوره : مهندس حسن احمدخانی


عنوان دوره طول دوره زمان برگزاری تاریخ شروع دوره شهریه استاد وضعیت ثبت نام ثبت نام
Advanced Big Data Analytics (آنلاین) 18 جلسه 54 ساعت یکشنبه از ساعت 17:30 الی 20:30
سه شنبه از ساعت 17:30 الی 20:30
یکشنبه ۱۰ مهر ۱۴۰۱ 3,705,000 تومان مهندس حسن احمدخانی

    سرفصل و محتوای دوره ی Advanced Big Data Analytics



     معرفی و هدف دوره :

    در دوره آموزشی مباحث پیشرفته در تحلیل داده های کلان، ابزار ها و تکنیک های مهم در آماده سازی، پاکسازی، تحلیل و مدیریت داده های کلان بررسی خواهند شد.

    هدف این دوره آموزشی بررسی مباحث پیشرفته در حوزه ابزارهای اکوسیستم هادوپ جهت احراز نیازمندی های مشاغل Data Engineer ،   Data Scientist  و Integration Engineer می باشد.

     

    طول دوره : 54 ساعت


    پیش نیاز دوره: دورهApplied Big Data Fundamentals یا تجربه کاری در زمینه Hadoop و اکوسیستم  Hadoop

     

     

    Course Content, At a Glance

    Build a Data Lake and Data Lakehouse for Advance Analytics at Scale: 33 hours

    Spark SQL

    Apache Hive

    Trino (aka PrestoSQL)

    HDFS Files Format Internals: ORC, Parquet, Iceberg, Delta and Hudi

    Apache Kudu

    Spark ML over Lakehouse

    Spark Graph Frame and GraphX

    Apache Airflow for Workflow Management

     

    Real Time Analytics and Streaming: 12 hours

    Advance NiFi and Data Flow Management with NiFi

    Kafka as commit log and stream storage

    Advance HBase and How HBase works

    Spark Streaming

    Flink Streaming

     

    Data Security: 3 hours

    Kerberos and Hadoop Ecosystem Integration with Krb5

    Ranger and Ranger Integration with Ecosystem

     

    Data Governance in Practice: 3 hours

    Data Governance Knowledge Areas (DMBOCK 2.0)

    Data Modeling: Star, Snowflake and Data Vault

    Meta Data Management and Data lineage with Apache Atlas

    Ranger and Atlas Integration and Tag Based Access Control

     

    Cloud Platforms Review: 3 hours

    Databricks

    GCP

    AWS

    Snowflake

     

     

    Course Content, Detail:

    Build a Data Lake and Data Lakehouse for Advance Analytics at Scale: 33 hours

    Preparation of development environment

    Hadoop Cluster

    Zeppelin Notebook

    Create interpreters in Zeppelin

    Zeppelin security

    IntelliJ IDEA

     

    Spark SQL

    Data frame and Dataset

    Reading data from different sources: HDFS, RDBMS and NoSQL

    Spark SQL functions and case studies for batch processing, ETL and ELT

     

    Apache Hive

    Hive introduction and Deployment

    HS2 and HMS

    Data operations with HQL-CRUD

    Partitioning and Bucketing

    Join methods and internal of joins in distributed model

    Hive UDF: UDF, UDAF and UDTF

    SerDe in Hive

    Hive on Spark

    Hive on Tez

    Hive LLAP

    Spark to Hive connections methods

    Hive tuning

     

    Trino (aka PrestoSQL)

    PrestoSQL/Trino introduction and deployment

    Data operations with Trino-CRUD

    Trino catalogs to Hive and HDFS, RDBMS and MongoDB

    Trino parameters and Trino tuning

    Presto/Trino on Spark

     

    HDFS Files Format Internals: ORC, Parquet, Iceberg, Delta and Hudi

    ORC and Parquet file format internals

    Avro file format

    Schema evolution

    Update and delete challenges in HDFS file formats

    Data lake vs lakehouse

    Iceberg file format internals

    Delta file format

    Hudi file format internals

    Data operation with Spark on Iceberg, Delta and Hudi

    Data operation with Trino on Iceberg, Delta and Hudi

     

    Apache Kudu

    Kudu introduction and deployment

    Kudu architecture and data modeling in Kudu

    Data operations on Kudu with Spark and Trino

    Build Kudu from source code

    Kudu parameters and tuning options


    Spark ML over Lakehouse

    Machine learning introduction and ML in distributed manner

    ML algorithms in Spark ML

    Spark ML pipelines

    Case studies and examples for clustering, classification and recommendation

    Deep learning with Spark

    Model deployment

     

    Spark Graph Frame and GraphX

    Graph processing, graph query and graph algorithms introduction

    Graph query and algorithms with Spark GraphX

    Graph query and algorithms with Spark Graph Frame

    Spark SQL, ML and Graph in one picture, how to in practice

     

    Apache Airflow for Workflow Management

    Airflow introduction and deployment

    Data flow design with Airflow

    DAG, Task, Operator, Executor and Scheduler in Airflow

    Airflow operators and custom operators

     

    Real Time Analytics and Streaming: 12 hours

    Kafka as commit log and stream storage

    Kafka introduction and review (prerequisite: Kafka topics in Applied Big Data Fundamentals course)

    Kafka cluster deployment

     

    Advance NiFi and Data Flow Management with NiFi

    NiFi introduction and deployment

    Data flow design with NiFi and streaming to Kafka and Data lake

    NiFi custom processors

    NiFi Registry

    NiFi Toolkit

    NiFi clustering and tuning

     

     

    Advance HBase and How HBase works

    HBase introduction and deployment

    Data modeling and schema design in HBase

    Data operations with Spark and Trino/PrestoSQL over HBase

    HBase and Phoenix as operational database

     

    Spark Streaming

    Streaming definition and use cases

    Spark streaming, real time and near real time processing models

    Streaming concepts and operations on streaming data

    Basic operations

    Window operations

    Join operations on stream data

    State store and check pointing

    Streaming sink to HDFS, Kafka, Delta, Iceberg, Kudu and RDBMS

    Implementation of Incremental ETL and ELT

    Implementation of end-to-end real time ETL and ingestion from RDBMS to Lakehouse with CDC, Kafka, Spark Streaming, Delta and Kudu

    Implementation of real time ML and Graph processing system

     

    Flink Streaming

    Flink streaming vs Spark streaming

    Flink introduction and deployment

    DataStream API and Table API

    Stateful functions

    Interactive processing with Flink

    Window operations with Flink

     

    Data Security: 3 hours

    Security introduction

    Authentication

    Authorization

    Transportation security - SSL/HTTPS

    At Rest security – Encryption at Rest

    Audit

     

    Kerberos and Hadoop Ecosystem Integration with Krb5

    Kerberos and how krb5 works

    Kafka integration with Kerberos

    Hive and HDFS integration with Kerbros

     

    Ranger and Ranger Integration with Ecosystem

    Ranger introduction and build Apache Ranger

    Kudu Security with Ranger

    PrestoSQL/Trino security with Ranger

    HDFS security with Ranger

     

    Data Governance in Practice: 3 hours

    Data Governance Knowledge Areas (DMBOCK 2.0)

    Data Governance

    Data Architecture

    Data Modeling and Design

    Data Storage and Operations

    Data Security

    Data Integration and Interoperability

    Documents and Content

    Reference and Master Data

    Data Warehousing and Business Intelligence

    Metadata

     

    Data Quality and Data Modeling: Star, Snowflake and Data Vault

    Review Star and Snowflake

    Introduction to Data Vault 2.0

     

    Meta Data Management and Data lineage with Apache Atlas

    Introduction to Atlas

    How to build and use Atlas

     

    Ranger and Atlas Integration and Tag Based Access Control

    Tag based security control introduction

    Atlas and Ranger integration for tag based security management

     

    Cloud Platforms Review: 3 hours

    Databricks

    Databricks features and options for big data processing

    Introduction to data engineering with Databricks

    Introduction to data science and ML with Databricks

     

    GCP

    Google cloud platform features and options for big data processing

    Introduction to data engineering with GCP

    Introduction to data science and ML with GCP

     

    AWS

    AWS features and options for big data processing

    Introduction to data engineering with AWS

    Introduction to data science and ML with AWS

     

    Snowflake

    Snowflake features and options for big data processing

    Introduction to data engineering with Snowflake

    Introduction to data science and ML with Snowflake