Skip to content

Latest commit

 

History

History
88 lines (67 loc) · 3.01 KB

File metadata and controls

88 lines (67 loc) · 3.01 KB

Data-Engineer-Nanodegree-Projects-Udacity

Projects done in the Data Engineer Nanodegree by Udacity.com

Course 1: Data Modeling

Introduction to Data Modeling

  • Understand the purpose of data modeling
  • Identify the strengths and weaknesses of different types of databases and data storage techniques
  • Create a table in Postgres and Apache Cassandra

Relational Data Models

  • Understand when to use a relational database
  • Understand the difference between OLAP and OLTP databases
  • Create normalized data tables
  • Implement denormalized schemas (e.g. STAR, Snowflake)

NoSQL Data Models

  • Understand when to use NoSQL databases and how they differ from relational databases
  • Select the appropriate primary key and clustering columns for a given use case
  • Create a NoSQL database in Apache Cassandra

Project 1: Data Modeling with Postgres and Apache Cassandra

Course 2: Cloud Data Warehouses

Introduction to the Data Warehouses

  • Understand Data Warehousing architecture
  • Run an ETL process to denormalize a database (3NF to Star)
  • Create an OLAP cube from facts and dimensions
  • Compare columnar vs. row oriented approaches

Introduction to the Cloud with AWS

  • Understand cloud computing
  • Create an AWS account and understand their services
  • Set up Amazon S3, IAM, VPC, EC2, RDS PostgreSQL

Implementing Data Warehouses on AWS

  • Identify components of the Redshift architecture
  • Run ETL process to extract data from S3 into Redshift
  • Set up AWS infrastructure using Infrastructure as Code (IaC)
  • Design an optimized table by selecting the appropriate distribution style and sorting key

Project 2: Data Infrastructure on the Cloud

Course 3: Data Lakes with Spark

The Power of Spark

  • Understand the big data ecosystem
  • Understand when to use Spark and when not to use it

Data Wrangling with Spark

  • Manipulate data with SparkSQL and Spark Dataframes
  • Use Spark for ETL purposes

Debugging and Optimization

  • Troubleshoot common errors and optimize their code using the Spark WebUI

Introduction to Data Lakes

  • Understand the purpose and evolution of data lakes
  • Implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue
  • Use Spark to run ELT processes and analytics on data of diverse sources, structures, and vintages
  • Understand the components and issues of data lakes

Project 3: Big Data with Spark

Course 4: Automate Data Pipelines

Data Pipelines

  • Create data pipelines with Apache Airflow
  • Set up task dependencies
  • Create data connections using hooks

Data Quality

  • Track data lineage
  • Set up data pipeline schedules
  • Partition data to optimize pipelines
  • Write tests to ensure data quality
  • Backfill data

Production Data Pipelines

  • Build reusable and maintainable pipelines
  • Build your own Apache Airflow plugins
  • Implement subDAGs
  • Set up task boundaries
  • Monitor data pipelines

Project 4: Data Pipelines with Airflow

./images/certification.jpg