Udacity Data Engineering Project

This project makes use of pyspark to extract data and organize then into a snowflack schema database composed of fact and dimention tables. It extracts data from parquet files and csv files.

US Immigration Data

Data Engineering Capstone Project

Project Summary

This project aims to use techniques learned in the Udacity Nanodegree course on Data Engineer to create a fact and dimension tables regarding US immigration data, together with analysis on temperature and airport characteristics. Here we will use two formats of data (SAS and .csv) to build FACT and dimension tables, splitting the US immigration dataset in regards to what are its informations about (flight? passenger?), and add informtion on temperature on the dates observed.

Dataset 1: I94 Immigration Data from the US National Tourism and Trade Office

Is the data that will be used to build our FACT table. It contains information on airline, visatipe, destination of people immigrating to the US.
Firstly we will check for missing data, check if the data is behaving as expected and transform the data where necessary.
This dataset has around 3 million lines and 28 columns. We will limit the columns to those we are interested in and explore how to break it into new tables when relevant
This data is updated monthly
For this project I opted for deleting columns with more than 30% of its values being null. However, it is expected that cases with null values happen in data engineering projects, and this is one of many possible approaches.
In this dataset we have a lot of infomations, we can split it in regards to what the informations are about.
We make use of draw.io to build visualization for the schema
I will allow nan values to continue as nan, without replacement techniques

Dataset 2: Temperature

Temperature by city and date

Dataset 2: Airport

Code
Location
State

Technologies and Tools

Here I opted for using pyspark and take a vantage of parallem processing.

Steps:

Create an enviroment for the project
Install the following packages: Pandas, Re, Pyspark, Numpy.
Run Capstone Project Template.ipynb (example: by using Vscode and checking that the kernel refers to the diserd environment)

Goal:

Create this data model from the initial datasets described avobe:

Example of Analysis with Data Model:

Check temperature on arrival date of a passanger
Check temperature on port on the date level
Correlate passanger and find out if there are indiations of preference regarding temperature for choosing fligh date

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
sas_data		sas_data
Capstone Project Template.html		Capstone Project Template.html
Capstone Project Template.ipynb		Capstone Project Template.ipynb
I94_SAS_Labels_Descriptions.SAS		I94_SAS_Labels_Descriptions.SAS
README.md		README.md
i94portCodes.csv		i94portCodes.csv
new_datamodel.PNG		new_datamodel.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udacity Data Engineering Project

US Immigration Data

Data Engineering Capstone Project

Project Summary

Dataset 1: I94 Immigration Data from the US National Tourism and Trade Office

Dataset 2: Temperature

Dataset 2: Airport

Technologies and Tools

Steps:

Goal:

Example of Analysis with Data Model:

About

Releases

Packages

Languages

anapaulasandes/dataengineer_table_creation

Folders and files

Latest commit

History

Repository files navigation

Udacity Data Engineering Project

US Immigration Data

Data Engineering Capstone Project

Project Summary

Dataset 1: I94 Immigration Data from the US National Tourism and Trade Office

Dataset 2: Temperature

Dataset 2: Airport

Technologies and Tools

Steps:

Goal:

Example of Analysis with Data Model:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages