udacity_capstone

Udacity Nanodegree

udacity_capstone

Project Summary

The purpose of this project is to study the foreign students. The goal is to offer Data teams Analysts a selection of data concerning immigration to the United States.

The demography of the United States was built thanks to the arrival of the first settlers of various European nationalities. In the 20th century, the population increased due to immigration. Here I would like to make the link with the world news and the variations of the entry to the USA via the airports.

The database will be answers questions about the foreign students. Where are the go to studies in USA? Where are they from? It's provide some indicators about their origin countries and the arrival state.

I'll look at the US National Tourism and Trade Office along with various immigration and other information about arriving person. I-94 Website is the Official Site for Travelers Visisting the United States. An I-94 form is needed by all Air and Sea travelers. The GDELT Project, here, monitors the news around our world. These will form the base of my project. I need to add more information.

Data Source

Data	File	Data Source
I94 Immigration	immigration_data_sample.csv	US National Tourism and Trade Office
I94 Description Labels Description	I94_SAS_Labels_Descriptions.SAS	US National Tourism and Trade Office
Global Land Temperature	GlobalLandTemperaturesByCity.csv	Berkeley Earth
Global Airports	airports-extended.csv	OpenFlights.org and user contributions
Airports codes	airport-codes_csv.csv	provide by Udacity
Iso country	wikipedia-iso-country-codes.csv	Kaggle
US Cities Demographic	us-cities-demographics.csv	provide by Udacity
Indicators developpment	WDIData.csv	Kaggle

Tools used

I used Python to explore and PySpark for ETL. I try the twice to improve my code. I used Pyspark and Pandas libraries. I run notebooks, python scripts and Spark jobs in a Docker Yarm.

I use compress the dataset in data.zip

then , to upload large file to github

     git lfs track "*.zip"
     git add .gitattributes
     git commit -m "Updated the attributes"
     git push
     git add my_large_file.zip
     git lfs ls-files
         And here I would ensure that I saw my_large_file.zip being tracked.
     git commit -m "Now I am adding the large file"
     git push

Folder structure

Click here to see the result of the command TREE of the main folder

tree -CAL 2
.  
├── data  
│   ├── 18-83510-I94-Data-2016  
│   ├── airport-codes_csv.csv  
│   ├── airports-extended.csv  
│   ├── airports_us.csv  
│   ├── dataset.tar.gz  
│   ├── GlobalLandTemperaturesByCity.csv  
│   ├── I94_SAS_Labels_Descriptions.SAS  
│   ├── immigration_data_sample.csv  
│   ├── postgres  
│   ├── us-cities-demographics.csv  
│   ├── WDIData.csv  
│   └── wikipedia-iso-country-codes.csv  
├── dl.cfg  
├── docker_nbextensions  
│   ├── Dockerfile  
│   └── docker.txt  
├── documentation
│   ├── 0_Take_a_look_dataset.html
│   ├── 1_Exploration_python.html
│   ├── 2_Data_dictionnary.html
│   ├── 3_Explanation_capstone_project.html
│   ├── 4_Create_etl_work_process.html
│   ├── 5_create_etl_workinCopy.html
│   ├── dend_schema.png
├── input
│   ├── airport-codes_csv
│   ├── airports-extended.csv
│   ├── GlobalLandTemperaturesByCity
│   ├── i94addr.parquet
│   ├── i94_apr16
│   ├── i94cit_i94res.parquet
│   ├── i94mode.parquet
│   ├── i94port.parquet
│   ├── i94visa.parquet
│   ├── us-cities-demographics
│   ├── WDIData
│   └── wikipedia-iso-country-codes.csv
├── output
│   ├── country_table
│   ├── demograph_table
│   ├── fact_immigration
│   ├── indicator_table
│   └── usairport_table
├── README.md  
├── stack.yml  
└── work  
    ├── bootstrap_jupyter.sh  
    ├── log4j.properties  
    ├── Makefile  
    ├── notebook  
    │   ├── 0_Take_a_look_dataset.ipynb  
    │   ├── 1_Exploration_python.ipynb  
    │   ├── 2_Data_dictionnary.ipynb  
    │   ├── 3_Explanation_capstone_project.ipynb  
    │   ├── 4_Create_etl_work_process.ipynb  
    │   ├── etl.py
    │   ├── cleasning.py
    │   ├── convertJupyterToHtml.py
    │   ├── parse_file1.py  
    │   ├── postgresql-42.2.10.jar  
    │   ├── process_tables.py  
    │   ├── read_file1.py  
    ├── postgresql-42.2.10.jar  
    └── requirements.txt

To Run

I used Docker.

In a terminal, run git clone https://github.com/anthelix/udacity_capstone.git <folder>,
cd <folder>
Unzip data.zip : It's make a folder data in <folder>
in <folder>:
- pull docker images first:
```
docker pull jupyter/all-spark-notebook:latest
```
- Deploy Docker Stack: docker stack deploy -c stack.yml jupyter
- Retrieve the token to log into Jupyter: docker logs $(docker ps | grep jupyter_spark | awk '{print $NF}')
  - ctrl + clic on "http://127. ..."
From the Jupyter terminal, run the install script: Make install == > sh bootstrap_jupyter.sh (in jupyter Home , at the right, click button new, then terminal)
then Make etl ==> python3 etl.py
A new folder output will created with the dimensions and fact table in <folder>
At the end, docker stack rm stack.yml jupyter, docker swarm leave --force, docker rmi -f $(docker images -qa), docker system prune --volumes, sudo docker images

Without Docker:

Go to the ./work directory
for local development, install Python packages: python3 -m pip install -r requirements.txt
Run Anaconda, then Jupyter Notebook
From the Jupyter terminal,in the folder ./work, run the install script : Make install == > sh bootstrap_jupyter.sh (in jupyter Home , at the right, click button new, then terminal)
Then Make etl
Make re clean the folder and delete folders input and output and re-create the same folderbut empty. I

ETL process

Process the file 'I94 Description Labels Description` to extract and store details in parquet files data.

Click here to see the result.

```
Running "../../data/I94_SAS_Labels_Descriptions.SAS"

There are 583 rows in i94port.parquet
There are 3 rows in i94visa.parquet
There are 55 rows in i94addr.parquet
There are 289 rows in i94cit_i94res.parquet
There are 4 rows in i94mode.parquet

***** Make i94 labels files is done!
```

Load the files

The files are in '.sas7bdat', '.csv', and '.parquet', create dataframes.

Click here to see the result.

  ```
  Variable             Type         Data/Info
  -------------------------------------------
  df_airport_code      DataFrame    DataFrame[ident: string, <...>ring, local_code: string]
  df_demograph         DataFrame    DataFrame[City: string, S<...>Race: string, Count: int]
  df_global_airports   DataFrame    DataFrame[airport_ID: int<...>ry: string, iata: string]
  df_immigration       DataFrame    DataFrame[cicid: double, <...>string, visatype: string]
  df_indicator_dev     DataFrame    DataFrame[Country Name: s<...>de: string, 2015: double]
  df_iso_country       DataFrame    DataFrame[Country: string<...>ring, ISO_3166-2: string]
  df_temperature       DataFrame    DataFrame[AverageTemperat<...> string, Country: string]
  ```

Cleasning

I kept data useful, rename columns, drop duplicates

Click here to see the result.

  ```
      ***** Make df_clean_global_airports processing 
  root
  |-- airport_id: integer (nullable = true)
  |-- airport_name: string (nullable = true)
  |-- city_name: string (nullable = true)
  |-- country_name: string (nullable = true)
  |-- iata_code: string (nullable = true)
+----------+--------------------+---------+------------+---------+
|airport_id|        airport_name|city_name|country_name|iata_code|
+----------+--------------------+---------+------------+---------+
|       263|Margaret Ekpo Int...|  Calabar|     Nigeria|      CBQ|
|       428|       Ivalo Airport|    Ivalo|     Finland|      IVL|
+----------+--------------------+---------+------------+---------+
only showing top 2 rows
</pre>
</details>

Create Dimensions anf fact tables in a new folder <folder>/ouput

Click here to see the result.

  ```
      +--------------+------------+------------+-----------+---------------+
  |  country_name|country_iso2|country_iso3|country_num|avg_temperature|
  +--------------+------------+------------+-----------+---------------+
  |   Afghanistan|          AF|         AFG|          4|      13.816497|
  |       Albania|          AL|         ALB|          8|      15.525828|
  |       Algeria|          DZ|         DZA|         12|      17.763206|
  |American Samoa|          AS|         ASM|         16|           null|
  |       Andorra|          AD|         AND|         20|           null|
  +--------------+------------+------------+-----------+---------------+
  only showing top 5 rows
  ```

Summary

Step 1: Scope the Project and Gather Data

Step 2: Explore and Assess the Data

Step 3: Define the Data Model

Step 4: Run ETL to model the data

Step 5: Complete Project write up

Step 1: Scope the Project and Gather Data

Data warehouse allow us to collect, transform and manage data from varied sources. Then, Data Team Business connect to it and analyse data.

The main dataset includes data on immigration to the United State. The questions about foreign students and their choice to come to US may be useful to propose services.
How many students arrived in US in April?
Which Airline bring the most student in April?
What are the top city to arrive in the USA?
Where are from?
what are the student profils (age, country born, country indicators)?

In this project, the data is transforming, cleansing, staging and load into a datawarehouse.

I use Python, PySpark, Docker

Describe and Gather Data

Data dictionnary provide informations about dataset and tables used.

Data Source

Data	File	Data Source
I94 Immigration data	18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat	US National Tourism and Trade Office
Airports data	airport-codes_csv.csv	provide by Udacity
Global Airports	airports-extended.csv	OpenFlights.org and user contributions
Global Land Temperature	GlobalLandTemperaturesByCity.csv	Berkeley Earth
I94 Description Labels Description	I94_SAS_Labels_Descriptions.SAS	US National Tourism and Trade Office
US Cities Demographic	us-cities-demographics.csv	provide by Udacity
Indicators developpment	WDIData.csv	Kaggle
Iso country	wikipedia-iso-country-codes.csv	Kaggle

Click for details about I94 Immigration data .

#### I94 Immigration data Description: file : 18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat Each line correspond to a record of I-94 Form from the U.S. immigration officers. It's provide information about Arrival/Departure to foreign visitors. Some explanation about the [Visitor Arrivals Program (I-94 Form)](https://travel.trade.gov/research/programs/i94/description.asp).

Dataset information: There is a file per month for 2016, storage format is sas7bdat. These records are described according to 28 variables.
A small description is provided here
I keep this variables for this project( df_immigration ):

Column Name	Description	Example	Type
cicid	ID uniq per record in the dataset	4.08e+06	float64
i94yr	4 digit year	2016.0	float64
i94mon	Numeric month	4.0	float64
i94cit	3 digit code of source city for immigration (Born country)	209.0	float64
i94res	3 digit code of source country for immigration	209.0	float64
i94port	Port addmitted through	HHW	object
arrdate	Arrival date in the USA	20566.0	float64
i94mode	Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)	1.0	float
i94addr	State of arrival	HI	object
i94bir	Age in years	61.0	float
i94visa	Visa Code - 1 = Business / 2 = Pleasure / 3 = Student	2.0	float
dtadfile	Date Field in I94 files	20160422	int 64
gender	Gender	M	object
visatype	Class of admission legally admitting the non-immigrant to temporarily stay in U.S.	WT	object
airline	Airline used to arrive in U.S.	MU	Object

Click for details about Airports Data.

#### Airports Data Description
file : airport-codes_csv.csv
The airport code refers to the IATA airport code, 3 letters code unique for all airports in the world. It's a code used in passenger reservation, ticket and baggage-handling too.     
Dataset information: The airport-codes_csv.csv provides informations about aiports and have 12 variables. A small description is provided [here](./2_Data_dictionnary.ipynb#airportscode). I keep this variables for this project ( _df_airport_code_ ):



Column Name
Description
Example
Type




ident
Unique identifier Airport code
00AK
object


type
Type of airport
small_airport
object


name
Name of the airport
Lowell Field
object


iso_country
ISO code of airport country
US
object


iso_region
ISO code of the region airport
US-KS
object


municipality
City name where the airport is located
Anchor Point
object


iata_code
IATA code of the airport

object

Click for details about Global Airports Data.

Global Airports Data

file : airports-extended.csv
This is a database of airports, train stations, and ferry terminals around the world. Some of the data come from public sources and some of it comes from OpenFlights.org user contributions.

Dataset information: A small description is provided here. I give name and keep this variables ( df_global_airports ):



Column Name
Description
Example
Type




airport_ID
Id in the table
1
Int


airport_name
Name of airport
Nadzab Airport
Object


airport_city
Main city served by airport
Nadzab
Object


airport_country
Country or territory where airport is located
Papua New Guinea
Object


airport_iata
3-letter IATA code
LAE
Object

Click for details about Global Land Temperature Data.

####  Global Land Temperature Data  Description
file : GlobalLandTemperaturesByCity.csv
The Berkeley Earth Surface Temperature Study provide climate information. Each line correspond to a record of temperature per day from city around the world.     
Dataset information: the GlobalLandTemperaturesByCity.csv has 7 variables. A small description is provided [here](./2_Data_dictionnary.ipynb#temperature). I keep this variables for this project ( _df_temperature_ ):



Column Name
Description
Example
Type




dt
Date format YYYY-MM-DD
1743-11-01
object


AverageTemperature
Average Temperature for the city to th date dt
6.07
float64


City
City name
Århus
object


Country
Country name
Denmark
object

Click for details about I94 Description Labels.

I94 Description Labels Description

file : I94_SAS_Labels_Descriptions.SAS The I94_SAS_Labels_Description.SAS file is provide to add explanations about code used in data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat. I parse this file, save the result in 5 .csv files.

i94visa Data * i94country and i94residence Data * i94port Data * i94mode Data * i94addr

A small description is provided here

Click for details about US cities Demographics data.

US cities Demographics

file: us-cities-demographics.csv This dataset contains information about the demographics of all US cities and come from the US Census Bureau.
Dataset information: A small description is provided here. This dataset contains 12 variables and provides simple informations about us state population. I keep this variables for this project ( df_demograph ):

Column Name	Description	Example	Type
City	Name of the city	Silver Spring	Object
State	US state of the city	Maryland	Object
Median Age	The median of the age of the population	33.8	Float64
Male Population	Number of the male population	40601.0	Float64
Female Population	Number of the female population	41862.0	Float64
Total Population	Number of the total population	82463	Float64
Foreign-born	Number of residents of the city that were not born in the city	30908.0	Float64
State Code	Code of the state of the city	MD	Object
Race	Race class	Hispanic or Latino	Object
Count	Number of individual of each race	25924	Int64

Click for details about World Development Indicators data.

#### World Development Indicators
file : WDIData.csv
The primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.   
Dataset information: This dataset contains 64 variables with economics context , most of which are variables per year(1960 to 2018).
A small description is provided [here](./2_Data_dictionnary.ipynb#indicators).
I keep this variables for this project ( _df_indicator_dev_ ):



Column Name
Description
Example
Type




Country Name
Name of the country
Arab World
Object


Country Code
3 letters code of country
ARB
Object


Indicator Name
indicators of economic development
2005 PPP conversion factor, GDP (LCU per inter...
Object


Indicator Code
letters indicator code
PA.NUS.PPP.05
Object


1960 ...2018
one column per year since 1960
2018
Float64

Click for details about Iso country data.

#### Iso country
file: wikipedia-iso-country-codes.csv
This is a database about the different code useful to identify country.        
Datasset information: A small description is provided [here](./2_Data_dictionnary.ipynb#isocountry). This table gives us informations about Country codes used to identify each country and contains 4 variables. I keep this variables for this project ( _df_iso_country_ ):



Column Name
Description
Example
Type




English short name lower case
Country Name in English
Wallis and Futuna
Object


Alpha-2 code
code 2 letter code for the country
WF
Object


Alpha-3 code
code 3 letter code for the country
WLF
Object


Numeric code
ISO 3166-2 code
876
Int

Step 2: Explore and Assess the Data

Explore the Data

Data Source

Data dictionnary provides informations about dataset and tables used. This notebook performs a first exploration with Python and explain the datasets, which variables I kept.

Dataset	File	Data Source	Dataframe Name
I94 Immigration	immigration_data_sample.csv	US National Tourism and Trade Office	df_immigration
I94 Description Labels Description	I94_SAS_Labels_Descriptions.SAS	US National Tourism and Trade Office
Global Land Temperature	GlobalLandTemperaturesByCity.csv	Berkeley Earth	df_temperature
Global Airports	airports-extended.csv	OpenFlights.org and user contributions	df_global_airports
Airports codes	airport-codes_csv.csv	provide by Udacity	df_airport_code
Iso country	wikipedia-iso-country-codes.csv	Wikipedia	df_iso_country
US Cities Demographic	us-cities-demographics.csv	provide by Udacity	df_demograph
Indicators developpment	WDIData.csv	World Bank	df_indicator_dev

Click here to deploy the whole resume of the dataset.

##### I94 Immigration Data
* Source: https://travel.trade.gov/research/reports/historical/2016.html
* data 'data/18-83510-I94-Data-2016', provide one file per month
* These records are described according to 28 variables and 3M rows per file
* It's provide information about Arrival/Departure to foreign visitors
* I94_SAS_Labels_Description.SAS for variable descriptions
Global Land Temperature Data

Source: http://berkeleyearth.org/

data 'GlobalLandTemperaturesByCity.csv' provide climate information

Each line correspond to a record of temperature per day from city around the world.
The GlobalLandTemperaturesByCity.csv has 7 variables and 8599213 rows.

Global Airports Data

Source: https://www.kaggle.com/open-flights/airports-train-stations-and-ferry-terminals

data 'airports-extended.csv'. Some of the data come from public sources and some of it comes from OpenFlights.org user contributions.

It's provide informatioms about of airports, train stations, and ferry terminals around the world.
There are 4 variables in 'airports-extended.csv'and 10668 rows

Airports Data Description Data

Source: https://datahub.io/core/airport-codes#data

airport-codes_csv.csv. The airport code refers to the IATA airport code, 3 letters code unique for all airports in the world

The airport-codes_csv.csv provides informations about aiports.
There are 55075 rows and 12 columns in airport-codes_csv.csv.

Iso country Data

Source: https://gist.github.com/radcliff/f09c0f88344a7fcef373

data 'wikipedia-iso-country-codes.csv'. This is a database about the different code useful to identify country.

This table gives us informations about Country codes used to identify each country
There are 4 variables and 247 rows.

US cities Demographics Data

Source: https://data.census.gov/cedsci/.

data 'us-cities-demographics.csv'. This dataset contains information about the demographics of all US cities and come from the US Census Bureau.

Provides simple informations about US State population
Contains 12 variables and 2892 rows

World Development Indicators Data

Source: https://www.kaggle.com/xavier14/wdidata

data 'WDIData.csv'. The primary World Bank collection of development indicators, compiled from officially-recognized international sources.

It presents the most current and accurate global development data available, and includes national, regional and global estimates.
Contains 64 variables, most of which are variables per year(1960 to 2018), with economics context and 422137 rows.

i94addr Data