first commit

anthelix · Feb 21, 2020 · a8a7b39 · a8a7b39
commit a8a7b39
Show file tree

Hide file tree

Showing 9 changed files with 341 additions and 0 deletions.
diff --git a/Data_Warehouse_Project_Template/README.md b/Data_Warehouse_Project_Template/README.md
@@ -0,0 +1,7 @@
+##### Udacity Data Engineering Nanodegree
+
+
+# Project 3 :  Data Warehouse
+
+<img alt="" align="right" width="150" height="150" src = "../image/aws_logo.png" title = "postgres logo" alt = "aws logo">  
+
diff --git a/README.md b/README.md
@@ -0,0 +1,144 @@
+##### Udacity Data Engineering Nanodegree
+
+<img alt="" align="right" width="150" height="150" src = "./image/aws_logo.png" title = "aws logo" alt = "aws logo">  
+</br>
+</br>
+</br>
+
+# Project 3 : Data Warehouse
+
+About an ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables. 
+
+### Todo
+[Udacity: Project Instructions](https://classroom.udacity.com/nanodegrees/nd027/parts/69a25b76-3ebd-4b72-b7cb-03d82da12844/modules/58ff61b9-a54f-496d-b4c7-fa22750f6c76/lessons/b3ce1791-9545-4187-b1fc-1e29cc81f2b0/concepts/14843ffe-212c-464a-b4b6-3f0db421aa32)
+* set an anaconda environement(python3, psycopg2, configparsar, sql_queries) to work in spyder? 
+* vois si meme fichier que dams le projet 1
+* suivre le worflow dans le project instructions.
+
+[Redshift Create Table Docs.](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html)
+
+Note
+The SERIAL command in Postgres is not supported in Redshift. The equivalent in redshift is IDENTITY(0,1), which you can read more on in the Redshift Create Table Docs.
+
+
+
+### Table of contents
+
+   - [About the project](#about-the-project)
+   - [Purpose](#purpose)
+   - [Getting started](#getting-started)
+   - [Ressources](#ressources)
+       - [Dataset EventData](#dataset-eventdata)
+       - [Tools and Files](#tools-and-files)
+   - [Worflow](#worflow)
+      - [Modeling your NoSQL database](#modeling-your-nosql-database)
+      - [Build ETL pipeline](#build-etl-pipeline)
+   - [Workspace](#workspace)
+      - [My environnemets](#my-environements)
+      - [Discuss about the database](#discuss-about-the-database)
+      - [UML diagram](#uml-diagram)
+      - [Chebotko diagram](#chebotko-diagram)
+      - [Queries](#queries)
+      - [Web-links](#web-links)
+---
+
+## About the project
+
+A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.  
+They'd like a data engineer to build an ETL pipeline that extracts their data from S3, stages them in Redshift, nd transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. 
+* Load data from S# to sating tables on Redshift
+* Execute SQL statements that create the analytics tables from these staging tables
+
+## Purpose
+
+The purpose of this project is to apply implemeting data warehouse and build an ETL pipeline for a database hosted on Redshift using IAC. 
+
+## Getting started
+
+
+## Ressources
+
+### Dataset
+
+The twice datasetset reside in S3:  
+
+* `Song data: `s3://udacity-dend/song_data`
+* `Log data: s3://udacity-dend/log_data`
+
+Log data json path: `s3://udacity-dend/log_json_path.json`
+
+##### Song Dataset
+
+The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
+
+```
+song_data/A/B/C/TRABCEI128F424C983.json
+song_data/A/A/B/TRAABJL12903CDCF1A.json
+```
+And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
+```
+{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
+```
+##### Log Dataset
+
+The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.
+The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
+```
+log_data/2018/11/2018-11-12-events.json
+log_data/2018/11/2018-11-13-events.json
+```
+And below is an example of what the data in a log file, 2018-11-12-events.json, looks like.
+![log dataset image](./image/log_dataset.png)
+
+### Tools and Files
+
+In addition to the data files, the project template includes four files: 
+* `create_table.py` is where you'll create your fact and dimension tables for the star schema in Redshift.
+* `etl.py` is where you'll load data from S3 into staging tables on Redshift and then process that data into your analytics tables on Redshift.
+* `sql_queries.py` is where you'll define you SQL statements, which will be imported into the two other files above.
+* `README.md` is where you'll provide discussion on your process and decisions for this ETL pipeline.
+
+## Worflow
+
+1. **Create Table Schemas**
+
+    * Design schemas for fact and dimension tables
+    * Write a SQL `CREATE` statement for each tables in `sql_queries.py`
+    * Complete `create_tables.py` to connect the database and create these tables
+    * Write SQL `DROP` statements
+    * Launch a redshift cluster
+    * Create an IAM role for acces to s3
+    * Add Redshift database and IAM role info in dwh.cfg
+    * Test by running `create_tables.py`
+
+2. **Build ETL Pipeline**  
+
+    * Implement to load data from s3 to staging tables on Redshift
+    * Implement to load data staging tables to analytics tables on Redshift
+    * Test
+    * Delete the Redshift Cluster when finished
+
+3. **Document Process**
+
+* Discuss the purpose of this database in the context of the startup, Sparkify, and their analytical goals.
+* State and justify database schema design and ETL pipeline.
+* Provide example queries and results for song play analysis.s
+
+### Modeling 
+
+### Build ETL pipeline
+
+## Workspace
+
+### My environnemets
+
+### Discuss about the database
+
+### UML diagram
+
+### Chebotko diagram
+
+### Queries
+
+### Web-links
+
diff --git a/create_tables.py b/create_tables.py
@@ -0,0 +1,32 @@
+import configparser
+import psycopg2
+from sql_queries import create_table_queries, drop_table_queries
+
+
+def drop_tables(cur, conn):
+    for query in drop_table_queries:
+        cur.execute(query)
+        conn.commit()
+
+
+def create_tables(cur, conn):
+    for query in create_table_queries:
+        cur.execute(query)
+        conn.commit()
+
+
+def main():
+    config = configparser.ConfigParser()
+    config.read('dwh.cfg')
+
+    conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
+    cur = conn.cursor()
+
+    drop_tables(cur, conn)
+    create_tables(cur, conn)
+
+    conn.close()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/data-warehouse-project-template.zip b/data-warehouse-project-template.zip
diff --git a/dwh.cfg b/dwh.cfg
@@ -0,0 +1,14 @@
+[CLUSTER]
+HOST=
+DB_NAME=
+DB_USER=
+DB_PASSWORD=
+DB_PORT=
+
+[IAM_ROLE]
+ARN=''
+
+[S3]
+LOG_DATA='s3://udacity-dend/log-data'
+LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
+SONG_DATA='s3://udacity-dend/song-data'
diff --git a/etl.py b/etl.py
@@ -0,0 +1,32 @@
+import configparser
+import psycopg2
+from sql_queries import copy_table_queries, insert_table_queries
+
+
+def load_staging_tables(cur, conn):
+    for query in copy_table_queries:
+        cur.execute(query)
+        conn.commit()
+
+
+def insert_tables(cur, conn):
+    for query in insert_table_queries:
+        cur.execute(query)
+        conn.commit()
+
+
+def main():
+    config = configparser.ConfigParser()
+    config.read('dwh.cfg')
+
+    conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
+    cur = conn.cursor()
+
+    load_staging_tables(cur, conn)
+    insert_tables(cur, conn)
+
+    conn.close()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image/aws_logo.png b/image/aws_logo.png
diff --git a/image/log_dataset.png b/image/log_dataset.png
diff --git a/sql_queries.py b/sql_queries.py
@@ -0,0 +1,112 @@
+import configparser
+
+
+# CONFIG
+config = configparser.ConfigParser()
+config.read('dwh.cfg')
+
+# DROP TABLES
+
+staging_events_table_drop = "DROP TABLE IF EXISTS staging_events"
+staging_songs_table_drop = "DROP TABLE IF EXISTS stating_songs"
+songplay_table_drop = "DROP TABLE IF EXISTS songplay"
+user_table_drop = "DROP TABLE IF EXISTS user"
+song_table_drop = "DROP TABLE IF EXISTS song"
+artist_table_drop = "DROP TABLE IF EXISTS artist"
+time_table_drop = "DROP TABLE IF EXISTS time"
+
+# CREATE TABLES
+
+staging_events_table_create= ("""CREATE TABLE IF NOT EXISTS staging_events
+""")
+
+staging_songs_table_create = ("""CREATE TABLE IF NOT EXISTS staging_song
+""")
+## Fact Tables
+songplay_table_create = ("""CREATE TABLE IF NOT EXISTS songplay
+                         (
+                         songplay_id int IDENTITY(0,1),
+                         start_time bigint NOT NULL,
+                         user_id bigint NOT NULL,
+                         level varchar NOT NULL,
+                         song_id varchar,
+                         artist_id varchar,
+                         session_id bigint NOT NULL,
+                         location varchar,
+                         user_agent varchar
+                         )
+""")
+## Dimension Tables
+user_table_create = ("""CREATE TABLE IF NOT EXISTS user
+                     (
+                     user_id bigint NOT NULL,
+                     first_name varchar,
+                     last_name varchar,
+                     gender varchar(1),
+                     level varchar NOT NULL
+                     )
+""")
+
+song_table_create = ("""CREATE TABLE IF NOT EXISTS song
+                     (
+                     song_id varchar NOT NULL,
+                     title varchar NOT NULL,
+                     artist_id varchar NOT NULL,
+                     year int,
+                     duration numeric
+                     )
+""")
+
+artist_table_create = ("""CREATE TABLE IF NOT EXISTS artist
+                       (
+                       artist_id varchar NOT NULL,
+                       name varchar NOT NULL,
+                       location varchar,
+                       latitude numeric,
+                       longitude numeric
+                       )
+""")
+
+time_table_create = ("""CREATE TABLE IF NOT EXISTS time
+                     (
+                     start_time timestamp NOT NULL,
+                     hour int,
+                     day int,
+                     week int,
+                     month int,
+                     year int,
+                     weekday varchar
+                     )
+""")
+
+# STAGING TABLES
+
+staging_events_copy = ("""
+""").format()
+
+staging_songs_copy = ("""
+""").format()
+
+# FINAL TABLES
+
+songplay_table_insert = ("""
+""")
+
+user_table_insert = ("""
+""")
+
+song_table_insert = ("""
+""")
+
+artist_table_insert = ("""
+""")
+
+time_table_insert = ("""
+""")
+
+# QUERY LISTS
+
+create_table_queries = [staging_events_table_create, staging_songs_table_create, songplay_table_create, user_table_create, song_table_create, artist_table_create, time_table_create]
+drop_table_queries = [staging_events_table_drop, staging_songs_table_drop, songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop]
+copy_table_queries = [staging_events_copy, staging_songs_copy]
+insert_table_queries = [songplay_table_insert, user_table_insert, song_table_insert, artist_table_insert, time_table_insert]