Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
anthelix committed Feb 21, 2020
0 parents commit a8a7b39
Show file tree
Hide file tree
Showing 9 changed files with 341 additions and 0 deletions.
7 changes: 7 additions & 0 deletions Data_Warehouse_Project_Template/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
##### Udacity Data Engineering Nanodegree


# Project 3 : Data Warehouse

<img alt="" align="right" width="150" height="150" src = "../image/aws_logo.png" title = "postgres logo" alt = "aws logo">

144 changes: 144 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
##### Udacity Data Engineering Nanodegree

<img alt="" align="right" width="150" height="150" src = "./image/aws_logo.png" title = "aws logo" alt = "aws logo">
</br>
</br>
</br>

# Project 3 : Data Warehouse

About an ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables.

### Todo
[Udacity: Project Instructions](https://classroom.udacity.com/nanodegrees/nd027/parts/69a25b76-3ebd-4b72-b7cb-03d82da12844/modules/58ff61b9-a54f-496d-b4c7-fa22750f6c76/lessons/b3ce1791-9545-4187-b1fc-1e29cc81f2b0/concepts/14843ffe-212c-464a-b4b6-3f0db421aa32)
* set an anaconda environement(python3, psycopg2, configparsar, sql_queries) to work in spyder?
* vois si meme fichier que dams le projet 1
* suivre le worflow dans le project instructions.

[Redshift Create Table Docs.](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html)

Note
The SERIAL command in Postgres is not supported in Redshift. The equivalent in redshift is IDENTITY(0,1), which you can read more on in the Redshift Create Table Docs.



### Table of contents

- [About the project](#about-the-project)
- [Purpose](#purpose)
- [Getting started](#getting-started)
- [Ressources](#ressources)
- [Dataset EventData](#dataset-eventdata)
- [Tools and Files](#tools-and-files)
- [Worflow](#worflow)
- [Modeling your NoSQL database](#modeling-your-nosql-database)
- [Build ETL pipeline](#build-etl-pipeline)
- [Workspace](#workspace)
- [My environnemets](#my-environements)
- [Discuss about the database](#discuss-about-the-database)
- [UML diagram](#uml-diagram)
- [Chebotko diagram](#chebotko-diagram)
- [Queries](#queries)
- [Web-links](#web-links)
---

## About the project

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
They'd like a data engineer to build an ETL pipeline that extracts their data from S3, stages them in Redshift, nd transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.
* Load data from S# to sating tables on Redshift
* Execute SQL statements that create the analytics tables from these staging tables

## Purpose

The purpose of this project is to apply implemeting data warehouse and build an ETL pipeline for a database hosted on Redshift using IAC.

## Getting started


## Ressources

### Dataset

The twice datasetset reside in S3:

* `Song data: `s3://udacity-dend/song_data`
* `Log data: s3://udacity-dend/log_data`

Log data json path: `s3://udacity-dend/log_json_path.json`

##### Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

```
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
```
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
```
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
```
##### Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.
The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
```
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
```
And below is an example of what the data in a log file, 2018-11-12-events.json, looks like.
![log dataset image](./image/log_dataset.png)

### Tools and Files

In addition to the data files, the project template includes four files:
* `create_table.py` is where you'll create your fact and dimension tables for the star schema in Redshift.
* `etl.py` is where you'll load data from S3 into staging tables on Redshift and then process that data into your analytics tables on Redshift.
* `sql_queries.py` is where you'll define you SQL statements, which will be imported into the two other files above.
* `README.md` is where you'll provide discussion on your process and decisions for this ETL pipeline.

## Worflow

1. **Create Table Schemas**

* Design schemas for fact and dimension tables
* Write a SQL `CREATE` statement for each tables in `sql_queries.py`
* Complete `create_tables.py` to connect the database and create these tables
* Write SQL `DROP` statements
* Launch a redshift cluster
* Create an IAM role for acces to s3
* Add Redshift database and IAM role info in dwh.cfg
* Test by running `create_tables.py`

2. **Build ETL Pipeline**

* Implement to load data from s3 to staging tables on Redshift
* Implement to load data staging tables to analytics tables on Redshift
* Test
* Delete the Redshift Cluster when finished

3. **Document Process**

* Discuss the purpose of this database in the context of the startup, Sparkify, and their analytical goals.
* State and justify database schema design and ETL pipeline.
* Provide example queries and results for song play analysis.s

### Modeling

### Build ETL pipeline

## Workspace

### My environnemets

### Discuss about the database

### UML diagram

### Chebotko diagram

### Queries

### Web-links

32 changes: 32 additions & 0 deletions create_tables.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import configparser
import psycopg2
from sql_queries import create_table_queries, drop_table_queries


def drop_tables(cur, conn):
for query in drop_table_queries:
cur.execute(query)
conn.commit()


def create_tables(cur, conn):
for query in create_table_queries:
cur.execute(query)
conn.commit()


def main():
config = configparser.ConfigParser()
config.read('dwh.cfg')

conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
cur = conn.cursor()

drop_tables(cur, conn)
create_tables(cur, conn)

conn.close()


if __name__ == "__main__":
main()
Binary file added data-warehouse-project-template.zip
Binary file not shown.
14 changes: 14 additions & 0 deletions dwh.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[CLUSTER]
HOST=
DB_NAME=
DB_USER=
DB_PASSWORD=
DB_PORT=

[IAM_ROLE]
ARN=''

[S3]
LOG_DATA='s3://udacity-dend/log-data'
LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
SONG_DATA='s3://udacity-dend/song-data'
32 changes: 32 additions & 0 deletions etl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import configparser
import psycopg2
from sql_queries import copy_table_queries, insert_table_queries


def load_staging_tables(cur, conn):
for query in copy_table_queries:
cur.execute(query)
conn.commit()


def insert_tables(cur, conn):
for query in insert_table_queries:
cur.execute(query)
conn.commit()


def main():
config = configparser.ConfigParser()
config.read('dwh.cfg')

conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(*config['CLUSTER'].values()))
cur = conn.cursor()

load_staging_tables(cur, conn)
insert_tables(cur, conn)

conn.close()


if __name__ == "__main__":
main()
Binary file added image/aws_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added image/log_dataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
112 changes: 112 additions & 0 deletions sql_queries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
import configparser


# CONFIG
config = configparser.ConfigParser()
config.read('dwh.cfg')

# DROP TABLES

staging_events_table_drop = "DROP TABLE IF EXISTS staging_events"
staging_songs_table_drop = "DROP TABLE IF EXISTS stating_songs"
songplay_table_drop = "DROP TABLE IF EXISTS songplay"
user_table_drop = "DROP TABLE IF EXISTS user"
song_table_drop = "DROP TABLE IF EXISTS song"
artist_table_drop = "DROP TABLE IF EXISTS artist"
time_table_drop = "DROP TABLE IF EXISTS time"

# CREATE TABLES

staging_events_table_create= ("""CREATE TABLE IF NOT EXISTS staging_events
""")

staging_songs_table_create = ("""CREATE TABLE IF NOT EXISTS staging_song
""")
## Fact Tables
songplay_table_create = ("""CREATE TABLE IF NOT EXISTS songplay
(
songplay_id int IDENTITY(0,1),
start_time bigint NOT NULL,
user_id bigint NOT NULL,
level varchar NOT NULL,
song_id varchar,
artist_id varchar,
session_id bigint NOT NULL,
location varchar,
user_agent varchar
)
""")
## Dimension Tables
user_table_create = ("""CREATE TABLE IF NOT EXISTS user
(
user_id bigint NOT NULL,
first_name varchar,
last_name varchar,
gender varchar(1),
level varchar NOT NULL
)
""")

song_table_create = ("""CREATE TABLE IF NOT EXISTS song
(
song_id varchar NOT NULL,
title varchar NOT NULL,
artist_id varchar NOT NULL,
year int,
duration numeric
)
""")

artist_table_create = ("""CREATE TABLE IF NOT EXISTS artist
(
artist_id varchar NOT NULL,
name varchar NOT NULL,
location varchar,
latitude numeric,
longitude numeric
)
""")

time_table_create = ("""CREATE TABLE IF NOT EXISTS time
(
start_time timestamp NOT NULL,
hour int,
day int,
week int,
month int,
year int,
weekday varchar
)
""")

# STAGING TABLES

staging_events_copy = ("""
""").format()

staging_songs_copy = ("""
""").format()

# FINAL TABLES

songplay_table_insert = ("""
""")

user_table_insert = ("""
""")

song_table_insert = ("""
""")

artist_table_insert = ("""
""")

time_table_insert = ("""
""")

# QUERY LISTS

create_table_queries = [staging_events_table_create, staging_songs_table_create, songplay_table_create, user_table_create, song_table_create, artist_table_create, time_table_create]
drop_table_queries = [staging_events_table_drop, staging_songs_table_drop, songplay_table_drop, user_table_drop, song_table_drop, artist_table_drop, time_table_drop]
copy_table_queries = [staging_events_copy, staging_songs_copy]
insert_table_queries = [songplay_table_insert, user_table_insert, song_table_insert, artist_table_insert, time_table_insert]

0 comments on commit a8a7b39

Please sign in to comment.