Skip to content
This repository has been archived by the owner on May 2, 2022. It is now read-only.

Data Systems for Software Engineers: Final Project

License

Notifications You must be signed in to change notification settings

KrisTheCanadian/SOEN-363-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SOEN-363-Project

SOEN-363: Final Project

Project Information 🚀

Phase 1 Information (PDF)

Phase 2 Information (PDF)

Presentation Slides (Canva) 🖼

Datasets 📙

SQL Dataset 1 (Spotify)

SQL Dataset 2 (Spotify)

NoSQL Dataset (Reddit)

Importing the Data 📁

  • Using Datagrip & Python scripts (cleaning and importing data)

A really useful script for uploading to Elastic Search can be found here

Team Members 💪 🎉 🔥


Queries 🖥️

Phase 1 ✅

SQL Queries Queries implementation: here

Phase 1 Queries (PDF)

Phase 2 ✅

SQL Queries

ER Diagram

Queries implementation: here

  1. What are all Led Zeppelin song names in rock_Music_data, and on which days do they end up on the popularDataSet in 2018]
  2. What are all the playlists that those Led Zeppelin songs feature in?
  3. What were the most popular songs (songs listed in the top 3) of the month of January of 2019 in Canada? Order by popularity and limit output to 10.
  4. What is the largest popularity gap in rock_Music (lowest popularity, highest popularity)?
  5. Which songs have the most genres (limit to 10 results)?
  6. Which band shows up the most often in Alternative Music Data? Which of their songs appear in the Popular Dataset, include artist name, title, date and country.
  7. Which Artists appear in both the indie and alternative music data starting by the letter S
  8. What are some good club music (danceability > 0.8) listed as pop which artists also make music categorized as blues? Return the pop song and blues song with its respective artist.
  9. Out of the most popular alternative playlist, list in increasing order the songs above 5 minutes in length.
  10. How many pop songs released in 2020 that are in the top 20 have a tempo greater than 120?

Elastic Search Queries

Queries implementation: here

  1. What are the top 10 most upvoted comments of all time? Print the comment and the score in an ordered list.
  2. How many of the comments listed as controversial are also listed as an edited comment?
  3. Show and state the number of all the controversial comments were made at night (after 10pm)?
  4. What is the percentage of comments with the word sorry in them and are also replying to another comment?
  5. Who were the top 3 users that commented the most in 2006? How many comments did they make and what was their top commented subreddit?
  6. Find all comments about postgres. Display the number of comments that have a score between 15-30. Display the top comment and the lowest comment in that range
  7. Display the number of comments for every subreddit and the top comment score. Order them in popularity.
  8. Query every comment between September 2007 and December 2007 that either has the word ‘sql’ or ‘nosql’ in the comment. Only include comments which have a score greater than 0. Print the number of comments and print the first 10 results (sorted by score).
  9. Find the top comment in January 2007, print it and also display the number of replies this comment got in total.
  10. Find all comments that mention at least 2 of the following words: sql, database and programming, software. In 2006. State the number of comments

License 📝

This repository is available under the MIT LICENSE.

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •