Merge branch 'gh-pages' of https://github.com/dsc-courses/dsc80-2024-wi…

… into gh-pages
dsc-courses · Feb 29, 2024 · 31b6a13 · 31b6a13
2 parents 1563569 + d866774
commit 31b6a13
Show file tree

Hide file tree

Showing 7 changed files with 537 additions and 5 deletions.
diff --git a/_modules/week-09.md b/_modules/week-09.md
@@ -33,6 +33,6 @@ days:
           [Ch. 16](https://learningds.org/ch/16/ms_intro.html)
       - name: PROJ 4
         type: proj
-        title: "Data Science Lifecycle (Checkpoint)"
-        # url: https://dsc80.com/project5/
+        title: <b>Data Science Lifecycle (Checkpoint)</b>
+        url: proj04
 ---
diff --git a/_modules/week-11.md b/_modules/week-11.md
@@ -14,6 +14,6 @@ days:
     events:
       - name: PROJ 4
         type: proj
-        title: "Data Science Lifecycle (no slip days!)"
-        # url: https://dsc80.com/project5/
+        title: <b>Data Science Lifecycle (no slip days!)</b>
+        url: proj04
 ---
diff --git a/index.md b/index.md
@@ -13,7 +13,7 @@ nav_order: 1
 
 {{ site.staffersnobio }}
 
-[Jump to the current week](#week-7-text-data-linear-regression){: .btn } [Lab Solutions](https://edstem.org/us/courses/51951/discussion/4183397){: .btn .btn-green }
+[Jump to the current week](#week-8-feature-engineering-and-generalization){: .btn } [Lab Solutions](https://edstem.org/us/courses/51951/discussion/4183397){: .btn .btn-green }
 
 Click the 🎥 button to view the recording of a lecture/discussion.<br>Click the 📝 button to view lecture notebooks after they've been filled in during lecture.
 

diff --git a/proj04/index.md b/proj04/index.md
diff --git a/proj04/league-of-legends.md b/proj04/league-of-legends.md
@@ -0,0 +1,60 @@
+---
+layout: page
+title: League of Legends ⌨️
+description: Description of the League of Legends dataset in Project 4.
+parent: 'Project 4: The Data Science Lifecycle 📊'
+nav_exclude: true
+---
+
+# League of Legends ⌨️
+{:.no_toc}
+
+## Table of Contents
+{: .no_toc .text-delta }
+
+1. TOC
+{:toc}
+
+---
+
+Welcome to Summoner's Rift! This dataset contains information of players and teams from over 10,000 League of Legends competitive matches.
+
+{: .note }
+You'll probably want to be at least a little bit familiar with [*League of Legends*](https://en.wikipedia.org/wiki/League_of_Legends) and its terminology to use this dataset. If not, one of the other datasets may be more interesting to you.
+
+---
+
+## Getting the Data
+
+The data can be found on the website [Oracle's Elixir](https://oracleselixir.com/tools/downloads) at the provided Google Drive link.
+
+We've verified that it's possible to satisfy the requirements of the project using match data from 2022. You're welcome to use newer or older datasets if you wish, but keep in mind that League of Legends changes significantly between years; this can make it difficult to combine or make comparisons between datasets from different years.
+
+---
+
+## Example Questions and Prediction Problems
+
+Feel free to base your exploration into the dataset in Steps 1-4 around one of these questions, or come up with a question of your own.
+
+- Looking at [tier-one professional leagues](https://en.wikipedia.org/wiki/List_of_League_of_Legends_leagues_and_tournaments), which league has the most "action-packed" games? Is the amount of "action" in this league significantly different than in other leagues? Note that you'll have to come up with a way of quantifying "action".
+- Which competitive region has the highest win rate against teams outside their region? Note you will have to find and merge region data for this question as the dataset does not have it.
+- Which role "carries" (does the best) in their team more often: ADCs (Bot lanes) or Mid laners?
+- Is Talon (former tutor Costin's favorite champion) more likely to win or lose any given match?
+
+Feel free to use one of the prompts below to build your predictive model in Steps 5-8, or come up with a prediction task of your own.
+
+* Predict if a team will win or lose a game.
+* Predict which role (top-lane, jungle, support, etc.) a player played given their post-game data.
+* Predict how long a game will take before it happens.
+* Predict which team will get the first Baron.
+
+Make sure to justify what information you would know at the “time of prediction” and to only train your model using those features.
+
+---
+
+## Special Considerations
+
+### Step 2: Data Cleaning and Exploratory Data Analysis
+
+- Each `'gameid'` corresponds to up to 12 rows – one for each of the 5 players on both teams and 2 containing summary data for the two teams (try to find out what distinguishes those rows). After selecting your line of inquiry, make sure to remove either the player rows or the team rows so as not to have issues later in your analysis.
+- Many columns should be of type `bool` but are not.
diff --git a/proj04/power-outages.md b/proj04/power-outages.md
@@ -0,0 +1,61 @@
+---
+layout: page
+title: Power Outages 🔋
+description: Description of the power outages dataset in Project 4.
+parent: 'Project 4: The Data Science Lifecycle 📊'
+nav_exclude: true
+---
+
+# Power Outages 🔋
+{:.no_toc}
+
+## Table of Contents
+{: .no_toc .text-delta }
+
+1. TOC
+{:toc}
+
+---
+
+This dataset has major power outage data in the continental U.S. from January 2000 to July 2016.
+
+---
+
+## Getting the Data
+
+The data is downloadable [here](https://engineering.purdue.edu/LASCI/research-data/outages/outagerisks).
+
+***Note***: If you are having a hard time with the "This dataset" link, hold shift and click the link to open it into a new tab and then refresh that new tab.
+
+A data dictionary is available at this [article](https://www.sciencedirect.com/science/article/pii/S2352340918307182) under *Table 1. Variable descriptions*.
+
+---
+
+## Example Questions and Prediction Problems
+
+Feel free to base your exploration into the dataset in Steps 1-4 around one of these questions, or come up with a question of your own.
+
+- Where and when do major power outages tend to occur?
+- What are the characteristics of major power outages with higher severity? Variables to consider include location, time, climate, land-use characteristics, electricity consumption patterns, economic characteristics, etc. What risk factors may an energy company want to look into when predicting the location and severity of its next major power outage?
+- What characteristics are associated with each category of cause?
+- How have characteristics of major power outages changed over time? Is there a clear trend?
+
+Feel free to use one of the prompts below to build your predictive model in Steps 5-8, or come up with a prediction task of your own.
+
+* Predict the severity (in terms of number of customers, duration, or demand loss) of a major power outage.
+* Predict the cause of a major power outage.
+* Predict the number and/or severity of major power outages in the year 2022.
+* Predict the electricity consumption of an area.
+
+Make sure to justify what information you would know at the “time of prediction” and to only train your model using those features.
+
+---
+
+## Special Considerations
+
+### Step 2: Data Cleaning and Exploratory Data Analysis
+- The data is given as an Excel file rather than a CSV. Open the data in Google Sheets or another spreadsheet application and determine which rows and columns of the sheet should be ignored when loading the data in `pandas`.
+    - Note that `pandas` can load multiple filetypes: `pd.read_csv`, `pd.read_excel`, `pd.read_html`, `pd.read_json`, etc.
+- The power outage start date and time is given by `'OUTAGE.START.DATE'` and `'OUTAGE.START.TIME'`. It would be preferable if these two columns were combined into one `pd.Timestamp` column. Combine `'OUTAGE.START.DATE'` and `'OUTAGE.START.TIME'` into a new `pd.Timestamp` column called `'OUTAGE.START'`. Similarly, combine `'OUTAGE.RESTORATION.DATE'` and `'OUTAGE.RESTORATION.TIME'` into a new `pd.Timestamp` column called `'OUTAGE.RESTORATION'`.
+    - `pd.to_datetime` and `pd.to_timedelta` will be useful here.
+- To visualize geospatial data, consider [Folium](https://python-visualization.github.io/folium/) or another geospatial plotting library. You can even embed Folium maps in your website! If `fig` is a `folium.folium.Map` object, then `fig._repr_html_()` evaluates to a string containing your plot as HTML; use `open` and `write` to write this string to an `.html` file.
diff --git a/proj04/recipes-and-ratings.md b/proj04/recipes-and-ratings.md
@@ -0,0 +1,97 @@
+---
+layout: page
+title: Recipes and Ratings 🍽️
+description: Description of the recipes and ratings dataset in Project 4.
+parent: 'Project 4: The Data Science Lifecycle 📊'
+nav_exclude: true
+---
+
+# Recipes and Ratings 🍽️
+{:.no_toc}
+
+## Table of Contents
+{: .no_toc .text-delta }
+
+1. TOC
+{:toc}
+
+---
+
+This dataset contains recipes and ratings from [food.com](https://food.com). It was originally scraped and used by the authors of [this](https://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19c.pdf) recommender systems paper.
+
+---
+
+## Getting the Data
+
+Download the data [here](https://drive.google.com/file/d/1kIbMz6jlhleiZ9_3QthmUnifoSds_2EI/view?usp=sharing). You'll download two CSV files:
+- `RAW_recipes.csv` contains recipes.
+- `RAW_interactions.csv` contains reviews and ratings submitted for the recipes in `RAW_recipes.csv`.
+
+We've provided you with a subset of the raw data used in the original report, containing only the recipes and reviews posted since 2008, since the original data is quite large.
+
+A description of each column in both datasets is given below.
+
+#### Recipes
+
+For context, you may want to look at an example recipe [directly on food.com](https://www.food.com/recipe/chickpea-and-fresh-tomato-toss-51631).
+
+| Column         | Description                                                                                                                                       |
+|:---------------|:--------------------------------------------------------------------------------------------------------------------------------------------------|
+| `'name'`        | Recipe name                                                                                                                                       |
+| `'id'`             | Recipe ID                                                                                                                                         |
+| `'minutes'`        | Minutes to prepare recipe                                                                                                                         |
+| `'contributor_id'` | User ID who submitted this recipe                                                                                                                 |
+| `'submitted'`      | Date recipe was submitted                                                                                                                         |
+| `'tags'`          | Food.com tags for recipe                                                                                                                          |
+| `'nutrition'`      | Nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for "percentage of daily value" |
+| `'n_steps'`        | Number of steps in recipe                                                                                                                         |
+| `'steps'`          | Text for recipe steps, in order                                                                                                                   |
+| `'description'`    | User-provided description                                                                                                                         |
+
+#### Ratings
+
+| Column    | Description         |
+|:----------|:--------------------|
+| `'user_id'`   | User ID             |
+| `'recipe_id'` | Recipe ID           |
+| `'date'`      | Date of interaction |
+| `'rating'`    | Rating given        |
+| `'review'`    | Review text         |
+
+After downloading the datasets, you **must** follow the following steps to merge the two datasets and create a column containing the average rating per recipe:
+1. Left merge the recipes and interactions datasets together.
+2. In the merged dataset, fill all ratings of 0 with `np.nan`. (Think about _why_ this is a reasonable step, and include your justification in your website.)
+3. Find the average rating per recipe, as a Series.
+4. Add this Series containing the average rating per recipe back to the recipes dataset however you'd like (e.g., by merging). **Use the resulting dataset for all of your analysis.** (For the purposes of Project 4, the `'review'` column in the interactions dataset doesn't have much use.)
+
+---
+
+## Example Questions and Prediction Problems
+
+Feel free to base your exploration into the dataset in Steps 1-4 around one of these questions, or come up with a question of your own.
+
+- What types of recipes tend to have the most calories?
+- What types of recipes tend to have higher average ratings?
+- What types of recipes tend to be healthier (i.e. more protein, fewer carbs)?
+- What is the relationship between the cooking time and average rating of recipes?
+
+Feel free to use one of the prompts below to build your predictive model in Steps 5-8, or come up with a prediction task of your own.
+
+- Predict ratings of recipes.
+- Predict the number of minutes to prepare recipes.
+- Predict the number of steps in recipes.
+- Predict calories of recipes.
+
+Make sure to justify what information you would know at the “time of prediction” and to only train your model using those features.
+
+---
+
+## Special Considerations
+
+### Step 2: Data Cleaning and Exploratory Data Analysis
+
+Some columns, like `'nutrition'`, contain values that look like lists, but are actually strings that look like lists. You may want to turn the strings into actual lists, or create columns for every unique value in those lists. For instance, per the data dictionary, each value in the `'nutrition'` column contains information in the form `"[calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), and carbohydrates (PDV)]"`; you could create individual columns in your dataset titled `'calories'`, `'total fat'`, etc.
+
+### Step 4: Assessment of Missingness
+
+There are only three columns in the merged dataset that contain missing values; make sure you're using the merged dataset for all of your analysis (and that you followed the steps at the top of this page exactly).