Sensemaking Problem Set: Course Catalog Data Cycle

This repository contains the code and analysis for the Sensemaking Problem Set, which provides a comprehensive exploration of the full data cycle, from collection to consumption, focusing on a public university course catalog. The problem set aims to develop skills in data acquisition, preparation, parsing, cleaning, extraction, analysis, visualization, and the creation of a clean, formatted dataset.

Assignment Objectives

Gain hands-on experience in data collection, storage, processing, and consumption
Develop skills in analytics and visualization by working with a public university course catalog data

Repository Structure

The repository is organized into the following structure:

01_pull.py: Script for data acquisition, automating the process of accessing the website, navigating to the relevant sections if necessary, and downloading the HTML content.
02_combine.py: Script for data preparation, combining multiple HTML files into a single comprehensive document.
03_parse.py: Script for data parsing, extracting relevant information from the consolidated HTML document.
04_clean.py: Script for data cleaning, refining the extracted information by removing or correcting any data that will break the parser.
05_extract.py: Script for data extraction, identifying and extracting course titles from the cleaned dataset.
06_frequency.py: Script for word frequency analysis, analyzing the most common words used in course titles.
07_visualization.py: Script for data visualization, creating visual representations of the word frequencies obtained from the course titles.
08_export.py: Script for exporting a clean, well-formatted dataset of the entire university catalog.
09_pipeline.py: Script for automating the sequential execution of previously created script files.
10_mit_1996.json: Extracted course data from the scanned 1996 MIT course catalog.
10_extract_1996.py: Script for extracting course data from the scanned 1996 MIT course catalog.
11_mit_2024.json: Extracted course data from the current MIT course catalog.
11_extract_2024.py: Script for extracting course data from the current MIT course catalog.
12_course_offerings.py: Script for analyzing the number of courses offered in various departments over time.
13_title_evolution.py: Script for conducting a word frequency analysis on course titles from 1996 and 2024.
14_new_and_old.py: Script for identifying subjects that were offered in 1996 but no longer exist in 2024, as well as new subjects introduced in 2024.
15_curriculum_breadth.py: Script for comparing the breadth of topics in the 1996 and 2024 catalogs.
16_summary_reflection.txt: Written summary reflecting on the most significant changes in the MIT course catalog over time.

Setup and Usage

Clone the repository
Install the necessary dependencies. Refer to the individual script files for specific requirements.
Run the scripts in the specified order to perform the various stages of the data cycle.
Analyze the generated outputs, including the cleaned dataset, visualizations, and summary reflections.

Submission

The problem set submission instructions can be found at: https://classroom.github.com/a/MAgpvTrm

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
raw_html/northeastern		raw_html/northeastern
01_pull.py		01_pull.py
02_combine.py		02_combine.py
03_parse.py		03_parse.py
04_clean.py		04_clean.py
05_extract.py		05_extract.py
06_frequency.py		06_frequency.py
07_visualization.py		07_visualization.py
08_export.py		08_export.py
09_pipeline.py		09_pipeline.py
10_extract_1996.py		10_extract_1996.py
10_mit_1996.json		10_mit_1996.json
11_extract_2024.py		11_extract_2024.py
11_mit_2024.json		11_mit_2024.json
12_course_offerings.py		12_course_offerings.py
13_title_evolution.py		13_title_evolution.py
14_new_and_old.py		14_new_and_old.py
15_curriculum_breadth.py		15_curriculum_breadth.py
16_summary_reflection.txt		16_summary_reflection.txt
LICENSE		LICENSE
README.md		README.md
cleaned_courses.txt		cleaned_courses.txt
combined_northeastern_catalog.html		combined_northeastern_catalog.html
course_offerings_insights.txt		course_offerings_insights.txt
department_changes.csv		department_changes.csv
department_changes.png		department_changes.png
extracted_titles.txt		extracted_titles.txt
new_and_discontinued_subjects.txt		new_and_discontinued_subjects.txt
parsed_courses.txt		parsed_courses.txt
pipeline.log		pipeline.log
top_words_comparison.png		top_words_comparison.png
university_catalog.json		university_catalog.json
word_frequencies.txt		word_frequencies.txt
wordcloud_1996.png		wordcloud_1996.png
wordcloud_2024.png		wordcloud_2024.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sensemaking Problem Set: Course Catalog Data Cycle

Assignment Objectives

Repository Structure

Setup and Usage

Submission

Resources and References

Visualizations

About

Releases

Packages

Languages

License

jtwirly/sensemaking

Folders and files

Latest commit

History

Repository files navigation

Sensemaking Problem Set: Course Catalog Data Cycle

Assignment Objectives

Repository Structure

Setup and Usage

Submission

Resources and References

Visualizations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages