A CS105 Project that classifies and characterizes class lectures based on word frequency.
Project developed by Shreya Balaji, Benson Wan, and Richard Duong.
Link to the Github Repository Here
If you want to take a look at our presentation and findings, click here
This repository contains sample data extracted from YouTube. The directory structure is as follows:
YouReader/ # custom python package for extracting captions
docs/ # documents, graphics, and other resources
notebooks/ # notebooks for graphics
scripts/ # setup scripts
tests/ # unit and integration tests
old/ # old development code
data/ # collected data
links.csv # input file for links
example.csv # example input file
save.json # downloaded and cleaned data
Before you can use and test code from this project, you will need the following installed on your system:
Optional if you want to generate graphics with notebooks
To use this package, you'll have to generate a virtual environment to download the prerequisite python libraries. If you have not generated the virtual environment yet, follow these steps.
- Download and extract the code
- Run the following commands:
Move to project directory
=========================
$ cd GuessTheClass
To generate a virtual environment
=================================
[Linux, MacOS]
$ chmod +x scripts/setup.sh
$ scripts/setup.sh
[Git Bash on Windows]
$ scripts/winsetup.sh
[Cmd Prompt on Windows]
> "scripts/setup.bat"
After setting up the virtual environment for the first time, Run these commands to load up the virtual environment before you start using our package.
Load the virtual environment
============================
[Linux, MacOS]
$ source env/bin/activate
[Git Bash on Windows]
$ source env/Scripts/activate
[Cmd Prompt on Windows]
> "env/Scripts/activate.bat"
Disable the virtual environment
===============================
$ deactivate
If you want to run our program and use the existing dataset, you can use the template notebook in the notebook/ directory
GuessTheclass/notebooks/template.ipynb
If you have your own existing dataset that you want to test:
- Put your YouTube links into "data/links.csv"
- You can build your captions dataset using the example below