IMDb Data Analyzer

What this script does:

Starting from an input movie, this script outputs data about the Box Office revenues based:

the previous 10 movies of:
- the lead actor
- second lead actor
- director
- producer
the top 100 movies of their main genre
the top 100 movios of their secondary genre

Getting started:

In order to use this python script you need to run it from a virtual environment. Please follow the next steps:

open a terminal and navigate to the project folder
create a virtual environment at your choice, e.g.:
```
  virtualenv .venv
```
enter the virtual environment:
```
  source .venv/bin/activate
```
install the required packages:
```
  pip install -r requirements.txt
```

Next, create a new folder and name it "input_data".

Please use this website to download the required input files:

https://datasets.imdbws.com/

Please download the following packages: title.basics.tsv.gz and title.ratings.tsv.gz. After downloading, unarchive and move them both to input_data folder.

Every once in a while you must download this data again in order to keep the results up to date.

Using these 2 files, the script will generate a JSON file called movies_by_genre.json that can be found in the project root directory. If that file already exists, the script will simply read it and use it. This file contains all the movies categorized by genre and sorted descending by the number of votes.

Usage

This script takes as input an IMDb movie ID, e.g. 0372784.

WARNING: Please input only digits. In their URLs and databases, IMDb writes the IDs with tt before, i.e.: tt0372784. Anyway, in IMDbPy, only the digits are used.

Make sure that you run the script while being into a virtual environment (see the command above).

The script takes only 1 argument. If you provide more, only the first one will be considered.

    python3 main.py <ID>

Outputs

Running the script will populate output_data folder. Every file name has the same format:

ID_DATE_DESCRIPTION.file

ID - the input movie ID without the "tt" prefix DATE - the date of running the script

This scrip generates the following files:

ID_DATE.json : this contains all the data in JSON format. Both opening weekend revenues and worldwide revenues
ID_DATE_weekend.csv : table that stores the Box Office revenue data from the opening weekdend
ID_DATE_worldwide.csv : table that stores the worldwide Box Office revenue data
ID_DATE_averages.json : for each category (i.e., lead actor), the average is calculated for both opening weekend and worldwide revenues. It counts only the non-null values.
ID_DATE_ratios.json : for each movie, other than the inputted one, the opening weekend revenue / worldwide revenue is calculated.

Conclusion

This can be easily extended to take as input multiple IDs. Anyway the program takes quite a lot of time to completely run, since IMDbPy is quite slow and it has to take the data from 240 movies.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py
movies_by_genre.json		movies_by_genre.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDb Data Analyzer

What this script does:

Getting started:

Next, create a new folder and name it "input_data".

Usage

Outputs

Conclusion

About

Releases

Packages

Languages

zcabbub/imdb_data_analyzer

Folders and files

Latest commit

History

Repository files navigation

IMDb Data Analyzer

What this script does:

Getting started:

Next, create a new folder and name it "input_data".

Usage

Outputs

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages