Team members:
Luka Pavićević
,63220489
,[email protected]
Andrija Stanišić
,63220491
,[email protected]
Stefanela Stevanović
,63220492
,[email protected]
Group public acronym/name: PosnaSarma
This value will be used for publishing marks/scores. It will be known only to you and not you colleagues.
The goal of this project is to create a summary generator for the movies without any human interaction in the process. We plan to achieve this goal by applying different natural language processing methods. For the base line we have implemented Latent Semantic Analysis (LSA) and for the final implementation we have used a transformer based model which has shown promising results. The transformer-based model used is T5-small, which was imported from the Hugging Face and fine-tuned on our dataset.
We have collected our dataset from different sources and we have pushed it completely on the github, so you don't have to performe any web scraping yourself. If you wish to download the dataset yourself that can be done in four steps. To download the movie scripts, you should run the Jupyter Notebook in the scripts repository. The second step is downloading the subtitles, this can be done by running the provided Jupyter Notebook from the directory subslikescript. This method will download subtitles and summaries. Since not all the summaries were available on the first site we have downloaded more summaries from Rottent Tomatoes and MetaCritic, these summaries can be obtained by running provied Jupyter Notebook in the repositories metacritic and rottentomatoes respectivly.
To performe the natural language processing you should look at the directories baseline and main. In the baseline you will find our Latent Semantic Analysis (LSA). This analysis can be done by running the provided lsa_plus_rouge.py script. As for the main, there you will find our fine tunned T5-small model that was imported from the Hugging Face. We have fine tunned our model on Google Colab, the link to the directory where the trained model is: https://drive.google.com/drive/folders/1f9Mn2DGvRzg5IcoYgRdbNxH_NJf8fz8D?fbclid=IwAR1B2T67K3uIpSlYNQcXBS8q8xl8wZVTLfltuJegE8fUtvstM3mPyX8kcZ4
The following python libraries are required in order to run our code. The requirements are split based on what part of the code you want to run.
- Scrip acquisition:
- Requests
- BeautifulSoup 4
- tqdm
- Subtitle acquisition:
- Requests
- BeautifulSoup 4
- tqdm
- Pandas
- MetaCritic/RottenTomatoes summary acquisition:
- Matplotlib
- Numpy
- Selenium
- Baseline model:
- Pandas
- Sumy
- Numpy
- NLTK
- Rouge_score
- Main model:
- Transformers
- Datasets
- Evaluate
- Rouge_score
- Accelerate
- Sentencepiece
- Bert_score
- Pandas
If you have any questions and/or suggestions reggarding our imeplementation you can contact any of the contributors, and we would be more than happy to help you.