Repository for Task 3 of the Information Retrival team project: Latent Semantic Indexing.
Used Technologies:
- Python (version 3.6)
- Libraries
- Natural Language ToolKit (NLTK) for Python
- Numpy
- SciPy (v0.19)
- json
- Webpage as UI build with Angular
- We recommend using Anaconda as package manager and runtime (https://www.continuum.io/downloads)
Preprocessed files are persisted and already available in the project. If you want to re-preprocess, follow the steps of "How to Run from Scratch".
- Copy the
20news-bydate
folder intoLatentSemanticIndexing/data
- Execute
python /main.py
- Navigate to http://localhost:8000 in your web browser.
- If successful, the user interface of the search should appear.
It is assumed that the newsgroup folder is available as described above in LatentSemanticIndexing/data
.
- Go to
src/preprocessing/LemmatizationFilePreprocessing
- If you have never used the NLTK stopword removal list and the tokenizer, follow the subsequent steps. Otherwise continue with step 3.
Execute
import nltk
nltk.download()
- A download explorer opens.
- Click on "Corpora", search for "stopwords" and "wordnet" and download both.
- Furthermore, click on "Models",
- Search for "averaged_perceptron_tagger" and "punkt" and download both.
- Run the program "LemmatizationFilePreprocessing.py"