Offline Wikipedia Text API

Welcome to the Offline Wikipedia Text API! This project provides a simple way to search and retrieve Wikipedia articles from an offline dataset using the txtai library. The API offers three endpoints to get full articles by title, full articles by search prompt, and summary snippets of articles by search prompt.

Features

Offline Access: All Wikipedia article texts are stored offline, allowing for fast and private access.
Search Functionality: Uses the powerful txtai library to search for articles by prompts.

Requirements

This project requires a minimum of 60GB of hard disk space to store the related datasets
This project utilizes Git to pull down the needed datasets (https://git-scm.com/downloads)
- This can be skipped by downloading the datasets into their respective folders in the project directory.
  - "wiki-dataset" folder: https://huggingface.co/datasets/NeuML/wikipedia-20240101
  - "txtai-wikipedia" folder: https://huggingface.co/NeuML/txtai-wikipedia
- The existence of the two dataset folders should skip the git calls, bypassing their need.
This project is a Python project, and requires Python to run.

Important Notes

During first run, the app will first download about 60GB worth of datasets (see above), and then will take about 10-15 minutes to do some indexing. This will only occur on first run; just let it do its thing. If, for any reason, you kill the process halfway through and need to redo it, you can simply delete the "title_to_index.json" file and it will be recreated. You can also delete the "wiki-dataset" and "txtai-wikipedia" folders to redownload.

If you're dataset savvy and want to make new, more up to date, datasets to use with this- NeuML's huggingface repos give instructions on how.

This project relies heavily on txtai, which uses various libraries to download and utilize small models itself for searching. Please see that project for an understanding of what gets downloaded and where.

Clone the Repository

git clone https://github.com/SomeOddCodeGuy/OfflineWikipediaTextApi
cd OfflineWikipediaTextApi

Installation via Scripts

Run the API
- For Windows:
```
run_windows.bat
```
- For Linux or MacOS:
  - There are currently scripts within "Untested", though there is a known issue for MacOS related to git. A workaround is presented in the README for that folder.

Manual Installation

Pull down the code from https://github.com/SomeOddCodeGuy/OfflineWikipediaTextApi
Open command prompt and navigate to the folder containing the code
Optional: create a python virtual environment.
1. Windows: python -m venv venv
2. MacOS/Linux: python3 -m venv venv
Optional: activate python virtual environment.
1. Windows: venv\Scripts\activate
2. MacOS/Linux: venv/bin/activate
Pip install the requirements from requirements.txt
1. Windows: python -m pip install -r requirements.txt
2. Linux/MacOS: python3 -m pip install -r requirements.txt
Pull down the two needed datasets into the following folders within the project folder:
1. wiki-dataset folder: https://huggingface.co/datasets/NeuML/wikipedia-20240101
2. txtai-wikipedia folder: https://huggingface.co/NeuML/txtai-wikipedia
3. See project structure below to make sure you did it right
Run start_api.py
1. Windows: python start_api.py
2. MacOS/Linux: python3 start_api.py

Step 7 will take between 10-15 minutes on the first run only. This is to index some stuff for future runs. After that it should be fast.

Your project should look like this:


- OfflineWikipediaTextApi/
   - wiki-dataset/
       - train/
           - data-00000-of-00044.arrow
           - data-00001-of-00044.arrow
           - ...
       - pageviews.sqlite
       - README.md
   - txtai-wikipedia
       - config.json
       - documents
       - embeddings
       - README.md
   - start_api.py
   - ...

Configuration

The API configuration is managed through the config.json file:

{
    "host": "0.0.0.0",
    "port": 5728,
    "verbose": false
}

The "verbose" is for changing whether the API library uvicorn outputs all logs vs just warning logs. Set to warning by default.

Endpoints

1. Get Full Article by Title

Endpoint: /articles/{title}

Example cURL Command

curl -X GET "http://localhost:5728/articles/Applications%20of%20quantum%20mechanics"

2. Get Wiki Summaries by Prompt

Endpoint: /summaries

Example cURL Command

curl -G "http://localhost:5728/summaries" --data-urlencode "prompt=Quantum Physics" --data-urlencode "percentile=0.5" --data-urlencode "num_results=1"

3. Get Full Wiki Articles by Prompt

Endpoint: /articles

Example cURL Command

curl -G "http://localhost:5728/articles" --data-urlencode "prompt=Artificial Intelligence" --data-urlencode "percentile=0.5" --data-urlencode "num_results=1"

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for more details.

Third-Party Licenses

This project imports dependencies in the requirements.txt:

Please see ThirdParty-Licenses directory for details on their licenses.

License and Copyright

OfflineWikipediaTextApi
Copyright (C) 2024 Christopher Smith

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Offline Wikipedia Text API

Features

Requirements

Important Notes

Installation via Scripts

Manual Installation

Configuration

Endpoints

1. Get Full Article by Title

Example cURL Command

2. Get Wiki Summaries by Prompt

Example cURL Command

3. Get Full Wiki Articles by Prompt

Example cURL Command

License

Third-Party Licenses

License and Copyright

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ThirdParty-Licenses		ThirdParty-Licenses
Untested		Untested
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
requirements.txt		requirements.txt
run_windows.bat		run_windows.bat
start_api.py		start_api.py

License

RodriMora/OfflineWikipediaTextApi

Folders and files

Latest commit

History

Repository files navigation

Offline Wikipedia Text API

Features

Requirements

Important Notes

Installation via Scripts

Manual Installation

Configuration

Endpoints

1. Get Full Article by Title

Example cURL Command

2. Get Wiki Summaries by Prompt

Example cURL Command

3. Get Full Wiki Articles by Prompt

Example cURL Command

License

Third-Party Licenses

License and Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages