Sangraha Internet Archive Data Download

Code Repository for Scripts and Utils for downloading and curating Indic Data from archive.org Files.

Setup

Create a virtual environment and install required python dependencies provided in the requirements.txt file.

Single Machine Download from Internet Archive

In the pipeline folder, We have Single Machine Download Python Script for downloading archive data into your machine. The script requires a list of language names i.e. Dogri, Tamil, Hindi, etc. followed by optional arguments such as pdf_only and id_only download options.

Distributed Machine Download from Internet Archive

This setup was utilized so that we can download data onto machines with more storage and parallelize downloads.

Note : You will have to setup RabbitMQ in your server and client machines and configure the Credentials file accordingly.

In the pipeline folder, We have two files :

Multiple Machine Server for queueing the identifiers from the identifiers.csv file downloaded from the previous section using id_only parameter.
Multiple Machine Client for pulling identifiers from server host and downloading data onto client machine.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
config		config
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sangraha Internet Archive Data Download

Setup

Single Machine Download from Internet Archive

Distributed Machine Download from Internet Archive

About

Releases

Packages

Languages

AI4Bharat/sangraha-internet-archive-download

Folders and files

Latest commit

History

Repository files navigation

Sangraha Internet Archive Data Download

Setup

Single Machine Download from Internet Archive

Distributed Machine Download from Internet Archive

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages