An implementation of a simple web crawler in Python. The crawler is fully multithreaded and can be used to crawl the web for a given domain name.
To get started you need to have Poetry installed. You can install Poetry by running the following command in the shell.
pip install poetry
When the installation is finished, run the following command in the shell in the root folder of this repository to install the dependencies and create a virtual environment for the project.
poetry install
After that, enter the Poetry environment by invoking the poetry shell command.
poetry shell
If you are using a Debian-based system, you can install the system-wide dependencies by running the following command.
sudo apt-get install python3-bs4 libnss-resolve nscd
To run the crawler, you can use the following command.
pushd src && python3 main.py --domain <domain_name> --threads <number_of_threads> --output <output_file> && popd
This project is licensed under the MIT License - see the LICENSE file for details.