Job Vacancy Web Scraping Project

Overview

Demo project to practice Python and Machine Learning technologies.

Project extracts job posting from jobs sites and analizes the data.

Prerequisites

Python 3.7+ (as it used @dataclass decorator).

See requirements.txt in the folder.

Getting Started (Linux/Unix instructions)

Clone this repo from GitHub

Open a terminal window in repo's root folder and follow the steps below

Create a Python virtual in the folder that contains this repo:

    python -m venv venv

Activate the virtual environment:

    source venv/bin/activate

Install the project dependencies:

    pip install --no-cache-dir --upgrade -r requirements.txt

Set the PYTHONPATH environment variable:

    export PYTHONPATH="./"

First time database setup: create SQLite database to store the jobs:

    python app/scripts/create_database.py

Create a directory to store logs:

    mkdir logs

Execute program to scrape jobs(first see "Set job search parameters"):

    python app/scrape_jobs.py

Start API for local development

Open a terminal window in repo's root folder and start the API for local development

    fastapi dev app/main.py

Once it starts, open a browser window and enter http://127.0.0.1:8000/docs to try the endpoints on Swagger.

Unit test

Open a terminal window in repo's root folder and execute

    pytest

Set job search parameters:

Currently only scraped site is Jobserve.

If you are interested in a particular set of jobs to store in the database you can populate the Jobserve search form. Then use the session Id (shid) that appear in the browser querystring to target these set from the application. To do this, follow the steps:

Go to https://www.jobserve.com/gb/en/Job-Search/ and fill the search form with your requirements.
After hitting Search button, you will be redirected to a search results page.
From the URL you can obtain the session id shid value: https://www.jobserve.com/gb/en/JobSearch.aspx?shid=<session-id>
In the .env file to add a line with JOBSERVE-SHID= the shid got in the URL

JOBSERVE-SHID=<session-id>

NOTE: After 2 days not accessing Jobserve with this session id, it will expire and you will need to repopulate the search as explained in previous steps.

Additional Notes:

Check the logs in the ../logs directory if you encounter any issues running the application.
Jobs scraped are stored in the SQLite database data/job-scrape.db. Install a SQLite browser to inspect data retrieved. There are some SQL queries queries.sql file.

Limitations and Future Improvements

Add web interface to extract insights from the database.
Application to set particular values in the Search form to avoid having to deal with the expired session id.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
.ipynb_checkpoints		.ipynb_checkpoints
app		app
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
README.md		README.md
jobs-scrappe.ipynb		jobs-scrappe.ipynb
pytest.ini		pytest.ini
queries.sql		queries.sql
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Job Vacancy Web Scraping Project

Overview

Prerequisites

Getting Started (Linux/Unix instructions)

Start API for local development

Unit test

Set job search parameters:

Limitations and Future Improvements

License

About

Releases

Packages

Contributors 2

Languages

rigaml/job-scraper

Folders and files

Latest commit

History

Repository files navigation

Job Vacancy Web Scraping Project

Overview

Prerequisites

Getting Started (Linux/Unix instructions)

Start API for local development

Unit test

Set job search parameters:

Limitations and Future Improvements

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages