Skip to content

calipho-sib/cellosaurus-wikidata-bot

Repository files navigation

Cellosaurus Wikidata Bot


***************************

Cellosaurus Bot allows to integrate cell lines from Cellosaurus to Wikidata.

It was developed based on the WikidataIntegrator library.

Main functions

The Cellosaurus bot will:

  • Create Wikidata items for new cell lines in Cellosaurus release.
  • Update Wikidata items for changes in cell lines informations in Cellosaurus release.

Running the bot

The bot is based on the dumps of the Cellosaurus database. The first step for making the integration is to download the dump of the current release from the Cellosaurus website. The FTP link is:

ftp://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt

To run it, change the run_pipeline.sh file with the updated parameters.

There are 3 main scripts that process the release and integrate to Wikidata:

  • prepare_files.py
  • check_lines_on_wikidata.py
  • update_wikidata.py

The goal of prepare_files.py is to parse the dump and save a Python object that contains the cell line information in a Wikidata-compatible format.

It takes 3 arguments:

  • 1st: The path to the .txt of the Cellosaurus dump
  • 2nd: The path to the folder where the pickle file and cell lines on wikidata were saved after running "prepare_files.py"
  • 3rd: The folder for errors.

For example: python3 prepare_files.py release_38/cellosaurus.txt pickle_files errors

Some articles might not be on Wikidata at the time of the release. These will be logged under the folder for errors.

For adding articles to Wikidata, run: python3 add_articles.py errors

After that, re-run prepare_files to effectively used the newly added articles in the Cellosaurus integration.

The goal of * check_lines_on_wikidata.py is to check if each cell line in the current release is present on Wikidata. Then, it adds to Wikidata the information about any cell lines that are missing.

The second one takes 4 arguments:

  • 1st: The path to the .txt of the Cellosaurus dump
  • 2nd: The path to the folder where the pickle file and cell lines on Wikidata were saved after running prepare_files.py
  • 3rd: The folder for errors.
  • 4th: The QID for the Cellosaurus release on Wikidata

You will have to check Wikidata manually for the ID of the release. For release 36, the ID is Q100993240. For release 38, it is Q106915727.

Notice that you will need the Wikidata user and password of the CellosaurusBot for that operation. The script looks for it in src/local.py. Notice that the credentials should not be commited to GitHub.

For example: python3 check_lines_on_wikidata.py release_36/cellosaurus.txt pickle_files errors Q100993240

Now that all the cell lines are represented on Wikidata, we can update the information for all of them (including inter-cell line links):

python3 update_wikidata.py release_36/cellosaurus.txt pickle_files errors Q100993240

Acknowledgments

The CellosaurusBot now is a 2020 remake of the CellosaurusBot developed in 2018. The following people contributed directly to this project:

  • Amos Bairoch
  • Lelia Debornes
  • Tiago Lubiana
  • Andra Waagmeester

About

Bot to add cell lines from Cellosaurus to Wikidata

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •