The aim of this assignment is for you to become familiar with proteomics databases like PRIDE. By taking advantage of programmatic access to these databases, you can screen for the datasets that correspond to the organism(s) that you are looking for. While details about the sample processing will need to be extracted manually in the end, this exercise will also give you additional starting points for the group project.
No input data is needed. You can start directly using the programmatic access to the database.
- Use programmatic access to PRIDE (e.g. pridepy, or ppx) to find all datasets corresponding to the genus Neisseria.
- Count the number of datasets for each species in that genus for which datasets in PRIDE exist.
- Report your findings either as a csv table, or a graph (or both).
- Make sure to comment your code, so that others can read and understand it easily.
- Create a README file describing how to run your code. Include requirements (e.g. Python packages that need to be installed) in that description, or as a separate requirements.txt file.
- Commit all your input files, scripts, and result files to your GitHub Classroom repository.
- Search for at least two additional species from different genera.
- Count the number of datasets per year of publication for each species, and display the results in a graph.
You must submit the assignment by 8 am Feb 1, 8 am to get full credit.