The goal of this project is to extract identifyable genes, proteins and metabolites from publised pathway figures. In addition to all the code for assembling and running the Pathway Figure OCR pipeline, this repo contains scripts specific to the QC, analysis and figure generation involved in our publications of the work. Here we document a few of the key files and folders relevant to each paper:
-
25 Years of Pathway Figures (BioRxiv 2020)
- Interactive search tool for 65k pathway figures and their gene content: shiny app and code
- NIH Figshare of identified pathway figures and OCR results as RDS datasets: collection
- UpSet plot of top text and figure genes: script
- Pie chart data for top disease terms for text and figure genes: script
- Overlap matrix for Hippo Signaling pathway figure genes: script
- Machine learning progression plots: script
- Local database name:
pfocr20200131
-
Identifying Genes in Published Pathway Figure Images (BioRxiv 2018)
- Performance assessment figures: folder
- Local database name:
pfocr2018121717
This work is supported by NIGMS, R01GM100039
The codebook is a good place to start to see how we assemble and run the PFOCR pipeline. Be forewarned, however, this project is still in development and is not ready for production or even dev releases. So, don't expect things to work :) Contact us via Issues if you're interested in contributing to the development. All our code are open source.