StructHunt is a program designed to scrape scientific articles from BioRXiv, parse them, convert them into embeddings, and perform analysis on whether they employ certain methodologies. The resulting information is then organized and stored in a CSV file. The program consists of several components that work together seamlessly to achieve this functionality.
scraper.py
is responsible for scraping BioRXiv to obtain scientific articles in PDF format. It utilizes external libraries and APIs to download these articles and then applies the necessary parsing logic to extract relevant information.
VectorDatabase.py
contains the Lantern
class, which is used to interact with a PostgreSQL database. The embeddings generated from the articles are input into the database, associating them with the corresponding publications.
hackathon_runner.py
is the script responsible for managing the overall flow of the program. It identifies publications that haven't been processed, retrieves their IDs, and triggers subsequent processing steps.
The chatgpt
component involves interacting with OpenAI's GPT-based language model. This is done using prompts generated from the updated_prompt.py
script along with the embeddings retrieved from the previous step. The goal is to analyze whether the publications implement certain methodologies.
prompts.py
generates prompts that are used to query the GPT model. These prompts are crafted based on the specific characteristics of the publications being analyzed.
The program populates a CSV file with the analysis results. This file contains information on whether the publications employ certain methodologies, providing a structured output for easy interpretation and further analysis.
-
Environment Setup:
- Ensure that you have Python installed.
- Install the required Postgres Database and Python packages using
initialize_database.sh
.
sudo ./initialize_database.sh
-
Run the Program:
- Execute
runner.py
to initiate the structured hunting process.
- Execute
python runner.py
Feel free to contribute to the development of StructHunt by submitting issues, feature requests, or pull requests. Your feedback and contributions are highly appreciated.
This project is licensed under the MIT License.