Bioassay data associative promiscuity pattern learning engine V2.
If you want to use/recreate the classic version of badapple follow the instructions here.
NOTE: In progress, the steps below are not final or complete
The steps below outline how one can generate the Badapple2 DB on their own system.
Make sure to inspect all bash scripts and modify variable definitions (mostly file paths) as needed before running them. When running bash scripts, make sure your conda environment is active (conda activate badapple2
).
Code is expected to work on Linux systems. Thus far all code has been tested on the following OS:
Distributor ID: Linuxmint
Description: Linux Mint 21.2
Release: 21.2
Codename: victoria
- Setup conda (see the Miniconda Site for more info)
- (Optional) I'd recommend using the libmamba solver for faster install times, see here
- Install the Badapple2 environment:
conda env create -f environment.yml
- This will create a new conda env with name
badapple2
. If you wish, you can change the first line of environment.yml prior to the command above to change the name.
- This will create a new conda env with name
The steps below are common to installation of the badapple
, badapple_classic
, and badapple2
databases (DBs).
- Install PostgreSQL with the RDKit cartridge (requires sudo):
sudo apt install postgresql-14-rdkit
- (Option 1) Make your user a superuser prior to DB setup:
- Switch to postgres user:
(base) <username>@<computer>:~$ sudo -i -u postgres
- Make yourself a superuser:
psql -c "CREATE ROLE <username> WITH SUPERUSER PASSWORD '<password>'"
- Switch to postgres user:
- (Option 2) If you don't want to make
<username>
a superuser, follow the steps below:- When running DB setup commands, prepend
sudo -u postgres
to DB setup commands. For example, instead ofcreatedb <DB_NAME>
usesudo -u postgres createdb <DB_NAME>
. - After setting up the DB as
postgres
you can grant permissions to<username>
to access the DB as<username>
like so:
sudo -i -u postgres psql -d <DB_NAME> -c "CREATE ROLE <username> WITH LOGIN PASSWORD '<password>'" psql -d <DB_NAME> -c "GRANT SELECT ON ALL TABLES IN SCHEMA public TO <username>" psql -d <DB_NAME> -c "GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO <username>" psql -d <DB_NAME> -c "GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO <username>"
- When running DB setup commands, prepend
Additionally, before getting started, make sure you have the following files:
- AID file: Text file listing all PubChem AIDs to be included in the DB.
The steps below outline how to mirror PubChem data to your system (much faster/more reliable than using PUG-REST API) and how to generate the 5 input TSVs we'll use in part (3). I would recommend saving all 5 of these TSVs to the same directory.
- Run
bash sh_scripts/mirror_pubchem.sh
- This will mirror PubChem Bioassay data on your system (~11 GB of space required).
- Files will be saved to
{workdir}/bioassay
.
- Run
bash sh_scripts/python/run_pubchem_assays_local.sh
. This will generate 3 files:o_compound
: TSV file with compound CIDs and isomeric SMILES.o_sid2cid
: TSV file mapping compound id (CID) <=> substance id (SID)o_assaystats
: TSV file with assay id (AID), substance id (SID), and activity outcome.
- Run
bash sh_scripts/python/run_generate_scaffolds.sh
. This will generate 3 output files:o_mol
: TSV file with compound canonical SMILES and their CIDso_scaf
: TSV file with all scaffolds and their IDso_mol2scaf
: TSV file mapping compound CID to scaffold ID(s)
(Step 6 currently out of date, will update)
- Install postgresql with the RDKit cartridge (requires sudo):
apt install postgresql-14-rdkit
- Run
bash sh_scripts/db/create_db.sh
- Connect to db with
psql -d badapple2
- Run
CREATE EXTENSION rdkit;
. This should returnCREATE EXTENSION
. - (Optional) You can test that the RDKit cartridge is working with the
is_valid_smiles
command:
badapple2=# select is_valid_smiles('O1OCCCC1');
is_valid_smiles
-----------------
t
(1 row)
- Run
bash sh_scripts/db/load_db.sh
- Done!