Update 05-create-profiles.Rmd (#63)

* Update 05-create-profiles.Rmd * Add external metadata instructions Co-authored-by: Niranj <[email protected]>
cytomining · Aug 23, 2021 · b328e16 · b328e16
1 parent 79a59be
commit b328e16
Show file tree

Hide file tree

Showing 2 changed files with 205 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,6 @@ _publish.R
 _book
 _bookdown_files
 rsconnect
+
+# Pycharm files 
+.idea/
diff --git a/05-create-profiles.Rmd b/05-create-profiles.Rmd
@@ -79,7 +79,7 @@ python3 -m pip install -e .
 ```
 
 The command below first calls `cytominer-database ingest` to create the SQLite backend, and then pycytominer's `aggregate_profiles` to create per-well profiles.
-Once complete, all files are uploaded to S3 and the local cache are deleted.
+Once complete, all files are uploaded to S3 and the local cache are deleted.  This step takes several hours, but metadata creation and GitHub setup can be done in this time.
 
 [collate.py](https://github.com/cytomining/pycytominer/blob/jump/pycytominer/cyto_utils/collate.py) ingests and indexes the database.
 
@@ -127,4 +127,204 @@ This is the resulting structure of `backend` on S3 (one level below `workspace`)
 ```
 
 
-At this point, the user needs to use the [profiling template](https://github.com/cytomining/profiling-template) to use [pycytominer](https://github.com/cytomining/pycytominer/) to annotate the profiles with metadata, normalize them, and feature select them.  Detailed instructions for these steps will be added as soon as possible.
+At this point, the user needs to use the [profiling template](https://github.com/cytomining/profiling-template) to use [pycytominer](https://github.com/cytomining/pycytominer/) to annotate the profiles with metadata, normalize them, and feature select them.  
+
+## Create Metadata Files
+
+First, get metadata for the plates.
+This should be created beforehand and uploaded into S3.
+
+This is the structure of the metadata folder (one level below `workspace`):
+
+```
+└── metadata
+    └── platemaps
+        └── 2016_04_01_a549_48hr_batch1
+            ├── barcode_platemap.csv
+            └── platemap
+                └── C-7161-01-LM6-006.txt
+```
+
+`2016_04_01_a549_48hr_batch1` is the batch name – the plates (and all related data) are arranged under batches, as seen below.
+
+`barcode_platemap.csv` is structured as shown below.
+`Assay_Plate_Barcode` and `Plate_Map_Name` are currently the only mandatory columns (they are used to join the metadata of the plate map with each assay plate).
+Each unique entry in the `Plate_Map_Name` should have a corresponding tab-separated file `.txt` file under `platemap` (e.g. `C-7161-01-LM6-006.txt`).
+
+```
+Assay_Plate_Barcode,Plate_Map_Name
+SQ00015167,C-7161-01-LM6-006
+```
+
+The tab-separated files are plate maps and are structured like this:
+(This is the typical format followed by Broad Chemical Biology Platform)
+
+```
+plate_map_name  well_position broad_sample  mg_per_ml mmoles_per_liter  solvent
+C-7161-01-LM6-006 A07 BRD-K18895904-001-16-1  3.12432000000000016 9.99999999999999999 DMSO
+C-7161-01-LM6-006 A08 BRD-K18895904-001-16-1  1.04143999999919895 3.33333333333076923 DMSO
+C-7161-01-LM6-006 A09 BRD-K18895904-001-16-1  0.347146666668001866  1.11111111111538462 DMSO
+```
+
+```{block2, type='rmdnote'}
+- `plate_map_name` should be identical to the name of the file (without extension).
+- `plate_map_name` and `well_position` are currently the only mandatory columns.
+```
+
+The external metadata is an optional file tab separated `.tsv` file that contains the mapping between a perturbation identifier to other metadata. The name of the perturbation identifier column (e.g. `broad_sample`) should be the same as the column in platemap.txt.
+
+The external metadata file should be placed in a folder named `external_metadata` within the `metadata` folder. If this file is provided, then the following should be the folder structure
+
+```
+└── metadata
+    ├── external_metadata
+    │   └── external_metadata.tsv
+    └── platemaps
+        └── 2016_04_01_a549_48hr_batch1
+            ├── barcode_platemap.csv
+            └── platemap
+                └── C-7161-01-LM6-006.txt
+```
+
+
+## Set up GitHub
+
+Once and only once - fork the [profiling recipe](https://github.com/cytomining/profiling-recipe) to your own user name
+(Each time you make a new project - you may want to [keep your fork up to date](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/syncing-a-fork)
+
+Once per new PROJECT, not new batch - make a copy of the [template repository](https://github.com/cytomining/profiling-template) into your preferred organization with a project name that is similar OR identical to its project tag on S3 and elsewhere. 
+
+
+## Make Profiles
+
+
+### Optional - set up compute environment
+
+These final steps are small and can be done either in your local environment or on your node used to build the backends.  Conda and Git LFS are currently required.
+For now, the backend creation VMs don’t contain conda or git lfs, but in the meantime these are the commands used to install both for a Linux ecosystem (otherwise, search for your own OS).
+```
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
+bash ~/miniconda.sh -b -p $HOME/miniconda
+export PATH="/home/ubuntu/miniconda/bin:$PATH"
+source .bashrc
+conda init bash
+source .bashrc
+
+curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
+```
+If not using the same machine + tmux as for making backends, where your environment variables are already set, set them up
+
+* [Configure Environment for Full Profiling Pipeline]
+
+### Set new environment variables
+
+Specifically, `ORG` and `DATA` should be the GitHub organization and repository name used when creating the data repository from the template.
+`USER` should be your GitHub username.
+CONFIG_FILE will be the name of the config file used for this run, so something that makes it distinguishable (ie, batch numbers being run at this time) is helpful.
+
+```
+ORG=broadinstitute
+DATA=2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad
+USER=gh_username
+CONFIG_FILE=config_batch1
+```
+
+### If a first batch in this compute environment, make some directories
+```
+mkdir -p ~/work/projects/${PROJECT_NAME}/workspace/{backend,software}
+```
+
+### Add your backend files
+```
+aws s3 sync s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/backend/${BATCH_ID} ~/work/projects/${PROJECT_NAME}/workspace/backend/${BATCH_ID} --exclude="*" --include="*.csv"
+```
+
+### If a first batch in this compute environment, clone your repository
+```
+cd ~/work/projects/${PROJECT_NAME}/workspace/software
+git clone [email protected]:${ORG}/${DATA}.git
+#depending on your repo/machine set up you may need to provide credentials here
+cd ${DATA}
+```
+
+### If a first batch in this project, weld the recipe into the repository
+```
+git submodule add https://github.com/${USER}/profiling-recipe.git profiling-recipe
+git add profiling-recipe
+git add .gitmodules
+git commit -m 'finalizing the recipe weld'
+git push
+#depending on your repo/machine set up you may need to provide credentials here
+git submodule update --init --recursive
+```
+
+
+### If a first batch in this compute environment, set up the environment
+```
+cp profiling-recipe/environment.yml .
+conda env create --force --file environment.yml
+```
+
+
+### Activate the environment
+```
+conda activate profiling
+```
+
+### If a first batch in this project, create the necessary directories
+```
+profiling-recipe/scripts/create_dirs.sh
+```
+
+
+### Download the load_data_CSVs
+```
+aws s3 sync s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/load_data_csv/${BATCH_ID} load_data_csv/${BATCH_ID}
+gzip -r  load_data_csv/${BATCH_ID}
+```
+
+
+### Download the metadata files
+```
+aws s3 sync s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID} metadata/platemaps/${BATCH_ID}
+```
+
+### Make the config file
+```
+cp profiling-recipe/config_template.yml config_files/${CONFIG_FILE}.yml
+nano config_files/${CONFIG_FILE}.yml
+```
+
+```{block2, type='rmdnote'}
+The changes you will likely need to make for most small use cases following this handbook- for `aggregate` set `perform` to `false`, for `annotate` sub-setting `external` set `perform` to `false`, in `feature_select` set `gct` to `true`, and finally at the bottom set the batch(es) and plates names
+
+For large batches with many DMSO wells and external metadata ala the JUMP project - set `perform` under `external` to `true`, set `file` to the name of the external metadata file and set `merge_column` to the name of the compound identifier column in platemap.txt and external_metadata.tsv.
+```
+
+### Set up the profiles 
+Note that the “find” step can take a few seconds/minutes
+```
+mkdir -p profiles/${BATCH_ID}
+find ../../backend/${BATCH_ID}/ -type f -name "*.csv" -exec profiling-recipe/scripts/csv2gz.py {} \;
+rsync -arzv --include="*/" --include="*.gz" --exclude "*" ../../backend/${BATCH_ID}/ profiles/${BATCH_ID}/
+```
+
+### Run the profiling workflow
+Especially for large number of plates, this will take some time.  Output will be logged to the console as different steps proceed.
+```
+python profiling-recipe/profiles/profiling_pipeline.py  --config config_files/{$CONFIG_FILE}.yml
+```
+
+
+### Push resulting files back up to GitHub
+```
+git add *
+git commit -m 'add profiles for batch _'
+git push
+```
+
+
+### Push resulting files up to S3
+```
+parallel aws s3 sync {1} s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/{1} ::: config_files gct profiles quality_control
+```