Merge branch 'master' of github.com:wikipathways/pathway-figure-ocr

AlexanderPico · AlexanderPico · commit fdc1410ce775 · 2018-07-27T12:45:48.000-07:00
diff --git a/codebook.md b/codebook.md
@@ -1,21 +1,24 @@
 # Codebook for Pathway Figure OCR project
+
 The sections below detail the steps taken to generate files and run scripts for this project.
 
 ### Install Dependencies
+
 [Nix](https://nixos.org/nixos/nix-pills/install-on-your-running-system.html#idm140737316672400)
 
 ## PubMed Central Image Extraction
-_These scripts are capable of populating the database with structured paper and figure information for future OCR runs._
 
+_These scripts are capable of populating the database with structured paper and figure information for future OCR runs._
 
-This url returns >77k figures from PMC articles matching "signaling pathways". Approximately 80% of these are actually pathway figures. These make a reasonably efficient source of sample figures to test methods. *Consider other search terms and other sources when scaling up.*
+This url returns >77k figures from PMC articles matching "signaling pathways". Approximately 80% of these are actually pathway figures. These make a reasonably efficient source of sample figures to test methods. _Consider other search terms and other sources when scaling up._
 
 ```
 http://www.ncbi.nlm.nih.gov/pmc/?term=signaling+pathway&report=imagesdocsum
 ```
 
 ### Scrape HTML
-For sample sets you can simply save dozens of pages of results and quickly get 1000s of pathway figures. *Consider automating this step when scaling up.*
+
+For sample sets you can simply save dozens of pages of results and quickly get 1000s of pathway figures. _Consider automating this step when scaling up._
 
 ```
 Set Display Settings: to max (100)
@@ -29,99 +32,119 @@ php pmc_image_parse.php
 ```
 
 * depends on simple_html_dom.php
-* outputs images as "PMC######__<filename>.<ext>
-* outputs caption as "PMC######__<filename>.<ext>.html
+* outputs images as "PMC######\_\_<filename>.<ext>
+* outputs caption as "PMC######\_\_<filename>.<ext>.html
 
-*Consider loading caption information directly into database and skip exporting this html file*
+_Consider loading caption information directly into database and skip exporting this html file_
 
 These files are exported to a designated folder, e.g., pmc/20150501/images_all
 
 ### Prune Images
-Another manual step here to increase accuracy of downstream counts. Make a copy of the images_all dir, renaming to images_pruned. View the extracted images in Finder, for example, and delete pairs of files associated with figures that are not actually pathways. In this first sample run, ~20% of images were pruned away. The most common non-pathway figures wer of gel electrophoresis runs. *Consider automated ways to either exclude gel figures or select only pathway images to scale this step up.*
+
+Another manual step here to increase accuracy of downstream counts. Make a copy of the images*all dir, renaming to images_pruned. View the extracted images in Finder, for example, and delete pairs of files associated with figures that are not actually pathways. In this first sample run, ~20% of images were pruned away. The most common non-pathway figures wer of gel electrophoresis runs. \_Consider automated ways to either exclude gel figures or select only pathway images to scale this step up.*
 
 ### Load into Database
+
 Create database:
 
 Enter nix-shell:
+
 ```
 nix-shell
 ```
 
 ```
 psql
-\i database/create_tables.sql 
+\i database/create_tables.sql
 \q
 ```
+
 Load filenames (or paths) and extracted content into database
+
 * papers (id, pmcid, title, url)
 * figures (id, paperid, filepath, fignumber, caption)
 
 Enter nix-shell:
+
 ```
 nix-shell
 ```
 
 First time:
+
 ```sh
 ./pfocr.py load_figures
 ```
 
 After first time:
+
 ```sh
 sh ./copy_tables.sh
 ```
 
 ## Optical Character Recognition
+
 _These scripts are capable of reading selected sets of figures from the database and performing individual runs of OCR_
 
 ### Read in Files from Database
+
 * figures (filepath)
 
 ### Image Preprocessing
+
 #### Imagemagick
+
 Exploration of settings to improve OCR by pre-processing of image:
 
 ```
 convert test1.jpg -colorspace gray test1_gr.jpg
-convert test1_gr.jpg -threshold 50% test1_gr_th.jpg 
+convert test1_gr.jpg -threshold 50% test1_gr_th.jpg
 convert test1_gr_th.jpg -define connected-components:verbose=true -define connected-components:area-threshold=400 -connected-components 4 -auto-level -depth 8 test1_gr_th_cc.jpg
 ```
 
 ### Run Google Cloud Vision
+
 * Set parameters
   * 'LanguageCode':'en' - to restrict to English language characters
 * Produce JSON files
 
 Enter nix-shell:
+
 ```
 nix-shell
 ```
 
 Caution: if you don't specify a `limit` value, it'll run until the last figure. Default `start` value is 0.
+
 ```sh
 ./pfocr.py ocr gcv --preprocessor noop --start 1 --limit 20
 ```
+
 Note: This command calls `ocr_pmc.py` at the end, passing along args and functions. The `ocr_pmc.py` script then:
 
 * gets an `ocr_processor_id` corresponding the unique hash of processing parameters
 * retrieves all figure rows and steps through rows, starting with `start`
+
   * runs image pre-processing
   * performs OCR
   * populates `ocr_processors__figures` with `ocr_processor_id`, `figure_id` and `result`
-  
+
 ```
   Example psql query to select words from result:
   select substring(regexp_replace(ta->>'description', E'[\\n\\r]+',',','g'),1,45) as word from ocr_processors__figures opf, json_array_elements(opf.result::json->'textAnnotations') AS ta ;
 ```
 
 ## Process Results
+
 _These scripts are capable of processing the results from one or more ocr runs previously stored in the database._
 
 ### Create/update word tables for all extracted text
+
 -n for normalizations
 -m for mutations
 
 Enter nix-shell:
+
 ```
 nix-shell
 ```
@@ -136,76 +159,91 @@ bash run.sh
 * populates `match_attempts` with all `figure_id` and `word_id` occurences
 
 ### Create/update xref tables for all lexicon "hits"
+
 * xrefs (id, xref)
-* figures__xrefs (ocr_processor_id, figure_id, xref, symbol, unique_wp_hs, filepath)
+* figures\_\_xrefs (ocr_processor_id, figure_id, xref, symbol, unique_wp_hs, filepath)
 
 ```
   Example psql query to rank order figures by unique xrefs:
   select figure_id, count(unique_wp_hs) as unique from figures__xrefs where unique_wp_hs = TRUE group by figure_id order by 2 desc;
 ```
 
 * Export a table view to file. Can only write to /tmp dir; then sftp to download.
+
 ```
 copy (select * from figures__xrefs) to '/tmp/filename.csv' with csv;
 ```
 
 #### Exploring results
+
 * Words extracted for a given paper:
+
 ```
 select pmcid,figure_number,result from ocr_processors__figures join figures on figures.id=figure_id join papers on papers.id=figures.paper_id where pmcid='PMC2780819';
 ```
+
 * All paper figures for a given word:
+
 ```
 select pmcid, figure_number, word from match_attempts join words on words.id=word_id join figures on figures.id=figure_id join papers on papers.id=paper_id where word = 'AC' group by pmcid, figure_number,word;
 ```
 
 ### Collect run stats
-* batches__ocr_processors (batch_id, ocr_processor_id)
-* batches (timestamp, parameters, paper_count, figure_count,  total_word_gross, total_word_unique, total_xrefs_gross, total_xrefs_unique)
+
+* batches\_\_ocr_processors (batch_id, ocr_processor_id)
+* batches (timestamp, parameters, paper_count, figure_count, total_word_gross, total_word_unique, total_xrefs_gross, total_xrefs_unique)
 
 ## Generating Files and Initial Tables
+
 Do not apply upper() or remove non-alphanumerics during lexicon constuction. These normalizations will be applied in parallel to both the lexicon and extracted words during post-processing.
 
 #### hgnc lexicon files
-1. Download ```protein-coding-gene``` TXT file from http://www.genenames.org/cgi-bin/statistics
-2. Import TXT into Excel, first setting all columns to "skip" then explicitly choosing "text" for symbol, alias_symbol, prev_symbol and entrez_id columns during import wizard (to avoid date conversion of SEPT1, etc)
-3. Delete rows without entrez_id mappings
-4. In separate tabs, expand 'alias symbol' and 'prev symbol' lists into single-value rows, maintaining entrez_id mappings for each row. Used Data>Text to Columns>Other:|>Column types:Text. Delete empty rows. Collapse multiple columns by pasting entrez_id before each column, sorting and stacking. 
-5. Filter each list for unique pairs (only affected alias and prev)
-6. For **prev** and **alias**, only keep symbols of 3 or more characters, using:
-   * `IF(LEN(B2)<3,"",B2)`
-7. Enter these formulas into columns C and D, next to sorted **alias** in order to "tag" all instances of symbols that match more than one entrez. Delete *all* of these instances.
-   * `MATCH(B2,B3:B$###,0)` and `MATCH(B2,B$1:B1,0)`, where ### is last row in sheet.
-8. Then delete (ignore) all of these instances (i.e., rather than picking one arbitrarily via a unique function)
-   * `IF(AND(ISNA(C2),ISNA(D2)),A2,"")` and `IF(AND(ISNA(C2),ISNA(D2)),B2,"")`  
-9. Export as separate CSV files.
+
+1.  Download `protein-coding-gene` TXT file from http://www.genenames.org/cgi-bin/statistics
+2.  Import TXT into Excel, first setting all columns to "skip" then explicitly choosing "text" for symbol, alias_symbol, prev_symbol and entrez_id columns during import wizard (to avoid date conversion of SEPT1, etc)
+3.  Delete rows without entrez_id mappings
+4.  In separate tabs, expand 'alias symbol' and 'prev symbol' lists into single-value rows, maintaining entrez_id mappings for each row. Used Data>Text to Columns>Other:|>Column types:Text. Delete empty rows. Collapse multiple columns by pasting entrez_id before each column, sorting and stacking.
+5.  Filter each list for unique pairs (only affected alias and prev)
+6.  For **prev** and **alias**, only keep symbols of 3 or more characters, using:
+    * `IF(LEN(B2)<3,"",B2)`
+7.  Enter these formulas into columns C and D, next to sorted **alias** in order to "tag" all instances of symbols that match more than one entrez. Delete _all_ of these instances.
+    * `MATCH(B2,B3:B$###,0)` and `MATCH(B2,B$1:B1,0)`, where ### is last row in sheet.
+8.  Then delete (ignore) all of these instances (i.e., rather than picking one arbitrarily via a unique function)
+    * `IF(AND(ISNA(C2),ISNA(D2)),A2,"")` and `IF(AND(ISNA(C2),ISNA(D2)),B2,"")`
+9.  Export as separate CSV files.
 
 #### bioentities lexicon file
-1. Starting with this file from our fork of bioentities: https://raw.githubusercontent.com/wikipathways/bioentities/master/relations.csv. It captures complexes, generic symbols and gene families, e.g., "WNT" mapping to each of the WNT## entries.
-2. Import CSV into Excel, setting identifier columns to import as "text". 
-3. Delete "isa" column. Add column names: type, symbol, type2, bioentities. Turn column filters on.
-4. Filter on 'type' and make separate tabs for rows with "BE" and "HGNC" values. Sort "be" tab by "symbol" (Column B).
-5. Add a column to "hgnc" tab based on =VLOOKUP(D2,be!B$2:D$116,3,FALSE). Copy/paste B and D into new tab and copy/paste-special B and E to append the list. Sort bioentities and remove rows with #N/A.
-6. Copy f_symbol tab (from hgnc protein-coding_gene workbook) and sort symbol column. Then add entrez_id column to bioentities via lookup on hgnc symbol using =LOOKUP(A2,n_symbol.csv!$B$2:$B$19177,n_symbol.csv!$A$2:$A$19177).
-7. Copy/paste-special columns of entrez_id and bioentities into new tab. Filter for unique pairs.
-8. Export as CSV file.
+
+1.  Starting with this file from our fork of bioentities: https://raw.githubusercontent.com/wikipathways/bioentities/master/relations.csv. It captures complexes, generic symbols and gene families, e.g., "WNT" mapping to each of the WNT## entries.
+2.  Import CSV into Excel, setting identifier columns to import as "text".
+3.  Delete "isa" column. Add column names: type, symbol, type2, bioentities. Turn column filters on.
+4.  Filter on 'type' and make separate tabs for rows with "BE" and "HGNC" values. Sort "be" tab by "symbol" (Column B).
+5.  Add a column to "hgnc" tab based on =VLOOKUP(D2,be!B$2:D$116,3,FALSE). Copy/paste B and D into new tab and copy/paste-special B and E to append the list. Sort bioentities and remove rows with #N/A.
+6.  Copy f_symbol tab (from hgnc protein-coding_gene workbook) and sort symbol column. Then add entrez_id column to bioentities via lookup on hgnc symbol using =LOOKUP(A2,n_symbol.csv!$B$2:$B$19177,n_symbol.csv!$A$2:$A$19177).
+7.  Copy/paste-special columns of entrez_id and bioentities into new tab. Filter for unique pairs.
+8.  Export as CSV file.
 
 #### WikiPathways human lists
-1. Download human GMT from http://data.wikipathways.org/current/gmt/
-2. Import GMT file into Excel
-3. Select complete matrix and name 'matrix' (upper left text field)
-4. Insert column and paste this in to A1
-  * =OFFSET(matrix,TRUNC((ROW()-ROW($A$1))/COLUMNS(matrix)),MOD(ROW()-ROW($A$1),COLUMNS(matrix)),1,1)
-5. Copy equation down to bottom of sheet, e.g., at least to =ROWS(matrix)\*COLUMNS(matrix)
-6. Filter out '0', then filter for unique
-7. Export as CSV file. 
+
+1.  Download human GMT from http://data.wikipathways.org/current/gmt/
+2.  Import GMT file into Excel
+3.  Select complete matrix and name 'matrix' (upper left text field)
+4.  Insert column and paste this in to A1
+
+* =OFFSET(matrix,TRUNC((ROW()-ROW($A$1))/COLUMNS(matrix)),MOD(ROW()-ROW($A$1),COLUMNS(matrix)),1,1)
+
+5.  Copy equation down to bottom of sheet, e.g., at least to =ROWS(matrix)\*COLUMNS(matrix)
+6.  Filter out '0', then filter for unique
+7.  Export as CSV file.
 
 ### organism names from taxdump
+
 Taxonomy names file (names.dmp):
-	tax_id					-- the id of node associated with this name
-	name_txt				-- name itself
-	unique name				-- the unique variant of this name if name not unique
-	name class				-- (synonym, common name, ...)
+tax_id -- the id of node associated with this name
+name_txt -- name itself
+unique name -- the unique variant of this name if name not unique
+name class -- (synonym, common name, ...)
+
 ```
 wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
 tar -xzf taxdump.tar.gz names.dmp
@@ -214,6 +252,7 @@ rm names.dmp taxdump.tar.gz
 ```
 
 ### gene2pubmed, pmc2pmid & organism2pubmed
+
 ```
 wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz
 gunzip gene2pubmed.gz
@@ -227,6 +266,7 @@ gunzip PMC-ids.csv.gz
 ```
 
 ### gene2pubtator & organism2pubtator
+
 ```
 wget ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/gene2pubtator.gz
 gunzip gene2pubtator.gz
@@ -247,3 +287,6 @@ tail -n +2 organism2pubtator.tsv | cut -f 1,2 | sort -u >> organism2pubtator_uni
 rm species2pubtator organism2pubtator
 ```
 
+### Running Database Queries
+
+See database/README.md
diff --git a/database/README.md b/database/README.md
@@ -22,6 +22,8 @@ sudo systemctl enable postgresql
 sudo systemctl start postgresql
 ```
 
+Don't use this: `sudo systemctl start postgresql`
+
 ```sh
 sudo su - postgres
 psql
@@ -33,11 +35,11 @@ Exit from psql: `\q`.
 exit
 ```
 
-## Loading Lexicon 
-Load each of your source lexicon files in order of preference (use filename numbering, e.g., ```1_symbol.csv```) to populate unique ```xrefs``` and ```symbols``` tables which are then referenced by the ```lexicon``` table. A temporary ```s``` table holds *previously seen* symbols (i.e., from preferred sources) to exclude redundancy across sources. However, many-to-many mappings are expected *within* a source, e.g., complexes and families.
+## Loading Lexicon
 
-```sh
+Load each of your source lexicon files in order of preference (use filename numbering, e.g., `1_symbol.csv`) to populate unique `xrefs` and `symbols` tables which are then referenced by the `lexicon` table. A temporary `s` table holds _previously seen_ symbols (i.e., from preferred sources) to exclude redundancy across sources. However, many-to-many mappings are expected _within_ a source, e.g., complexes and families.
 
+```sh
 # clear tables before inserting new content
 delete from lexicon;
 delete from xrefs;
@@ -81,3 +83,21 @@ drop table t;
 drop table s;
 ```
 
+### Running Database Queries
+
+Anders found the program [pgManage](https://github.com/pgManage/pgManage) easier to use than psql for interactive queries. You can install it by choosing the appropriate package from [releases](https://github.com/pgManage/pgManage/releases), e.g., `pgManage-10.3.0.dmg` for macOS.
+
+#### Create a tunnel
+
+Set up your ssh connection to rely on your SSH key (use `ssh-copy-id`). Then tunnel local port 3333 to remote port 5432:
+
+```
+ssh -L 3333:wikipathways-workspace.gladstone.internal:5432 ariutta@wikipathways-workspace.gladstone.internal
+```
+
+Next open `pgManage` and create a connection named `pfocr_plus` with:
+
+* Host `127.0.0.1`
+* Port `3333`
+* DB Name `pfocr_plus`
+* SSL Mode: `prefer`