Skip to content

Commit fdc1410

Browse files
committed
Merge branch 'master' of github.com:wikipathways/pathway-figure-ocr
2 parents 0e60274 + 39c4cb2 commit fdc1410

File tree

2 files changed

+111
-48
lines changed

2 files changed

+111
-48
lines changed

codebook.md

+88-45
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,24 @@
11
# Codebook for Pathway Figure OCR project
2+
23
The sections below detail the steps taken to generate files and run scripts for this project.
34

45
### Install Dependencies
6+
57
[Nix](https://nixos.org/nixos/nix-pills/install-on-your-running-system.html#idm140737316672400)
68

79
## PubMed Central Image Extraction
8-
_These scripts are capable of populating the database with structured paper and figure information for future OCR runs._
910

11+
_These scripts are capable of populating the database with structured paper and figure information for future OCR runs._
1012

11-
This url returns >77k figures from PMC articles matching "signaling pathways". Approximately 80% of these are actually pathway figures. These make a reasonably efficient source of sample figures to test methods. *Consider other search terms and other sources when scaling up.*
13+
This url returns >77k figures from PMC articles matching "signaling pathways". Approximately 80% of these are actually pathway figures. These make a reasonably efficient source of sample figures to test methods. _Consider other search terms and other sources when scaling up._
1214

1315
```
1416
http://www.ncbi.nlm.nih.gov/pmc/?term=signaling+pathway&report=imagesdocsum
1517
```
1618

1719
### Scrape HTML
18-
For sample sets you can simply save dozens of pages of results and quickly get 1000s of pathway figures. *Consider automating this step when scaling up.*
20+
21+
For sample sets you can simply save dozens of pages of results and quickly get 1000s of pathway figures. _Consider automating this step when scaling up._
1922

2023
```
2124
Set Display Settings: to max (100)
@@ -29,99 +32,119 @@ php pmc_image_parse.php
2932
```
3033

3134
* depends on simple_html_dom.php
32-
* outputs images as "PMC######__<filename>.<ext>
33-
* outputs caption as "PMC######__<filename>.<ext>.html
35+
* outputs images as "PMC######\_\_<filename>.<ext>
36+
* outputs caption as "PMC######\_\_<filename>.<ext>.html
3437

35-
*Consider loading caption information directly into database and skip exporting this html file*
38+
_Consider loading caption information directly into database and skip exporting this html file_
3639

3740
These files are exported to a designated folder, e.g., pmc/20150501/images_all
3841

3942
### Prune Images
40-
Another manual step here to increase accuracy of downstream counts. Make a copy of the images_all dir, renaming to images_pruned. View the extracted images in Finder, for example, and delete pairs of files associated with figures that are not actually pathways. In this first sample run, ~20% of images were pruned away. The most common non-pathway figures wer of gel electrophoresis runs. *Consider automated ways to either exclude gel figures or select only pathway images to scale this step up.*
43+
44+
Another manual step here to increase accuracy of downstream counts. Make a copy of the images*all dir, renaming to images_pruned. View the extracted images in Finder, for example, and delete pairs of files associated with figures that are not actually pathways. In this first sample run, ~20% of images were pruned away. The most common non-pathway figures wer of gel electrophoresis runs. \_Consider automated ways to either exclude gel figures or select only pathway images to scale this step up.*
4145

4246
### Load into Database
47+
4348
Create database:
4449

4550
Enter nix-shell:
51+
4652
```
4753
nix-shell
4854
```
4955

5056
```
5157
psql
52-
\i database/create_tables.sql
58+
\i database/create_tables.sql
5359
\q
5460
```
61+
5562
Load filenames (or paths) and extracted content into database
63+
5664
* papers (id, pmcid, title, url)
5765
* figures (id, paperid, filepath, fignumber, caption)
5866

5967
Enter nix-shell:
68+
6069
```
6170
nix-shell
6271
```
6372

6473
First time:
74+
6575
```sh
6676
./pfocr.py load_figures
6777
```
6878

6979
After first time:
80+
7081
```sh
7182
sh ./copy_tables.sh
7283
```
7384

7485
## Optical Character Recognition
86+
7587
_These scripts are capable of reading selected sets of figures from the database and performing individual runs of OCR_
7688

7789
### Read in Files from Database
90+
7891
* figures (filepath)
7992

8093
### Image Preprocessing
94+
8195
#### Imagemagick
96+
8297
Exploration of settings to improve OCR by pre-processing of image:
8398

8499
```
85100
convert test1.jpg -colorspace gray test1_gr.jpg
86-
convert test1_gr.jpg -threshold 50% test1_gr_th.jpg
101+
convert test1_gr.jpg -threshold 50% test1_gr_th.jpg
87102
convert test1_gr_th.jpg -define connected-components:verbose=true -define connected-components:area-threshold=400 -connected-components 4 -auto-level -depth 8 test1_gr_th_cc.jpg
88103
```
89104

90105
### Run Google Cloud Vision
106+
91107
* Set parameters
92108
* 'LanguageCode':'en' - to restrict to English language characters
93109
* Produce JSON files
94110

95111
Enter nix-shell:
112+
96113
```
97114
nix-shell
98115
```
99116

100117
Caution: if you don't specify a `limit` value, it'll run until the last figure. Default `start` value is 0.
118+
101119
```sh
102120
./pfocr.py ocr gcv --preprocessor noop --start 1 --limit 20
103121
```
122+
104123
Note: This command calls `ocr_pmc.py` at the end, passing along args and functions. The `ocr_pmc.py` script then:
105124

106125
* gets an `ocr_processor_id` corresponding the unique hash of processing parameters
107126
* retrieves all figure rows and steps through rows, starting with `start`
127+
108128
* runs image pre-processing
109129
* performs OCR
110130
* populates `ocr_processors__figures` with `ocr_processor_id`, `figure_id` and `result`
111-
131+
112132
```
113133
Example psql query to select words from result:
114134
select substring(regexp_replace(ta->>'description', E'[\\n\\r]+',',','g'),1,45) as word from ocr_processors__figures opf, json_array_elements(opf.result::json->'textAnnotations') AS ta ;
115135
```
116136

117137
## Process Results
138+
118139
_These scripts are capable of processing the results from one or more ocr runs previously stored in the database._
119140

120141
### Create/update word tables for all extracted text
142+
121143
-n for normalizations
122144
-m for mutations
123145

124146
Enter nix-shell:
147+
125148
```
126149
nix-shell
127150
```
@@ -136,76 +159,91 @@ bash run.sh
136159
* populates `match_attempts` with all `figure_id` and `word_id` occurences
137160

138161
### Create/update xref tables for all lexicon "hits"
162+
139163
* xrefs (id, xref)
140-
* figures__xrefs (ocr_processor_id, figure_id, xref, symbol, unique_wp_hs, filepath)
164+
* figures\_\_xrefs (ocr_processor_id, figure_id, xref, symbol, unique_wp_hs, filepath)
141165

142166
```
143167
Example psql query to rank order figures by unique xrefs:
144168
select figure_id, count(unique_wp_hs) as unique from figures__xrefs where unique_wp_hs = TRUE group by figure_id order by 2 desc;
145169
```
146170

147171
* Export a table view to file. Can only write to /tmp dir; then sftp to download.
172+
148173
```
149174
copy (select * from figures__xrefs) to '/tmp/filename.csv' with csv;
150175
```
151176

152177
#### Exploring results
178+
153179
* Words extracted for a given paper:
180+
154181
```
155182
select pmcid,figure_number,result from ocr_processors__figures join figures on figures.id=figure_id join papers on papers.id=figures.paper_id where pmcid='PMC2780819';
156183
```
184+
157185
* All paper figures for a given word:
186+
158187
```
159188
select pmcid, figure_number, word from match_attempts join words on words.id=word_id join figures on figures.id=figure_id join papers on papers.id=paper_id where word = 'AC' group by pmcid, figure_number,word;
160189
```
161190

162191
### Collect run stats
163-
* batches__ocr_processors (batch_id, ocr_processor_id)
164-
* batches (timestamp, parameters, paper_count, figure_count, total_word_gross, total_word_unique, total_xrefs_gross, total_xrefs_unique)
192+
193+
* batches\_\_ocr_processors (batch_id, ocr_processor_id)
194+
* batches (timestamp, parameters, paper_count, figure_count, total_word_gross, total_word_unique, total_xrefs_gross, total_xrefs_unique)
165195

166196
## Generating Files and Initial Tables
197+
167198
Do not apply upper() or remove non-alphanumerics during lexicon constuction. These normalizations will be applied in parallel to both the lexicon and extracted words during post-processing.
168199

169200
#### hgnc lexicon files
170-
1. Download ```protein-coding-gene``` TXT file from http://www.genenames.org/cgi-bin/statistics
171-
2. Import TXT into Excel, first setting all columns to "skip" then explicitly choosing "text" for symbol, alias_symbol, prev_symbol and entrez_id columns during import wizard (to avoid date conversion of SEPT1, etc)
172-
3. Delete rows without entrez_id mappings
173-
4. In separate tabs, expand 'alias symbol' and 'prev symbol' lists into single-value rows, maintaining entrez_id mappings for each row. Used Data>Text to Columns>Other:|>Column types:Text. Delete empty rows. Collapse multiple columns by pasting entrez_id before each column, sorting and stacking.
174-
5. Filter each list for unique pairs (only affected alias and prev)
175-
6. For **prev** and **alias**, only keep symbols of 3 or more characters, using:
176-
* `IF(LEN(B2)<3,"",B2)`
177-
7. Enter these formulas into columns C and D, next to sorted **alias** in order to "tag" all instances of symbols that match more than one entrez. Delete *all* of these instances.
178-
* `MATCH(B2,B3:B$###,0)` and `MATCH(B2,B$1:B1,0)`, where ### is last row in sheet.
179-
8. Then delete (ignore) all of these instances (i.e., rather than picking one arbitrarily via a unique function)
180-
* `IF(AND(ISNA(C2),ISNA(D2)),A2,"")` and `IF(AND(ISNA(C2),ISNA(D2)),B2,"")`
181-
9. Export as separate CSV files.
201+
202+
1. Download `protein-coding-gene` TXT file from http://www.genenames.org/cgi-bin/statistics
203+
2. Import TXT into Excel, first setting all columns to "skip" then explicitly choosing "text" for symbol, alias_symbol, prev_symbol and entrez_id columns during import wizard (to avoid date conversion of SEPT1, etc)
204+
3. Delete rows without entrez_id mappings
205+
4. In separate tabs, expand 'alias symbol' and 'prev symbol' lists into single-value rows, maintaining entrez_id mappings for each row. Used Data>Text to Columns>Other:|>Column types:Text. Delete empty rows. Collapse multiple columns by pasting entrez_id before each column, sorting and stacking.
206+
5. Filter each list for unique pairs (only affected alias and prev)
207+
6. For **prev** and **alias**, only keep symbols of 3 or more characters, using:
208+
* `IF(LEN(B2)<3,"",B2)`
209+
7. Enter these formulas into columns C and D, next to sorted **alias** in order to "tag" all instances of symbols that match more than one entrez. Delete _all_ of these instances.
210+
* `MATCH(B2,B3:B$###,0)` and `MATCH(B2,B$1:B1,0)`, where ### is last row in sheet.
211+
8. Then delete (ignore) all of these instances (i.e., rather than picking one arbitrarily via a unique function)
212+
* `IF(AND(ISNA(C2),ISNA(D2)),A2,"")` and `IF(AND(ISNA(C2),ISNA(D2)),B2,"")`
213+
9. Export as separate CSV files.
182214

183215
#### bioentities lexicon file
184-
1. Starting with this file from our fork of bioentities: https://raw.githubusercontent.com/wikipathways/bioentities/master/relations.csv. It captures complexes, generic symbols and gene families, e.g., "WNT" mapping to each of the WNT## entries.
185-
2. Import CSV into Excel, setting identifier columns to import as "text".
186-
3. Delete "isa" column. Add column names: type, symbol, type2, bioentities. Turn column filters on.
187-
4. Filter on 'type' and make separate tabs for rows with "BE" and "HGNC" values. Sort "be" tab by "symbol" (Column B).
188-
5. Add a column to "hgnc" tab based on =VLOOKUP(D2,be!B$2:D$116,3,FALSE). Copy/paste B and D into new tab and copy/paste-special B and E to append the list. Sort bioentities and remove rows with #N/A.
189-
6. Copy f_symbol tab (from hgnc protein-coding_gene workbook) and sort symbol column. Then add entrez_id column to bioentities via lookup on hgnc symbol using =LOOKUP(A2,n_symbol.csv!$B$2:$B$19177,n_symbol.csv!$A$2:$A$19177).
190-
7. Copy/paste-special columns of entrez_id and bioentities into new tab. Filter for unique pairs.
191-
8. Export as CSV file.
216+
217+
1. Starting with this file from our fork of bioentities: https://raw.githubusercontent.com/wikipathways/bioentities/master/relations.csv. It captures complexes, generic symbols and gene families, e.g., "WNT" mapping to each of the WNT## entries.
218+
2. Import CSV into Excel, setting identifier columns to import as "text".
219+
3. Delete "isa" column. Add column names: type, symbol, type2, bioentities. Turn column filters on.
220+
4. Filter on 'type' and make separate tabs for rows with "BE" and "HGNC" values. Sort "be" tab by "symbol" (Column B).
221+
5. Add a column to "hgnc" tab based on =VLOOKUP(D2,be!B$2:D$116,3,FALSE). Copy/paste B and D into new tab and copy/paste-special B and E to append the list. Sort bioentities and remove rows with #N/A.
222+
6. Copy f_symbol tab (from hgnc protein-coding_gene workbook) and sort symbol column. Then add entrez_id column to bioentities via lookup on hgnc symbol using =LOOKUP(A2,n_symbol.csv!$B$2:$B$19177,n_symbol.csv!$A$2:$A$19177).
223+
7. Copy/paste-special columns of entrez_id and bioentities into new tab. Filter for unique pairs.
224+
8. Export as CSV file.
192225

193226
#### WikiPathways human lists
194-
1. Download human GMT from http://data.wikipathways.org/current/gmt/
195-
2. Import GMT file into Excel
196-
3. Select complete matrix and name 'matrix' (upper left text field)
197-
4. Insert column and paste this in to A1
198-
* =OFFSET(matrix,TRUNC((ROW()-ROW($A$1))/COLUMNS(matrix)),MOD(ROW()-ROW($A$1),COLUMNS(matrix)),1,1)
199-
5. Copy equation down to bottom of sheet, e.g., at least to =ROWS(matrix)\*COLUMNS(matrix)
200-
6. Filter out '0', then filter for unique
201-
7. Export as CSV file.
227+
228+
1. Download human GMT from http://data.wikipathways.org/current/gmt/
229+
2. Import GMT file into Excel
230+
3. Select complete matrix and name 'matrix' (upper left text field)
231+
4. Insert column and paste this in to A1
232+
233+
* =OFFSET(matrix,TRUNC((ROW()-ROW($A$1))/COLUMNS(matrix)),MOD(ROW()-ROW($A$1),COLUMNS(matrix)),1,1)
234+
235+
5. Copy equation down to bottom of sheet, e.g., at least to =ROWS(matrix)\*COLUMNS(matrix)
236+
6. Filter out '0', then filter for unique
237+
7. Export as CSV file.
202238

203239
### organism names from taxdump
240+
204241
Taxonomy names file (names.dmp):
205-
tax_id -- the id of node associated with this name
206-
name_txt -- name itself
207-
unique name -- the unique variant of this name if name not unique
208-
name class -- (synonym, common name, ...)
242+
tax_id -- the id of node associated with this name
243+
name_txt -- name itself
244+
unique name -- the unique variant of this name if name not unique
245+
name class -- (synonym, common name, ...)
246+
209247
```
210248
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
211249
tar -xzf taxdump.tar.gz names.dmp
@@ -214,6 +252,7 @@ rm names.dmp taxdump.tar.gz
214252
```
215253

216254
### gene2pubmed, pmc2pmid & organism2pubmed
255+
217256
```
218257
wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz
219258
gunzip gene2pubmed.gz
@@ -227,6 +266,7 @@ gunzip PMC-ids.csv.gz
227266
```
228267

229268
### gene2pubtator & organism2pubtator
269+
230270
```
231271
wget ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/gene2pubtator.gz
232272
gunzip gene2pubtator.gz
@@ -247,3 +287,6 @@ tail -n +2 organism2pubtator.tsv | cut -f 1,2 | sort -u >> organism2pubtator_uni
247287
rm species2pubtator organism2pubtator
248288
```
249289

290+
### Running Database Queries
291+
292+
See database/README.md

database/README.md

+23-3
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ sudo systemctl enable postgresql
2222
sudo systemctl start postgresql
2323
```
2424

25+
Don't use this: `sudo systemctl start postgresql`
26+
2527
```sh
2628
sudo su - postgres
2729
psql
@@ -33,11 +35,11 @@ Exit from psql: `\q`.
3335
exit
3436
```
3537

36-
## Loading Lexicon
37-
Load each of your source lexicon files in order of preference (use filename numbering, e.g., ```1_symbol.csv```) to populate unique ```xrefs``` and ```symbols``` tables which are then referenced by the ```lexicon``` table. A temporary ```s``` table holds *previously seen* symbols (i.e., from preferred sources) to exclude redundancy across sources. However, many-to-many mappings are expected *within* a source, e.g., complexes and families.
38+
## Loading Lexicon
3839

39-
```sh
40+
Load each of your source lexicon files in order of preference (use filename numbering, e.g., `1_symbol.csv`) to populate unique `xrefs` and `symbols` tables which are then referenced by the `lexicon` table. A temporary `s` table holds _previously seen_ symbols (i.e., from preferred sources) to exclude redundancy across sources. However, many-to-many mappings are expected _within_ a source, e.g., complexes and families.
4041

42+
```sh
4143
# clear tables before inserting new content
4244
delete from lexicon;
4345
delete from xrefs;
@@ -81,3 +83,21 @@ drop table t;
8183
drop table s;
8284
```
8385
86+
### Running Database Queries
87+
88+
Anders found the program [pgManage](https://github.com/pgManage/pgManage) easier to use than psql for interactive queries. You can install it by choosing the appropriate package from [releases](https://github.com/pgManage/pgManage/releases), e.g., `pgManage-10.3.0.dmg` for macOS.
89+
90+
#### Create a tunnel
91+
92+
Set up your ssh connection to rely on your SSH key (use `ssh-copy-id`). Then tunnel local port 3333 to remote port 5432:
93+
94+
```
95+
ssh -L 3333:wikipathways-workspace.gladstone.internal:5432 [email protected]
96+
```
97+
98+
Next open `pgManage` and create a connection named `pfocr_plus` with:
99+
100+
* Host `127.0.0.1`
101+
* Port `3333`
102+
* DB Name `pfocr_plus`
103+
* SSL Mode: `prefer`

0 commit comments

Comments
 (0)