You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
_These scripts are capable of populating the database with structured paper and figure information for future OCR runs._
9
10
11
+
_These scripts are capable of populating the database with structured paper and figure information for future OCR runs._
10
12
11
-
This url returns >77k figures from PMC articles matching "signaling pathways". Approximately 80% of these are actually pathway figures. These make a reasonably efficient source of sample figures to test methods. *Consider other search terms and other sources when scaling up.*
13
+
This url returns >77k figures from PMC articles matching "signaling pathways". Approximately 80% of these are actually pathway figures. These make a reasonably efficient source of sample figures to test methods. _Consider other search terms and other sources when scaling up._
For sample sets you can simply save dozens of pages of results and quickly get 1000s of pathway figures. *Consider automating this step when scaling up.*
20
+
21
+
For sample sets you can simply save dozens of pages of results and quickly get 1000s of pathway figures. _Consider automating this step when scaling up._
19
22
20
23
```
21
24
Set Display Settings: to max (100)
@@ -29,99 +32,119 @@ php pmc_image_parse.php
29
32
```
30
33
31
34
* depends on simple_html_dom.php
32
-
* outputs images as "PMC######__<filename>.<ext>
33
-
* outputs caption as "PMC######__<filename>.<ext>.html
35
+
* outputs images as "PMC######\_\_<filename>.<ext>
36
+
* outputs caption as "PMC######\_\_<filename>.<ext>.html
34
37
35
-
*Consider loading caption information directly into database and skip exporting this html file*
38
+
_Consider loading caption information directly into database and skip exporting this html file_
36
39
37
40
These files are exported to a designated folder, e.g., pmc/20150501/images_all
38
41
39
42
### Prune Images
40
-
Another manual step here to increase accuracy of downstream counts. Make a copy of the images_all dir, renaming to images_pruned. View the extracted images in Finder, for example, and delete pairs of files associated with figures that are not actually pathways. In this first sample run, ~20% of images were pruned away. The most common non-pathway figures wer of gel electrophoresis runs. *Consider automated ways to either exclude gel figures or select only pathway images to scale this step up.*
43
+
44
+
Another manual step here to increase accuracy of downstream counts. Make a copy of the images*all dir, renaming to images_pruned. View the extracted images in Finder, for example, and delete pairs of files associated with figures that are not actually pathways. In this first sample run, ~20% of images were pruned away. The most common non-pathway figures wer of gel electrophoresis runs. \_Consider automated ways to either exclude gel figures or select only pathway images to scale this step up.*
41
45
42
46
### Load into Database
47
+
43
48
Create database:
44
49
45
50
Enter nix-shell:
51
+
46
52
```
47
53
nix-shell
48
54
```
49
55
50
56
```
51
57
psql
52
-
\i database/create_tables.sql
58
+
\i database/create_tables.sql
53
59
\q
54
60
```
61
+
55
62
Load filenames (or paths) and extracted content into database
Note: This command calls `ocr_pmc.py` at the end, passing along args and functions. The `ocr_pmc.py` script then:
105
124
106
125
* gets an `ocr_processor_id` corresponding the unique hash of processing parameters
107
126
* retrieves all figure rows and steps through rows, starting with `start`
127
+
108
128
* runs image pre-processing
109
129
* performs OCR
110
130
* populates `ocr_processors__figures` with `ocr_processor_id`, `figure_id` and `result`
111
-
131
+
112
132
```
113
133
Example psql query to select words from result:
114
134
select substring(regexp_replace(ta->>'description', E'[\\n\\r]+',',','g'),1,45) as word from ocr_processors__figures opf, json_array_elements(opf.result::json->'textAnnotations') AS ta ;
115
135
```
116
136
117
137
## Process Results
138
+
118
139
_These scripts are capable of processing the results from one or more ocr runs previously stored in the database._
119
140
120
141
### Create/update word tables for all extracted text
142
+
121
143
-n for normalizations
122
144
-m for mutations
123
145
124
146
Enter nix-shell:
147
+
125
148
```
126
149
nix-shell
127
150
```
@@ -136,76 +159,91 @@ bash run.sh
136
159
* populates `match_attempts` with all `figure_id` and `word_id` occurences
137
160
138
161
### Create/update xref tables for all lexicon "hits"
Example psql query to rank order figures by unique xrefs:
144
168
select figure_id, count(unique_wp_hs) as unique from figures__xrefs where unique_wp_hs = TRUE group by figure_id order by 2 desc;
145
169
```
146
170
147
171
* Export a table view to file. Can only write to /tmp dir; then sftp to download.
172
+
148
173
```
149
174
copy (select * from figures__xrefs) to '/tmp/filename.csv' with csv;
150
175
```
151
176
152
177
#### Exploring results
178
+
153
179
* Words extracted for a given paper:
180
+
154
181
```
155
182
select pmcid,figure_number,result from ocr_processors__figures join figures on figures.id=figure_id join papers on papers.id=figures.paper_id where pmcid='PMC2780819';
156
183
```
184
+
157
185
* All paper figures for a given word:
186
+
158
187
```
159
188
select pmcid, figure_number, word from match_attempts join words on words.id=word_id join figures on figures.id=figure_id join papers on papers.id=paper_id where word = 'AC' group by pmcid, figure_number,word;
Do not apply upper() or remove non-alphanumerics during lexicon constuction. These normalizations will be applied in parallel to both the lexicon and extracted words during post-processing.
168
199
169
200
#### hgnc lexicon files
170
-
1. Download ```protein-coding-gene``` TXT file from http://www.genenames.org/cgi-bin/statistics
171
-
2. Import TXT into Excel, first setting all columns to "skip" then explicitly choosing "text" for symbol, alias_symbol, prev_symbol and entrez_id columns during import wizard (to avoid date conversion of SEPT1, etc)
172
-
3. Delete rows without entrez_id mappings
173
-
4. In separate tabs, expand 'alias symbol' and 'prev symbol' lists into single-value rows, maintaining entrez_id mappings for each row. Used Data>Text to Columns>Other:|>Column types:Text. Delete empty rows. Collapse multiple columns by pasting entrez_id before each column, sorting and stacking.
174
-
5. Filter each list for unique pairs (only affected alias and prev)
175
-
6. For **prev** and **alias**, only keep symbols of 3 or more characters, using:
176
-
*`IF(LEN(B2)<3,"",B2)`
177
-
7. Enter these formulas into columns C and D, next to sorted **alias** in order to "tag" all instances of symbols that match more than one entrez. Delete *all* of these instances.
178
-
*`MATCH(B2,B3:B$###,0)` and `MATCH(B2,B$1:B1,0)`, where ### is last row in sheet.
179
-
8. Then delete (ignore) all of these instances (i.e., rather than picking one arbitrarily via a unique function)
180
-
*`IF(AND(ISNA(C2),ISNA(D2)),A2,"")` and `IF(AND(ISNA(C2),ISNA(D2)),B2,"")`
181
-
9. Export as separate CSV files.
201
+
202
+
1. Download `protein-coding-gene` TXT file from http://www.genenames.org/cgi-bin/statistics
203
+
2. Import TXT into Excel, first setting all columns to "skip" then explicitly choosing "text" for symbol, alias_symbol, prev_symbol and entrez_id columns during import wizard (to avoid date conversion of SEPT1, etc)
204
+
3. Delete rows without entrez_id mappings
205
+
4. In separate tabs, expand 'alias symbol' and 'prev symbol' lists into single-value rows, maintaining entrez_id mappings for each row. Used Data>Text to Columns>Other:|>Column types:Text. Delete empty rows. Collapse multiple columns by pasting entrez_id before each column, sorting and stacking.
206
+
5. Filter each list for unique pairs (only affected alias and prev)
207
+
6. For **prev** and **alias**, only keep symbols of 3 or more characters, using:
208
+
*`IF(LEN(B2)<3,"",B2)`
209
+
7. Enter these formulas into columns C and D, next to sorted **alias** in order to "tag" all instances of symbols that match more than one entrez. Delete _all_ of these instances.
210
+
*`MATCH(B2,B3:B$###,0)` and `MATCH(B2,B$1:B1,0)`, where ### is last row in sheet.
211
+
8. Then delete (ignore) all of these instances (i.e., rather than picking one arbitrarily via a unique function)
212
+
*`IF(AND(ISNA(C2),ISNA(D2)),A2,"")` and `IF(AND(ISNA(C2),ISNA(D2)),B2,"")`
213
+
9. Export as separate CSV files.
182
214
183
215
#### bioentities lexicon file
184
-
1. Starting with this file from our fork of bioentities: https://raw.githubusercontent.com/wikipathways/bioentities/master/relations.csv. It captures complexes, generic symbols and gene families, e.g., "WNT" mapping to each of the WNT## entries.
185
-
2. Import CSV into Excel, setting identifier columns to import as "text".
4. Filter on 'type' and make separate tabs for rows with "BE" and "HGNC" values. Sort "be" tab by "symbol" (Column B).
188
-
5. Add a column to "hgnc" tab based on =VLOOKUP(D2,be!B$2:D$116,3,FALSE). Copy/paste B and D into new tab and copy/paste-special B and E to append the list. Sort bioentities and remove rows with #N/A.
189
-
6. Copy f_symbol tab (from hgnc protein-coding_gene workbook) and sort symbol column. Then add entrez_id column to bioentities via lookup on hgnc symbol using =LOOKUP(A2,n_symbol.csv!$B$2:$B$19177,n_symbol.csv!$A$2:$A$19177).
190
-
7. Copy/paste-special columns of entrez_id and bioentities into new tab. Filter for unique pairs.
191
-
8. Export as CSV file.
216
+
217
+
1. Starting with this file from our fork of bioentities: https://raw.githubusercontent.com/wikipathways/bioentities/master/relations.csv. It captures complexes, generic symbols and gene families, e.g., "WNT" mapping to each of the WNT## entries.
218
+
2. Import CSV into Excel, setting identifier columns to import as "text".
4. Filter on 'type' and make separate tabs for rows with "BE" and "HGNC" values. Sort "be" tab by "symbol" (Column B).
221
+
5. Add a column to "hgnc" tab based on =VLOOKUP(D2,be!B$2:D$116,3,FALSE). Copy/paste B and D into new tab and copy/paste-special B and E to append the list. Sort bioentities and remove rows with #N/A.
222
+
6. Copy f_symbol tab (from hgnc protein-coding_gene workbook) and sort symbol column. Then add entrez_id column to bioentities via lookup on hgnc symbol using =LOOKUP(A2,n_symbol.csv!$B$2:$B$19177,n_symbol.csv!$A$2:$A$19177).
223
+
7. Copy/paste-special columns of entrez_id and bioentities into new tab. Filter for unique pairs.
224
+
8. Export as CSV file.
192
225
193
226
#### WikiPathways human lists
194
-
1. Download human GMT from http://data.wikipathways.org/current/gmt/
195
-
2. Import GMT file into Excel
196
-
3. Select complete matrix and name 'matrix' (upper left text field)
Load each of your source lexicon files in order of preference (use filename numbering, e.g., ```1_symbol.csv```) to populate unique ```xrefs``` and ```symbols``` tables which are then referenced by the ```lexicon``` table. A temporary ```s``` table holds *previously seen* symbols (i.e., from preferred sources) to exclude redundancy across sources. However, many-to-many mappings are expected *within* a source, e.g., complexes and families.
38
+
## Loading Lexicon
38
39
39
-
```sh
40
+
Load each of your source lexicon files in order of preference (use filename numbering, e.g., `1_symbol.csv`) to populate unique `xrefs` and `symbols` tables which are then referenced by the `lexicon` table. A temporary `s` table holds _previously seen_ symbols (i.e., from preferred sources) to exclude redundancy across sources. However, many-to-many mappings are expected _within_ a source, e.g., complexes and families.
40
41
42
+
```sh
41
43
# clear tables before inserting new content
42
44
delete from lexicon;
43
45
delete from xrefs;
@@ -81,3 +83,21 @@ drop table t;
81
83
drop table s;
82
84
```
83
85
86
+
### Running Database Queries
87
+
88
+
Anders found the program [pgManage](https://github.com/pgManage/pgManage) easier to use than psql for interactive queries. You can install it by choosing the appropriate package from [releases](https://github.com/pgManage/pgManage/releases), e.g., `pgManage-10.3.0.dmg`for macOS.
89
+
90
+
#### Create a tunnel
91
+
92
+
Set up your ssh connection to rely on your SSH key (use `ssh-copy-id`). Then tunnel local port 3333 to remote port 5432:
0 commit comments