-
Notifications
You must be signed in to change notification settings - Fork 0
Transgene first pass
Arun wrote a script to identify Is and In regular expression patterns. SCRIPT: //finds all Is and In transgenes. The C. elegans corpus is used as in file for update_textpreso_transgene.pl*
SCRIPT: //finds all Is, In, and Ex expressions plus any possible genomic expression following the transgene regular expression name.
These output files are available here: http://textpresso-dev.caltech.edu/transgene/
The output file is: http://textpresso-dev.caltech.edu/transgene/transgenes_in_regular_papers.out
Note: these scripts will miss transgenes that do not have a canonical name. http://textpresso-dev.caltech.edu/transgene/transgenes_summary.out formats the results according to transgene with Papers listed below each transgenes; "…paper.sup.1" means the transgene name was mentioned in the supplementary file.
From Arun's list (See e-mail July 22, 2010 from Arun) Updated every 15th of the month: http://textpresso-dev.caltech.edu/transgene/transgenes_summary.out http://textpresso-dev.caltech.edu/transgene/transgenes_summary_in_regular_papers.out
Can run the following parsing scripts to get a list of high priority (those extrachromosomal arrays that appear in more than one paper. See message from Wen (Jun, 2010)
The script is written and placed on spica: /home/citpub/Karen/TgSummary/getNewTg.pl
It creates two files:
1. /home/citpub/Karen/TgSummary/NewTg.txt This is the list of all new Ex lines not in WS216.
2. /home/citpub/Karen/TgSummary/NewTgHighPriority.txt This is the list of all new Ex lines that have at least 2 paper entries.
Here is how to run the script and the results:
[citpub@spica]$ ./getNewTg.pl //This script screens the Textpresso output of transgenes, gets rid of those that already exist in WormBase.
Input file 1: WSTg.ace this file lists all transgenes already in WormBase. You need to update this list before running the script by dumping out the list of transgene names from an updated WS
Input file 2: transgenes_summary.out this output file is from the Textpresso scan
Output file: NewTg.txt, which also gives the following information
4951 new Ex transgenes found. 607 Ex transgenes has more than 2 paper entries.
To update the Tg priority lists,
-update WSTg.ace ->query out all Tg from citace after the upload? or get a list from postgres.
-update transgenes_summary.out output file from Textpresso ->still not sure how to do this, if the script is available for me to run or if I need to ask Arun.
All the transgene-paper links will be entered into the postgres database automatically by Juancarlos's script update_textpreso_transgene.pl* on tazendra at /home/postgres/work/pgpopulation/textpresso/transgene. This script runs every day at 4am and populates the transgene postgres table with all new Is transgenes as well as populates new references for pre-existing transgenes.
At this moment, only the 'Is' lines are entered into postgres. 'In' lines are not entered to textpresso unless the transgene name already matches something that is in the database (which means that they have already been confirmed as valid transgenes).
'Is' lines are entered with Arun as a curator so these new entries are easy to find.
Scripts: /home/postgres/work/pgpopulation/textpresso/transgene/update_textpresso_transgene.pl*\\this was originally spelled update_textpreso_transgene.pl, J fixed the spelling 7/11/11 and fixed it in the wrapper as well as in antibody, rnai.
The scripts will repeatedly pick up and enter anything that is new or not curated, therefore if a bad object was picked up, e.g. typo, mistake, false object, that object needed to be added to an exclusion list so it would not be entered again. The steps taken before to deal with these objects are below. We have gotten around having to follow through with these objects by utilizing a false toggle button that can be applied to an object in a single line. This allows for the object to declared false in conjunction with certain papers yet allows the object to be declared a real object in papers where it might be real. These false objects are not dumped during the .ace dumper script and remain visible in the OA table.
For Reference, the old pipeline for removing these objects:
There are some false positives transgene hits extracted by the Textpresso scan, these objects need to be deleted from postgres as well as added to the transgene object exclusion list so it will not be picked up again during future transgene object scans. The transgene exclusion (false positive) list lives:
on tazendra
/home/acedb/wen/phenote_transgene/ObsoleteTg.txt when is this file called on?, Scripts?
This file is important and needs to be edited every time a new false positive is discovered. Some examples of false positives include:
- "In" lines that are chromosomal inversions rather than transgenes.
- There are some typos in transgene names, which get published, these should not go to postgres as their own entities, but should be noted in the remark or listed as a synonym, if appropriate, for the real transgene.
- Textpresso mishandles of transgene names, such as when transgenes are referred to as syIs13-19 in the paper, in which case Textpresso reports it as syIs1319 (i.e., the hyphen disappears sometimes during pdf2Text conversion). Curators should enter all the transgene objects into postgres and obsolete the syIs1314 object.
use strict; use diagnostics; use DBI; use LWP::Simple; use Jex; my $date = &getSimpleDate(); my $timestamp = &getSimpleSecDate(); my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n"; my $result; my $directory = '/home/postgres/work/pgpopulation/textpresso/transgene'; chdir($directory) or die "Cannot go to $directory ($!)"; $/ = undef; #reads whole file instead of line by line my $full_file = 'transgenes_in_regular_papers.out'; open (IN, "<$full_file") or die "Cannot open $full_file : $!"; my $last_data = <IN>; close (IN) or die "Cannot close $full_file : $!"; $/ = "\n"; my $new_data = get "http://textpresso-dev.caltech.edu/transgene/transgenes_in_regular_papers.out"; # UNCOMMENT THIS exit if ($last_data eq $new_data); open (OUT, ">$full_file") or die "Cannot rewrite $full_file : $!"; print OUT $new_data; close (OUT) or die "Cannot close $full_file : $!"; my (@tlines) = split/\n/, $new_data; my @pgcommands; # no longer populate textpresso firstpass tables for this, transgene is not a FP curation field in form. 2010 08 224 # # my $logfile = $directory . '/logfile.pg'; # open (LOG, ">$logfile") or die "Cannot rewrite $logfile : $!"; # # push @pgcommands, "DELETE FROM tfp_transgene;"; # foreach my $line (@tlines) { # my ($paper, @transgenes) = split/\s+/, $line; # if ($paper =~ m/(WBPaper\d+)/) { $paper = $1; } # my ($joinkey) = $paper =~ m/WBPaper(\d+)/; # push @pgcommands, "INSERT INTO tfp_transgene VALUES ('$joinkey', '$line');"; # } # foreach my $line (@tlines) # # foreach my $command (@pgcommands) { # print LOG "$command\n"; # $result = $dbh->do( $command ); # } # foreach my $command (@pgcommands) # # close (LOG) or die "Cannot close $logfile : $!"; #we don't appear to be doing anything with these outfiles 7/11/11 my $outfile = $directory . '/transgene_textpresso'; my $outfile2 = $directory . '/new_transgene_textpresso'; #supposed to be used for keeping track of In transgenes my $outfile3 = $directory . '/transgene_textpresso.pg'; # open (OUT, ">>$outfile") or die "Cannot create $outfile : $!"; # open (OU2, ">>$outfile2") or die "Cannot create $outfile2 : $!"; # open (OU3, ">>$outfile3") or die "Cannot create $outfile3 : $!"; my %syns; my %valid; #transgenes are assigned as valid as they already exist in postgres in name or synonym $result = $dbh->prepare( "SELECT trp_name.trp_name, trp_synonym.trp_synonym FROM trp_name, trp_synonym WHERE trp_name.joinkey = trp_synonym.joinkey;" ); $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; while (my @row = $result->fetchrow) { my (@syns) = split/ \| /, $row[1]; $valid{$row[0]}++; foreach my $syn (@syns) { $valid{$syn}++; $syns{$syn} = $row[0]; } } $result = $dbh->prepare( "SELECT trp_name FROM trp_name;" ); $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; while (my @row = $result->fetchrow) { $valid{$row[0]}++; } my %obs; # % is used to signify a hash variable, $ is used to signify a value inside a % (hash) # no longer use flatfile, use trp_objpap_falsepos table (Fail) 2010 08 24 # my $infile = '/home/acedb/wen/phenote_transgene/ObsoleteTg.txt'; # open (IN, "<$infile") or die "Cannot open $infile : $!"; # while (my $line = <IN>) { # chomp $line; # next unless $line; # next if ($line =~ m/^\/\//); # # my ($tg, $paper, $comment) = split/\t/, $line; # wen keeps forgetting tabs # my ($tg, $paper); # if ($line =~ m/^(\S+)\s+(\S+)/) { $tg = $1; $paper = $2; } # $tg =~ s/\s+//g; # unless ($paper) { $paper = 'all'; } # $obs{$tg}{$paper}++; # # print "OBS $tg PAP $paper E\n"; # } # while (my $line = <IN>) { # close (IN) or die "Cannot close $infile : $!"; $result = $dbh->prepare( "SELECT trp_name.joinkey, trp_name.trp_name, trp_paper.trp_paper FROM trp_name, trp_paper WHERE trp_name.joinkey = trp_paper.joinkey AND trp_name.joinkey IN (SELECT joinkey FROM trp_objpap_falsepos WHERE trp_objpap_falsepos = 'Fail');" ); #matches names from Arun's output to those with fails and does not enter them again into postgres: finding names and papers with ids that belong to the fails and puts them in the %obs (an obsolete hash). these obsoletes will not be entered into postgres $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; while (my @row = $result->fetchrow) { my (@papers) = split/","/, $row[2]; foreach my $pap (@papers) { $pap =~ s/\"//g; $obs{$row[1]}{$pap}++; } } $obs{hIn1}{all}++; $obs{mIn1}{all}++; my %tdata; #a hash of all textpresso data foreach my $line (@tlines) { chomp $line; # my ($paper, $tg) = split/ /, $line; my ($paper, $tg) = $line =~ m/^(\S+)\s+(.*)$/; # arun changed the format again 2010 02 27; gets papers and transgenes ($paper) = $paper =~ m/(WBPaper\d+)/; #includes all .sup papers as WBPaperID alone # unless ($tg) { print "-- NO TG $line\n"; } my (@tg) = split/\s/, $tg; foreach my $tg (@tg) { if ($syns{$tg}) { $tg = $syns{$tg}; } #gets transgene names if synonyms next if ($obs{$tg}{'all'}); next if ($obs{$tg}{$paper}); #if not obsolete, paper and name are retrieved $tdata{$tg}{$paper}++; #adds anything that hasn't been skipped into textpresso data hash } } # foreach my $line (@lines) my %pdata; #postgres data hash $result = $dbh->prepare( "SELECT trp_name.trp_name, trp_paper.trp_paper FROM trp_name, trp_paper WHERE trp_name.joinkey = trp_paper.joinkey;" ); #queries name and paper from postgres $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; while (my @row = $result->fetchrow) { my $tg = $row[0]; # my (@papers) = split/ | /, $row[1]; my (@papers) = split/\",\"/, $row[1]; foreach my $paper (@papers) { $paper =~ s/\"//g; delete $tdata{$tg}{$paper}; } } #if transgene and paper exist in textpresso data, those data are deleted from the textpresso data, will keep transgenes whose names do match but have a different paper value, both valid and invalid data exists foreach my $tg (sort keys %tdata) { #this section is about adding the data to postgres, choosing one of three actions my (@papers) = sort keys %{ $tdata{$tg} }; # my $papers = join" | ", @papers; if ($papers[0]) { if ($valid{$tg}) { &addToTg($tg, \@papers, $timestamp ); } #if valid (name or synonym exists in postgres, updates paper, not timestamp elsif ($tg =~ m/In/) { 1; } # print OU2 "$timestamp new $tg in $papers\n"; { # }, do not do anything with In transgenes else { &newTg($tg, \@papers, $timestamp ); } #not valid (doesn't already exist, create new one } # if ($papers) } # foreach my $paper (sort keys %tdata) # close (OUT) or die "Cannot close $outfile : $!"; # close (OU2) or die "Cannot close $outfile2 : $!"; # close (OU3) or die "Cannot close $outfile3 : $!"; # print "not same\n"; sub addToTg { #puts papers in postgres format my ($tg, $arref_papers, $date) = @_; my @papers = @$arref_papers; my %papers; my $papers = join"\",\"", @papers; $papers = '"' . $papers . '"'; #puts paper in postgres format foreach (@papers) { $papers{$_}++; } my @pgcommands; print "\nADD to tg $date more papers $tg in $papers\n"; # print OUT "$date more papers $tg in $papers\n"; my %joinkeys; # get all joinkeys that refer to this Tg $result = $dbh->prepare( "SELECT * FROM trp_name WHERE trp_name = '$tg';" ); #queries for pgid with 'nonvalid' transgene name $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; while (my @row = $result->fetchrow) { $joinkeys{$row[0]}++; } foreach my $joinkey (keys %joinkeys) { # for all joinkeys of that Tg $result = $dbh->prepare( "SELECT trp_paper FROM trp_paper WHERE joinkey = '$joinkey';" ); #trp_paper should have been changed from trp_reference $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; my @row = $result->fetchrow; if ($row[0]) { # if there's a reference, append it, add new paper to pgid my (@paps) = split/\",\"/, $row[0]; foreach (@paps) { $_ =~ s/\"//g; $papers{$_}++; } # $papers = "$row[0] | $papers"; (@paps) = sort keys %papers; $papers = join"\",\"", @paps; $papers = '"' . $papers . '"'; my $command = "UPDATE trp_paper SET trp_paper = '$papers' WHERE joinkey = '$joinkey';"; #apply postgres command push @pgcommands, $command; # print OU3 "$command -- $date\n"; # my $result2 = $dbh->do( $command ); } else { # if new reference, add it my $command = "INSERT INTO trp_paper VALUES ('$joinkey', '$papers');"; #if there is no paper, insert the paper push @pgcommands, $command; # print OU3 "$command -- $date\n"; # my $result2 = $dbh->do( $command ); } my $command = "INSERT INTO trp_paper_hst VALUES ('$joinkey', '$papers');"; #add it to the history table, needs to be changed from trp_reference_hst push @pgcommands, $command; # print OU3 "$command -- $date\n"; # my $result2 = $dbh->do( $command ); } # foreach my $joinkey (keys %joinkeys) foreach my $pgcommand (@pgcommands) { print "$pgcommand -- $date\n"; #keeps log # UNCOMMENT TO POPULATE PG my $result2 = $dbh->do( $pgcommand ); } } # sub addToTg sub newTg { #for new transgenes my ($tg, $arref_papers, $date) = @_; my @papers = @$arref_papers; my $papers = join"\",\"", @papers; $papers = '"' . $papers . '"'; my @pgcommands; my $joinkey = 0; $result = $dbh->prepare( "SELECT * FROM trp_name;" ); #queries the transgenes name table to find the highest pgid $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; while (my @row = $result->fetchrow) { if ($row[0] > $joinkey) { $joinkey = $row[0]; } } $joinkey++; #adds 1 to the highest pgid to create the next available value print "\nNEW tg $date new $tg in $papers\n"; # print OUT "$date new $tg in $papers\n"; my $command = "INSERT INTO trp_name VALUES ('$joinkey', '$tg');"; push @pgcommands, $command; # print OU3 "$command -- $date\n"; # my $result2 = $dbh->do( $command ); $command = "INSERT INTO trp_name_hst VALUES ('$joinkey', '$tg');"; push @pgcommands, $command; # print OU3 "$command -- $date\n"; # $result2 = $dbh->do( $command ); $command = "INSERT INTO trp_paper VALUES ('$joinkey', '$papers');"; #changed from trp_reference push @pgcommands, $command; # print OU3 "$command -- $date\n"; # $result2 = $dbh->do( $command ); $command = "INSERT INTO trp_paper_hst VALUES ('$joinkey', '$papers');"; #changed from trp_reference_hst push @pgcommands, $command; # print OU3 "$command -- $date\n"; # $result2 = $dbh->do( $command ); $command = "INSERT INTO trp_curator VALUES ('$joinkey', 'WBPerson4793');"; #adds Arun to curator for new transgenes push @pgcommands, $command; $command = "INSERT INTO trp_curator_hst VALUES ('$joinkey', 'WBPerson4793');"; #adds Arun push @pgcommands, $command; foreach my $pgcommand (@pgcommands) { print "$pgcommand -- $date\n"; # UNCOMMENT TO POPULATE PG my $result2 = $dbh->do( $pgcommand ); } } # sub newTg __END__
# look at textpresso transgene data and update postgres based on that. # 2008 10 07 # # wasn't checking regular names, oops. 2008 10 20 # # # update postgres based on value, cron job # 0 2 * * mon /home/postgres/work/pgpopulation/transgene/textpresso_transgene/textpresso_transgene.pl # 2008 10 14 # # run every day 2009 02 23 # 0 2 * * * /home/postgres/work/pgpopulation/transgene/textpresso_transgene/textpresso_transgene.pl use LWP::Simple; use strict; use diagnostics; use Pg; use Jex; my $directory = '/home/postgres/work/pgpopulation/transgene/textpresso_transgene'; chdir($directory) or die "Cannot go to $directory ($!)"; my $conn = Pg::connectdb("dbname=testdb"); die $conn->errorMessage unless PGRES_CONNECTION_OK eq $conn->status; my $date = &getSimpleSecDate(); __END__