Transgene first pass

Table of Contents Transgene first pass: Automated identification of transgenes and flagging of papers High priority transgenes Upload transgene search results into postgres Removing obsolete transgene objects update_textpresso_transgene.pl

Transgene first pass:

Automated identification of transgenes and flagging of papers

Arun wrote a script to identify Is and In regular expression patterns. SCRIPT: //finds all Is and In transgenes. The C. elegans corpus is used as in file for update_textpreso_transgene.pl*

SCRIPT: //finds all Is, In, and Ex expressions plus any possible genomic expression following the transgene regular expression name.
These output files are available here: http://textpresso-dev.caltech.edu/transgene/
The output file is: http://textpresso-dev.caltech.edu/transgene/transgenes_in_regular_papers.out

Note: these scripts will miss transgenes that do not have a canonical name. http://textpresso-dev.caltech.edu/transgene/transgenes_summary.out formats the results according to transgene with Papers listed below each transgenes; "…paper.sup.1" means the transgene name was mentioned in the supplementary file.

High priority transgenes

From Arun's list (See e-mail July 22, 2010 from Arun) Updated every 15th of the month: http://textpresso-dev.caltech.edu/transgene/transgenes_summary.out http://textpresso-dev.caltech.edu/transgene/transgenes_summary_in_regular_papers.out

Can run the following parsing scripts to get a list of high priority (those extrachromosomal arrays that appear in more than one paper. See message from Wen (Jun, 2010)

The script is written and placed on spica: /home/citpub/Karen/TgSummary/getNewTg.pl

It creates two files:

1. /home/citpub/Karen/TgSummary/NewTg.txt This is the list of all new Ex lines not in WS216.

2. /home/citpub/Karen/TgSummary/NewTgHighPriority.txt This is the list of all new Ex lines that have at least 2 paper entries.

Here is how to run the script and the results:

[citpub@spica]$ ./getNewTg.pl //This script screens the Textpresso output of transgenes, gets rid of those that already exist in WormBase.

Input file 1: WSTg.ace this file lists all transgenes already in WormBase. You need to update this list before running the script by dumping out the list of transgene names from an updated WS

Input file 2: transgenes_summary.out this output file is from the Textpresso scan

Output file: NewTg.txt, which also gives the following information

4951 new Ex transgenes found. 607 Ex transgenes has more than 2 paper entries.

To update the Tg priority lists,

-update WSTg.ace ->query out all Tg from citace after the upload? or get a list from postgres.

-update transgenes_summary.out output file from Textpresso ->still not sure how to do this, if the script is available for me to run or if I need to ask Arun.

Upload transgene search results into postgres

All the transgene-paper links will be entered into the postgres database automatically by Juancarlos's script update_textpreso_transgene.pl* on tazendra at /home/postgres/work/pgpopulation/textpresso/transgene. This script runs every day at 4am and populates the transgene postgres table with all new Is transgenes as well as populates new references for pre-existing transgenes.

At this moment, only the 'Is' lines are entered into postgres. 'In' lines are not entered to textpresso unless the transgene name already matches something that is in the database (which means that they have already been confirmed as valid transgenes).

'Is' lines are entered with Arun as a curator so these new entries are easy to find.

Scripts: /home/postgres/work/pgpopulation/textpresso/transgene/update_textpresso_transgene.pl*\\this was originally spelled update_textpreso_transgene.pl, J fixed the spelling 7/11/11 and fixed it in the wrapper as well as in antibody, rnai.

Removing obsolete transgene objects

The scripts will repeatedly pick up and enter anything that is new or not curated, therefore if a bad object was picked up, e.g. typo, mistake, false object, that object needed to be added to an exclusion list so it would not be entered again. The steps taken before to deal with these objects are below. We have gotten around having to follow through with these objects by utilizing a false toggle button that can be applied to an object in a single line. This allows for the object to declared false in conjunction with certain papers yet allows the object to be declared a real object in papers where it might be real. These false objects are not dumped during the .ace dumper script and remain visible in the OA table.

For Reference, the old pipeline for removing these objects:
There are some false positives transgene hits extracted by the Textpresso scan, these objects need to be deleted from postgres as well as added to the transgene object exclusion list so it will not be picked up again during future transgene object scans. The transgene exclusion (false positive) list lives: on tazendra
/home/acedb/wen/phenote_transgene/ObsoleteTg.txt when is this file called on?, Scripts?
This file is important and needs to be edited every time a new false positive is discovered. Some examples of false positives include:

"In" lines that are chromosomal inversions rather than transgenes.
There are some typos in transgene names, which get published, these should not go to postgres as their own entities, but should be noted in the remark or listed as a synonym, if appropriate, for the real transgene.
Textpresso mishandles of transgene names, such as when transgenes are referred to as syIs13-19 in the paper, in which case Textpresso reports it as syIs1319 (i.e., the hyphen disappears sometimes during pdf2Text conversion). Curators should enter all the transgene objects into postgres and obsolete the syIs1314 object.

update_textpresso_transgene.pl

 use strict;
 use diagnostics;
 use DBI;
 use LWP::Simple;
 use Jex;
 
 my $date = &getSimpleDate();
 my $timestamp = &getSimpleSecDate();
 
 my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n";
 my $result;
 
 my $directory = '/home/postgres/work/pgpopulation/textpresso/transgene';
 chdir($directory) or die "Cannot go to $directory ($!)";
  
 
 $/ = undef; #reads whole file instead of line by line
 my $full_file = 'transgenes_in_regular_papers.out';
 open (IN, "<$full_file") or die "Cannot open $full_file : $!";
 my $last_data = <IN>;
 close (IN) or die "Cannot close $full_file : $!";
 $/ = "\n";
 
 my $new_data = get "http://textpresso-dev.caltech.edu/transgene/transgenes_in_regular_papers.out";
 # UNCOMMENT THIS
 exit if ($last_data eq $new_data);
 
 
 open (OUT, ">$full_file") or die "Cannot rewrite $full_file : $!";
 print OUT $new_data;
 close (OUT) or die "Cannot close $full_file : $!";
 
 my (@tlines) = split/\n/, $new_data;
 my @pgcommands;
 
 # no longer populate textpresso firstpass tables for this, transgene is not a FP curation field in form.  2010 08 224
 #
 # my $logfile = $directory . '/logfile.pg';
 # open (LOG, ">$logfile") or die "Cannot rewrite $logfile : $!";
 #
 # push @pgcommands, "DELETE FROM tfp_transgene;";
 # foreach my $line (@tlines) {
 #   my ($paper, @transgenes) = split/\s+/, $line;
 #   if ($paper =~ m/(WBPaper\d+)/) { $paper = $1; }
 #   my ($joinkey) = $paper =~ m/WBPaper(\d+)/;
 #   push @pgcommands, "INSERT INTO tfp_transgene VALUES ('$joinkey', '$line');";
 # } # foreach my $line (@tlines)
 # 
 # foreach my $command (@pgcommands) {
 #   print LOG "$command\n";
 #   $result = $dbh->do( $command );
 # } # foreach my $command (@pgcommands)
 # 
 # close (LOG) or die "Cannot close $logfile : $!";
 
 #we don't appear to be doing anything with these outfiles 7/11/11
 my $outfile  = $directory . '/transgene_textpresso';
 my $outfile2 = $directory . '/new_transgene_textpresso'; #supposed to be used for keeping track of In transgenes
 my $outfile3 = $directory . '/transgene_textpresso.pg';
 # open (OUT, ">>$outfile") or die "Cannot create $outfile : $!";
 # open (OU2, ">>$outfile2") or die "Cannot create $outfile2 : $!";
 # open (OU3, ">>$outfile3") or die "Cannot create $outfile3 : $!";
 
 my %syns;
 my %valid; #transgenes are assigned as valid as they already exist in postgres in name or synonym
 $result = $dbh->prepare( "SELECT trp_name.trp_name, trp_synonym.trp_synonym FROM trp_name, trp_synonym WHERE trp_name.joinkey = trp_synonym.joinkey;" );
 $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
 while (my @row = $result->fetchrow) { 
   my (@syns) = split/ \| /, $row[1];
   $valid{$row[0]}++;
   foreach my $syn (@syns) {
     $valid{$syn}++;
     $syns{$syn} = $row[0]; } }
 $result = $dbh->prepare( "SELECT trp_name FROM trp_name;" );
 $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
 while (my @row = $result->fetchrow) { $valid{$row[0]}++; }
 
 
 my %obs; # % is used to signify a hash variable, $ is used to signify a value inside a % (hash)
 
 # no longer use flatfile, use trp_objpap_falsepos table (Fail) 2010 08 24
 # my $infile = '/home/acedb/wen/phenote_transgene/ObsoleteTg.txt';
 # open (IN, "<$infile") or die "Cannot open $infile : $!";
 # while (my $line = <IN>) {
 #   chomp $line;
 #   next unless $line;
 #   next if ($line =~ m/^\/\//);
 # #   my ($tg, $paper, $comment) = split/\t/, $line;	# wen keeps forgetting tabs
 #   my ($tg, $paper);
 #   if ($line =~ m/^(\S+)\s+(\S+)/) { $tg = $1; $paper = $2; }
 #   $tg =~ s/\s+//g;
 #   unless ($paper) { $paper = 'all'; }
 #   $obs{$tg}{$paper}++;
 # #   print "OBS $tg PAP $paper E\n";
 # } # while (my $line = <IN>) {
 # close (IN) or die "Cannot close $infile : $!";
 
 $result = $dbh->prepare( "SELECT trp_name.joinkey, trp_name.trp_name, trp_paper.trp_paper FROM trp_name, trp_paper WHERE trp_name.joinkey = trp_paper.joinkey AND trp_name.joinkey IN (SELECT joinkey FROM trp_objpap_falsepos WHERE trp_objpap_falsepos = 'Fail');" ); #matches names from Arun's output to those with fails and does not enter them again into postgres: finding names and papers with ids that belong to the fails and puts them in the %obs (an obsolete hash).  these obsoletes will not be entered into postgres
 $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
 while (my @row = $result->fetchrow) { 
   my (@papers) = split/","/, $row[2];
   foreach my $pap (@papers) { $pap =~ s/\"//g; $obs{$row[1]}{$pap}++; } }
 $obs{hIn1}{all}++;
 $obs{mIn1}{all}++;
 
 
  my %tdata; #a hash of all textpresso data
  foreach my $line (@tlines) {
   chomp $line;
 #   my ($paper, $tg) = split/  /, $line;
   my ($paper, $tg) = $line =~ m/^(\S+)\s+(.*)$/;	# arun changed the format again  2010 02 27; gets  papers and transgenes
   ($paper) = $paper =~ m/(WBPaper\d+)/; #includes all .sup papers as WBPaperID alone
 #   unless ($tg) { print "-- NO TG $line\n"; }
   my (@tg) = split/\s/, $tg;
   foreach my $tg (@tg) { 
     if ($syns{$tg}) { $tg = $syns{$tg}; } #gets transgene names if synonyms
     next if ($obs{$tg}{'all'});
     next if ($obs{$tg}{$paper}); #if not obsolete, paper and name are retrieved
     $tdata{$tg}{$paper}++;  #adds anything that hasn't been skipped into textpresso data hash
   }
 } # foreach my $line (@lines)
 
 my %pdata;  #postgres data hash
 $result = $dbh->prepare( "SELECT trp_name.trp_name, trp_paper.trp_paper FROM trp_name, trp_paper WHERE  trp_name.joinkey = trp_paper.joinkey;" );  #queries name and paper from postgres
 $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
 while (my @row = $result->fetchrow) { 
   my $tg = $row[0];
 #   my (@papers) = split/ | /, $row[1];
   my (@papers) = split/\",\"/, $row[1];
   foreach my $paper (@papers) { $paper =~ s/\"//g; delete $tdata{$tg}{$paper}; } } #if transgene and paper exist in textpresso data, those data are deleted from the textpresso data, will keep transgenes whose names do match but have a different paper value, both valid and invalid data exists
 
 foreach my $tg (sort keys %tdata) {  #this section is about adding the data to postgres, choosing one of three actions
   my (@papers) = sort keys %{ $tdata{$tg} };
 #   my $papers = join" | ", @papers;
   if ($papers[0]) {
     if ($valid{$tg}) { &addToTg($tg, \@papers, $timestamp ); } #if valid (name or synonym exists in postgres, updates paper, not timestamp
       elsif ($tg =~ m/In/) { 1; } # print OU2 "$timestamp new $tg in $papers\n"; { # }, do not do anything with In transgenes
       else { &newTg($tg, \@papers, $timestamp ); } #not valid (doesn't already exist, create new one
   } # if ($papers)
 } # foreach my $paper (sort keys %tdata)
 
 # close (OUT) or die "Cannot close $outfile : $!";
 # close (OU2) or die "Cannot close $outfile2 : $!";
 # close (OU3) or die "Cannot close $outfile3 : $!";
 
 
 # print "not same\n";
 
 
 sub addToTg { #puts papers in postgres format
   my ($tg, $arref_papers, $date) = @_;
   my @papers = @$arref_papers; 
   my %papers;
   my $papers = join"\",\"", @papers; $papers = '"' . $papers . '"'; #puts paper in postgres format
   foreach (@papers) { $papers{$_}++; }
   my @pgcommands;
   print "\nADD to tg $date more papers $tg in $papers\n"; 
 #   print OUT "$date more papers $tg in $papers\n"; 
   my %joinkeys;					# get all joinkeys that refer to this Tg
   $result = $dbh->prepare( "SELECT * FROM trp_name WHERE trp_name = '$tg';" ); #queries for pgid with 'nonvalid' transgene name
   $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
   while (my @row = $result->fetchrow) { $joinkeys{$row[0]}++; }
   foreach my $joinkey (keys %joinkeys) {	# for all joinkeys of that Tg
     $result = $dbh->prepare( "SELECT trp_paper FROM trp_paper WHERE joinkey = '$joinkey';" ); #trp_paper should have been changed from trp_reference
     $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
     my @row = $result->fetchrow;
     if ($row[0]) { 				# if there's a reference, append it, add new paper to pgid
       my (@paps) = split/\",\"/, $row[0]; foreach (@paps) { $_ =~ s/\"//g; $papers{$_}++; }
 #       $papers = "$row[0] | $papers"; 
       (@paps) = sort keys %papers; $papers = join"\",\"", @paps; $papers = '"' . $papers . '"';
       my $command = "UPDATE trp_paper SET trp_paper = '$papers' WHERE joinkey = '$joinkey';"; #apply postgres command
       push @pgcommands, $command;
 #       print OU3 "$command -- $date\n";
 #       my $result2 = $dbh->do( $command );
     } else {					# if new reference, add it
       my $command = "INSERT INTO trp_paper VALUES ('$joinkey', '$papers');"; #if there is no paper, insert the paper
       push @pgcommands, $command;
 #       print OU3 "$command -- $date\n";
 #       my $result2 = $dbh->do( $command );
     }
     my $command = "INSERT INTO trp_paper_hst VALUES ('$joinkey', '$papers');"; #add it to the history table, needs to be changed from trp_reference_hst
     push @pgcommands, $command;
 #     print OU3 "$command -- $date\n";
 #     my $result2 = $dbh->do( $command );
   } # foreach my $joinkey (keys %joinkeys)
   foreach my $pgcommand (@pgcommands) {
     print "$pgcommand  -- $date\n"; #keeps log 
 # UNCOMMENT TO POPULATE PG
     my $result2 = $dbh->do( $pgcommand );
   }
 } # sub addToTg
 
 sub newTg { #for new transgenes 
   my ($tg, $arref_papers, $date) = @_;
   my @papers = @$arref_papers;
   my $papers = join"\",\"", @papers; $papers = '"' . $papers . '"';
   my @pgcommands;
   my $joinkey = 0;
   $result = $dbh->prepare( "SELECT * FROM trp_name;" ); #queries the transgenes name table to find the highest pgid
   $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
   while (my @row = $result->fetchrow) { if ($row[0] > $joinkey) { $joinkey = $row[0]; } }
   $joinkey++; #adds 1 to the highest pgid to create the next available value
   print "\nNEW tg $date new $tg in $papers\n";
 #   print OUT "$date new $tg in $papers\n"; 
   my $command = "INSERT INTO trp_name VALUES ('$joinkey', '$tg');";
   push @pgcommands, $command;
 #   print OU3 "$command -- $date\n";
 #   my $result2 = $dbh->do( $command );
   $command = "INSERT INTO trp_name_hst VALUES ('$joinkey', '$tg');";
   push @pgcommands, $command;
 #   print OU3 "$command -- $date\n";
 #   $result2 = $dbh->do( $command );
   $command = "INSERT INTO trp_paper VALUES ('$joinkey', '$papers');"; #changed from trp_reference
   push @pgcommands, $command;
 #   print OU3 "$command -- $date\n";
 #   $result2 = $dbh->do( $command );
   $command = "INSERT INTO trp_paper_hst VALUES ('$joinkey', '$papers');"; #changed from trp_reference_hst
   push @pgcommands, $command;
 #   print OU3 "$command -- $date\n";
 #   $result2 = $dbh->do( $command );
   $command = "INSERT INTO trp_curator VALUES ('$joinkey', 'WBPerson4793');"; #adds Arun to curator for new  transgenes
   push @pgcommands, $command;
   $command = "INSERT INTO trp_curator_hst VALUES ('$joinkey', 'WBPerson4793');"; #adds Arun
   push @pgcommands, $command;
   foreach my $pgcommand (@pgcommands) {
     print "$pgcommand  -- $date\n";
 # UNCOMMENT TO POPULATE PG
     my $result2 = $dbh->do( $pgcommand );
   }
 } # sub newTg
 
 
 __END__

 # look at textpresso transgene data and update postgres based on that.  
 # 2008 10 07
 #
 # wasn't checking regular names, oops.  2008 10 20
 #
 #
 # update postgres based on value, cron job 
 # 0 2 * * mon /home/postgres/work/pgpopulation/transgene/textpresso_transgene/textpresso_transgene.pl
 # 2008 10 14
 #
 # run every day  2009 02 23
 # 0 2 * * * /home/postgres/work/pgpopulation/transgene/textpresso_transgene/textpresso_transgene.pl
 
 use LWP::Simple;
 use strict;
 use diagnostics;
 use Pg;
 use Jex;
 
 my $directory = '/home/postgres/work/pgpopulation/transgene/textpresso_transgene';
 chdir($directory) or die "Cannot go to $directory ($!)";
 
 my $conn = Pg::connectdb("dbname=testdb");
 die $conn->errorMessage unless PGRES_CONNECTION_OK eq $conn->status;
 
 my $date = &getSimpleSecDate();
 
 
 __END__

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transgene first pass

Table of Contents