Modified code from the TF affinity prediction
See also http://www.molgen.mpg.de/~manke/papers/TFaffinities/
In order to analyze multiple sequences, one has to first extract the promoter sequences, eg.
human.promoter <- read.DNAStringSet('~/data/fimo/human_upstream1000.fa')
where human_upstream1000.fa
is the whole genome promoter sequence file, one
can then pick her genes of interest (eg. all upregulated genes) and find out
the ENTREZ IDs
tf.entrez <- unlist(mget(as.character(df.uptime$gene), org.Hs.egSYMBOL2EG,
ifnotfound=NA))
tf.entrez <- tf.entrez[!is.na(tf.entrez)]
which have to be converted to REFSEQ IDs to match the promoter file
tf.refseq <- unlist(mget(tf.entrez, org.Hs.egREFSEQ))
tf.refseq <- tf.refseq[grep('NM',tf.refseq)]
idx <- grep(paste(tf.refseq,collapse='_|'), names(human.promoter))
Here we define a function to convert REFSEQ IDs back to gene symbols, such that our sequence files will have the consistent gene names
refseq2symbol <- function(refseq.str) {
all.str <- unlist(strsplit(refseq.str, '_'))
refseq <- paste(all.str[1:2], collapse='_')
entrez <- unlist(mget(refseq,org.Hs.egREFSEQ2EG,ifnotfound=NA))
if (is.na(entrez))
return('')
else
return(get(entrez, org.Hs.egSYMBOL))
}
Now we can define our output directory and dump the sequence files into it
wd <- '~/data/hdf/qpcr_seq/'
for (i in 1:length(idx)) {
symbol <- refseq2symbol(names(human.promoter)[idx[i]])
if (symbol != '') {
seq.file <- paste(wd,symbol,'.fa',sep='')
write.XStringSet(human.promoter[idx[i]], seq.file)
}
symbol
}
With all that in place, we can then execute the TRAP program on all sequence
files (eg. using the parallel functionality in foreach
)
all.file <- dir(wd, 'fa')
foreach(i=1:length(all.file)) %dopar% {
seq.file <- paste(wd,all.file[i],sep='')
res.file <- paste(wd,sub('fa','trap',all.file[i]),sep='')
system(paste('./TRAP ~/data/TRANSFAC/TFP_2012.3/dat/matrix.dat',seq.file,
'>',res.file))
}
where the matrix.dat
is the position score matrix provided by TRANSFAC, and
the result files will have the extension of .trap
, while locating in the same
directory. The result files can then be individually processed to calculate the
normalized affinity score.