Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
WilliamDiakite committed Apr 23, 2018
0 parents commit 3ef4698
Show file tree
Hide file tree
Showing 9 changed files with 70,202 additions and 0 deletions.
68,890 changes: 68,890 additions & 0 deletions .ipynb_checkpoints/Untitled-checkpoint.ipynb

Large diffs are not rendered by default.

422 changes: 422 additions & 0 deletions .ipynb_checkpoints/relevance_ranking.py-checkpoint.ipynb

Large diffs are not rendered by default.

Binary file added __pycache__/utils.cpython-36.pyc
Binary file not shown.
31 changes: 31 additions & 0 deletions co_occurrence.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Source,Target,Weight
ingenieur,sens du collectif,2305
mecanicien,ponctuel,2351
cuisinier,autonome,2224
boulanger,temeraire,2269
politicien,autonome,2317
luthier,rigoureux,2276
ingenieur,curieux,2253
mecanicien,motive,2279
cuisinier,creatif,2264
boulanger,motive,2345
politicien,sens du collectif,2274
luthier,ponctuel,2254
ingenieur,creatif,2201
mecanicien,rigoureux,2223
politicien,creatif,2299
luthier,habile,2268
ingenieur,audacieux,2365
cuisinier,sens du collectif,2318
boulanger,habile,2220
politicien,curieux,2260
mecanicien,temeraire,2347
cuisinier,curieux,2281
luthier,temeraire,2306
cuisinier,audacieux,2342
mecanicien,habile,2229
boulanger,ponctuel,2390
politicien,audacieux,2279
ingenieur,autonome,2305
boulanger,rigoureux,2205
luthier,motive,2325
9 changes: 9 additions & 0 deletions data/NLM_500/documents/12915586.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Susceptibility of Human Hepatitis Delta Virus RNAs to Small Interfering RNA Action
In animal cells, small interfering RNAs (siRNA), when exogenously provided, have been reported to be capable of inhibiting replication of several different viruses. In preliminary studies, siRNA species were designed and tested for their ability to act on the protein expressed in Huh7 cells transfected with DNA-directed mRNA constructs containing hepatitis delta virus (HDV) target sequences. The aim was to achieve siRNA specific for each of the three RNAs of HDV replication: (i) the 1,679-nucleotide circular RNA genome, (ii) its exact complement, the antigenome, and (iii) the less abundant polyadenylated mRNA for the small delta protein. Many of the 16 siRNA tested gave >80% inhibition in this assay. Next, these three classes of siRNA were tested for their ability to act during HDV genome replication. It was found that only siRNA targeted against HDV mRNA sequences could interfere with HDV genome replication. In contrast, siRNA targeted against genomic and antigenomic RNA sequences had no detectable effect on the accumulation of these RNAs. Reconstruction experiments with nonreplicating HDV RNA sequences support the interpretation that neither the potential for intramolecular rod-like RNA folding nor the presence of the delta protein conferred resistance to siRNA. In terms of replicating HDV RNAs, it is considered more likely that the genomic and antigenomic RNAs are resistant because their location within the nucleus makes them inaccessible to siRNA-mediated degradation.

Human hepatitis delta virus (HDV) has a 1,679-nucleotide (nt) single-stranded circular RNA genome that is replicated by RNA-directed RNA synthesis, most probably involving host RNA polymerase II . During this replication, three RNA species accumulate , as represented in Fig. . The genome and its exact complement, the antigenome, are considered unit length. They exist primarily in a circular conformation but also in a linear conformation; these two conformations can be resolved using appropriate conditions of gel electrophoresis . The third RNA species consists of relatively lower amounts of an 800-nt polyadenylated RNA (of the same polarity as the antigenome), which is translated to produce a 195-amino-acid protein, known as the delta antigen (deltaAg-S), and is essential for HDV genome replication . An increasing number of reports have shown that small interfering RNAs (siRNA) can be exogenously provided to cells undergoing animal virus replication and achieve inhibition . For the following reasons, we were specifically interested in the possible susceptibility to siRNA of HDV RNAs. (i) The HDV genomic and antigenomic RNAs can fold into an unbranched rod-like structure with 74% of the bases paired , and this folding might interfere with siRNA action. (ii) The delta protein has the ability to bind double-stranded RNA and thus might also interfere. (iii) While several reports indicate that HDV genomic and antigenomic RNAs are predominantly located in the cell nucleus , two recent studies using cell fractionation indicated that much of the genomic RNA (but not the antigenomic RNA) might be cytoplasmic and thus possibly accessible to siRNA attack . With these questions in mind, our first objective was to design and test siRNA species specific for sequences on HDV mRNA, genomic RNA, and antigenomic RNA. As represented in Fig. , the initial strategy was to use expression vectors to produce within Huh7 cells DNA-directed mRNA species that contain these three sequences. As indicated, a total of 16 siRNA, each containing a 21-nt region based on the nucleotide sequence of Kuo et al. , were designed and constructed using a Silencer siRNA construction kit (Ambion). The locations and sequences of the target sites are listed in Table . The assay in each case was for the translation and accumulation of the delta protein, as detected by immunoblotting. The results and their quantitation are presented in Fig. . Also shown is the immunoblot for the internal control, green fluorescent protein (GFP). An expression vector for GFP was cotransfected, and as expected, the siRNA directed against HDV sequences did not inhibit the accumulation of GFP protein. Consistent with the siRNA experience of others , we found that some of the designed siRNA were able to give 80 to 95% inhibition of the HDV mRNA sequences (Fig. , lanes 4, 6, and 7). Furthermore, as planned, it was possible to obtain siRNA that attacked the genomic and antigenomic RNA sequences (Fig. , respectively). It should be noted that in Fig. , the insert of 657 nt of antigenomic sequence (from position 660 to 4) should thus have been able in large part to fold with the 585 nt spanning the delta antigen open reading frame (at position 1596 to 1011) into an extensive amount of unbranched rod-like structure. Even if this potential folding did occur in vivo, it did not confer resistance to functioning siRNA (lanes 14, 15, and 16). We next examined whether those siRNA with proven activity could interfere with HDV genome replication, as assayed by the accumulation of unit-length HDV RNA species. Some results are shown in Fig. , along with quantitation. We observed that inhibition of RNA accumulation occurred only with siRNA 4 and 6, which targeted the mRNA sequences and caused a significant reduction of delta protein accumulation . (In panel A, the data shown are for antigenomic RNA; however, similar results were obtained for genomic RNA [data not shown]. Also, siRNA 14, 15, and 16 failed to reduce HDV RNA accumulation.) In the above-described experiment, the gel electrophoresis conditions used do not separate the circular from the linear conformations of unit-length HDV RNAs. However, if the siRNA treatment led to single endonucleolytic cuts on unit-length circles to produce linears, we expect that this would have inhibited further RNA accumulation . Therefore, to further test this preliminary interpretation that unit-length genomic and antigenomic RNAs were resistant to siRNA, we carried out the following additional experiment. We used transfection to express HDV RNA multimers by DNA-directed RNA transcription. From previous studies, we knew that these would be posttranscriptionally processed to form unit-length RNA circles and yet, because of a 2-nt deletion in the open reading frame for the small delta protein, would be unable to make the essential delta protein and undergo RNA-directed transcription and replication . We then used gel analysis conditions capable of separating both linear and circular conformations of unit-length HDV RNA. Some data are shown in Fig. . As can be seen from the quantitation, cotransfection with siRNA 10, 11, 12, and 13, specific for genomic RNA sequences, did not reduce the accumulation of unit-length genomic RNA. Furthermore, the presence of this siRNA did not cause a reduction in the fraction of RNAs with a circular conformation (lanes 10 to 13 relative to lane C). In similar experiments, we expressed antigenomic nonreplicating RNAs and found that they were not sensitive to siRNA 4, 6, and 9 (data not shown). We interpret these data as evidence against siRNA action, even for inducing a single nick on the nonreplicating HDV RNAs. In addition, the delta protein was not present and thus could not be the basis for the observed resistance. Our studies show that HDV circular RNAs, whether transcribed from an RNA template or from a DNA template , were resistant to siRNA attack. In the case of the DNA-directed transcript, the expression vector was such that the primary (nonreplicating) multimeric transcript was via host RNA polymerase II and that RNA (prior to ribozyme processing and ligation to form unit-length circles) should have undergone both 5'-capping and 3' polyadenylation. Others have shown that for a host mRNA precursor, the intronic regions are resistant to siRNA while the exons are sensitive . Our studies show that the unit-length circular RNA processed out of the multimeric transcript was not only stable but also resistant to siRNA attack. In summary, these studies support the interpretation that during genome replication, the only HDV RNA directly susceptible to inhibition mediated by siRNA is the mRNA. At least under the conditions of these experiments, the resistance of the genomic and antigenomic RNAs was not dependent on RNA structure, RNA conformation, or the presence of the small delta protein. Further experiments will be needed to determine if this "resistance" was in fact due to inaccessibility based on nuclear localization of genomic and antigenomic unit-length RNAs. It remains possible that some of the genomic RNA is cytoplasmic but is somehow inaccessible to attack by siRNA. For example, it could be protected by a host RNA-binding protein. For this or maybe other reasons, these HDV RNAs are probably resistant to siRNA because they are simply not accessible to the RISC, a protein-RNA effector nuclease complex that recognizes and destroys target RNAs . In this respect, our findings with HDV are analogous to those reported for siRNA action on the replication of respiratory syncytial virus and influenza virus . That is, siRNA cannot target the replicating viral RNA transcripts directly but only indirectly via action on the viral mRNA species.
Representation of three main species of HDV RNA. Representation of three main species of HDV RNA. Also indicated are the genomic and antigenomic ribozymes (cleavage site is shown as a circle) and the open reading frame (ORF) for the deltaAg . At the right is the number of molecules of each RNA per average liver cell for an infected woodchuck and chimpanzee, as previously reported . Indicated on the 1,679-nt genomic RNA is the origin for the nucleotide numbering, according to the sequence of Kuo et al. .
Transfected siRNAs could target HDV mRNA species. Transfected siRNAs could target HDV mRNA species. Huh7 cells were transfected with one of three plasmids that express an HDV mRNA species. (A) pDL444 expressed an mRNA equivalent to normal HDV mRNA. (B and C) pDL444 was modified to contain 657 nt of extra sequences in the 3' untranslated region, which led to the transcription of either partial genomic HDV RNA sequences (position 4 to 660) (B) or antigenomic RNA sequences (position 660 to 4) (C). The constructs were cotransfected with a plasmid expressing GFP. After 2 days the total protein was extracted and examined by immunoblotting to detect both delta protein and GFP. Detection and quantitation were with a bioimager (Fuji). The 16 HDV-specific siRNA indicated in the figure were designed and delivered as a cotransfection using Lipofectamine 2000 (Invitrogen). As a negative control (lanes C) we used siRNA against glyceraldehyde-3-phosphate dehydrogenase.
Effect of transfected siRNA on the accumulation of replicating HDV RNAs. Effect of transfected siRNA on the accumulation of replicating HDV RNAs. Cells were transfected with pDL553 to initiate HDV genome replication. siRNA species (at 30 nM) were cotransfected as indicated and as previously described in Fig. . At day 2, total RNA was extracted, glyoxalated, and analyzed by electrophoresis in a 1% agarose gel. HDV antigenomic RNA was then detected by Northern assay (A). Similarly, total protein was examined by immunoblotting to detect deltaAg-S (B). For both panels, bioimager data were subjected to quantitation, expressing the amount of signal detected relative to that obtained for the control transfection. As a negative control (lane C), we used siRNA against endogenous glyceraldehyde-3-phosphate dehydrogenase (GAPDH); treatment with this siRNA reduced GAPDH mRNA but had no effect on HDV RNA levels (data not shown). Also shown in panel B is the immunoblot assay for expression of GFP, which was cotransfected as a control.
Transfected siRNA did not target nonreplicating unit-length HDV genomic RNA. Transfected siRNA did not target nonreplicating unit-length HDV genomic RNA. Cells were transfected with pDL542 to achieve the transcription and accumulation of nonreplicating unit-length genomic HDV RNA circles. Cotransfected with this plasmid was either a glyceraldehyde-3-phosphate dehydrogenase control siRNA (lane C) or genomic HDV-specific siRNA (lanes 10 to 13), as indicated at the top of the figure and as previously described for Fig. . At day 2, total RNA was extracted, glyoxalated, and analyzed by electrophoresis in a 3% agarose gel. Linear and circular forms of HDV genomic RNA were then detected by Northern assay. After hybridization and quantitation using the bioimager, we deduced the amounts of linear and circular RNAs, as summarized in the histogram.
Locations and sequences of siRNA targets
219 changes: 219 additions & 0 deletions relevance_ranking.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
import os

from utils import tokenize, save_obj, load_obj
from itertools import chain
from collections import Counter
from collections import defaultdict


# import some stopwords
with open('stopwords.txt') as f:
stopwords = [s.rstrip() for s in f]


# import the documents and their annotations
documents = []
annotations = []

for f in os.listdir('./data/NLM_500/documents/'):
filename = './data/NLM_500/documents/' + f
if filename.endswith('.txt'):
documents.append(open(filename, encoding='ISO-8859-1').read())
elif filename.endswith('.key'):
an = [a.rstrip().lower() for a in open(filename,
encoding='ISO-8859-1')]
annotations.append(an)

# print(documents[0])
# print()
# print(annotations[0])

# tokenize documents
print('[ + ] Tokenizing documents')
documents = [tokenize(d, stopwords) for d in documents]
documents = [list(set(d)) for d in documents]


# Output some infos about the data
vocab = []
thesaurus = []

for doc in documents:
vocab += list(set(doc))

for tw in annotations:
thesaurus += tw

vocab = sorted(list(set(vocab)))
thesaurus = sorted(list(set(thesaurus)))
intersection = sorted(list(set(thesaurus).intersection(vocab)))

print('Vocab size:', len(vocab))
print('Thesaurus size:', len(thesaurus))
print('Intersection size:', len(intersection))


# Tag quantity (and not really a distribution)
nb_tags = 0
min_doc = 5
tag_dist = Counter(chain.from_iterable(annotations))
for t in sorted(tag_dist.items(), key=lambda x: x[1], reverse=True):
if t[1] >= min_doc:
nb_tags += 1
print('[ ! ]Tags that tag more than {} documents: {}'.format(min_doc, nb_tags))

# Init training
# TODO: cross validation

# Get training test
documents_train = documents[:375]
annotations_train = annotations[:375]

# Get test set
documents_test = documents[375:]
annotations_test = annotations[375:]


def word_occurence(documents, vocab):
# Init a zero vector of size vocab
word_count_idx = dict((w, 0) for i, w in enumerate(vocab))

for doc in documents:
# Update vectors for all the collection
for word in doc:
word_count_idx[word] += 1

return word_count_idx


def compute_mle_vector(word_occurence, N):
mle_vector = dict()
for w in word_occurence:
mle_vector[w] = (word_occurence[w] + 0.5) / (N + 1)

return mle_vector


def compute_relevance_matrices(documents, annotations, thesaurus, vocab):

for th in thesaurus:
with open('last_th.txt', 'w') as f:
f.write(th)

# hepls separate relevant docs from non-relevant ones
corpus = defaultdict(lambda: [])

# mark relevance for all documents
for doc, tags, i in zip(documents, annotations, range(len(documents))):
if th in tags:
corpus['relevant'].append(doc)
else:
corpus['nonrelevant'].append(doc)

# Word occurrences in relevant and non relevant documents
rel_count_vec = word_occurence(corpus['relevant'], vocab)
non_count_vec = word_occurence(corpus['nonrelevant'], vocab)

# Number of relevant and non-relevant documents
N_rel = len(corpus['relevant'])
N_non = len(corpus['nonrelevant'])

# Compute maximum likelihood
p_prob = compute_mle_vector(rel_count_vec, N_rel)
q_prob = compute_mle_vector(non_count_vec, N_non)

# save probabilities on disk
save_obj(obj=(p_prob, q_prob), name=th)


def score(p_vec, q_vec, new_doc):
'''
For each word of the thesaurus, we compute
the probability of new_doc of being relevant
'''
num_prod = 1
denum_prod = 1
score = 0

for t in new_doc:
num_prod *= p_vec[t] * (1 - q_vec[t])
denum_prod *= q_vec[t] * (1 - p_vec[t])

try:
score = num_prod / denum_prod
except Exception as e:
pass

return score


def predict_tags(thesaurus, new_doc, n_best=10):
scores = dict()

for th in thesaurus:
p_vec, q_vec = load_obj(th)
scores[th] = score(p_vec, q_vec, new_doc)

scores = sorted(scores.items(), key=lambda x: x[1])

return scores[:n_best]


def test():
# define some shit data
docs = []
doc1 = ['i', 'love', 'paris']
doc2 = ['i', 'love', 'cats']
doc3 = ['am', 'allergic', 'cats']

docs.append(doc1)
docs.append(doc2)
docs.append(doc3)

voc = doc1 + doc2 + doc3
voc = list(set(voc))
print('vocab:', voc)

# Here thesaurus and annotations are the same
thes = ['animals', 'city', 'health']
anno = [['city'], ['animals'], ['health']]

# check relevance matrix
compute_relevance_matrices(docs, anno, thes, voc)
for t in thes:
print('Thesaurus word:', t)
t_p, t_q = load_obj(t)
for w in voc:
print('word:', w, 't_p:', t_p[w], 't_q:', t_q[w])
print()

doc4 = ['cats', 'love', 'am']
print(doc4)

doc4_scores = predict_tags(thes, doc4)
for th_s in doc4_scores:
print('thesaurus:', th_s[0], 'relevance:', th_s[1])


print('\n========== TEST ============\n')
test()

print('\n========== TRAIN ============\n')
# compute_relevance_matrices(documents_train, annotations_train,
# thesaurus, vocab)
print('[ + ] finished training')


print('\n========== SCORE ============\n')
print('[...] Computing document relevance')
results = []
for doc, tags in zip(documents_test, annotations_test):
predicted = predict_tags(thesaurus, doc, n_best=len(tags))
results.append((predicted, tags))

print('[...] Computing model accuracy')
accuracy = 0
for r in results:
accuracy += len(set(r[0]).intersection(set(r[1]))) / len(results)

print('[ + ] Model accuracy: {} %'.format(accuracy))
Loading

0 comments on commit 3ef4698

Please sign in to comment.