first commit

WilliamDiakite · Apr 23, 2018 · 3ef4698 · 3ef4698
commit 3ef4698
Show file tree

Hide file tree

Showing 9 changed files with 70,202 additions and 0 deletions.
diff --git a/.ipynb_checkpoints/Untitled-checkpoint.ipynb b/.ipynb_checkpoints/Untitled-checkpoint.ipynb
diff --git a/.ipynb_checkpoints/relevance_ranking.py-checkpoint.ipynb b/.ipynb_checkpoints/relevance_ranking.py-checkpoint.ipynb
diff --git a/__pycache__/utils.cpython-36.pyc b/__pycache__/utils.cpython-36.pyc
diff --git a/co_occurrence.csv b/co_occurrence.csv
@@ -0,0 +1,31 @@
+Source,Target,Weight
+ingenieur,sens du collectif,2305
+mecanicien,ponctuel,2351
+cuisinier,autonome,2224
+boulanger,temeraire,2269
+politicien,autonome,2317
+luthier,rigoureux,2276
+ingenieur,curieux,2253
+mecanicien,motive,2279
+cuisinier,creatif,2264
+boulanger,motive,2345
+politicien,sens du collectif,2274
+luthier,ponctuel,2254
+ingenieur,creatif,2201
+mecanicien,rigoureux,2223
+politicien,creatif,2299
+luthier,habile,2268
+ingenieur,audacieux,2365
+cuisinier,sens du collectif,2318
+boulanger,habile,2220
+politicien,curieux,2260
+mecanicien,temeraire,2347
+cuisinier,curieux,2281
+luthier,temeraire,2306
+cuisinier,audacieux,2342
+mecanicien,habile,2229
+boulanger,ponctuel,2390
+politicien,audacieux,2279
+ingenieur,autonome,2305
+boulanger,rigoureux,2205
+luthier,motive,2325
diff --git a/data/NLM_500/documents/12915586.txt b/data/NLM_500/documents/12915586.txt
@@ -0,0 +1,9 @@
+Susceptibility of Human Hepatitis Delta Virus RNAs to Small Interfering RNA Action
+In animal cells, small interfering RNAs (siRNA), when exogenously provided, have been reported to be capable of inhibiting replication of several different viruses. In preliminary studies, siRNA species were designed and tested for their ability to act on the protein expressed in Huh7 cells transfected with DNA-directed mRNA constructs containing hepatitis delta virus (HDV) target sequences. The aim was to achieve siRNA specific for each of the three RNAs of HDV replication: (i) the 1,679-nucleotide circular RNA genome, (ii) its exact complement, the antigenome, and (iii) the less abundant polyadenylated mRNA for the small delta protein. Many of the 16 siRNA tested gave >80% inhibition in this assay. Next, these three classes of siRNA were tested for their ability to act during HDV genome replication. It was found that only siRNA targeted against HDV mRNA sequences could interfere with HDV genome replication. In contrast, siRNA targeted against genomic and antigenomic RNA sequences had no detectable effect on the accumulation of these RNAs. Reconstruction experiments with nonreplicating HDV RNA sequences support the interpretation that neither the potential for intramolecular rod-like RNA folding nor the presence of the delta protein conferred resistance to siRNA. In terms of replicating HDV RNAs, it is considered more likely that the genomic and antigenomic RNAs are resistant because their location within the nucleus makes them inaccessible to siRNA-mediated degradation.
+
+Human hepatitis delta virus (HDV) has a 1,679-nucleotide (nt) single-stranded circular RNA genome that is replicated by RNA-directed RNA synthesis, most probably involving host RNA polymerase II . During this replication, three RNA species accumulate , as represented in Fig. . The genome and its exact complement, the antigenome, are considered unit length. They exist primarily in a circular conformation but also in a linear conformation; these two conformations can be resolved using appropriate conditions of gel electrophoresis . The third RNA species consists of relatively lower amounts of an 800-nt polyadenylated RNA (of the same polarity as the antigenome), which is translated to produce a 195-amino-acid protein, known as the delta antigen (deltaAg-S), and is essential for HDV genome replication . An increasing number of reports have shown that small interfering RNAs (siRNA) can be exogenously provided to cells undergoing animal virus replication and achieve inhibition . For the following reasons, we were specifically interested in the possible susceptibility to siRNA of HDV RNAs. (i) The HDV genomic and antigenomic RNAs can fold into an unbranched rod-like structure with 74% of the bases paired , and this folding might interfere with siRNA action. (ii) The delta protein has the ability to bind double-stranded RNA  and thus might also interfere. (iii) While several reports indicate that HDV genomic and antigenomic RNAs are predominantly located in the cell nucleus , two recent studies using cell fractionation indicated that much of the genomic RNA (but not the antigenomic RNA) might be cytoplasmic  and thus possibly accessible to siRNA attack . With these questions in mind, our first objective was to design and test siRNA species specific for sequences on HDV mRNA, genomic RNA, and antigenomic RNA. As represented in Fig. , the initial strategy was to use expression vectors to produce within Huh7 cells  DNA-directed mRNA species that contain these three sequences. As indicated, a total of 16 siRNA, each containing a 21-nt region based on the nucleotide sequence of Kuo et al. , were designed and constructed using a Silencer siRNA construction kit (Ambion). The locations and sequences of the target sites are listed in Table . The assay in each case was for the translation and accumulation of the delta protein, as detected by immunoblotting. The results and their quantitation are presented in Fig. . Also shown is the immunoblot for the internal control, green fluorescent protein (GFP). An expression vector for GFP was cotransfected, and as expected, the siRNA directed against HDV sequences did not inhibit the accumulation of GFP protein. Consistent with the siRNA experience of others , we found that some of the designed siRNA were able to give 80 to 95% inhibition of the HDV mRNA sequences (Fig. , lanes 4, 6, and 7). Furthermore, as planned, it was possible to obtain siRNA that attacked the genomic and antigenomic RNA sequences (Fig. , respectively). It should be noted that in Fig. , the insert of 657 nt of antigenomic sequence (from position 660 to 4) should thus have been able in large part to fold with the 585 nt spanning the delta antigen open reading frame (at position 1596 to 1011) into an extensive amount of unbranched rod-like structure. Even if this potential folding did occur in vivo, it did not confer resistance to functioning siRNA (lanes 14, 15, and 16). We next examined whether those siRNA with proven activity could interfere with HDV genome replication, as assayed by the accumulation of unit-length HDV RNA species. Some results are shown in Fig. , along with quantitation. We observed that inhibition of RNA accumulation  occurred only with siRNA 4 and 6, which targeted the mRNA sequences and caused a significant reduction of delta protein accumulation . (In panel A, the data shown are for antigenomic RNA; however, similar results were obtained for genomic RNA [data not shown]. Also, siRNA 14, 15, and 16 failed to reduce HDV RNA accumulation.) In the above-described experiment, the gel electrophoresis conditions used do not separate the circular from the linear conformations of unit-length HDV RNAs. However, if the siRNA treatment led to single endonucleolytic cuts on unit-length circles to produce linears, we expect that this would have inhibited further RNA accumulation . Therefore, to further test this preliminary interpretation that unit-length genomic and antigenomic RNAs were resistant to siRNA, we carried out the following additional experiment. We used transfection to express HDV RNA multimers by DNA-directed RNA transcription. From previous studies, we knew that these would be posttranscriptionally processed to form unit-length RNA circles and yet, because of a 2-nt deletion in the open reading frame for the small delta protein, would be unable to make the essential delta protein and undergo RNA-directed transcription and replication . We then used gel analysis conditions capable of separating both linear and circular conformations of unit-length HDV RNA. Some data are shown in Fig. . As can be seen from the quantitation, cotransfection with siRNA 10, 11, 12, and 13, specific for genomic RNA sequences, did not reduce the accumulation of unit-length genomic RNA. Furthermore, the presence of this siRNA did not cause a reduction in the fraction of RNAs with a circular conformation (lanes 10 to 13 relative to lane C). In similar experiments, we expressed antigenomic nonreplicating RNAs and found that they were not sensitive to siRNA 4, 6, and 9 (data not shown). We interpret these data as evidence against siRNA action, even for inducing a single nick on the nonreplicating HDV RNAs. In addition, the delta protein was not present and thus could not be the basis for the observed resistance. Our studies show that HDV circular RNAs, whether transcribed from an RNA template  or from a DNA template , were resistant to siRNA attack. In the case of the DNA-directed transcript, the expression vector was such that the primary (nonreplicating) multimeric transcript was via host RNA polymerase II and that RNA (prior to ribozyme processing and ligation to form unit-length circles) should have undergone both 5'-capping and 3' polyadenylation. Others have shown that for a host mRNA precursor, the intronic regions are resistant to siRNA while the exons are sensitive . Our studies show that the unit-length circular RNA processed out of the multimeric transcript was not only stable but also resistant to siRNA attack. In summary, these studies support the interpretation that during genome replication, the only HDV RNA directly susceptible to inhibition mediated by siRNA is the mRNA. At least under the conditions of these experiments, the resistance of the genomic and antigenomic RNAs was not dependent on RNA structure, RNA conformation, or the presence of the small delta protein. Further experiments will be needed to determine if this "resistance" was in fact due to inaccessibility based on nuclear localization of genomic and antigenomic unit-length RNAs. It remains possible that some of the genomic RNA is cytoplasmic but is somehow inaccessible to attack by siRNA. For example, it could be protected by a host RNA-binding protein. For this or maybe other reasons, these HDV RNAs are probably resistant to siRNA because they are simply not accessible to the RISC, a protein-RNA effector nuclease complex that recognizes and destroys target RNAs . In this respect, our findings with HDV are analogous to those reported for siRNA action on the replication of respiratory syncytial virus  and influenza virus . That is, siRNA cannot target the replicating viral RNA transcripts directly but only indirectly via action on the viral mRNA species.
+Representation of three main species of HDV RNA. Representation of three main species of HDV RNA. Also indicated are the genomic and antigenomic ribozymes (cleavage site is shown as a circle) and the open reading frame (ORF) for the deltaAg . At the right is the number of molecules of each RNA per average liver cell for an infected woodchuck and chimpanzee, as previously reported . Indicated on the 1,679-nt genomic RNA is the origin for the nucleotide numbering, according to the sequence of Kuo et al. .
+Transfected siRNAs could target HDV mRNA species. Transfected siRNAs could target HDV mRNA species. Huh7 cells were transfected with one of three plasmids that express an HDV mRNA species. (A) pDL444  expressed an mRNA equivalent to normal HDV mRNA. (B and C) pDL444 was modified to contain 657 nt of extra sequences in the 3' untranslated region, which led to the transcription of either partial genomic HDV RNA sequences (position 4 to 660) (B) or antigenomic RNA sequences (position 660 to 4) (C). The constructs were cotransfected with a plasmid expressing GFP. After 2 days the total protein was extracted and examined by immunoblotting to detect both delta protein and GFP. Detection and quantitation were with a bioimager (Fuji). The 16 HDV-specific siRNA indicated in the figure were designed and delivered as a cotransfection using Lipofectamine 2000 (Invitrogen). As a negative control (lanes C) we used siRNA against glyceraldehyde-3-phosphate dehydrogenase.
+Effect of transfected siRNA on the accumulation of replicating HDV RNAs. Effect of transfected siRNA on the accumulation of replicating HDV RNAs. Cells were transfected with pDL553  to initiate HDV genome replication. siRNA species (at 30 nM) were cotransfected as indicated and as previously described in Fig. . At day 2, total RNA was extracted, glyoxalated, and analyzed by electrophoresis in a 1% agarose gel. HDV antigenomic RNA was then detected by Northern assay (A). Similarly, total protein was examined by immunoblotting to detect deltaAg-S (B). For both panels, bioimager data were subjected to quantitation, expressing the amount of signal detected relative to that obtained for the control transfection. As a negative control (lane C), we used siRNA against endogenous glyceraldehyde-3-phosphate dehydrogenase (GAPDH); treatment with this siRNA reduced GAPDH mRNA but had no effect on HDV RNA levels (data not shown). Also shown in panel B is the immunoblot assay for expression of GFP, which was cotransfected as a control.
+Transfected siRNA did not target nonreplicating unit-length HDV genomic RNA. Transfected siRNA did not target nonreplicating unit-length HDV genomic RNA. Cells were transfected with pDL542  to achieve the transcription and accumulation of nonreplicating unit-length genomic HDV RNA circles. Cotransfected with this plasmid was either a glyceraldehyde-3-phosphate dehydrogenase control siRNA (lane C) or genomic HDV-specific siRNA (lanes 10 to 13), as indicated at the top of the figure and as previously described for Fig. . At day 2, total RNA was extracted, glyoxalated, and analyzed by electrophoresis in a 3% agarose gel. Linear and circular forms of HDV genomic RNA were then detected by Northern assay. After hybridization and quantitation using the bioimager, we deduced the amounts of linear and circular RNAs, as summarized in the histogram.
+Locations and sequences of siRNA targets
diff --git a/relevance_ranking.py b/relevance_ranking.py
@@ -0,0 +1,219 @@
+import os
+
+from utils import tokenize, save_obj, load_obj
+from itertools import chain
+from collections import Counter
+from collections import defaultdict
+
+
+# import some stopwords
+with open('stopwords.txt') as f:
+    stopwords = [s.rstrip() for s in f]
+
+
+# import the documents and their annotations
+documents = []
+annotations = []
+
+for f in os.listdir('./data/NLM_500/documents/'):
+    filename = './data/NLM_500/documents/' + f
+    if filename.endswith('.txt'):
+        documents.append(open(filename, encoding='ISO-8859-1').read())
+    elif filename.endswith('.key'):
+        an = [a.rstrip().lower() for a in open(filename,
+                                               encoding='ISO-8859-1')]
+        annotations.append(an)
+
+# print(documents[0])
+# print()
+# print(annotations[0])
+
+# tokenize documents
+print('[ + ] Tokenizing documents')
+documents = [tokenize(d, stopwords) for d in documents]
+documents = [list(set(d)) for d in documents]
+
+
+# Output some infos about the data
+vocab = []
+thesaurus = []
+
+for doc in documents:
+    vocab += list(set(doc))
+
+for tw in annotations:
+    thesaurus += tw
+
+vocab = sorted(list(set(vocab)))
+thesaurus = sorted(list(set(thesaurus)))
+intersection = sorted(list(set(thesaurus).intersection(vocab)))
+
+print('Vocab size:', len(vocab))
+print('Thesaurus size:', len(thesaurus))
+print('Intersection size:', len(intersection))
+
+
+# Tag quantity (and not really a distribution)
+nb_tags = 0
+min_doc = 5
+tag_dist = Counter(chain.from_iterable(annotations))
+for t in sorted(tag_dist.items(), key=lambda x: x[1], reverse=True):
+    if t[1] >= min_doc:
+        nb_tags += 1
+print('[ ! ]Tags that tag more than {} documents: {}'.format(min_doc, nb_tags))
+
+# Init training
+# TODO: cross validation
+
+# Get training test
+documents_train = documents[:375]
+annotations_train = annotations[:375]
+
+# Get test set
+documents_test = documents[375:]
+annotations_test = annotations[375:]
+
+
+def word_occurence(documents, vocab):
+    # Init a zero vector of size vocab
+    word_count_idx = dict((w, 0) for i, w in enumerate(vocab))
+
+    for doc in documents:
+        # Update vectors for all the collection
+        for word in doc:
+            word_count_idx[word] += 1
+
+    return word_count_idx
+
+
+def compute_mle_vector(word_occurence, N):
+    mle_vector = dict()
+    for w in word_occurence:
+        mle_vector[w] = (word_occurence[w] + 0.5) / (N + 1)
+
+    return mle_vector
+
+
+def compute_relevance_matrices(documents, annotations, thesaurus, vocab):
+
+    for th in thesaurus:
+        with open('last_th.txt', 'w') as f:
+            f.write(th)
+
+        # hepls separate relevant docs from non-relevant ones
+        corpus = defaultdict(lambda: [])
+
+        # mark relevance for all documents
+        for doc, tags, i in zip(documents, annotations, range(len(documents))):
+            if th in tags:
+                corpus['relevant'].append(doc)
+            else:
+                corpus['nonrelevant'].append(doc)
+
+        # Word occurrences in relevant and non relevant documents
+        rel_count_vec = word_occurence(corpus['relevant'], vocab)
+        non_count_vec = word_occurence(corpus['nonrelevant'], vocab)
+
+        # Number of relevant and non-relevant documents
+        N_rel = len(corpus['relevant'])
+        N_non = len(corpus['nonrelevant'])
+
+        # Compute maximum likelihood
+        p_prob = compute_mle_vector(rel_count_vec, N_rel)
+        q_prob = compute_mle_vector(non_count_vec, N_non)
+
+        # save probabilities on disk
+        save_obj(obj=(p_prob, q_prob), name=th)
+
+
+def score(p_vec, q_vec, new_doc):
+    '''
+        For each word of the thesaurus, we compute
+        the probability of new_doc of being relevant
+    '''
+    num_prod = 1
+    denum_prod = 1
+    score = 0
+
+    for t in new_doc:
+        num_prod *= p_vec[t] * (1 - q_vec[t])
+        denum_prod *= q_vec[t] * (1 - p_vec[t])
+
+    try:
+        score = num_prod / denum_prod
+    except Exception as e:
+        pass
+
+    return score
+
+
+def predict_tags(thesaurus, new_doc, n_best=10):
+    scores = dict()
+
+    for th in thesaurus:
+        p_vec, q_vec = load_obj(th)
+        scores[th] = score(p_vec, q_vec, new_doc)
+
+    scores = sorted(scores.items(), key=lambda x: x[1])
+
+    return scores[:n_best]
+
+
+def test():
+    # define some shit data
+    docs = []
+    doc1 = ['i', 'love', 'paris']
+    doc2 = ['i', 'love', 'cats']
+    doc3 = ['am', 'allergic', 'cats']
+
+    docs.append(doc1)
+    docs.append(doc2)
+    docs.append(doc3)
+
+    voc = doc1 + doc2 + doc3
+    voc = list(set(voc))
+    print('vocab:', voc)
+
+    # Here thesaurus and annotations are the same
+    thes = ['animals', 'city', 'health']
+    anno = [['city'], ['animals'], ['health']]
+
+    # check relevance matrix
+    compute_relevance_matrices(docs, anno, thes, voc)
+    for t in thes:
+        print('Thesaurus word:', t)
+        t_p, t_q = load_obj(t)
+        for w in voc:
+            print('word:', w, 't_p:', t_p[w], 't_q:', t_q[w])
+        print()
+
+    doc4 = ['cats', 'love', 'am']
+    print(doc4)
+
+    doc4_scores = predict_tags(thes, doc4)
+    for th_s in doc4_scores:
+        print('thesaurus:', th_s[0], 'relevance:', th_s[1])
+
+
+print('\n========== TEST ============\n')
+test()
+
+print('\n========== TRAIN ============\n')
+# compute_relevance_matrices(documents_train, annotations_train,
+#                            thesaurus, vocab)
+print('[ + ] finished training')
+
+
+print('\n========== SCORE ============\n')
+print('[...] Computing document relevance')
+results = []
+for doc, tags in zip(documents_test, annotations_test):
+    predicted = predict_tags(thesaurus, doc, n_best=len(tags))
+    results.append((predicted, tags))
+
+print('[...] Computing model accuracy')
+accuracy = 0
+for r in results:
+    accuracy += len(set(r[0]).intersection(set(r[1]))) / len(results)
+
+print('[ + ] Model accuracy: {} %'.format(accuracy))