SNLP-Mini-Project

###How to execute the program

Download the latest wikipedia dump. This is a bz2 archive, which contains a xml -file with the latest stand of the articles of wikipedia. https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Extract the bz2 archive
Extract the plain text from the xml-file:

python WikiExtractor/WikiExtractor.py -q -o "Wikipedia Corpus" enwiki-latest-pages-articles.xml

This can take up to a few hours.

If this error appears: "AttributeError: module 'fileinput' has no attribute 'hook_compressed_encoded'" Follow this link: https://www.bountysource.com/issues/51517999-another-unicodedecodeerror And add the new method hook_compressed_encoded to the file $PYTHON_DIR/Lib/fileinput.py
Start the main method of the class "TextAnalyzer" with the following paramters: This class analyzes the corpus. The first step for this is to extract all articles from the original corpus. Afterwards it checks whether an article contains all the nouns (or the corresponding synonyms) of one of the facts. In the case that all nouns or synonyms are part of an article, it will be copied to a folder, which is named after the fact id. This program results in a folder with 1301 (the number of facts) subfolders, where each folder can contain wikipedia articles according to the number of matches between the nouns and synonyms of the statement and the content of the articles.
The last step is starting the main method of the class "FactChecker". This class is processing the statements and assigns a truth value to them. Every statement, respectively fact, is processed on its own. The nouns and verbs are extracted from a statement. Afterwards it is checked whether any of those have synonyms. When that is done, it is checked if any of the text files, which have been declared to be related to the statement beforehand, contain any lines, which contain all of the previously extracted words from the statement. If a matching line was found '1.0' is assigned to the current statement. If no matching line was found or if there are no related texts for the statement, the statement is assigned '-1.0'. Afterwards the result will be saved in a file named "result.ttl".

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.settings		.settings
SynonymDictionary		SynonymDictionary
WikiExtractor		WikiExtractor
src/de/snlp/mp		src/de/snlp/mp
.classpath		.classpath
.gitignore		.gitignore
.project		.project
Output-de.snlp.mp.fact_checking.FactChecker-1.txt		Output-de.snlp.mp.fact_checking.FactChecker-1.txt
Output-de.snlp.mp.fact_checking.FactChecker-2.txt		Output-de.snlp.mp.fact_checking.FactChecker-2.txt
Output-de.snlp.mp.fact_checking.FactChecker-3.txt		Output-de.snlp.mp.fact_checking.FactChecker-3.txt
Output-de.snlp.mp.fact_checking.FactChecker-4.txt		Output-de.snlp.mp.fact_checking.FactChecker-4.txt
Output-de.snlp.mp.text_analysis.TextAnalyzer-1.txt		Output-de.snlp.mp.text_analysis.TextAnalyzer-1.txt
README.md		README.md
commons-io-2.6.jar		commons-io-2.6.jar
edu.mit.jwi_2.4.0_jdk.jar		edu.mit.jwi_2.4.0_jdk.jar
ejml-0.23.jar		ejml-0.23.jar
jackson-annotations-2.9.3.jar		jackson-annotations-2.9.3.jar
jackson-core-2.9.3.jar		jackson-core-2.9.3.jar
jackson-databind-2.9.3.jar		jackson-databind-2.9.3.jar
jollyday.jar		jollyday.jar
json.txt		json.txt
protobuf.jar		protobuf.jar
result.ttl		result.ttl
simplenlg-v4.4.2.jar		simplenlg-v4.4.2.jar
stanford-corenlp-3.8.0.jar		stanford-corenlp-3.8.0.jar
test.tsv		test.tsv
train.tsv		train.tsv
xom.jar		xom.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SNLP-Mini-Project

About

Releases

Packages

Contributors 2

Languages

Senne021/SNLP-Mini-Project

Folders and files

Latest commit

History

Repository files navigation

SNLP-Mini-Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages