Aligned Narrative Documents 📚

A collection of scripts to create a Document-aligned corpus of German Narrative Texts from four different sources of Simple Language Texts and three different sources of Standard Language Texts. Each sample in the corpus is a pair consisting of a Text in its Standard Language Version (Original) and it's Simple Language Version. The texts can be obtained from PDF, TXT, and HTML-files. Additionally, the original version is truncated to match the extent of the simplified version.

Getting started 🚀

pip install -r -q requirements.txt

Configure the input 🗂️

each of the scripts uses the same json-format to load data, the following table describes the json-attributes:

Parameter	Type	Explaination	Example
simple_path	str	Local File Path to the simplified document. Should be preferred over simple_url as input source.	`<...>/simple.pdf`
simple_start_page	int	PDF-pagenumber of the first PDF-page of the simple version that is processed	`6`
simple_first_page_number_for_removal	int	This parameter helps to remove the in-text page numbers from the actual text. This is page number written on the page of first PDF-page of the simple version.	`6`
simple_start_of_text_marker	str	Text snippet, from which the simple text starts.	`Nathan schreibt`
simple_end_of_text_marker	str	Text snippet, up to which the simple text goes.	`Bis bald.`
simple_url	str	URL to the simplified document.	`https://www.<...>/simple.pdf`
original_url	str	URL to the original document.	`https://www.<...>/original.txt`
original_start_of_text_marker	str	Text snippet, from which the original text starts.	`Nathanael an Lothar`
original_end_of_text_marker	str	Last text snippet of the aligned original document.	`Lebe wohl etc. etc.`
title	str	Title or identifier for the text. Can be left blank if the text's source gave it a title.	`mytext`
simple_text_in_boxes	str	Text snippet, that should be deleted.	`Mehr Informationen`

The sub-datasets 🧩

Our dataset contains one full-text source (MILS) and three fragment-text source (EB, KV, PV). Two scripts and four configuration json-files are needed to create these sub-datasets. They are merged in a final step.

MILS Corpus 🚧

uses the described json-format, stored in the file: ../../data/mils_data.json (texts from https://www.ndr.de/fernsehen/barrierefreie_angebote/leichte_sprache/Maerchen-in-Leichter-Sprache,maerchenleichtesprache100.html)

cd src/preprocessing
python mils_preprocessor.py

if you make any changes to the parser, use the corresponding unit-tests:

cd src/preprocessing
python mils_preprocessor_test.py

EB, PV, KV Corpus 🚧

uses the described json-format, stored in the files:
../../data/eb_data.json (texts from https://einfachebuecher.de),
../../data/pv_data.json(texts from https://www.passanten-verlag.de),
and ../../data/kv_data.json(texts from https://www.kindermannverlag.de)

cd src/preprocessing
python reading_sample_preprocessor.py

if you make any changes to the parser, use the corresponding unit-tests:

cd src/preprocessing
python reading_sample_preprocessor_test.py

Complete Corpus 📚 🗺️

Merge all previous sub-dataset in a complete corpus, and separates them in train, validate and test data. All previous scripts had to be run successfully to create the corpus. This scripts results in six files:
/val-source.txt (Validation dataset, Original Texts) /val-target.txt (Validation dataset, Simple Texts)
/train-source.txt (Train dataset, Original Texts) /train-target.txt (Train dataset, Simple Texts)
/test-source.txt (Test dataset, Original Texts) /test-target.txt (Test dataset, Simple Texts)

cd src/preprocessing
python gnats.py

to better download it: tar -cvf gnats.tar.gz gnats

License & Acknowledments

This work is licensed under a Creative Commons Attribution 4.0 International License. The texts are partially under other licenses. We used texts from Gutenberg-DE and from the NDR Märchen in Leichter Sprache project. We would like to thank NDR very much for giving us the opportunity to make this data publicly available for the first time.

Citation

Please have a look out our CITATION file or use the followong bibtex:

@inproceedings{SchomackerExploringAutomatic2023,
title = {{Exploring Automatic Text Simplification of German Narrative Documents}},
author = {Schomacker, Thorben and Dönicke, Tillmann and Tropmann-Frick, Marina},
booktitle = {Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)},
language = {eng},
month = sep,
year = {2023},
copyright = {Creative Commons Attribution 4.0 International},
address = {Ingolstadt, Germany}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
mils		mils
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aligned Narrative Documents 📚

Getting started 🚀

Configure the input 🗂️

The sub-datasets 🧩

MILS Corpus 🚧

EB, PV, KV Corpus 🚧

Complete Corpus 📚 🗺️

License & Acknowledments

Citation

About

Releases

Packages

Languages

License

tschomacker/aligned-narrative-documents

Folders and files

Latest commit

History

Repository files navigation

Aligned Narrative Documents 📚

Getting started 🚀

Configure the input 🗂️

The sub-datasets 🧩

MILS Corpus 🚧

EB, PV, KV Corpus 🚧

Complete Corpus 📚 🗺️

License & Acknowledments

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages