From a4aa8f26c11a9453feaaeda7b1fc30201c9a2315 Mon Sep 17 00:00:00 2001 From: David Thomas Date: Wed, 7 Mar 2018 23:42:40 -0500 Subject: [PATCH] templates & code of conduct added, rm_nonchars fixed for greek texts --- CODE_OF_CONDUCT.md | 73 +++++++++++++++ README.md | 174 ++++++++++++++++++++++++++++-------- dhelp/text/_bases_mixins.py | 11 ++- issue_template.md | 17 ++++ pull_request_template.md | 17 ++++ 5 files changed, 254 insertions(+), 38 deletions(-) create mode 100644 CODE_OF_CONDUCT.md create mode 100644 issue_template.md create mode 100644 pull_request_template.md diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..b2f60a9 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,73 @@ +# Contributor Covenant Code of Conduct + +## Our Pledge + +In the interest of fostering an open and welcoming environment, we as +contributors and maintainers pledge to making participation in our project and +our community a harassment-free experience for everyone, regardless of age, body +size, disability, ethnicity, gender identity and expression, level of experience, +education, socio-economic status, nationality, personal appearance, race, +religion, or sexual identity and orientation. + +## Our Standards + +Examples of behavior that contributes to creating a positive environment +include: + +* Using welcoming and inclusive language +* Being respectful of differing viewpoints and experiences +* Gracefully accepting constructive criticism +* Focusing on what is best for the community +* Showing empathy towards other community members + +Examples of unacceptable behavior by participants include: + +* The use of sexualized language or imagery and unwelcome sexual attention or + advances +* Trolling, insulting/derogatory comments, and personal or political attacks +* Public or private harassment +* Publishing others' private information, such as a physical or electronic + address, without explicit permission +* Other conduct which could reasonably be considered inappropriate in a + professional setting + +## Our Responsibilities + +Project maintainers are responsible for clarifying the standards of acceptable +behavior and are expected to take appropriate and fair corrective action in +response to any instances of unacceptable behavior. + +Project maintainers have the right and responsibility to remove, edit, or +reject comments, commits, code, wiki edits, issues, and other contributions +that are not aligned to this Code of Conduct, or to ban temporarily or +permanently any contributor for other behaviors that they deem inappropriate, +threatening, offensive, or harmful. + +## Scope + +This Code of Conduct applies both within project spaces and in public spaces +when an individual is representing the project or its community. Examples of +representing a project or community include using an official project e-mail +address, posting via an official social media account, or acting as an appointed +representative at an online or offline event. Representation of a project may be +further defined and clarified by project maintainers. + +## Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported by contacting the project team at dave.a.base@gmail.com. All +complaints will be reviewed and investigated and will result in a response that +is deemed necessary and appropriate to the circumstances. The project team is +obligated to maintain confidentiality with regard to the reporter of an incident. +Further details of specific enforcement policies may be posted separately. + +Project maintainers who do not follow or enforce the Code of Conduct in good +faith may face temporary or permanent repercussions as determined by other +members of the project's leadership. + +## Attribution + +This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, +available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html + +[homepage]: https://www.contributor-covenant.org diff --git a/README.md b/README.md index 729cfdf..a21d96d 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,9 @@ --- -[![PyPI version](https://badge.fury.io/py/dhelp.svg)](https://badge.fury.io/py/dhelp) [![Build Status](https://travis-ci.org/thePortus/dhelp.svg?branch=master)](https://travis-ci.org/thePortus/dhelp) [![Coverage Status](https://coveralls.io/repos/github/thePortus/dhelp/badge.svg?branch=master)](https://coveralls.io/github/thePortus/dhelp?branch=master) [![Documentation Status](https://readthedocs.org/projects/dhelp/badge/?version=latest)](http://dhelp.readthedocs.io/en/latest/?badge=latest) [![Code Health](https://landscape.io/github/thePortus/dhelp/master/landscape.svg?style=flat)](https://landscape.io/github/thePortus/dhelp/master) [![Waffle.io - Columns and their card count](https://badge.waffle.io/thePortus/dhelp.svg?columns=all)](https://waffle.io/thePortus/dhelp) +[![PyPI version](https://badge.fury.io/py/dhelp.svg)](https://badge.fury.io/py/dhelp) +![PyPI - License](https://img.shields.io/pypi/l/Django.svg) + [![Build Status](https://travis-ci.org/thePortus/dhelp.svg?branch=master)](https://travis-ci.org/thePortus/dhelp) [![Coverage Status](https://coveralls.io/repos/github/thePortus/dhelp/badge.svg?branch=master)](https://coveralls.io/github/thePortus/dhelp?branch=master) [![Documentation Status](https://readthedocs.org/projects/dhelp/badge/?version=latest)](http://dhelp.readthedocs.io/en/latest/?badge=latest) [![Code Health](https://landscape.io/github/thePortus/dhelp/master/landscape.svg?style=flat)](https://landscape.io/github/thePortus/dhelp/master) [![Total GitHub downloads](https://img.shields.io/github/downloads/thePortus/dhelp/total.svg)](https://img.shields.io/github/downloads/thePortus/dhelp/total.svg) [![Waffle.io - Columns and their card count](https://badge.waffle.io/thePortus/dhelp.svg?columns=all)](https://waffle.io/thePortus/dhelp) --- @@ -32,6 +34,23 @@ Requires [Python 3.x](https://python.org) --- +# Table of Contents + +* [Installation](#installation) +* [Quickstart Guide](#quickstart-guide) +* [Web Module](#web-module) + * [WebPage](#webpage) +* [File Module](#file-module) + * [TextFile](#textfile) + * [TextFolder](#textfolder) + * [CSVFile](#csvfile) +* [Text Module](#text-module) + * [EnglishText](#englishtext) + * [LatinText](#latintext) + * [AncientGreekText](#ancientgreektext) + +--- + # Installation Install with pip (recommended) @@ -125,11 +144,18 @@ that comes with many convenient cleaning/nlp methods attached. You can chain any of the string transformation methods to perform many text operations at once. -#### All Languages +### EnglishText + +**Setup: Download the English Corpora** + +Before you use this object for any of the methods below you need to download trainer corpora. -**All Languages Have These Methods** +```python +from dhelp import EnglishText +EnglishText('').setup() +``` -Examples... +**Examples...** ```python @@ -144,9 +170,9 @@ text.rm_nonchars() 'The quick brown fox jumped over the lazy dog' # .rm_edits() - remove text between editorial marks -text = EnglishText('Th3e qui\\nck b rown fox jumped over the lazy dog') +text = EnglishText('The [quick] brown fox jumped over the lazy dog') text.rm_edits() -'The quick brown fox jumped over the lazy dog' +'The brown fox jumped over the lazy dog' # .rm_spaces() - collapses redundant whitespaces text = EnglishText('Th3e qui\\nck b rown fox jumped over the lazy dog') @@ -170,23 +196,6 @@ text = EnglishText('Th3e qui\\nck b rown fox jumped over the lazy dog') text.rm_lines().rm_nonchars().rm_spaces() 'The quick brown fox jumped over the lazy dog' -``` - -#### English - -**Setup: Download the English Corpora** - -Before you use this object for any of the methods below you need to download trainer corpora. - -```python -from dhelp import EnglishText -EnglishText('').setup() -``` - -Examples... - -```python - # lemmatize a text to make word counts/analysis text = EnglishText('The quick brown fox jumped over the lazy dog.') text.lemmatize() @@ -219,9 +228,7 @@ text.skipgrams() ``` -#### Latin - -**Note: Latin Classes inherit all methods from EnglishText** +#### LatinText **Setup: Download the Latin Corpora** @@ -234,8 +241,47 @@ LatinText('').setup() ``` +**Examples...** + ```python +# .rm_lines() - remove endline characters +text = LatinText('Gallia \\nest omnis divisa in partes tres') +text.rm_lines() +'Gallia est omnis divisa in partes tres' + +# .rm_nonchars() - remove non-letters +text = LatinText('Ga3llia est omnis divisa in partes tres') +text.rm_nonchars() +'Gallia est omnis divisa in partes tres' + +# .rm_edits() - remove text between editorial marks +text = LatinText('Gallia est [omnis] divisa in partes tres) +text.rm_edits() +'Gallia est omnis divisa in partes tres' + +# .rm_spaces() - collapses redundant whitespaces +text = LatinText('Gallia est omnis divisa in partes tres') +text.rm_spaces() +'Gallia est omnis divisa in partes tres' + +# .re_search() - checks for a given pattern +text = LatinText('Gallia est omnis divisa in partes tres') +text.re_search('Gallia') +True +text.re_search('Graecia') +False + +# .rm_stopwords() - removes a list of words from text +text = LatinText('Gallia est omnis divisa in partes tres') +text.rm_stopwords(['est', 'in']) +'Gallia omnis divisa partes tres' + +# chain methods to perform them in one command +text = LatinText('Ga3llia \\nest omnis divisa in partes tres') +text.rm_lines().rm_nonchars().rm_spaces() +'Gallia est omnis divisa in partes tres' + # tokenize words into list of strings text = LatinText('Gallia est omnis divisa in partes tres') text.tokenize() @@ -246,6 +292,21 @@ text = LatinText('Gallia est omnis divisa in partes tres') text.lemmatize() 'gallia edo1 omne divido in pars tres' +# generate ngrams... +text = LatinText('They hated to think of sample sentences.') +text.ngrams() +[('They', 'hated', 'to'), ('hated', 'to', 'think'), ('to', 'think', 'of'), ('think', 'of', 'sample'), ('of', 'sample', 'sentences'), ('sample', 'sentences', '.')] + +# ... or skipgrams +text = LatinText('They hated to think of sample sentences.') +text.skipgrams() +[('Gallia', 'est', 'omnis'), ('est', 'omnis', 'divisa'), ('omnis', 'divisa', 'in'), ('divisa', 'in', 'partes'), ('in', 'partes', 'tres')] + +# count all words +text = LatinText('Gallia est omnis divisa in partes tres tres tres') +text.word_count(word='tres') +3 + # scan text for meter text = LatinText('Arma virumque cano, Troiae qui primus ab oris') text.scansion() @@ -272,20 +333,13 @@ text.compare_longest_common_substring('Galliae sunt omnis divisae in partes tres 'in partes tres' # compare minhash's -LatinText('Gallia est omnis divisa in partes tres') +text = LatinText('Gallia est omnis divisa in partes tres') text.compare_minhash('Galliae sunt omnis divisae in partes tres') 0.6444444444444445 -# count all words -text = LatinText('Gallia est omnis divisa in partes tres tres tres') -text.word_count(word='tres') -3 - ``` -#### Greek - -**Note: Greek Classes inherit all methods from EnglishText** +#### AncientGreekText **Setup: Download the Greek Corpora** @@ -298,8 +352,47 @@ AncientGreekText('').setup() ``` +**Examples...** + ```python +# .rm_lines() - remove endline characters +text = AncientGreekText('ἔνθα \nποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'') +text.rm_lines() +'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα + +# .rm_nonchars() - remove non-letters +text = AncientGreekText('ἔν3θα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'') +text.rm_nonchars() +'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα' + +# .rm_edits() - remove text between editorial marks +text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα') +text.rm_edits() +'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα' + +# .rm_spaces() - collapses redundant whitespaces +text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα) +text.rm_spaces() +'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα' + +# .re_search() - checks for a given pattern +text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα') +text.re_search('Ἀθηναίοις') +True +text.re_search('σπαρτίοις') +False + +# .rm_stopwords() - removes a list of words from text +text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα') +text.rm_stopwords(['ποτὲ', 'ἀργύρου']) +'ἔνθα Ἀθηναίοις ἦν μέταλλα' + +# chain methods to perform them in one command +text = AncientGreekText('ἔν3θα \nποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα') +text.rm_lines().rm_nonchars().rm_spaces() +'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα' + # normalize character encoding differences text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα') text.normalize() @@ -320,6 +413,14 @@ text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύ text.lemmatize() 'ἔνθα ποτὲ ἀθηναῖος εἰμί ἀργυρόω μέταλλον' +text.ngrams() +[('ἔνθα', 'ποτὲ', 'Ἀθηναίοις'), ('ποτὲ', 'Ἀθηναίοις', 'ἦν'), ('Ἀθηναίοις', 'ἦν', 'ἀργύρου'), ('ἦν', 'ἀργύρου', 'μέταλλα')] + +# ... or skipgrams +text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα') +text.skipgrams() +[('ἔνθα', 'ποτὲ', 'Ἀθηναίοις'), ('ἔνθα', 'ποτὲ', 'ἦν'), ('ἔνθα', 'Ἀθηναίοις', 'ἦν'), ('ποτὲ', 'Ἀθηναίοις', 'ἦν'), ('ποτὲ', 'Ἀθηναίοις', 'ἀργύρου'), ('ποτὲ', 'ἦν', 'ἀργύρου'), ('Ἀθηναίοις', 'ἦν', 'ἀργύρου'), ('Ἀθηναίοις', 'ἦν', 'μέταλλα'), ('Ἀθηναίοις', 'ἀργύρου', 'μέταλλα'), ('ἦν', 'ἀργύρου', 'μέταλλα')] + # perform part-of-speech tagging text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα') text.tag() @@ -345,6 +446,7 @@ text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύ text.word_count(word='Ἀθηναίοις') 1 + ``` --- @@ -370,3 +472,5 @@ text_folder.modify('path/to/latin/files-lemmatized', modify_function) # That's it, you will now have a folder full of lemmatized Latin files ``` + +More Examples Coming Soon diff --git a/dhelp/text/_bases_mixins.py b/dhelp/text/_bases_mixins.py index a11eae2..d75772e 100644 --- a/dhelp/text/_bases_mixins.py +++ b/dhelp/text/_bases_mixins.py @@ -85,8 +85,9 @@ def rm_lines(self): def rm_nonchars(self): """Removes non-language characters. - Gives a new version of the text with only latin characters remaining. - Is overriden by child objects for languages using non latinate chars. + Gives a new version of the text with only latin characters remaining, + or Greek characters for Greek, texts, and so on. Defaults to assuming + Latin based. Returns: :obj:`self.__class__` Returns new version of text, with non-letters removed @@ -97,8 +98,12 @@ def rm_nonchars(self): >>> print(modified_text) 'Lorem ipsum dolor sit amet...' """ # noqa + if self.options['language'] == 'greek': + valid_chars_pattern = '([ʹ-Ϋά-ϡἀ-ᾯᾰ-῾ ])' + else: + valid_chars_pattern = '([A-Za-z ])' return self.__class__( - "".join(re.findall("([A-Za-z ])", self.data)), + "".join(re.findall(valid_chars_pattern, self.data)), self.options ) diff --git a/issue_template.md b/issue_template.md new file mode 100644 index 0000000..f8b8ed9 --- /dev/null +++ b/issue_template.md @@ -0,0 +1,17 @@ +## Expected Behavior + + +## Actual Behavior + + +## Steps to Reproduce the Problem + + 1. + 1. + 1. + +## Specifications + + - Version: + - Platform: + - Subsystem: diff --git a/pull_request_template.md b/pull_request_template.md new file mode 100644 index 0000000..11ccb18 --- /dev/null +++ b/pull_request_template.md @@ -0,0 +1,17 @@ +# Fixes + +Changes Proposed + +* +* +* + +Please ensure your pull request adheres to the following guidelines: + +- [ ] Use the following format: `* [owner/repo](link)` +- [ ] Link additions should be added to the bottom of the relevant category. +- [ ] New categories or improvements to the existing categorization are welcome. +- [ ] Search previous suggestions before making a new one, as yours may be a duplicate. +- [ ] Sort by alphabetical order + +Thanks for contributing!