templates & code of conduct added, rm_nonchars fixed for greek texts

thePortus · Mar 8, 2018 · a4aa8f2 · a4aa8f2
1 parent 6eed006
commit a4aa8f2
Show file tree

Hide file tree

Showing 5 changed files with 254 additions and 38 deletions.
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,73 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, gender identity and expression, level of experience,
+education, socio-economic status, nationality, personal appearance, race,
+religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+  advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+  address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces
+when an individual is representing the project or its community. Examples of
+representing a project or community include using an official project e-mail
+address, posting via an official social media account, or acting as an appointed
+representative at an online or offline event. Representation of a project may be
+further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at [email protected]. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
diff --git a/README.md b/README.md
@@ -4,7 +4,9 @@
 
 ---
 
-[![PyPI version](https://badge.fury.io/py/dhelp.svg)](https://badge.fury.io/py/dhelp) [![Build Status](https://travis-ci.org/thePortus/dhelp.svg?branch=master)](https://travis-ci.org/thePortus/dhelp) [![Coverage Status](https://coveralls.io/repos/github/thePortus/dhelp/badge.svg?branch=master)](https://coveralls.io/github/thePortus/dhelp?branch=master) [![Documentation Status](https://readthedocs.org/projects/dhelp/badge/?version=latest)](http://dhelp.readthedocs.io/en/latest/?badge=latest) [![Code Health](https://landscape.io/github/thePortus/dhelp/master/landscape.svg?style=flat)](https://landscape.io/github/thePortus/dhelp/master) [![Waffle.io - Columns and their card count](https://badge.waffle.io/thePortus/dhelp.svg?columns=all)](https://waffle.io/thePortus/dhelp)
+[![PyPI version](https://badge.fury.io/py/dhelp.svg)](https://badge.fury.io/py/dhelp)
+![PyPI - License](https://img.shields.io/pypi/l/Django.svg)
+ [![Build Status](https://travis-ci.org/thePortus/dhelp.svg?branch=master)](https://travis-ci.org/thePortus/dhelp) [![Coverage Status](https://coveralls.io/repos/github/thePortus/dhelp/badge.svg?branch=master)](https://coveralls.io/github/thePortus/dhelp?branch=master) [![Documentation Status](https://readthedocs.org/projects/dhelp/badge/?version=latest)](http://dhelp.readthedocs.io/en/latest/?badge=latest) [![Code Health](https://landscape.io/github/thePortus/dhelp/master/landscape.svg?style=flat)](https://landscape.io/github/thePortus/dhelp/master) [![Total GitHub downloads](https://img.shields.io/github/downloads/thePortus/dhelp/total.svg)](https://img.shields.io/github/downloads/thePortus/dhelp/total.svg) [![Waffle.io - Columns and their card count](https://badge.waffle.io/thePortus/dhelp.svg?columns=all)](https://waffle.io/thePortus/dhelp)
 
 
 ---
@@ -32,6 +34,23 @@ Requires [Python 3.x](https://python.org)
 
 ---
 
+# Table of Contents
+
+* [Installation](#installation)
+* [Quickstart Guide](#quickstart-guide)
+* [Web Module](#web-module)
+    * [WebPage](#webpage)
+* [File Module](#file-module)
+    * [TextFile](#textfile)
+    * [TextFolder](#textfolder)
+    * [CSVFile](#csvfile)
+* [Text Module](#text-module)
+    * [EnglishText](#englishtext)
+    * [LatinText](#latintext)
+    * [AncientGreekText](#ancientgreektext)
+
+---
+
 # Installation
 
 Install with pip (recommended)
@@ -125,11 +144,18 @@ that comes with many convenient cleaning/nlp methods attached. You can chain
 any of the string transformation methods to perform many text operations at
 once.
 
-#### All Languages
+### EnglishText
+
+**Setup: Download the English Corpora**
+
+Before you use this object for any of the methods below you need to download trainer corpora.
 
-**All Languages Have These Methods**
+```python
+from dhelp import EnglishText
+EnglishText('').setup()
+```
 
-Examples...
+**Examples...**
 
 ```python
 
@@ -144,9 +170,9 @@ text.rm_nonchars()
 'The quick brown fox jumped over the lazy dog'
 
 # .rm_edits() - remove text between editorial marks
-text = EnglishText('Th3e qui\\nck b     rown fox jumped over the lazy dog')
+text = EnglishText('The [quick] brown fox jumped over the lazy dog')
 text.rm_edits()
-'The quick brown fox jumped over the lazy dog'
+'The brown fox jumped over the lazy dog'
 
 # .rm_spaces() - collapses redundant whitespaces
 text = EnglishText('Th3e qui\\nck b     rown fox jumped over the lazy dog')
@@ -170,23 +196,6 @@ text = EnglishText('Th3e qui\\nck b     rown fox jumped over the lazy dog')
 text.rm_lines().rm_nonchars().rm_spaces()
 'The quick brown fox jumped over the lazy dog'
 
-```
-
-#### English
-
-**Setup: Download the English Corpora**
-
-Before you use this object for any of the methods below you need to download trainer corpora.
-
-```python
-from dhelp import EnglishText
-EnglishText('').setup()
-```
-
-Examples...
-
-```python
-
 # lemmatize a text to make word counts/analysis
 text = EnglishText('The quick brown fox jumped over the lazy dog.')
 text.lemmatize()
@@ -219,9 +228,7 @@ text.skipgrams()
 
 ```
 
-#### Latin
-
-**Note: Latin Classes inherit all methods from EnglishText**
+#### LatinText
 
 **Setup: Download the Latin Corpora**
 
@@ -234,8 +241,47 @@ LatinText('').setup()
 
 ```
 
+**Examples...**
+
 ```python
 
+# .rm_lines() - remove endline characters
+text = LatinText('Gallia \\nest omnis divisa in partes tres')
+text.rm_lines()
+'Gallia est omnis divisa in partes tres'
+
+# .rm_nonchars() - remove non-letters
+text = LatinText('Ga3llia est omnis divisa in partes tres')
+text.rm_nonchars()
+'Gallia est omnis divisa in partes tres'
+
+# .rm_edits() - remove text between editorial marks
+text = LatinText('Gallia est [omnis] divisa in partes tres)
+text.rm_edits()
+'Gallia est omnis divisa in partes tres'
+
+# .rm_spaces() - collapses redundant whitespaces
+text = LatinText('Gallia    est omnis divisa       in partes        tres')
+text.rm_spaces()
+'Gallia est omnis divisa in partes tres'
+
+# .re_search() - checks for a given pattern
+text = LatinText('Gallia est omnis divisa in partes tres')
+text.re_search('Gallia')
+True
+text.re_search('Graecia')
+False
+
+# .rm_stopwords() - removes a list of words from text
+text = LatinText('Gallia est omnis divisa in partes tres')
+text.rm_stopwords(['est', 'in'])
+'Gallia omnis divisa partes tres'
+
+# chain methods to perform them in one command
+text = LatinText('Ga3llia    \\nest omnis divisa       in partes        tres')
+text.rm_lines().rm_nonchars().rm_spaces()
+'Gallia est omnis divisa in partes tres'
+
 # tokenize words into list of strings
 text = LatinText('Gallia est omnis divisa in partes tres')
 text.tokenize()
@@ -246,6 +292,21 @@ text = LatinText('Gallia est omnis divisa in partes tres')
 text.lemmatize()
 'gallia edo1 omne divido in pars tres'
 
+# generate ngrams...
+text = LatinText('They hated to think of sample sentences.')
+text.ngrams()
+[('They', 'hated', 'to'), ('hated', 'to', 'think'), ('to', 'think', 'of'), ('think', 'of', 'sample'), ('of', 'sample', 'sentences'), ('sample', 'sentences', '.')]
+
+# ... or skipgrams
+text = LatinText('They hated to think of sample sentences.')
+text.skipgrams()
+[('Gallia', 'est', 'omnis'), ('est', 'omnis', 'divisa'), ('omnis', 'divisa', 'in'), ('divisa', 'in', 'partes'), ('in', 'partes', 'tres')]
+
+# count all words
+text = LatinText('Gallia est omnis divisa in partes tres tres tres')
+text.word_count(word='tres')
+3
+
 # scan text for meter
 text = LatinText('Arma virumque cano, Troiae qui primus ab oris')
 text.scansion()
@@ -272,20 +333,13 @@ text.compare_longest_common_substring('Galliae sunt omnis divisae in partes tres
 'in partes tres'
 
 # compare minhash's
-LatinText('Gallia est omnis divisa in partes tres')
+text = LatinText('Gallia est omnis divisa in partes tres')
 text.compare_minhash('Galliae sunt omnis divisae in partes tres')
 0.6444444444444445
 
-# count all words
-text = LatinText('Gallia est omnis divisa in partes tres tres tres')
-text.word_count(word='tres')
-3
-
 ```
 
-#### Greek
-
-**Note: Greek Classes inherit all methods from EnglishText**
+#### AncientGreekText
 
 **Setup: Download the Greek Corpora**
 
@@ -298,8 +352,47 @@ AncientGreekText('').setup()
 
 ```
 
+**Examples...**
+
 ```python
 
+# .rm_lines() - remove endline characters
+text = AncientGreekText('ἔνθα \nποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'')
+text.rm_lines()
+'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα
+
+# .rm_nonchars() - remove non-letters
+text = AncientGreekText('ἔν3θα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'')
+text.rm_nonchars()
+'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'
+
+# .rm_edits() - remove text between editorial marks
+text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
+text.rm_edits()
+'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'
+
+# .rm_spaces() - collapses redundant whitespaces
+text = AncientGreekText('ἔνθα      ποτὲ     Ἀθηναίοις ἦν ἀργύρου μέταλλα)
+text.rm_spaces()
+'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'
+
+# .re_search() - checks for a given pattern
+text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
+text.re_search('Ἀθηναίοις')
+True
+text.re_search('σπαρτίοις')
+False
+
+# .rm_stopwords() - removes a list of words from text
+text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
+text.rm_stopwords(['ποτὲ', 'ἀργύρου'])
+'ἔνθα Ἀθηναίοις ἦν μέταλλα'
+
+# chain methods to perform them in one command
+text = AncientGreekText('ἔν3θα      \nποτὲ     Ἀθηναίοις ἦν ἀργύρου μέταλλα')
+text.rm_lines().rm_nonchars().rm_spaces()
+'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'
+
 # normalize character encoding differences
 text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
 text.normalize()
@@ -320,6 +413,14 @@ text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύ
 text.lemmatize()
 'ἔνθα ποτὲ ἀθηναῖος εἰμί ἀργυρόω μέταλλον'
 
+text.ngrams()
+[('ἔνθα', 'ποτὲ', 'Ἀθηναίοις'), ('ποτὲ', 'Ἀθηναίοις', 'ἦν'), ('Ἀθηναίοις', 'ἦν', 'ἀργύρου'), ('ἦν', 'ἀργύρου', 'μέταλλα')]
+
+# ... or skipgrams
+text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
+text.skipgrams()
+[('ἔνθα', 'ποτὲ', 'Ἀθηναίοις'), ('ἔνθα', 'ποτὲ', 'ἦν'), ('ἔνθα', 'Ἀθηναίοις', 'ἦν'), ('ποτὲ', 'Ἀθηναίοις', 'ἦν'), ('ποτὲ', 'Ἀθηναίοις', 'ἀργύρου'), ('ποτὲ', 'ἦν', 'ἀργύρου'), ('Ἀθηναίοις', 'ἦν', 'ἀργύρου'), ('Ἀθηναίοις', 'ἦν', 'μέταλλα'), ('Ἀθηναίοις', 'ἀργύρου', 'μέταλλα'), ('ἦν', 'ἀργύρου', 'μέταλλα')]
+
 # perform part-of-speech tagging
 text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
 text.tag()
@@ -345,6 +446,7 @@ text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύ
 text.word_count(word='Ἀθηναίοις')
 1
 
+
 ```
 
 ---
@@ -370,3 +472,5 @@ text_folder.modify('path/to/latin/files-lemmatized', modify_function)
 # That's it, you will now have a folder full of lemmatized Latin files
 
 ```
+
+More Examples Coming Soon
diff --git a/dhelp/text/_bases_mixins.py b/dhelp/text/_bases_mixins.py
@@ -85,8 +85,9 @@ def rm_lines(self):
     def rm_nonchars(self):
         """Removes non-language characters.
 
-        Gives a new version of the text with only latin characters remaining.
-        Is overriden by child objects for languages using non latinate chars.
+        Gives a new version of the text with only latin characters remaining,
+        or Greek characters for Greek, texts, and so on. Defaults to assuming
+        Latin based.
 
         Returns:
             :obj:`self.__class__` Returns new version of text, with non-letters removed
@@ -97,8 +98,12 @@ def rm_nonchars(self):
             >>> print(modified_text)
             'Lorem ipsum dolor sit amet...'
         """ # noqa
+        if self.options['language'] == 'greek':
+            valid_chars_pattern = '([ʹ-Ϋά-ϡἀ-ᾯᾰ-῾ ])'
+        else:
+            valid_chars_pattern = '([A-Za-z ])'
         return self.__class__(
-            "".join(re.findall("([A-Za-z ])", self.data)),
+            "".join(re.findall(valid_chars_pattern, self.data)),
             self.options
         )
 

diff --git a/issue_template.md b/issue_template.md
@@ -0,0 +1,17 @@
+## Expected Behavior
+
+
+## Actual Behavior
+
+
+## Steps to Reproduce the Problem
+
+  1.
+  1.
+  1.
+
+## Specifications
+
+  - Version:
+  - Platform:
+  - Subsystem: