Skip to content

Commit

Permalink
templates & code of conduct added, rm_nonchars fixed for greek texts
Browse files Browse the repository at this point in the history
  • Loading branch information
thePortus committed Mar 8, 2018
1 parent 6eed006 commit a4aa8f2
Show file tree
Hide file tree
Showing 5 changed files with 254 additions and 38 deletions.
73 changes: 73 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Contributor Covenant Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, gender identity and expression, level of experience,
education, socio-economic status, nationality, personal appearance, race,
religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at [email protected]. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

[homepage]: https://www.contributor-covenant.org
174 changes: 139 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@

---

[![PyPI version](https://badge.fury.io/py/dhelp.svg)](https://badge.fury.io/py/dhelp) [![Build Status](https://travis-ci.org/thePortus/dhelp.svg?branch=master)](https://travis-ci.org/thePortus/dhelp) [![Coverage Status](https://coveralls.io/repos/github/thePortus/dhelp/badge.svg?branch=master)](https://coveralls.io/github/thePortus/dhelp?branch=master) [![Documentation Status](https://readthedocs.org/projects/dhelp/badge/?version=latest)](http://dhelp.readthedocs.io/en/latest/?badge=latest) [![Code Health](https://landscape.io/github/thePortus/dhelp/master/landscape.svg?style=flat)](https://landscape.io/github/thePortus/dhelp/master) [![Waffle.io - Columns and their card count](https://badge.waffle.io/thePortus/dhelp.svg?columns=all)](https://waffle.io/thePortus/dhelp)
[![PyPI version](https://badge.fury.io/py/dhelp.svg)](https://badge.fury.io/py/dhelp)
![PyPI - License](https://img.shields.io/pypi/l/Django.svg)
[![Build Status](https://travis-ci.org/thePortus/dhelp.svg?branch=master)](https://travis-ci.org/thePortus/dhelp) [![Coverage Status](https://coveralls.io/repos/github/thePortus/dhelp/badge.svg?branch=master)](https://coveralls.io/github/thePortus/dhelp?branch=master) [![Documentation Status](https://readthedocs.org/projects/dhelp/badge/?version=latest)](http://dhelp.readthedocs.io/en/latest/?badge=latest) [![Code Health](https://landscape.io/github/thePortus/dhelp/master/landscape.svg?style=flat)](https://landscape.io/github/thePortus/dhelp/master) [![Total GitHub downloads](https://img.shields.io/github/downloads/thePortus/dhelp/total.svg)](https://img.shields.io/github/downloads/thePortus/dhelp/total.svg) [![Waffle.io - Columns and their card count](https://badge.waffle.io/thePortus/dhelp.svg?columns=all)](https://waffle.io/thePortus/dhelp)


---
Expand Down Expand Up @@ -32,6 +34,23 @@ Requires [Python 3.x](https://python.org)

---

# Table of Contents

* [Installation](#installation)
* [Quickstart Guide](#quickstart-guide)
* [Web Module](#web-module)
* [WebPage](#webpage)
* [File Module](#file-module)
* [TextFile](#textfile)
* [TextFolder](#textfolder)
* [CSVFile](#csvfile)
* [Text Module](#text-module)
* [EnglishText](#englishtext)
* [LatinText](#latintext)
* [AncientGreekText](#ancientgreektext)

---

# Installation

Install with pip (recommended)
Expand Down Expand Up @@ -125,11 +144,18 @@ that comes with many convenient cleaning/nlp methods attached. You can chain
any of the string transformation methods to perform many text operations at
once.

#### All Languages
### EnglishText

**Setup: Download the English Corpora**

Before you use this object for any of the methods below you need to download trainer corpora.

**All Languages Have These Methods**
```python
from dhelp import EnglishText
EnglishText('').setup()
```

Examples...
**Examples...**

```python

Expand All @@ -144,9 +170,9 @@ text.rm_nonchars()
'The quick brown fox jumped over the lazy dog'

# .rm_edits() - remove text between editorial marks
text = EnglishText('Th3e qui\\nck b rown fox jumped over the lazy dog')
text = EnglishText('The [quick] brown fox jumped over the lazy dog')
text.rm_edits()
'The quick brown fox jumped over the lazy dog'
'The brown fox jumped over the lazy dog'

# .rm_spaces() - collapses redundant whitespaces
text = EnglishText('Th3e qui\\nck b rown fox jumped over the lazy dog')
Expand All @@ -170,23 +196,6 @@ text = EnglishText('Th3e qui\\nck b rown fox jumped over the lazy dog')
text.rm_lines().rm_nonchars().rm_spaces()
'The quick brown fox jumped over the lazy dog'

```

#### English

**Setup: Download the English Corpora**

Before you use this object for any of the methods below you need to download trainer corpora.

```python
from dhelp import EnglishText
EnglishText('').setup()
```

Examples...

```python

# lemmatize a text to make word counts/analysis
text = EnglishText('The quick brown fox jumped over the lazy dog.')
text.lemmatize()
Expand Down Expand Up @@ -219,9 +228,7 @@ text.skipgrams()

```

#### Latin

**Note: Latin Classes inherit all methods from EnglishText**
#### LatinText

**Setup: Download the Latin Corpora**

Expand All @@ -234,8 +241,47 @@ LatinText('').setup()

```

**Examples...**

```python

# .rm_lines() - remove endline characters
text = LatinText('Gallia \\nest omnis divisa in partes tres')
text.rm_lines()
'Gallia est omnis divisa in partes tres'

# .rm_nonchars() - remove non-letters
text = LatinText('Ga3llia est omnis divisa in partes tres')
text.rm_nonchars()
'Gallia est omnis divisa in partes tres'

# .rm_edits() - remove text between editorial marks
text = LatinText('Gallia est [omnis] divisa in partes tres)
text.rm_edits()
'Gallia est omnis divisa in partes tres'

# .rm_spaces() - collapses redundant whitespaces
text = LatinText('Gallia est omnis divisa in partes tres')
text.rm_spaces()
'Gallia est omnis divisa in partes tres'

# .re_search() - checks for a given pattern
text = LatinText('Gallia est omnis divisa in partes tres')
text.re_search('Gallia')
True
text.re_search('Graecia')
False

# .rm_stopwords() - removes a list of words from text
text = LatinText('Gallia est omnis divisa in partes tres')
text.rm_stopwords(['est', 'in'])
'Gallia omnis divisa partes tres'

# chain methods to perform them in one command
text = LatinText('Ga3llia \\nest omnis divisa in partes tres')
text.rm_lines().rm_nonchars().rm_spaces()
'Gallia est omnis divisa in partes tres'

# tokenize words into list of strings
text = LatinText('Gallia est omnis divisa in partes tres')
text.tokenize()
Expand All @@ -246,6 +292,21 @@ text = LatinText('Gallia est omnis divisa in partes tres')
text.lemmatize()
'gallia edo1 omne divido in pars tres'

# generate ngrams...
text = LatinText('They hated to think of sample sentences.')
text.ngrams()
[('They', 'hated', 'to'), ('hated', 'to', 'think'), ('to', 'think', 'of'), ('think', 'of', 'sample'), ('of', 'sample', 'sentences'), ('sample', 'sentences', '.')]

# ... or skipgrams
text = LatinText('They hated to think of sample sentences.')
text.skipgrams()
[('Gallia', 'est', 'omnis'), ('est', 'omnis', 'divisa'), ('omnis', 'divisa', 'in'), ('divisa', 'in', 'partes'), ('in', 'partes', 'tres')]

# count all words
text = LatinText('Gallia est omnis divisa in partes tres tres tres')
text.word_count(word='tres')
3

# scan text for meter
text = LatinText('Arma virumque cano, Troiae qui primus ab oris')
text.scansion()
Expand All @@ -272,20 +333,13 @@ text.compare_longest_common_substring('Galliae sunt omnis divisae in partes tres
'in partes tres'

# compare minhash's
LatinText('Gallia est omnis divisa in partes tres')
text = LatinText('Gallia est omnis divisa in partes tres')
text.compare_minhash('Galliae sunt omnis divisae in partes tres')
0.6444444444444445

# count all words
text = LatinText('Gallia est omnis divisa in partes tres tres tres')
text.word_count(word='tres')
3

```

#### Greek

**Note: Greek Classes inherit all methods from EnglishText**
#### AncientGreekText

**Setup: Download the Greek Corpora**

Expand All @@ -298,8 +352,47 @@ AncientGreekText('').setup()

```

**Examples...**

```python

# .rm_lines() - remove endline characters
text = AncientGreekText('ἔνθα \nποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'')
text.rm_lines()
'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα

# .rm_nonchars() - remove non-letters
text = AncientGreekText('ἔν3θα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'')
text.rm_nonchars()
'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'

# .rm_edits() - remove text between editorial marks
text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
text.rm_edits()
'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'

# .rm_spaces() - collapses redundant whitespaces
text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα)
text.rm_spaces()
'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'

# .re_search() - checks for a given pattern
text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
text.re_search('Ἀθηναίοις')
True
text.re_search('σπαρτίοις')
False

# .rm_stopwords() - removes a list of words from text
text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
text.rm_stopwords(['ποτὲ', 'ἀργύρου'])
'ἔνθα Ἀθηναίοις ἦν μέταλλα'

# chain methods to perform them in one command
text = AncientGreekText('ἔν3θα \nποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
text.rm_lines().rm_nonchars().rm_spaces()
'ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα'

# normalize character encoding differences
text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
text.normalize()
Expand All @@ -320,6 +413,14 @@ text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύ
text.lemmatize()
'ἔνθα ποτὲ ἀθηναῖος εἰμί ἀργυρόω μέταλλον'

text.ngrams()
[('ἔνθα', 'ποτὲ', 'Ἀθηναίοις'), ('ποτὲ', 'Ἀθηναίοις', 'ἦν'), ('Ἀθηναίοις', 'ἦν', 'ἀργύρου'), ('ἦν', 'ἀργύρου', 'μέταλλα')]

# ... or skipgrams
text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
text.skipgrams()
[('ἔνθα', 'ποτὲ', 'Ἀθηναίοις'), ('ἔνθα', 'ποτὲ', 'ἦν'), ('ἔνθα', 'Ἀθηναίοις', 'ἦν'), ('ποτὲ', 'Ἀθηναίοις', 'ἦν'), ('ποτὲ', 'Ἀθηναίοις', 'ἀργύρου'), ('ποτὲ', 'ἦν', 'ἀργύρου'), ('Ἀθηναίοις', 'ἦν', 'ἀργύρου'), ('Ἀθηναίοις', 'ἦν', 'μέταλλα'), ('Ἀθηναίοις', 'ἀργύρου', 'μέταλλα'), ('ἦν', 'ἀργύρου', 'μέταλλα')]

# perform part-of-speech tagging
text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύρου μέταλλα')
text.tag()
Expand All @@ -345,6 +446,7 @@ text = AncientGreekText('ἔνθα ποτὲ Ἀθηναίοις ἦν ἀργύ
text.word_count(word='Ἀθηναίοις')
1


```

---
Expand All @@ -370,3 +472,5 @@ text_folder.modify('path/to/latin/files-lemmatized', modify_function)
# That's it, you will now have a folder full of lemmatized Latin files

```

More Examples Coming Soon
11 changes: 8 additions & 3 deletions dhelp/text/_bases_mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,9 @@ def rm_lines(self):
def rm_nonchars(self):
"""Removes non-language characters.
Gives a new version of the text with only latin characters remaining.
Is overriden by child objects for languages using non latinate chars.
Gives a new version of the text with only latin characters remaining,
or Greek characters for Greek, texts, and so on. Defaults to assuming
Latin based.
Returns:
:obj:`self.__class__` Returns new version of text, with non-letters removed
Expand All @@ -97,8 +98,12 @@ def rm_nonchars(self):
>>> print(modified_text)
'Lorem ipsum dolor sit amet...'
""" # noqa
if self.options['language'] == 'greek':
valid_chars_pattern = '([ʹ-Ϋά-ϡἀ-ᾯᾰ-῾ ])'
else:
valid_chars_pattern = '([A-Za-z ])'
return self.__class__(
"".join(re.findall("([A-Za-z ])", self.data)),
"".join(re.findall(valid_chars_pattern, self.data)),
self.options
)

Expand Down
17 changes: 17 additions & 0 deletions issue_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## Expected Behavior


## Actual Behavior


## Steps to Reproduce the Problem

1.
1.
1.

## Specifications

- Version:
- Platform:
- Subsystem:
Loading

0 comments on commit a4aa8f2

Please sign in to comment.