TEA - Text Extraction Automation

The TEA is designed to automatically extract sections of texts from documents for downstream use. Given the potentially large size and quantity of documents it limits itself to rule-based and performant techniques, making this tools valuable when large-scale compute is not available. To this end it uses a combination of label-tailored rules, regular expressions, and keyword matching to approximate sections of interest. As with all rule-based methods, it works best on simpler inputs (e.g., structured files) and performance decreases the more complex the extraction is.

Presently it lacks negative reinforcement and as such can't handle files that don't have sections of interests -it will simply return a random slice.

Installation
Usage
Interface Requirements
Extending Instructions
Appendix

1. Installation

Clone the repository

git clone https://github.com/Sharjeeliv/GNN-Platform

Install Python3 if needed

https://www.python.org/downloads/
Developed on Version 3.12.1

Install Conda if needed

https://docs.anaconda.com/miniconda/miniconda-install/

Go to project folder

Install packages and activate environment

conda env create -f environment.yml
conda activate tea

Update params.py values as needed

2. Usage

Example usage where target files are at "/Users/john/Documents/texts", corresponding labels are at "/Users/john/Documents/labels", each label end with _Extracted, and the targets files end in .txt, .html, and .htm.

Below is the corresponding command for the above situation, ensure that the command is executed from the src folder.

python main.py --texts "/Users/john/Documents/texts" --labels "/Users/john/Documents/labels" --ext .txt .html .htm --exclude Extracted -s _Extracted

For a list of input arguments use:

python3 main.py --help

  -h, --help            show this help message and exit
  -t, --texts           path to directory containing files
  -e, --exclude         list of words to exclude files, e.g., label keyword
  -x, --ext             list of extensions allowed for files
  -l, --labels          path to directory containing labels
  -s, --label_suffix    label file suffix

  --analyze             flag which reruns analysis layer
  --log                 flag which prints log statements
  --test                flag which runs test functions, requires labels

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
tea		tea
.gitignore		.gitignore
app.py		app.py
environment.yml		environment.yml
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TEA - Text Extraction Automation

1. Installation

2. Usage

About

Releases

Packages

Languages

Sharjeeliv/text-extraction-automation

Folders and files

Latest commit

History

Repository files navigation

TEA - Text Extraction Automation

1. Installation

2. Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages