Skip to content

Commit

Permalink
Run end-of-file and trailing-whitespace fixer on all files
Browse files Browse the repository at this point in the history
  • Loading branch information
Lingepumpe committed Jun 28, 2022
1 parent ae1f07f commit 94f0da0
Show file tree
Hide file tree
Showing 44 changed files with 410 additions and 442 deletions.
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ before_script: cd tests
script:
- pip freeze
- 'if [ "$TRAVIS_PULL_REQUEST" != "false" ]; then pytest --runintegration; fi'
- 'if [ "$TRAVIS_PULL_REQUEST" = "false" ]; then pytest; fi'
- 'if [ "$TRAVIS_PULL_REQUEST" = "false" ]; then pytest; fi'
18 changes: 9 additions & 9 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
# Contributing to Flair

We are happy to accept your contributions to make `flair` better and more awesome! To avoid unnecessary work on either
We are happy to accept your contributions to make `flair` better and more awesome! To avoid unnecessary work on either
side, please stick to the following process:

1. Check if there is already [an issue](https://github.com/zalandoresearch/flair/issues) for your concern.
2. If there is not, open a new one to start a discussion. We hate to close finished PRs!
3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the
3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the
commit guidelines below.

In case you just want to help out and don't know where to start,
[issues with "help wanted" label](https://github.com/zalandoresearch/flair/labels/help%20wanted) are good for
first-time contributors.
In case you just want to help out and don't know where to start,
[issues with "help wanted" label](https://github.com/zalandoresearch/flair/labels/help%20wanted) are good for
first-time contributors.


## Git Commit Guidelines

If there is already a ticket, use this number at the start of your commit message.
If there is already a ticket, use this number at the start of your commit message.
Use meaningful commit messages that described what you did.

**Example:** `GH-42: Added new type of embeddings: DocumentEmbedding.`
**Example:** `GH-42: Added new type of embeddings: DocumentEmbedding.`


## Developing locally
Expand Down Expand Up @@ -62,7 +62,7 @@ To run integration tests execute:
pytest --runintegration
```
The integration tests will train small models and therefore take more time.
In general, it is recommended to ensure all basic tests are running through before testing the integration tests
In general, it is recommended to ensure all basic tests are running through before testing the integration tests

### code formatting

Expand All @@ -75,4 +75,4 @@ You can automatically format the code via `black --config pyproject.toml flair/

If you want to automatically format your code on every commit, you can use [pre-commit](https://pre-commit.com/).
Just install it via `pip install pre-commit` and execute `pre-commit install` in the root folder.
This will add a hook to the repository, which reformats files on every commit.
This will add a hook to the repository, which reformats files on every commit.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ Permission is hereby granted, free of charge, to any person obtaining a copy of

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
2 changes: 1 addition & 1 deletion MAINTAINERS
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
Alan Akbik <[email protected]>
Tanja Bergmann <[email protected]>
Tanja Bergmann <[email protected]>
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ A very simple framework for **state-of-the-art NLP**. Developed by [Humboldt Uni
Flair is:

* **A powerful NLP library.** Flair allows you to apply our state-of-the-art natural language processing (NLP)
models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS),
models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS),
special support for [biomedical data](/resources/docs/HUNFLAIR.md),
sense disambiguation and classification, with support for a rapidly growing number of languages.

Expand All @@ -27,7 +27,7 @@ Now at [version 0.11](https://github.com/flairNLP/flair/releases)!

## State-of-the-Art Models

Flair ships with state-of-the-art models for a range of NLP tasks. For instance, check out our latest NER models:
Flair ships with state-of-the-art models for a range of NLP tasks. For instance, check out our latest NER models:

| Language | Dataset | Flair | Best published | Model card & demo
| --- | ----------- | ---------------- | ------------- | ------------- |
Expand All @@ -37,7 +37,7 @@ Flair ships with state-of-the-art models for a range of NLP tasks. For instance,
| Dutch | Conll-03 (4-class) | **95.25** | *93.7 [(Yu et al., 2020)](https://www.aclweb.org/anthology/2020.acl-main.577.pdf)* | [Flair Dutch 4-class NER demo](https://huggingface.co/flair/ner-dutch-large) |
| Spanish | Conll-03 (4-class) | **90.54** | *90.3 [(Yu et al., 2020)](https://www.aclweb.org/anthology/2020.acl-main.577.pdf)* | [Flair Spanish 4-class NER demo](https://huggingface.co/flair/ner-spanish-large) |

**New:** Most Flair sequence tagging models (named entity recognition, part-of-speech tagging etc.) are now hosted
**New:** Most Flair sequence tagging models (named entity recognition, part-of-speech tagging etc.) are now hosted
on the [__🤗 HuggingFace model hub__](https://huggingface.co/models?library=flair&sort=downloads)! You can browse models, check detailed information on how they were trained, and even try each model out online!


Expand Down
2 changes: 1 addition & 1 deletion SECURITY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
We acknowledge that every line of code that we write may potentially contain security issues.
We are trying to deal with it responsibly and provide patches as quickly as possible.
We are trying to deal with it responsibly and provide patches as quickly as possible.

We host our bug bounty program on HackerOne, it is currently private, therefore if you would like to report a vulnerability and get rewarded for it, please ask to join our program by filling this form:

Expand Down
38 changes: 19 additions & 19 deletions resources/docs/EXPERIMENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Here, we collect the best embedding configurations for each NLP task. If
you achieve better numbers, let us know which exact configuration of Flair
you used and we will add your experiment here!

**Data.** For each experiment, you need to first get the evaluation dataset. Then execute the code as provided in this
**Data.** For each experiment, you need to first get the evaluation dataset. Then execute the code as provided in this
documentation. Also check out the [tutorials](/resources/docs/TUTORIAL_1_BASICS.md) to get a better overview of
how Flair works.

Expand All @@ -17,7 +17,7 @@ how Flair works.

#### Data
The [CoNLL-03 data set for English](https://www.clips.uantwerpen.be/conll2003/ner/) is probably the most
well-known dataset to evaluate NER on. It contains 4 entity classes. Follows the steps on the task Web site to
well-known dataset to evaluate NER on. It contains 4 entity classes. Follows the steps on the task Web site to
get the dataset and place train, test and dev data in `/resources/tasks/conll_03/` as follows:

```
Expand All @@ -26,7 +26,7 @@ resources/tasks/conll_03/eng.testb
resources/tasks/conll_03/eng.train
```

This allows the `CONLL_03()` corpus object to read the data into our data structures. Initialize the corpus as follows:
This allows the `CONLL_03()` corpus object to read the data into our data structures. Initialize the corpus as follows:

```python
from flair.datasets import CONLL_03
Expand All @@ -37,7 +37,7 @@ This gives you a `Corpus` object that contains the data. Now, select `ner` as th

#### Best Known Configuration

The full code to get a state-of-the-art model for English NER is as follows:
The full code to get a state-of-the-art model for English NER is as follows:

```python
from flair.data import Corpus
Expand Down Expand Up @@ -83,7 +83,7 @@ from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-ner',
train_with_dev=True,
train_with_dev=True,
max_epochs=150)
```

Expand Down Expand Up @@ -146,7 +146,7 @@ from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-ner',
train_with_dev=True,
train_with_dev=True,
max_epochs=150)
```

Expand Down Expand Up @@ -248,7 +248,7 @@ from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-ner',
train_with_dev=True,
train_with_dev=True,
max_epochs=150)
```

Expand All @@ -262,8 +262,8 @@ trainer.train('resources/taggers/example-ner',
#### Data

The [Ontonotes corpus](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf) is one of the best resources
for different types of NLP and contains rich NER annotation. Get the corpus and split it into train, test and dev
splits using the scripts provided by the [CoNLL-12 shared task](http://conll.cemantix.org/2012/data.html).
for different types of NLP and contains rich NER annotation. Get the corpus and split it into train, test and dev
splits using the scripts provided by the [CoNLL-12 shared task](http://conll.cemantix.org/2012/data.html).

Place train, test and dev data in CoNLL-03 format in `resources/tasks/onto-ner/` as follows:

Expand All @@ -275,8 +275,8 @@ resources/tasks/onto-ner/eng.train

#### Best Known Configuration

Once you have the data, reproduce our experiments exactly like for CoNLL-03, just with a different dataset and with
FastText embeddings (they work better on this dataset). You also need to provide a `column_format` for the `ColumnCorpus` object indicating which column in the training file is the 'ner' information. The full code then is as follows:
Once you have the data, reproduce our experiments exactly like for CoNLL-03, just with a different dataset and with
FastText embeddings (they work better on this dataset). You also need to provide a `column_format` for the `ColumnCorpus` object indicating which column in the training file is the 'ner' information. The full code then is as follows:

```python
from flair.data import Corpus
Expand Down Expand Up @@ -318,7 +318,7 @@ trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-ner',
learning_rate=0.1,
train_with_dev=True,
train_with_dev=True,
# it's a big dataset so maybe set embeddings_storage_mode to 'none' (embeddings are not kept in memory)
embeddings_storage_mode='none')
```
Expand All @@ -335,16 +335,16 @@ trainer.train('resources/taggers/example-ner',

Get the [Penn treebank](https://catalog.ldc.upenn.edu/ldc99t42) and follow the guidelines
in [Collins (2002)](http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf) to produce train, dev and test splits.
Convert splits into CoNLLU-U format and place train, test and dev data in `/path/to/penn/` as follows:
Convert splits into CoNLLU-U format and place train, test and dev data in `/path/to/penn/` as follows:

```
/path/to/penn/test.conll
/path/to/penn/train.conll
/path/to/penn/valid.conll
```

Then, run the experiments with extvec embeddings and contextual string embeddings. Also, select 'pos' as `tag_type`,
so the algorithm knows that POS tags and not NER are to be predicted from this data.
Then, run the experiments with extvec embeddings and contextual string embeddings. Also, select 'pos' as `tag_type`,
so the algorithm knows that POS tags and not NER are to be predicted from this data.

#### Best Known Configuration

Expand Down Expand Up @@ -385,7 +385,7 @@ from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-pos',
train_with_dev=True,
train_with_dev=True,
max_epochs=150)
```

Expand All @@ -400,8 +400,8 @@ Data is included in Flair and will get automatically downloaded when you run the


#### Best Known Configuration
Run the code with extvec embeddings and our proposed contextual string embeddings. Use 'np' as `tag_type`,
so the algorithm knows that chunking tags and not NER are to be predicted from this data.
Run the code with extvec embeddings and our proposed contextual string embeddings. Use 'np' as `tag_type`,
so the algorithm knows that chunking tags and not NER are to be predicted from this data.

```python
from flair.data import Corpus
Expand Down Expand Up @@ -441,6 +441,6 @@ from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-chunk',
train_with_dev=True,
train_with_dev=True,
max_epochs=150)
```
42 changes: 21 additions & 21 deletions resources/docs/HUNFLAIR.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,37 @@
# HunFlair

*HunFlair* is a state-of-the-art NER tagger for biomedical texts. It comes with
models for genes/proteins, chemicals, diseases, species and cell lines. *HunFlair*
builds on pretrained domain-specific language models and outperforms other biomedical
NER tools on unseen corpora. Furthermore, it contains harmonized versions of [31 biomedical
*HunFlair* is a state-of-the-art NER tagger for biomedical texts. It comes with
models for genes/proteins, chemicals, diseases, species and cell lines. *HunFlair*
builds on pretrained domain-specific language models and outperforms other biomedical
NER tools on unseen corpora. Furthermore, it contains harmonized versions of [31 biomedical
NER data sets](HUNFLAIR_CORPORA.md) and comes with a Flair language model ("pubmed-X") and
FastText embeddings ("pubmed") that were trained on roughly 3 million full texts and about
25 million abstracts from the biomedical domain.

<b>Content:</b>
[Quick Start](#quick-start) |
<b>Content:</b>
[Quick Start](#quick-start) |
[BioNER-Tool Comparison](#comparison-to-other-biomedical-ner-tools) |
[Tutorials](#tutorials) |
[Citing HunFlair](#citing-hunflair)
[Tutorials](#tutorials) |
[Citing HunFlair](#citing-hunflair)

## Quick Start

#### Requirements and Installation
*HunFlair* is based on Flair 0.6+ and Python 3.6+.
*HunFlair* is based on Flair 0.6+ and Python 3.6+.
If you do not have Python 3.6, install it first. [Here is how for Ubuntu 16.04](https://vsupalov.com/developing-with-python3-6-on-ubuntu-16-04/).
Then, in your favorite virtual environment, simply do:
```
pip install flair
```
Furthermore, we recommend to install [SciSpaCy](https://allenai.github.io/scispacy/) for improved pre-processing
Furthermore, we recommend to install [SciSpaCy](https://allenai.github.io/scispacy/) for improved pre-processing
and tokenization of scientific / biomedical texts:
```
pip install scispacy==0.2.5
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
```

#### Example Usage
Let's run named entity recognition (NER) over an example sentence. All you need to do is
Let's run named entity recognition (NER) over an example sentence. All you need to do is
make a Sentence, load a pre-trained model and use it to predict tags for the sentence:
```python
from flair.data import Sentence
Expand Down Expand Up @@ -63,15 +63,15 @@ Span[6:7]: "Mouse" → Species (0.9979)
~~~

## Comparison to other biomedical NER tools
Tools for biomedical NER are typically trained and evaluated on rather small gold standard data sets.
However, they are applied "in the wild" to a much larger collection of texts, often varying in
topic, entity distribution, genre (e.g. patents vs. scientific articles) and text type (e.g. abstract
Tools for biomedical NER are typically trained and evaluated on rather small gold standard data sets.
However, they are applied "in the wild" to a much larger collection of texts, often varying in
topic, entity distribution, genre (e.g. patents vs. scientific articles) and text type (e.g. abstract
vs. full text), which can lead to severe drops in performance.

*HunFlair* outperforms other biomedical NER tools on corpora not used for training of neither *HunFlair*
or any of the competitor tools.

| Corpus | Entity Type | Misc<sup><sub>[1](#f1)</sub></sup> | SciSpaCy | HUNER | HunFlair |
| Corpus | Entity Type | Misc<sup><sub>[1](#f1)</sub></sup> | SciSpaCy | HUNER | HunFlair |
| --- | --- | --- | --- | --- | --- |
| [CRAFT v4.0](https://github.com/UCDenver-ccp/CRAFT) | Chemical | 42.88 | 35.73 | 42.99 | *__59.83__* |
| | Gene/Protein | 64.93 | 47.76 | 50.77 | *__73.51__* |
Expand All @@ -82,16 +82,16 @@ or any of the competitor tools.
| | Species | *__80.53__* | 57.11 | 67.84 | 76.41 |
| [Plant-Disease](http://gcancer.org/pdr/) | Species | 80.63 | 75.90 | 73.64 | *__83.44__* |

<sub>All results are F1 scores using partial matching of predicted text offsets with the original char offsets
<sub>All results are F1 scores using partial matching of predicted text offsets with the original char offsets
of the gold standard data. We allow a shift by max one character.</sub>

<sub><a name="f1">1</a>: Misc displays the results of multiple taggers:
[tmChem](https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmchem/) for Chemical,
[GNormPus](https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/) for Gene and Species, and
<sub><a name="f1">1</a>: Misc displays the results of multiple taggers:
[tmChem](https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmchem/) for Chemical,
[GNormPus](https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/) for Gene and Species, and
[DNorm](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/DNorm.html) for Disease
</sub>

Here's how to [reproduce these numbers](HUNFLAIR_EXPERIMENTS.md) using Flair.
Here's how to [reproduce these numbers](HUNFLAIR_EXPERIMENTS.md) using Flair.
You can find detailed evaluations and discussions in [our paper](https://arxiv.org/abs/2008.07347).

## Tutorials
Expand Down
Loading

0 comments on commit 94f0da0

Please sign in to comment.