Skip to content

Commit

Permalink
Merge pull request #644 from snipsco/release/0.16.2
Browse files Browse the repository at this point in the history
Release 0.16.2
  • Loading branch information
adrienball authored Aug 8, 2018
2 parents 6a851c7 + 9c02f76 commit ec72eee
Show file tree
Hide file tree
Showing 25 changed files with 269 additions and 93 deletions.
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,20 @@
# Changelog
All notable changes to this project will be documented in this file.

## [0.16.2] - 2018-08-08
### Added
- `automatically_extensible` flag in dataset generation tool
- System requirements
- Reference to chatito tool in documentation

### Changed
- Bump `snips-nlu-ontology` to `0.57.3`
- versions of dependencies are now defined more loosely

### Fixed
- Issue with synonyms mapping
- Issue with `snips-nlu download-all-languages` CLI command

## [0.16.1] - 2018-07-23
### Added
- Every processing unit can be persisted into (and loaded from) a `bytearray`
Expand Down Expand Up @@ -113,6 +127,7 @@ several commands.
- Fix compiling issue with `bindgen` dependency when installing from source
- Fix issue in `CRFSlotFiller` when handling builtin entities

[0.16.2]: https://github.com/snipsco/snips-nlu/compare/0.16.1...0.16.2
[0.16.1]: https://github.com/snipsco/snips-nlu/compare/0.16.0...0.16.1
[0.16.0]: https://github.com/snipsco/snips-nlu/compare/0.15.1...0.16.0
[0.15.1]: https://github.com/snipsco/snips-nlu/compare/0.15.0...0.15.1
Expand Down
1 change: 1 addition & 0 deletions CONTRIBUTORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ This is a list of everyone who has made significant contributions to Snips NLU,

* `Alice Coucke <https://github.com/choufractal>`_
* `Josh Meyer <https://github.com/JRMeyer>`_
* `Matthieu Brouillard <https://github.com/McFoggy>`_
7 changes: 7 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@ Snips NLU

Check out our `blog post`_ to get more details about why we built Snips NLU and how it works under the hood.

System requirements
-------------------
- 64-bit Linux, MacOS >= 10.11, 64-bit Windows
- Python 2.7 or Python >= 3.4
- RAM: Snips NLU will typically use between 100MB and 200MB of RAM, depending on the language and the size of the dataset.


Installation
------------

Expand Down
7 changes: 7 additions & 0 deletions docs/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@
Installation
============

System requirements
-------------------
- 64-bit Linux, MacOS >= 10.11, 64-bit Windows
- Python 2.7 or Python >= 3.4
- RAM: Snips NLU will typically use between 100MB and 200MB of RAM, depending on the language and the size of the dataset.


Python Version
--------------

Expand Down
17 changes: 14 additions & 3 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,10 @@ parse as well as easy to read.
We created a `sample dataset`_ that you can check to better understand the
format.

You have two options to create your dataset. You can build it manually by
respecting the format used in the sample or alternatively you can use the
dataset creation CLI that is contained in the lib.
You have three options to create your dataset. You can build it manually by
respecting the format used in the sample, you can also use the dataset creation
CLI included in the lib, or alternatively you can use `chatito`_ a DSL
tool for dataset generation.

We will go for the second option here and start by creating three files
corresponding to our three intents and one entity file corresponding to the
Expand Down Expand Up @@ -102,6 +103,15 @@ double quotes ``"``. If the value contains double quotes, it must be doubled
to be escaped like this: ``"A value with a "","" in it"`` which corresponds
to the actual value ``A value with a "," in it``.

.. Note::

By default entities are generated as :ref:`automatically extensible <auto_extensible>`, i.e. the recognition will accept additional values than the ones listed in the entity file.
This behavior can be changed by adding at the beginning of the entity file the following:

.. code-block:: bash
# automatically_extensible=false
We are now ready to generate our dataset:

.. code-block:: bash
Expand Down Expand Up @@ -364,3 +374,4 @@ Alternatively, you can persist/load the engine as a ``bytearray``:
.. _sample dataset: https://github.com/snipsco/snips-nlu/blob/master/snips_nlu_samples/sample_dataset.json
.. _default configurations: https://github.com/snipsco/snips-nlu/blob/master/snips_nlu/default_configs
.. _english one: https://github.com/snipsco/snips-nlu/blob/master/snips_nlu/default_configs/config_en.py
.. _chatito: https://github.com/rodrigopivi/Chatito
44 changes: 19 additions & 25 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,44 +13,38 @@
about = dict()
exec(f.read(), about)


with io.open(os.path.join(root, "README.rst"), encoding="utf8") as f:
readme = f.read()

nlu_metrics_version = "0.12.0"

required = [
"enum34==1.1.6",
"pathlib==1.0.1",
"enum34>=1.1,<2.0",
"numpy==1.14.0",
"scipy==1.0.0",
"scikit-learn==0.19.1",
"sklearn-crfsuite==0.3.6",
"semantic_version==2.6.0",
"snips_nlu_utils==0.6.1",
"snips_nlu_ontology==0.57.2",
"num2words==0.5.6",
"plac==0.9.6",
"requests==2.18.4"
"scipy>=1.0,<2.0",
"scikit-learn>=0.19,<0.20",
"sklearn-crfsuite>=0.3.6,<0.4",
"semantic_version>=2.6,<3.0",
"snips_nlu_utils>=0.6.1,<0.7",
"snips_nlu_ontology==0.57.3",
"num2words>=0.5.6,<0.6",
"plac>=0.9.6,<1.0",
"requests>=2.0,<3.0",
"pathlib==1.0.1; python_version < '3.4'",
]

extras_require = {
"doc": [
"sphinx==1.7.1",
"sphinxcontrib-napoleon==0.6.1",
"sphinx-rtd-theme==0.2.4"
"sphinx>=1.7,<2.0",
"sphinxcontrib-napoleon>=0.6.1,<0.7",
"sphinx-rtd-theme>=0.2.4,<0.3"
],
"metrics": [
"snips_nlu_metrics==%s" % nlu_metrics_version,
"snips_nlu_metrics>=0.13,<0.14",
],
"test": [
"mock==2.0.0",
"snips_nlu_metrics==%s" % nlu_metrics_version,
"pylint==1.8.2",
"coverage==4.4.2"
],
"integration_test": [
"snips_nlu_metrics==%s" % nlu_metrics_version,
"mock>=2.0,<3.0",
"snips_nlu_metrics>=0.13,<0.14",
"pylint>=1.8,<2.0",
"coverage>=4.4.2,<5.0"
]
}

Expand Down
2 changes: 1 addition & 1 deletion snips_nlu/__about__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
__email__ = "[email protected], [email protected]"
__license__ = "Apache License, Version 2.0"

__version__ = "0.16.1"
__version__ = "0.16.2"
__model_version__ = "0.16.0"

__download_url__ = "https://github.com/snipsco/snips-nlu-language-resources/releases/download"
Expand Down
12 changes: 10 additions & 2 deletions snips_nlu/cli/dataset/entities.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from __future__ import unicode_literals

import csv
import re
from abc import ABCMeta, abstractmethod
from pathlib import Path

Expand All @@ -12,6 +13,7 @@
from snips_nlu.constants import (
VALUE, SYNONYMS, AUTOMATICALLY_EXTENSIBLE, USE_SYNONYMS, DATA)

AUTO_EXT_REGEX = re.compile(r'^#\sautomatically_extensible=(true|false)\s*$')

class Entity(with_metaclass(ABCMeta, object)):
def __init__(self, name):
Expand Down Expand Up @@ -56,17 +58,23 @@ def from_file(cls, filepath):
if six.PY2:
it = list(utf_8_encoder(it))
reader = csv.reader(list(it))
autoextent = True
for row in reader:
if six.PY2:
row = [cell.decode("utf-8") for cell in row]
value = row[0]
if reader.line_num == 1:
m = AUTO_EXT_REGEX.match(row[0])
if m:
autoextent = not m.group(1).lower() == 'false'
continue
if len(row) > 1:
synonyms = row[1:]
else:
synonyms = []
utterances.append(EntityUtterance(value, synonyms))
return cls(entity_name, utterances, automatically_extensible=True,
use_synonyms=True)
return cls(entity_name, utterances,
automatically_extensible=autoextent, use_synonyms=True)

@property
def json(self):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# automatically_extensible=false
new york,big apple
paris,city of lights
london
4 changes: 2 additions & 2 deletions snips_nlu/cli/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def download(resource_name, direct=False,
def download_all_languages(*pip_args):
"""Download compatible resources for all supported languages"""
for language in get_all_languages():
download(language, *pip_args)
download(language, False, *pip_args)


def _get_compatibility():
Expand Down Expand Up @@ -106,7 +106,7 @@ def _get_installed_languages():
for directory in DATA_PATH.iterdir():
if not directory.is_dir():
continue
with (directory / "metadata.json").open() as f:
with (directory / "metadata.json").open(encoding="utf8") as f:
metadata = json.load(f)
languages.add(metadata["language"])
return languages
2 changes: 1 addition & 1 deletion snips_nlu/cli/generate_dataset.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from __future__ import unicode_literals
from __future__ import print_function, unicode_literals

import json

Expand Down
2 changes: 1 addition & 1 deletion snips_nlu/cli/inference.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from __future__ import unicode_literals
from __future__ import unicode_literals, print_function

import json
from builtins import input
Expand Down
2 changes: 1 addition & 1 deletion snips_nlu/cli/training.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from __future__ import unicode_literals
from __future__ import unicode_literals, print_function

import json
from pathlib import Path
Expand Down
94 changes: 64 additions & 30 deletions snips_nlu/dataset.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from __future__ import division, unicode_literals

import json
from builtins import str
from collections import Counter
from copy import deepcopy

from builtins import str
from future.utils import iteritems, itervalues
from snips_nlu_ontology import get_all_languages

Expand Down Expand Up @@ -97,16 +98,23 @@ def has_any_capitalization(entity_utterances, language):
return False


def add_variation_if_needed(utterances, variation, utterance, language):
if not variation:
return utterances
all_variations = get_string_variations(variation, language)
for v in all_variations:
if v not in utterances:
utterances[v] = utterance
def add_entity_variations(utterances, entity_variations, entity_value):
utterances[entity_value] = entity_value
for variation in entity_variations[entity_value]:
if variation:
utterances[variation] = entity_value
return utterances


def _extract_entity_values(entity):
values = set()
for ent in entity[DATA]:
values.add(ent[VALUE])
if entity[USE_SYNONYMS]:
values.update(set(ent[SYNONYMS]))
return values


def validate_and_format_custom_entity(entity, queries_entities, language):
validate_type(entity, dict)
mandatory_keys = [USE_SYNONYMS, AUTOMATICALLY_EXTENSIBLE, DATA]
Expand Down Expand Up @@ -139,33 +147,59 @@ def validate_and_format_custom_entity(entity, queries_entities, language):
formatted_entity[CAPITALIZE] = has_any_capitalization(queries_entities,
language)

# Normalize
validated_data = dict()
for entry in entity[DATA]:
entry_value = entry[VALUE]
validated_data = add_variation_if_needed(
validated_data, entry_value, entry_value, language)

validated_utterances = dict()
# Map original values an synonyms
for data in entity[DATA]:
ent_value = data[VALUE]
if not ent_value:
continue
validated_utterances[ent_value] = ent_value
if use_synonyms:
for s in entry[SYNONYMS]:
validated_data = add_variation_if_needed(
validated_data, s, entry_value, language)

formatted_entity[UTTERANCES] = validated_data
# Merge queries_entities
for value in queries_entities:
formatted_entity = add_entity_value_if_missing(
value, formatted_entity, language)
for s in data[SYNONYMS]:
if s and s not in validated_utterances:
validated_utterances[s] = ent_value

# Add variations if not colliding
all_original_values = _extract_entity_values(entity)
variations = dict()
for data in entity[DATA]:
ent_value = data[VALUE]
values_to_variate = {ent_value}
if use_synonyms:
values_to_variate.update(set(data[SYNONYMS]))
variations[ent_value] = set(
v for value in values_to_variate
for v in get_string_variations(value, language))
variation_counter = Counter(
[v for vars in itervalues(variations) for v in vars])
non_colliding_variations = {
value: [
v for v in variations if
v not in all_original_values and variation_counter[v] == 1
]
for value, variations in iteritems(variations)
}

for entry in entity[DATA]:
entry_value = entry[VALUE]
validated_utterances = add_entity_variations(
validated_utterances, non_colliding_variations, entry_value)

# Merge queries entities
queries_entities_variations = {
ent: get_string_variations(ent, language) for ent in queries_entities
}
for original_ent, variations in iteritems(queries_entities_variations):
if not original_ent or original_ent in validated_utterances:
continue
validated_utterances[original_ent] = original_ent
for variation in variations:
if variation and variation not in validated_utterances:
validated_utterances[variation] = original_ent
formatted_entity[UTTERANCES] = validated_utterances
return formatted_entity


def validate_and_format_builtin_entity(entity, queries_entities):
validate_type(entity, dict)
return {UTTERANCES: set(queries_entities)}


def add_entity_value_if_missing(value, entity, language):
entity[UTTERANCES] = add_variation_if_needed(entity[UTTERANCES], value,
value, language)
return entity
2 changes: 1 addition & 1 deletion snips_nlu/intent_classifier/log_reg_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ def from_path(cls, path):
raise OSError("Missing intent classifier model file: %s"
% model_path.name)

with model_path.open() as f:
with model_path.open(encoding="utf8") as f:
model_dict = json.load(f)
return cls.from_dict(model_dict)

Expand Down
2 changes: 1 addition & 1 deletion snips_nlu/intent_parser/deterministic_intent_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ def from_path(cls, path):
raise OSError("Missing deterministic intent parser metadata file: "
"%s" % metadata_path.name)

with metadata_path.open() as f:
with metadata_path.open(encoding="utf8") as f:
metadata = json.load(f)
return cls.from_dict(metadata)

Expand Down
Loading

0 comments on commit ec72eee

Please sign in to comment.