Skip to content

Commit

Permalink
feature interpret metadata entries with nonenan as non existent inste…
Browse files Browse the repository at this point in the history
…ad of returning an error (#4477)

<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description

This PR resolves problems working with `None` and `nan` values in
metadata property values.

- The `nan` values are ignored when detected on the server side
- The `None` values reset the metadata property value. So, if users want
to reset the `prop`metadata property, they can set it up with a `None`
value.

## UPDATE

New changes have been introduced to improve API behaviour:

- `Nan` values will raise a 422 error, which makes sense since is not a
valid value.
- `None` values are accepted as valid values. But the behaviour can be
equivalent to avoid passing the metadata value for that property since
the metadata whole data is replaced/overwritten on a record update.

Saying this, the client will parse metadata values to detect `nan` and
raise a proper. The `None` ones are fully accepted.

cc @davidberenstein1957 @sdiazlor @jfcalvo 

Closes #4300 

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [ ] New feature (non-breaking change which adds functionality)
- [X] Refactor (change restructuring the codebase without changing
functionality)
- [X] Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

This feature has been tested locally using the Python SDK.

**Checklist**

- [ ] I added relevant documentation
- [ ] I followed the style guidelines of this project
- [X] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [X] My changes generate no new warnings
- [X] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [X] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: José Francisco Calvo <[email protected]>
  • Loading branch information
3 people authored Jan 12, 2024
1 parent 4290369 commit c95326a
Show file tree
Hide file tree
Showing 11 changed files with 278 additions and 80 deletions.
12 changes: 6 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,23 +19,24 @@ These are the section headers that we use:
### Added

- Restore filters from feedback dataset settings ([#4461])(https://github.com/argilla-io/argilla/pull/4461)
- Warning on feedback dataset settings when leaving page with unsaved changes ([#4461])(https://github.com/argilla-io/argilla/pull/4461)
- Added pydantic v2 support using the python SDK ([#4459](https://github.com/argilla-io/argilla/pull/4459))
- Warning on feedback dataset settings when leaving page with unsaved changes. ([#4461](https://github.com/argilla-io/argilla/pull/4461))
- Added pydantic v2 support using the python SDK. ([#4459](https://github.com/argilla-io/argilla/pull/4459))

## Changed
### Changed

- Module `argilla.cli.server` definitions have been moved to `argilla.server.cli` module. ([#4472](https://github.com/argilla-io/argilla/pull/4472))
- The constant definition `ES_INDEX_REGEX_PATTERN` in module `argilla._constants` is now private. ([#4472](https://github.com/argilla-io/argilla/pull/4474))
- `nan` values in metadata properties will raise a 422 error when creating/updating records. ([#4300](https://github.com/argilla-io/argilla/issues/4300))
- `None` values are now allowed in metadata properties. ([#4300](https://github.com/argilla-io/argilla/issues/4300))

### Deprecated

- The `missing` response status for filtering records is deprecated and will be removed in the release v1.24.0. Use `pending` instead. ([#4433](https://github.com/argilla-io/argilla/pull/4433))

## Removed
### Removed

- The deprecated `python -m argilla database` command has been removed. ([#4472](https://github.com/argilla-io/argilla/pull/4472))


## [1.21.0](https://github.com/argilla-io/argilla/compare/v1.20.0...v1.21.0)

### Added
Expand All @@ -50,7 +51,6 @@ These are the section headers that we use:
- Added `httpx_extra_kwargs` argument to `rg.init` and `Argilla` to allow passing extra arguments to `httpx.Client` used by `Argilla`. ([#4440](https://github.com/argilla-io/argilla/pull/4441))
- Added `ResponseStatusFilter` enum in `__init__` imports of Argilla ([#4118](https://github.com/argilla-io/argilla/pull/4463)). Contributed by @Piyush-Kumar-Ghosh.


### Changed

- More productive and simpler shortcuts system ([#4215](https://github.com/argilla-io/argilla/pull/4215))
Expand Down
13 changes: 0 additions & 13 deletions src/argilla/client/feedback/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,3 @@
DELETE_DATASET_RECORDS_MAX_NUMBER = 100

FIELD_TYPE_TO_PYTHON_TYPE = {FieldTypes.text: str}
# We are using `pydantic`'s strict types to avoid implicit type conversions
METADATA_PROPERTY_TYPE_TO_PYDANTIC_TYPE = {
MetadataPropertyTypes.terms: Union[StrictStr, List[StrictStr]],
MetadataPropertyTypes.integer: StrictInt,
MetadataPropertyTypes.float: StrictFloat,
}

PYDANTIC_STRICT_TO_PYTHON_TYPE = {
StrictInt: int,
StrictFloat: float,
StrictStr: str,
Union[StrictStr, List[StrictStr]]: (str, list),
}
119 changes: 85 additions & 34 deletions src/argilla/client/feedback/schemas/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,10 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import math
from abc import ABC, abstractmethod
from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, Type, Union
from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, Union

from argilla.client.feedback.constants import METADATA_PROPERTY_TYPE_TO_PYDANTIC_TYPE, PYDANTIC_STRICT_TO_PYTHON_TYPE
from argilla.client.feedback.schemas.enums import MetadataPropertyTypes
from argilla.client.feedback.schemas.validators import (
validate_numeric_metadata_filter_bounds,
Expand Down Expand Up @@ -92,14 +91,18 @@ def _pydantic_field_with_validator(self) -> Tuple[Dict[str, Tuple[Any, ...]], Di
def _validate_filter(self, metadata_filter: "MetadataFilters") -> None:
pass

@abstractmethod
def _check_allowed_value_type(self, value: Any) -> Any:
expected_type = PYDANTIC_STRICT_TO_PYTHON_TYPE[METADATA_PROPERTY_TYPE_TO_PYDANTIC_TYPE[self.type]]
if not isinstance(value, expected_type):
raise ValueError(
f"Provided '{self.name}={value}' of type {type(value)} is not valid, "
f"only values of type {expected_type} are allowed."
)
return value
...

@abstractmethod
def _validator(self, value: Any) -> Any:
...


def _validator_definition(schema: MetadataPropertySchema) -> Dict[str, Any]:
"""Returns a dictionary with the pydantic validator definition for the provided schema."""
return {f"{schema.name}_validator": validator(schema.name, allow_reuse=True, pre=True)(schema._validator)}


class TermsMetadataProperty(MetadataPropertySchema):
Expand Down Expand Up @@ -161,21 +164,33 @@ def _all_values_exist(self, introduced_value: Optional[Union[str, List[str]]] =

return introduced_value

def _check_allowed_value_type(self, value: Any) -> Any:
if value is None or isinstance(value, str):
return value

if isinstance(value, list):
return [self._check_allowed_value_type(v) for v in value]

raise TypeError(
f"Provided '{self.name}={value}' of type {type(value)} is not valid, "
"only values of type `str` or `list` are allowed."
)

def _validator(self, value: Any) -> Any:
return self._all_values_exist(self._check_allowed_value_type(value))

@property
def _pydantic_field_with_validator(self) -> Tuple[Dict[str, Tuple[StrictStr, None]], Dict[str, Callable]]:
def _pydantic_field_with_validator(self) -> Tuple[Dict[str, tuple], Dict[str, Callable]]:
# TODO: Simplify the validation logic and do not base on dynamic pydantic models
return (
{self.name: (METADATA_PROPERTY_TYPE_TO_PYDANTIC_TYPE[self.type], None)},
{f"{self.name}_validator": validator(self.name, allow_reuse=True, pre=True)(self._validator)},
)
field_type, default_value = Optional[Union[StrictStr, List[StrictStr]]], None

return {self.name: (field_type, default_value)}, _validator_definition(self)

def _validate_filter(self, metadata_filter: "TermsMetadataFilter") -> None:
if self.values is not None and not all(value in self.values for value in metadata_filter.values):
raise ValidationError(
f"Provided 'values={metadata_filter.values}' is not valid, only values in {self.values} are allowed."
f"Provided 'values={metadata_filter.values}' is not valid, only values in {self.values} are allowed.",
model=type(metadata_filter),
)


Expand Down Expand Up @@ -233,37 +248,40 @@ def _value_in_bounds(self, provided_value: Optional[Union[int, float]]) -> Union
)
return provided_value

def _validator(self, value: Any) -> Any:
return self._value_in_bounds(self._check_allowed_value_type(value))
def _check_nan(self, provided_value: Optional[Union[int, float]]) -> Optional[Union[int, float]]:
if provided_value != provided_value:
raise ValueError(f"Provided '{self.name}={provided_value}' is not valid, NaN values are not allowed.")

@property
def _pydantic_field_with_validator(
self,
) -> Tuple[Dict[str, Tuple[Union[StrictInt, StrictFloat], None]], Dict[str, Callable]]:
return (
{self.name: (METADATA_PROPERTY_TYPE_TO_PYDANTIC_TYPE[self.type], None)},
{f"{self.name}_validator": validator(self.name, allow_reuse=True, pre=True)(self._validator)},
)
return provided_value

def _validator(self, value: Any) -> Any:
return self._value_in_bounds(self._check_allowed_value_type(self._check_nan(value)))

def _validate_filter(self, metadata_filter: Union["IntegerMetadataFilter", "FloatMetadataFilter"]) -> None:
metadata_filter = metadata_filter.dict()
metadata_filter_ = metadata_filter.dict()
for allowed_arg in ["ge", "le"]:
if metadata_filter[allowed_arg] is not None:
if metadata_filter_[allowed_arg] is not None:
if (
self.max is not None
and self.min is not None
and not (self.max >= metadata_filter[allowed_arg] >= self.min)
and not (self.max >= metadata_filter_[allowed_arg] >= self.min)
):
raise ValidationError(
f"Provided '{allowed_arg}={metadata_filter[allowed_arg]}' is not valid, only values between {self.min} and {self.max} are allowed."
f"Provided '{allowed_arg}={metadata_filter_[allowed_arg]}' is not valid, "
f"only values between {self.min} and {self.max} are allowed.",
model=type(metadata_filter),
)
if self.max is not None and not (self.max >= metadata_filter[allowed_arg]):
if self.max is not None and not (self.max >= metadata_filter_[allowed_arg]):
raise ValidationError(
f"Provided '{allowed_arg}={metadata_filter[allowed_arg]}' is not valid, only values under {self.max} are allowed."
f"Provided '{allowed_arg}={metadata_filter_[allowed_arg]}' is not valid, "
f"only values under {self.max} are allowed.",
model=type(metadata_filter),
)
if self.min is not None and not (self.min <= metadata_filter[allowed_arg]):
if self.min is not None and not (self.min <= metadata_filter_[allowed_arg]):
raise ValidationError(
f"Provided '{allowed_arg}={metadata_filter[allowed_arg]}' is not valid, only values over {self.min} are allowed."
f"Provided '{allowed_arg}={metadata_filter_[allowed_arg]}' is not valid, "
f"only values over {self.min} are allowed.",
model=type(metadata_filter),
)


Expand Down Expand Up @@ -292,6 +310,23 @@ class IntegerMetadataProperty(_NumericMetadataPropertySchema):
min: Optional[int] = None
max: Optional[int] = None

@property
def _pydantic_field_with_validator(
self,
) -> Tuple[Dict[str, tuple], Dict[str, Callable]]:
field_type, default_value = Optional[StrictInt], None
return {self.name: (field_type, default_value)}, _validator_definition(self)

def _check_allowed_value_type(self, value: Any) -> Any:
if value is not None:
if isinstance(value, int):
return value
raise TypeError(
f"Provided '{self.name}={value}' of type {type(value)} is not valid, "
"only values of type `int` are allowed."
)
return value


class FloatMetadataProperty(_NumericMetadataPropertySchema):
"""Schema for the `FeedbackDataset` metadata properties of type `float`. This kind
Expand All @@ -318,6 +353,22 @@ class FloatMetadataProperty(_NumericMetadataPropertySchema):
min: Optional[float] = None
max: Optional[float] = None

def _check_allowed_value_type(self, value: Any) -> Any:
if value is None or isinstance(value, (int, float)):
return value

raise TypeError(
f"Provided '{self.name}={value}' of type {type(value)} is not valid, "
"only values of type `int` or `float` are allowed."
)

@property
def _pydantic_field_with_validator(
self,
) -> Tuple[Dict[str, tuple], Dict[str, Any]]:
field_type, default_value = Optional[StrictFloat], None
return {self.name: (field_type, default_value)}, _validator_definition(self)


class MetadataFilterSchema(BaseModel, ABC):
"""Base schema for the `FeedbackDataset` metadata filters.
Expand Down
3 changes: 2 additions & 1 deletion src/argilla/server/contexts/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -512,7 +512,8 @@ async def _validate_metadata(
continue

try:
metadata_property.parsed_settings.check_metadata(value)
if value is not None:
metadata_property.parsed_settings.check_metadata(value)
except ValueError as e:
raise ValueError(f"'{name}' metadata property validation failed because {e}") from e

Expand Down
12 changes: 12 additions & 0 deletions src/argilla/server/schemas/v1/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -524,6 +524,18 @@ def check_user_id_is_unique(cls, values: Optional[List[UserResponseCreate]]) ->

return values

@validator("metadata", pre=True)
@classmethod
def prevent_nan_values(cls, metadata: Optional[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
if metadata is None:
return metadata

for k, v in metadata.items():
if v != v:
raise ValueError(f"NaN is not allowed as metadata value, found NaN for key {k!r}")

return metadata


class RecordsCreate(BaseModel):
items: conlist(item_type=RecordCreate, min_items=RECORDS_CREATE_MIN_ITEMS, max_items=RECORDS_CREATE_MAX_ITEMS)
Expand Down
14 changes: 13 additions & 1 deletion src/argilla/server/schemas/v1/records.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from uuid import UUID

from argilla.server.models import ResponseStatus
from argilla.server.pydantic_v1 import BaseModel, Field
from argilla.server.pydantic_v1 import BaseModel, Field, validator
from argilla.server.schemas.base import UpdateSchema
from argilla.server.schemas.v1.suggestions import SuggestionCreate

Expand Down Expand Up @@ -51,3 +51,15 @@ class RecordUpdate(UpdateSchema):
metadata_: Optional[Dict[str, Any]] = Field(None, alias="metadata")
suggestions: Optional[List[SuggestionCreate]] = None
vectors: Optional[Dict[str, List[float]]]

@validator("metadata_", pre=True)
@classmethod
def prevent_nan_values(cls, metadata: Optional[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
if metadata is None:
return metadata

for k, v in metadata.items():
if v != v:
raise ValueError(f"NaN is not allowed as metadata value, found NaN for key {k!r}")

return {k: v for k, v in metadata.items() if v == v} # By definition, NaN != NaN
48 changes: 48 additions & 0 deletions tests/integration/client/feedback/dataset/remote/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

import argilla as rg
import argilla.client.singleton
import numpy
import pytest
from argilla import (
FeedbackRecord,
Expand Down Expand Up @@ -841,3 +842,50 @@ async def test_warning_local_methods(self, role: UserRole) -> None:
match="A local `FeedbackDataset` returned because `prepare_for_training` is not supported for `RemoteFeedbackDataset`. ",
):
ds.prepare_for_training(framework=None, task=None)

async def test_add_records_with_metadata_including_nan_values(
self, owner: "User", feedback_dataset: FeedbackDataset
):
argilla.client.singleton.init(api_key=owner.api_key)
workspace = Workspace.create(name="test-workspace")

remote_dataset = feedback_dataset.push_to_argilla(name="test_dataset", workspace=workspace)

records = [
FeedbackRecord(
external_id=str(i),
fields={"text": "Hello world!", "text-2": "Hello world!"},
metadata={"float-metadata": float("nan")},
)
for i in range(1, 20)
]

with pytest.raises(ValueError, match="NaN values are not allowed"):
remote_dataset.add_records(records)

async def test_add_records_with_metadata_including_none_values(
self, owner: "User", feedback_dataset: FeedbackDataset
):
argilla.client.singleton.init(api_key=owner.api_key)
workspace = Workspace.create(name="test-workspace")

feedback_dataset.add_records(
[
FeedbackRecord(
fields={"text": "Hello world!"},
metadata={
"terms-metadata": None,
"integer-metadata": None,
"float-metadata": None,
},
)
]
)

remote_dataset = feedback_dataset.push_to_argilla(name="test_dataset", workspace=workspace)
assert len(remote_dataset.records) == 1
assert remote_dataset.records[0].metadata == {
"terms-metadata": None,
"integer-metadata": None,
"float-metadata": None,
}
6 changes: 3 additions & 3 deletions tests/unit/client/feedback/dataset/local/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,16 +226,16 @@ def test_add_record_with_numpy_values(property_class: Type["AllowedMetadataPrope
metadata_property = property_class(name="numeric_property")
dataset.add_metadata_property(metadata_property)

property_to_primitive_type = {IntegerMetadataProperty: int, FloatMetadataProperty: float}
expected_type = property_to_primitive_type[property_class]
property_to_expected_type_msg = {IntegerMetadataProperty: "`int`", FloatMetadataProperty: "`int` or `float`"}
expected_type_msg = property_to_expected_type_msg[property_class]

value = numpy_type(10.0)
record = FeedbackRecord(fields={"required-field": "text"}, metadata={"numeric_property": value})

with pytest.raises(
ValueError,
match=f"Provided 'numeric_property={value}' of type {str(numpy_type)} is not valid, "
f"only values of type {expected_type} are allowed.",
f"only values of type {expected_type_msg} are allowed.",
):
dataset.add_records(record)

Expand Down
12 changes: 12 additions & 0 deletions tests/unit/client/feedback/dataset/test_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,12 @@ def test_generate_pydantic_schema_for_metadata(
ValidationError,
"Provided 'int-metadata=wrong' of type <class 'str'> is not valid",
),
(
[IntegerMetadataProperty(name="int-metadata", min=0, max=10)],
{"int-metadata": float("nan")},
ValidationError,
"Provided 'int-metadata=nan' is not valid, NaN values are not allowed.",
),
(
[FloatMetadataProperty(name="float-metadata", min=0.0, max=10.0)],
{"float-metadata": 100.0},
Expand All @@ -106,6 +112,12 @@ def test_generate_pydantic_schema_for_metadata(
ValidationError,
"Provided 'float-metadata=wrong' of type <class 'str'> is not valid",
),
(
[FloatMetadataProperty(name="float-metadata", min=0.0, max=10.0)],
{"float-metadata": float("nan")},
ValidationError,
"Provided 'float-metadata=nan' is not valid, NaN values are not allowed.",
),
],
)
def test_generate_pydantic_schema_for_metadata_errors(
Expand Down
Loading

0 comments on commit c95326a

Please sign in to comment.