Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OPIK-218] Remove limitations in dataset items #369

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

thiagohora
Copy link
Contributor

@thiagohora thiagohora commented Oct 11, 2024

Details

We will remove limitations in dataset items by deprecating the input, expected_output and metadata to favor a more generic structure. The new field data is a Dict[String, String] or Map<String, String>, allowing users to send any number of columns they want.

To maintain backward compatibility (until we actually remove the old fields), the data from the input, expected_output and metadata fields are automatically copied to the new input_data field. Also, a new columns column was added to the Page object to show the frontend which columns are present in the dataset.

The new input_data field has the following format:

DatasetItem model

{
    ...
    "data": {
        "{{key}}":  {}        # {} json or "" string
        ...
    }
}

Sample:

{
    ...
    "data": {
        "user_question":  "What is this model name?",
        "results": {
                 "context": {
                     "message": "Be an expert."
                 }
                 "user": {
                      "message": "What is a shooting star¿"
                 }
        },
        "model":  "Lambda 3"
    }
}

As per the page object, it will look like this:

{
    "content": [ ... ],
    "page": 1,
    "size": 10,
    "total": 10,
    "columns": [ { "name": "user_question", "type": "String" },  { "name": "results", "type": "Object" } , { "name": "model", "type": "String" } ]
}

Issues

#OPIK-218

Testing

  • Automated tests for:
  1. input present
  2. input data present
  3. input data and input not present
  4. verify a copy of input, expected_output, and metadata values to data

Documentation

OPENAPI docs were updated to reflect new and deprecated columns.

@thiagohora thiagohora self-assigned this Oct 11, 2024
@thiagohora thiagohora marked this pull request as ready for review October 11, 2024 14:28
@thiagohora thiagohora requested a review from a team as a code owner October 11, 2024 14:28
…' of https://github.com/comet-ml/opik into thiagohora/OPIK-218_remove_limitations_in_dataset_items
@thiagohora thiagohora force-pushed the thiagohora/OPIK-218_remove_limitations_in_dataset_items branch from b9890a4 to c8edc9e Compare October 11, 2024 14:47
Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some minor and other no blocking comments. Let's focus and discuss:

  • the rollout strategy.
  • how co-existence of the old and new fields should be handled.
    • considering multiple versions of the SDK released.
    • considering the frontend expectations.
  • the data migration strategy.
  • fern generation for polymorphic types.

andrescrz
andrescrz previously approved these changes Oct 14, 2024
Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed and agreed this morning, this LGTM.

@thiagohora thiagohora force-pushed the thiagohora/OPIK-218_remove_limitations_in_dataset_items branch from 8567fdf to 260339f Compare October 15, 2024 11:17
@thiagohora thiagohora force-pushed the thiagohora/OPIK-218_remove_limitations_in_dataset_items branch from 72ced66 to ad13f5d Compare October 15, 2024 11:47
Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the latest revision has two major issues that we have to review:

  1. It's not returning the type of the columns, which is a requirement for FE per the ticket.
  2. It's resolving the type of the fields before storing them (which might be fine for caching purposes etc.) but then it isn't doing anything with them. The type of a field is implicit any valid JSON payload, so there's no need to resolve and store this extra information unless there's a reason to.

Finally, I left some guidance to easily accommodate the next requirements as filtering.

Once we're on the same page about these suggestions, following this approach is much easier as it's just a matter or removing some of the current code, like happened with the previous revision.

@thiagohora
Copy link
Contributor Author

inally, I left some guidance to easily accommodate the next requirements as filtering.

From what @ferc said, the column types do not need to be returned yet.

Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last two issues were discussed, agreed and addressed. This is ready to go from my side.

@andrescrz
Copy link
Collaborator

andrescrz commented Oct 15, 2024

inally, I left some guidance to easily accommodate the next requirements as filtering.

From what @ferc said, the column types do not need to be returned yet.

Agreed to move forward with at least the type in the returned payload. However, getting the type value was trivial, so the latest revision resolves it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants