Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting ignore_attribute with edit_dataset only uses last attribute #1289

Open
amueller opened this issue Nov 3, 2023 · 13 comments
Open

Setting ignore_attribute with edit_dataset only uses last attribute #1289

amueller opened this issue Nov 3, 2023 · 13 comments
Labels
bug serverside These issues are present in the rest API and not fixable by the Python package.

Comments

@amueller
Copy link
Contributor

amueller commented Nov 3, 2023

So I tried to create a new version of cylinder-bands because of openml/openml-data#59

import openml
ds = openml.datasets.get_dataset("cylinder-bands", version=2)
new_did = openml.datasets.fork_dataset(data_id=ds.id)
openml.datasets.edit_dataset(new_did , ignore_attribute=[
            "timestamp",
            "cylinder_number",
              "job_number"
        ])

However, that seems to have replaced the ignore_attribute just with "job_number" as you can see here:
https://www.openml.org/api/v1/json/data/45686

Opening this here since I used the Python interface, but the Python code looks pretty easy, so maybe it's an issue in the backend?

@amueller
Copy link
Contributor Author

amueller commented Nov 3, 2023

The XML send by the client code is

'<?xml version="1.0" encoding="utf-8"?>\n<oml:data_edit_parameters xmlns:oml="http://openml.org/openml"><oml:ignore_attribute>timestamp</oml:ignore_attribute><oml:ignore_attribute>cylinder_number</oml:ignore_attribute><oml:ignore_attribute>job_number</oml:ignore_attribute></oml:data_edit_parameters>'

Not sure if that is correct? @joaquinvanschoren ?

@PGijsbers
Copy link
Collaborator

Strange, when I query the production database, for dataset 45686 only timestamp and cylinder_number are saved with ignore_attribute set, job_number is not. I am very confused about what could be happening here. Hoping that someone more familiar with the back-end can chime in.

SELECT * FROM `data_feature` WHERE `did`=45686

image

@joaquinvanschoren
Copy link
Contributor

Looks like a bug in the API:
https://github.com/openml/OpenML/blob/858b9d471554bfd70b30bd16f53226c8ab916fa9/openml_OS/models/api/v1/Api_data.php#L564

This seems to overwrite the previous values in the same request. It also looks like every call replaces the columns to be ignored, it doesn't add them. As a workaround, passing all values at once (comma-separated string) should work (but haven't tested it).

@PGijsbers what do you think? Is it worth fixing this in API v1 or do a workaround now and fix this in API v2?

@joaquinvanschoren
Copy link
Contributor

@PGijsbers

Eh, could it be that the /data/edit endpoint only changes the dataset table, not the data_feature table?

Looks like it:
https://github.com/openml/OpenML/blob/858b9d471554bfd70b30bd16f53226c8ab916fa9/openml_OS/models/api/v1/Api_data.php#L623C31-L623C38

@PGijsbers
Copy link
Collaborator

PGijsbers commented Nov 3, 2023

The dataset table does only list "job_number" as ignore_attribute, explaining the response.

@amueller Does Joaquin's suggested workaround work?

@joaquinvanschoren If the work-around works, we could hotfix openml-python. To the best of my knowledge it is the only way this particular endpoint is exposed (it's not even listed in the documentation). My preference would be to fix this server-side though if it's not too much trouble, since with v2 we would want to expected format be an explicit list instead of a comma-separated string. If we apply our hotfix to openml-python now, we would need to adjust openml-python again once v2 standards are adopted.

@amueller
Copy link
Contributor Author

amueller commented Nov 6, 2023

Do you mean passing all values at once as a string? I tried that before opening the issue, the server-side validation didn't seem to like it the way I did it. There might be another way, though?

@amueller
Copy link
Contributor Author

It would be great to have a work-around for this, I'd really like to use this dataset.

@PGijsbers
Copy link
Collaborator

PGijsbers commented Nov 16, 2023

For me the workaround seems to work? openml.org/d/45705

import openml
ds = openml.datasets.get_dataset("cylinder-bands", version=2)
new_did = openml.datasets.fork_dataset(data_id=ds.id)
openml.datasets.edit_dataset(new_id, ignore_attribute='timestamp,cylinder_number,job_number')

after removing cache:

>>> import openml
>>> d=openml.datasets.get_dataset(45705)
...WARNINGS...
>>> d.ignore_attribute
['timestamp', 'cylinder_number', 'job_number']
>>> 

Please try again with the provided script, perhaps there were other formatting errors when you tried the workaround. If that still doesn't work, please provide the error message. And also the dataset id of the dataset that you tried to modify (i.e., your "fork" (new_id), not the original).

Running on a dev version of openml-python, but I don't think there have been any changes that would affect this for many releases.

@amueller
Copy link
Contributor Author

amueller commented Nov 16, 2023

I think I had spaces after the comma, that might have been the issue. Thank you! Version two is my fork IIRC :)

@PGijsbers PGijsbers added bug serverside These issues are present in the rest API and not fixable by the Python package. labels Nov 16, 2023
@amueller
Copy link
Contributor Author

FYI it seems that if you fork a dataset, it keeps the owner by default. I'm not sure if that's intentional?

@PGijsbers
Copy link
Collaborator

@amueller
Copy link
Contributor Author

Hm ok so this is the last person that edited it? Because 45705 was the one I created and it's now "uploaded" by you.

@PGijsbers
Copy link
Collaborator

I am a little confused. Are you saying that the "uploader" for a specific dataset id changed? E.g., 45705 was first marked as "uploaded by you" and now "uploaded by me"? Because I don't think that's supposed to happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug serverside These issues are present in the rest API and not fixable by the Python package.
Projects
None yet
Development

No branches or pull requests

3 participants