Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble uploading datasets to the test server #1159

Open
sebffischer opened this issue Jul 21, 2022 · 5 comments
Open

Trouble uploading datasets to the test server #1159

sebffischer opened this issue Jul 21, 2022 · 5 comments

Comments

@sebffischer
Copy link

I have tried with the API, as well as with through the website.
When trying to upload a dataset to the test server, I encounter the following error:

A PHP Error was encountered

Severity: Warning

Message: simplexml_load_string(): Entity: line 70: parser error : Extra content at the end of the document

Filename: new/post.php

Line Number: 155

Backtrace:

File: /var/www/openml/OpenML/openml_OS/views/pages/frontend/new/post.php
Line: 155
Function: simplexml_load_string

File: /var/www/openml/OpenML/openml_OS/helpers/cms_helper.php
Line: 19
Function: view

File: /var/www/openml/OpenML/openml_OS/controllers/Frontend.php
Line: 89
Function: loadpage

File: /var/www/openml/OpenML/index.php
Line: 334
Function: require_once
@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Jul 21, 2022

This is just a warning. The test server is supposed to show these. The production server doesn't. Did you actually have a problem uploading the dataset?

@sebffischer
Copy link
Author

Thanks for the clarification!
However this does not work (for me):

from openml.datasets import create_dataset
import sklearn
import numpy as np
from sklearn import datasets
import openml

openml.config.apikey = "API_TEST_KEY"
openml.config.server = "https://test.openml.org/api/v1"

diabetes = sklearn.datasets.load_diabetes()
name = "Diabetes(scikit-learn)"
X = diabetes.data
y = diabetes.target
attribute_names = diabetes.feature_names
description = diabetes.DESCR


data = np.concatenate((X, y.reshape((-1, 1))), axis=1)
attribute_names = list(attribute_names)
attributes = [(attribute_name, "REAL") for attribute_name in attribute_names] + [
    ("class", "INTEGER")
]
citation = (
    "Bradley Efron, Trevor Hastie, Iain Johnstone and "
    "Robert Tibshirani (2004) (Least Angle Regression) "
    "Annals of Statistics (with discussion), 407-499"
)
paper_url = "https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf"


diabetes_dataset = create_dataset(
    # The name of the dataset (needs to be unique).
    # Must not be longer than 128 characters and only contain
    # a-z, A-Z, 0-9 and the following special characters: _\-\.(),
    name=name,
    # Textual description of the dataset.
    description=description,
    # The person who created the dataset.
    creator="Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani",
    # People who contributed to the current version of the dataset.
    contributor=None,
    # The date the data was originally collected, given by the uploader.
    collection_date="09-01-2012",
    # Language in which the data is represented.
    # Starts with 1 upper case letter, rest lower case, e.g. 'English'.
    language="English",
    # License under which the data is/will be distributed.
    licence="BSD (from scikit-learn)",
    # Name of the target. Can also have multiple values (comma-separated).
    default_target_attribute="class",
    # The attribute that represents the row-id column, if present in the
    # dataset.
    row_id_attribute=None,
    # Attribute or list of attributes that should be excluded in modelling, such as
    # identifiers and indexes. E.g. "feat1" or ["feat1","feat2"]
    ignore_attribute=None,
    # How to cite the paper.
    citation=citation,
    # Attributes of the data
    attributes=attributes,
    data=data,
    # A version label which is provided by the user.
    version_label="test",
    original_data_url="https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html",
    paper_url=paper_url,
)

diabetes_dataset.publish()
print(f"URL for dataset: {diabetes_dataset.openml_url}")

gives me

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sebi/.local/lib/python3.8/site-packages/openml/base.py", line 133, in publish
    xml_response = xmltodict.parse(response_text)
  File "/home/sebi/.local/lib/python3.8/site-packages/xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: junk after document element: line 70, column 0

@sebffischer
Copy link
Author

sebffischer commented Jul 22, 2022

Similar for this:

import openml
from sklearn import compose, ensemble, impute, neighbors, preprocessing, pipeline, tree

openml.config.apikey = "TEST_KEY"
openml.config.server = "https://test.openml.org/api/v1"
# NOTE: We are using dataset 68 from the test server: https://test.openml.org/d/68
dataset = openml.datasets.get_dataset(68)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="array", target=dataset.default_target_attribute
)
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)

dataset = openml.datasets.get_dataset(17)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="array", target=dataset.default_target_attribute
)
print(f"Categorical features: {categorical_indicator}")
transformer = compose.ColumnTransformer(
    [("one_hot_encoder", preprocessing.OneHotEncoder(categories="auto"), categorical_indicator)]
)
X = transformer.fit_transform(X)
clf.fit(X, y)


# Get a task
task = openml.tasks.get_task(403)

# Build any classifier or pipeline
clf = tree.DecisionTreeClassifier()

# Run the flow
run = openml.runs.run_model_on_task(clf, task)

print(run)


myrun = run.publish()
# For this tutorial, our configuration publishes to the test server
# as to not pollute the main server.
print(f"Uploaded to {myrun.openml_url}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sebi/.local/lib/python3.8/site-packages/openml/base.py", line 133, in publish
    xml_response = xmltodict.parse(response_text)
  File "/home/sebi/.local/lib/python3.8/site-packages/xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: junk after document element: line 61, column 6

@joaquinvanschoren
Copy link
Contributor

Hi Seb,

I recently fixed a number of issues with the test server.
Can you please check if this issue is now resolved?

Thanks!

@sebffischer
Copy link
Author

It seems like listing datasets from the test server does not work. The other things I have not checked yet (e.g. upload) but will report when I did

> list_oml_data(test_server = TRUE)
INFO  [15:21:27.606] Retrieving JSON {url: `https://test.openml.org/api/v1/json/data/list/limit/1000`, authenticated: `TRUE`}
Error in parse_con(txt, bigint_as_char) :
  lexical error: invalid character inside string.
          t":"ARFF",    "md5_checksum":" <div style="border:1px solid
                     (right here) ------^
``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants