Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add multi-label support #298

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

louis-huang
Copy link
Contributor

Hi I added support to allow label as a list. So we can support reading data with multiple labels. This can then solve #286.
I verified new unit tests pass. Also test_matrix.py all pass with my local set up.
I verified locally by training a xgboost model with parquet data format, it works well. So far it should work well for parquet data format. Thank you!

@louis-huang
Copy link
Contributor Author

I verified the change works with the blow code example:

from sklearn.datasets import make_multilabel_classification
import pandas as pd
import numpy as np
n_classes = 5
random_state = 0
X, y = make_multilabel_classification(n_samples=32, n_classes=5, n_labels=3, random_state=random_state)
features = [f"f{i}" for i in range(len(X[0]))]
labels = [f"label_{i}" for i in range(n_classes)]

X_df = pd.DataFrame(X, columns = features)
y_df = pd.DataFrame(y, columns = labels)
data = pd.concat([X_df, y_df], axis = 1)

data.to_parquet("~/Desktop/sample_data/data.parquet")

from xgboost_ray import RayDMatrix, RayParams, train, RayFileType
n_classes = 5
features = [f"f{i}" for i in range(20)]

labels = [f"label_{i}" for i in range(n_classes)]

training_data = "~/Desktop/sample_data"
train_set = RayDMatrix(training_data, labels, columns = features + labels, filetype=RayFileType.PARQUET)

evals_result = {}
bst = train(
    {
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "random_state": random_state,
    },
    train_set,
    num_boost_round = 1,
    evals_result=evals_result,
    evals=[(train_set, "train")],
    verbose_eval=False,
    ray_params=RayParams(
        num_actors=1,  # Number of remote actors
        cpus_per_actor=1))

#bst.save_model("model.xgb")
#print("Final training error: {:.4f}".format(
#    evals_result["train"]["error"][-1]))

from xgboost_ray import predict
pred_ray = predict(bst, train_set, ray_params=RayParams(num_actors=1))
print(pred_ray)


import xgboost as xgb

clf = xgb.XGBClassifier(tree_method="hist", n_estimators = 1, random_state=0)
clf.fit(X, y)
expected = clf.predict_proba(X)

np.testing.assert_allclose(expected, pred_ray)

@heyitsmui
Copy link

@Yard1 can you help take a look when you get a chance? thanks!

@@ -118,12 +118,12 @@ def convert_to_series(data: Any) -> pd.Series:
@classmethod
def get_column(
cls, data: pd.DataFrame, column: Any
) -> Tuple[pd.Series, Optional[str]]:
) -> Tuple[pd.Series, Optional[Union[str, List]]]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we open up a separate get_columns(...) instead of overloading this method?

Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! cc @krfricke

xgboost_ray/matrix.py Outdated Show resolved Hide resolved
xgboost_ray/tests/test_matrix.py Outdated Show resolved Hide resolved
xgboost_ray/tests/test_matrix.py Outdated Show resolved Hide resolved
xgboost_ray/matrix.py Outdated Show resolved Hide resolved
Signed-off-by: Antoni Baum <[email protected]>
@louis-huang
Copy link
Contributor Author

Hi @Yard1 may I ask how to fix the lint test? Seems it still blocks the merge. Thank you!

@Yard1
Copy link
Member

Yard1 commented Nov 9, 2023

Can you run the ./format.sh script in the root of the repo?

@yc2984
Copy link

yc2984 commented Mar 21, 2024

@louis-huang can you please run the above test please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants