Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding capability of taking auxiliary data #436

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion amlb/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from typing import List, Union

from .job import Job, JobError, SimpleJobRunner, MultiThreadingJobRunner
from .datasets import DataLoader, DataSourceType
from .datasets import DataLoader, DataSourceType, DatasetWithAuxiliaryData
from .data import DatasetType
from .datautils import read_csv
from .resources import get as rget, config as rconfig, output_dirs as routput_dirs
Expand Down Expand Up @@ -489,6 +489,10 @@ def load_data(self):
else:
raise ValueError("Tasks should have one property among [openml_task_id, openml_dataset_id, dataset].")

if hasattr(self._task_def, 'auxiliary_data'):
auxiliary_data = Benchmark.data_loader.load_auxiliary_data(DataSourceType.file, auxiliary_data=self._task_def.auxiliary_data, fold=self.fold)
self._dataset = DatasetWithAuxiliaryData(self._dataset, auxiliary_data)

def as_job(self):
job = Job(name=rconfig().token_separator.join([
'local',
Expand Down
8 changes: 7 additions & 1 deletion amlb/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from enum import Enum, auto

from .file import FileLoader
from .file import FileLoader, DatasetWithAuxiliaryData
from .openml import OpenmlLoader


Expand All @@ -24,5 +24,11 @@ def load(self, source: DataSourceType, *args, **kwargs):
else:
raise NotImplementedError(f"data source {source} is not supported yet")

def load_auxiliary_data(self, source: DataSourceType, *args, **kwargs):
if source == DataSourceType.file:
return self.file_loader.load_auxiliary_data(*args, **kwargs)
else:
raise NotImplementedError(f"data source {source} is not supported yet")


__all__ = ["DataLoader", "DataSourceType"]
131 changes: 131 additions & 0 deletions amlb/datasets/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,84 @@ def load(self, dataset, fold=0):
else:
raise ValueError(f"Unsupported file type: {ext}")

@profile(logger=log)
def load_auxiliary_data(self, auxiliary_data, fold=0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test for this under tests/unit/amlb/datasets/file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

auxiliary_data = auxiliary_data if isinstance(auxiliary_data, ns) else ns(path=auxiliary_data)
log.debug("Loading auxiliary data %s", auxiliary_data)
paths = self._extract_auxiliary_paths(auxiliary_data.path if 'path' in auxiliary_data else auxiliary_data, fold=fold)
train_path = paths['train'][fold]
test_path = paths['test'][fold]
paths = dict(train=train_path, test=test_path)
return paths
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the fact that it only returns paths. For all other data, we try to provide an object that allows a consistent loading of data, but still providing the possibility to access the path directly. I don't see why it can't be done here as well.
Even if the access to data as a pandas DF is not provided immediately, at least we should have some interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that the auxiliary data could be in all different forms. For example, it could be a zip file containing bunch of images. We don't know how should we handle it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, see my suggestion for AuxData interface below: we can easily imagine extending it with some format property later.
Right now, I'm just concerned about providing an access to those aux data that is likely to be useful only for AG, that's why I just want to have a better interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the AuxData interface, there's a data property that's supposed to return a DF, which is not doable for many aux data. My point of view is that each framework should handle their own usage of the aux data because of its free form. I don't think providing only the path is useful only for AG.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point of view is that each framework should handle their own usage of the aux data because of its free form.

you're making a design decision for them here.
Most frameworks currently supported deal only with tabular data and that is the current focus of amlb. So if you want to autodetect to support images for your own need, fine for you, you can use the path property in AuxData, but you can't generalize your approach to all frameworks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the AuxData interface, there's a data property that's supposed to return a DF, which is not doable for many aux data

hence the idea of adding a type/format attribute to AuxData.
Again, you don't have to implement the data property now, I just want to have this AuxData interface now, so that we can implement more specific loading of those data later without having to rewrite everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Thanks for the discussion!


def _extract_auxiliary_paths(self, auxiliary_data, fold=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a copy/paste of _extract_train_test_paths with minor changes:

  • different regex pattern
  • slightly different messages
  • as it is a recursive function, internal call to _extract_auxiliary_paths instead of _extract_auxiliary_paths.

Given the complexity of this logic, we can't afford duplicating it.
Let's make a higher level _extract_paths function that can hold the logic for both, sth like:

def _extract_train_test_paths(self, dataset, fold=None):
    return self._extract_paths(self._extract_train_test_paths, dataset, fold)

def _extract_auxiliary_paths(self, dataset, fold=None):
    return self._extract_paths(self._extract_auxiliary_paths, dataset, fold,
                          train_suffix="train_auxiliary", test_suffix="test_auxiliary")    
    
def _extract_paths(self, extract_paths_fn, data, fold=None, train_suffix='train', test_suffix='test'):
    train_search_pat = re.compile(rf"(?:(.*)[_-]){train_suffix}(?:[_-](\d+))?\.\w+")
    ...
    return extract_paths_fn(ns(train=…)
    ...

train_search_pat = re.compile(r"(?:(.*)[_-])train_auxiliary(?:[_-](\d+))?\.\w+")
test_search_pat = re.compile(r"(?:(.*)[_-])test_auxiliary(?:[_-](\d+))?\.\w+")
if isinstance(auxiliary_data, (tuple, list)):
assert len(auxiliary_data) % 2 == 0, "auxiliary data list must contain an even number of paths: [train_auxiliary_0, test_auxiliary_0, train_auxiliary_1, test_auxiliary_1, ...]."
return self._extract_auxiliary_paths(ns(train=[p for i, p in enumerate(auxiliary_data) if i % 2 == 0],
test=[p for i, p in enumerate(auxiliary_data) if i % 2 == 1]),
fold=fold)
elif isinstance(auxiliary_data, ns):
return dict(
train=[self._extract_auxiliary_paths(p)['train'][0]
if i == fold else None
for i, p in enumerate(as_list(auxiliary_data.train))],
test=[self._extract_auxiliary_paths(p)['train'][0]
if i == fold else None
for i, p in enumerate(as_list(auxiliary_data.test))] if 'test' in auxiliary_data else []
)
else:
assert isinstance(auxiliary_data, str)
auxiliary_data = os.path.expanduser(auxiliary_data)
auxiliary_data = auxiliary_data.format(**rconfig().common_dirs)

if os.path.exists(auxiliary_data):
if os.path.isfile(auxiliary_data):
# we leave the auxiliary data handling to the user
return dict(train=[auxiliary_data], test=[])
elif os.path.isdir(auxiliary_data):
files = list_all_files(auxiliary_data)
log.debug("Files found in auxiliary data folder %s: %s", auxiliary_data, files)
assert len(files) > 0, f"Empty folder: {auxiliary_data}"
if len(files) == 1:
return dict(train=files, test=[])

train_matches = [m for m in [train_search_pat.search(f) for f in files] if m]
test_matches = [m for m in [test_search_pat.search(f) for f in files] if m]
# verify they're for the same dataset (just based on name)
assert train_matches, f"Folder {auxiliary_data} must contain at least one training auxiliary data."
root_names = {m[1] for m in (train_matches+test_matches)}
assert len(root_names) == 1, f"All dataset files in {auxiliary_data} should follow the same naming: xxxxx_train_auxiliary_N.ext or xxxxx_test_auxiliary_N.ext with N starting from 0."

train_no_fold = next((m[0] for m in train_matches if m[2] is None), None)
test_no_fold = next((m[0] for m in test_matches if m[2] is None), None)
if train_no_fold and test_no_fold:
return dict(train=[train_no_fold], test=[test_no_fold])

paths = dict(train=[], test=[])
fold = 0
while fold >= 0:
train = next((m[0] for m in train_matches if m[2] == str(fold)), None)
test = next((m[0] for m in test_matches if m[2] == str(fold)), None)
if train and test:
paths['train'].append(train)
paths['test'].append(test)
fold += 1
else:
fold = -1
assert len(paths) > 0, f"No dataset file found in {auxiliary_data}: they should follow the naming xxxx_train_auxiliary.ext, xxxx_test_auxiliary.ext or xxxx_train_auxiliary_0.ext, xxxx_test_auxiliary_0.ext, xxxx_train_auxiliary_1.ext, ..."
return paths
elif is_valid_url(auxiliary_data):
cached_file = os.path.join(self._cache_dir, os.path.basename(auxiliary_data))
if not os.path.exists(cached_file): # don't download if previously done
handler = get_file_handler(auxiliary_data)
assert handler.exists(auxiliary_data), f"Invalid path/url: {auxiliary_data}"
handler.download(auxiliary_data, dest_path=cached_file)
return self._extract_auxiliary_paths(cached_file)
else:
raise ValueError(f"Invalid dataset description: {auxiliary_data}")

def _extract_train_test_paths(self, dataset, fold=None):
if isinstance(dataset, (tuple, list)):
assert len(dataset) % 2 == 0, "dataset list must contain an even number of paths: [train_0, test_0, train_1, test_1, ...]."
Expand Down Expand Up @@ -167,6 +245,59 @@ def _get_metadata(self, prop):
return meta[prop]


class DatasetWithAuxiliaryData:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must extend Dataset if you want to expose this.

Actually, I think the right interface would be sth like:

# in data.py

class Datasplit(ABC):
    ...

    @property
    def has_auxiliary_data(self) -> bool:
        pass
       
    @property
    def auxiliary_data(self) -> List[AuxData]:
        pass
        
class AuxData(ABC):
    # does it need a name property? a type/category/labels?

    @property
    def path(self) -> str:
        pass

    @property
    @abstractmethod
    def data(self) -> DF:
        """
        :return: the auxiliary data as a pandas DataFrame.
        """
        pass

    @profile(logger=log)
    def release(self, properties=None):
        clear_cache(self, properties)

then you can implement the logic for files to fulfill the contract of this interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having auxiliary data in Datasplit seems unnecessary to me and it requires methods to add auxiliary data to an existing Dataset because dataset and aux data are loaded separately. This is identical to DatasetWithAuxiliaryData, which is a wrapper containing both the dataset and the aux data. I'll extend it with Dataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having auxiliary data in Datasplit seems unnecessary

why? you were planning to have dataset.train_auxiliary_path and dataset.test_auxiliary_path.
I'm suggesting that for you it would then be accessed as dataset.train.auxiliary_data[0].path and dataset.test.auxiliary_data[0].path: it's much more consistent with the current interfaces.

Everything that is exposed to frameworks must be defined in one of the interfaces defined in data.py.

This is identical to DatasetWithAuxiliaryData

no, this is not identical: you're creating a class that doesn't extend anything and you add features that are not visible in the contract defined by the interfaces: those are the only references for all data providers.

to sum up, a dataset has currently 2 splits: train and test.
you can have aux data for each of those, therefore auxdata should be a property of Datasplit: if you add the auxdata directly to the Dataset instance instead of Datasplit, you create an inconsistency.


def __init__(self, dataset: FileDataset, auxiliary_data_path):
self._dataset = dataset
self._train_auxiliary_data = auxiliary_data_path.get('train', None)
self._test_auxiliary_data = auxiliary_data_path.get('test', None)

@property
def train_auxiliary_data(self) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's a path, please name it train_auxiliary_path not _data: by convention in amlb, _data is used for loaded data and some serialization logic relies on this naming convention.

return self._train_auxiliary_data

@property
def test_auxiliary_data(self) -> str:
return self._test_auxiliary_data

@property
def type(self) -> DatasetType:
assert self._dataset.target is not None
return (DatasetType[self._dataset._type] if self._dataset._type is not None
else DatasetType.regression if self._dataset.target.values is None
else DatasetType.binary if len(self._dataset.target.values) == 2
else DatasetType.multiclass)

@property
def train(self) -> Datasplit:
return self._dataset._train

@property
def test(self) -> Datasplit:
return self._dataset._test

@property
def features(self) -> List[Feature]:
return self._get_metadata('features')

@property
def target(self) -> Feature:
return self._get_metadata('target')

@memoize
def _get_metadata(self, prop):
meta = self._dataset._train.load_metadata()
return meta[prop]

@profile(logger=log)
def release(self, properties=None):
"""
Call this to release cached properties and optimize memory once in-memory data are not needed anymore.
:param properties:
"""
self._dataset.release(properties)


class FileDatasplit(Datasplit):

def __init__(self, dataset: FileDataset, format: str, path: str):
Expand Down
4 changes: 4 additions & 0 deletions frameworks/AutoGluon/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ def run(dataset: Dataset, config: TaskConfig):
),
problem_type=dataset.type.name # AutoGluon problem_type is using same names as amlb.data.DatasetType
)
if hasattr(dataset, 'train_auxiliary_data'):
data['train_auxiliary_data'] = dict(path=dataset.train_auxiliary_data)
if hasattr(dataset, 'test_auxiliary_data'):
data['test_auxiliary_data'] = dict(path=dataset.test_auxiliary_data)

return run_in_venv(__file__, "exec.py",
input_data=data, dataset=dataset, config=config)
Expand Down