-
-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding capability of taking auxiliary data #436
base: master
Are you sure you want to change the base?
Conversation
There are duplicate code in terms of extracting auxiliary paths because I find it confusing to put the logic of extracting auxiliary and regular train/test paths in a single function. I'm sure there are better designs and feel free to propose them :). |
Thank you for the contribution. I just wanted to let you know that it might be a little while before I myself can look at this. Though if Seb finds the time then I'm okay with whatever he says :) In general I think it would be a useful addition to allow for auxiliary data to be present in a task. These types of tasks have already been the focus of research and AutoML competitions. Not processing auxiliary data in any meaningful way by the benchmark framework is the only way I see it work (given its free form). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks more like an ad-hoc feat for AG right now, would need to think how it can be implemented to be useful for more frameworks.
paths = dict(train=train_path, test=test_path) | ||
return paths | ||
|
||
def _extract_auxiliary_paths(self, auxiliary_data, fold=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a copy/paste of _extract_train_test_paths
with minor changes:
- different regex pattern
- slightly different messages
- as it is a recursive function, internal call to
_extract_auxiliary_paths
instead of_extract_auxiliary_paths
.
Given the complexity of this logic, we can't afford duplicating it.
Let's make a higher level _extract_paths
function that can hold the logic for both, sth like:
def _extract_train_test_paths(self, dataset, fold=None):
return self._extract_paths(self._extract_train_test_paths, dataset, fold)
def _extract_auxiliary_paths(self, dataset, fold=None):
return self._extract_paths(self._extract_auxiliary_paths, dataset, fold,
train_suffix="train_auxiliary", test_suffix="test_auxiliary")
def _extract_paths(self, extract_paths_fn, data, fold=None, train_suffix='train', test_suffix='test'):
train_search_pat = re.compile(rf"(?:(.*)[_-]){train_suffix}(?:[_-](\d+))?\.\w+")
...
return extract_paths_fn(ns(train=…)
...
amlb/datasets/file.py
Outdated
@@ -55,6 +55,84 @@ def load(self, dataset, fold=0): | |||
else: | |||
raise ValueError(f"Unsupported file type: {ext}") | |||
|
|||
@profile(logger=log) | |||
def load_auxiliary_data(self, auxiliary_data, fold=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a test for this under tests/unit/amlb/datasets/file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
amlb/datasets/file.py
Outdated
paths = dict(train=train_path, test=test_path) | ||
return paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the fact that it only returns paths. For all other data, we try to provide an object that allows a consistent loading of data, but still providing the possibility to access the path directly. I don't see why it can't be done here as well.
Even if the access to data as a pandas DF is not provided immediately, at least we should have some interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that the auxiliary data could be in all different forms. For example, it could be a zip file containing bunch of images. We don't know how should we handle it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, see my suggestion for AuxData
interface below: we can easily imagine extending it with some format
property later.
Right now, I'm just concerned about providing an access to those aux data that is likely to be useful only for AG, that's why I just want to have a better interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the AuxData
interface, there's a data
property that's supposed to return a DF, which is not doable for many aux data. My point of view is that each framework should handle their own usage of the aux data because of its free form. I don't think providing only the path is useful only for AG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point of view is that each framework should handle their own usage of the aux data because of its free form.
you're making a design decision for them here.
Most frameworks currently supported deal only with tabular data and that is the current focus of amlb
. So if you want to autodetect to support images for your own need, fine for you, you can use the path
property in AuxData
, but you can't generalize your approach to all frameworks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the AuxData interface, there's a data property that's supposed to return a DF, which is not doable for many aux data
hence the idea of adding a type/format attribute to AuxData
.
Again, you don't have to implement the data
property now, I just want to have this AuxData
interface now, so that we can implement more specific loading of those data later without having to rewrite everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Thanks for the discussion!
amlb/datasets/file.py
Outdated
self._test_auxiliary_data = auxiliary_data_path.get('test', None) | ||
|
||
@property | ||
def train_auxiliary_data(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's a path, please name it train_auxiliary_path
not _data
: by convention in amlb
, _data
is used for loaded data and some serialization logic relies on this naming convention.
amlb/datasets/file.py
Outdated
@@ -167,6 +245,59 @@ def _get_metadata(self, prop): | |||
return meta[prop] | |||
|
|||
|
|||
class DatasetWithAuxiliaryData: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must extend Dataset
if you want to expose this.
Actually, I think the right interface would be sth like:
# in data.py
class Datasplit(ABC):
...
@property
def has_auxiliary_data(self) -> bool:
pass
@property
def auxiliary_data(self) -> List[AuxData]:
pass
class AuxData(ABC):
# does it need a name property? a type/category/labels?
@property
def path(self) -> str:
pass
@property
@abstractmethod
def data(self) -> DF:
"""
:return: the auxiliary data as a pandas DataFrame.
"""
pass
@profile(logger=log)
def release(self, properties=None):
clear_cache(self, properties)
then you can implement the logic for files to fulfill the contract of this interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having auxiliary data in Datasplit seems unnecessary to me and it requires methods to add auxiliary data to an existing Dataset because dataset and aux data are loaded separately. This is identical to DatasetWithAuxiliaryData, which is a wrapper containing both the dataset and the aux data. I'll extend it with Dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having auxiliary data in Datasplit seems unnecessary
why? you were planning to have dataset.train_auxiliary_path
and dataset.test_auxiliary_path
.
I'm suggesting that for you it would then be accessed as dataset.train.auxiliary_data[0].path
and dataset.test.auxiliary_data[0].path
: it's much more consistent with the current interfaces.
Everything that is exposed to frameworks must be defined in one of the interfaces defined in data.py
.
This is identical to DatasetWithAuxiliaryData
no, this is not identical: you're creating a class that doesn't extend anything and you add features that are not visible in the contract defined by the interfaces: those are the only references for all data providers.
to sum up, a dataset has currently 2 splits: train
and test
.
you can have aux data for each of those, therefore auxdata should be a property of Datasplit
: if you add the auxdata directly to the Dataset instance instead of Datasplit, you create an inconsistency.
@sebhrusen Hey Seb, happy holidays! Can you review the code when you have time? |
@sebhrusen Hi, can you review this PR when you have time? Thanks! |
Hey @yinweisu sorry for making you wait on this: don't have much time for |
We have an intern joining for AutoGluon that we want to have work on multi-modal optimization (i.e. datasets with an image feature). Ideally we would like him to be able to use AutoMLBenchmark as the benchmarking tool, but this is tricky without this functionality being merged in. We can make do by hacking it into a forked repo, but just mentioning this. |
The idea of this PR is to enable benchmark yaml to take in auxiliary data. Auxiliary data could be useful when the task requires more than train and test datasets. For example, images are needed for multimodality tasks. There are more usage possibilities, and therefore we think it's worth to have auxiliary data as a general feature.
Example yaml file:
The main difference between auxiliary data and regular dataset should be that: