-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving preprocessing #1320
base: master
Are you sure you want to change the base?
Improving preprocessing #1320
Conversation
…g BinaryCategoricalPreprocessor, fix bugs, adding reduce memory size, delete clean_extra_spaces
Hello @aPovidlo! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-10-24 13:32:36 UTC |
Code in this pull request still contains PEP8 errors, please write the Comment last updated at |
…y method to OptimisedFeature
…_correct by adding check for target
…_fit_transform by adding cat and num idx in get_dataset func
…by switching Xgboost to Catboost, due to "Experimental support for categorical data is not implemented for current tree method yet." for XgBoost and checking feat ids with size
Благодарю! Обязательно разберусь с этим. Пока вносил другие небольшие изменения и тестировал на своих примерах. |
@nicl-nno Еще я думаю было бы полезно пройтись профайлером чтобы померить использование памяти в пиках. Почему-то думаю, что возможно |
@Lopa10ko Можно тебя попросить запустить снова интеграционные тесты? |
|
Благодарю! кттс сделаю в освободившиеся время. |
fedot/preprocessing/preprocessing.py
Outdated
else: | ||
self.log.debug('-- Reduce memory in features') | ||
data.features = reduce_mem_usage(data.features, data.supplementary_data.col_type_ids['features']) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is a persistent error that causes the test_fill_nan_without_categorical
integration test to crash. The reason for this is that data.features
should be of type np.ndarray
, but new functions, such as reduce_mem_usage
, internally work with pd.DataFrame
.
If you need to maintain the data type returned by reduce_mem_usage
, I recommend the following approach:
data.features = data.features.to_numpy() |
This also solves problems with test_text_preprocessing.py::test_clean_text_preprocessing
and test_real_cases.py::test_spam_detection_problem
@copy_doc(BasePreprocessor.reduce_memory_size) | ||
def reduce_memory_size(self, data: InputData) -> InputData: | ||
if isinstance(data, InputData): | ||
if data.task.task_type == TaskTypesEnum.ts_forecasting: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point the value of the data.supplementary_data.col_type_ids
field is set to None
, resulting in integration tests failures for test_classification.py:test_image_classification_quality
, test_classification.py:test_cnn_methods
, and test_data.py:test_data_from_image
if isinstance(features, np.ndarray): | ||
transformed_categorical = self.encoder.transform(features[:, self.categorical_ids]).toarray() | ||
# Stack transformed categorical and non-categorical data, ignore if none | ||
non_categorical_features = features[:, self.non_categorical_ids] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the rationale behind choosing a np.ndarray
as the type for self.non_categorical_ids
?
self.categorical_ids: np.ndarray = np.array([])
self.non_categorical_ids: np.ndarray = np.array([])
self.encoded_ids: np.ndarray = np.array([])
self.new_numerical_idx: np.ndarray = np.array([])
The integration test test_pipeline_preprocessing.py::test_data_preprocessor_performs_optional_data_preprocessing_only_once
failed when the default field value of self.non_categorical_ids
was changed from []
to np.array([])
It's a bit strange that the changes to the type of these ids attributes are limited only to OneHotEncodingImplementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lopa10ko Изначально стоял что на вход придет тип np.ndarray. Заметил, что PyCharm помечает несоответсвие типу, поэтому решил поменять. Видимо мб и в тесте поменять, ну или типы)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that numpy slices can only be created with id sequences with a constant data type (int | bool
). However, the test checks something entirely different. The easiest solution is to either leave the lists as they are or set dtype
This is a 🔨 code refactoring.
Summary
Significant Updates in Data Storage and Preprocessing
Major Updates:
InputData.from_numpy(...)
,InputData.from_dataframe(...)
, andInputData.from_csv(...)
methods.OptimizedFeatures
, which stores data with optimal dtypes for improved efficiency.reduce_memory_size
to optimize memory usage.PredefinedModel
to allow copying parameters fromDataPreprocessor
.Minor Updates:
Context
closes #1337
closes #1329