Problem with the function extract_features #1058

AlessioBolpagni98 · 2023-12-28T12:26:28Z

The problem:
I have a script that run everyday and in this script i use the tsfresh function extract_features(), but sometimes the script remain stucked in the function with the progressbar blocked at a certain percentage. The function doesn't raise any excpetion and the code remain blocked.
Packages (1).txt

Python version: 3.10.12
Operating System: Linux Ubuntu 22.04.3
tsfresh version: 0.20.1
Install method (conda, pip, source): pip

nils-braun · 2024-01-03T08:17:28Z

Hi @AlessioBolpagni98 !
Is this deterministic - meaning: it always gets stuck with the same data? Do you see a certain pattern in which data it gets stuck?
And which feature calculators are you using?

sidneyzhu · 2024-07-16T13:21:37Z

i encountered the similar issues, my raw dataframe has 1k ids, 27k rows, 140 features. it can be well done by full feature extraction with MultiprocessingDistributor(n_workers=12) on a 64GB machine within 30 mins. but it always hanged with ClusterDaskDistributor with 4 nodes of 64GB workers. and i noticed that it hanged in the result gathering step. after about 4 hours, the extract_features job will be killed out of memory.
my environment is :
python 3.10.12
tsfresh：0.20.2
dask：2024.7.0
pandas：2.2.2
OS：ubuntu 22.04.1 LTS (Jammy Jellyfish)

sidneyzhu · 2024-07-17T13:44:00Z

@AlessioBolpagni98 have you fixed this issue?

AlessioBolpagni98 · 2024-07-17T17:43:55Z

My problem was that i was using the extract_features() function in a improper way. i was using the same column for the parameter 'column_id' and 'column_sort'.

This was my problematic function:

def get_features(df_BTC):
    """extract features using TSfresh, return a dataframe with the features"""
    df_BTC = df_BTC.reset_index(drop=False)
    # Estrae le caratteristiche
    # Retry the function up to 3 times
    params = {
        "timeseries_container": df_BTC,
        "column_sort": "Date",
        "column_id": "Date",
    }

    extracted_features = extract_features(**params)
    impute(extracted_features)  # inplace

    cols_zero = []  # rimuovi le feature con dev. st. nulla
    for col in extracted_features.columns:
        if extracted_features[col].std() == 0:
            cols_zero.append(col)
    extracted_features_pulito = extracted_features.drop(columns=cols_zero)
    extracted_features_pulito["Date"] = df_BTC["Date"]

    return extracted_features_pulito

AlessioBolpagni98 · 2024-07-17T17:48:18Z

to solve this in my case all the rows must have the same ID, so i have created an ID 'A' for all the rows

sidneyzhu · 2024-07-18T02:56:06Z

thanks for your reply. in my case , the code can be finished by multiprocess of n_jobs=8 in about 30mins, but it can't be finished in clustered 8 workers on different machines.

sidneyzhu · 2024-07-18T11:15:14Z

i fixed it by shifting to dask_feature_extraction_on_chunk()， the ClusterDaskDistributor still failed with a lot of communication errors

AlessioBolpagni98 added the bug label Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with the function extract_features #1058

Problem with the function extract_features #1058

AlessioBolpagni98 commented Dec 28, 2023

nils-braun commented Jan 3, 2024

sidneyzhu commented Jul 16, 2024 •

edited

Loading

sidneyzhu commented Jul 17, 2024

AlessioBolpagni98 commented Jul 17, 2024 •

edited

Loading

AlessioBolpagni98 commented Jul 17, 2024

sidneyzhu commented Jul 18, 2024

sidneyzhu commented Jul 18, 2024

Problem with the function extract_features #1058

Problem with the function extract_features #1058

Comments

AlessioBolpagni98 commented Dec 28, 2023

nils-braun commented Jan 3, 2024

sidneyzhu commented Jul 16, 2024 • edited Loading

sidneyzhu commented Jul 17, 2024

AlessioBolpagni98 commented Jul 17, 2024 • edited Loading

AlessioBolpagni98 commented Jul 17, 2024

sidneyzhu commented Jul 18, 2024

sidneyzhu commented Jul 18, 2024

sidneyzhu commented Jul 16, 2024 •

edited

Loading

AlessioBolpagni98 commented Jul 17, 2024 •

edited

Loading