Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with the function extract_features #1058

Open
AlessioBolpagni98 opened this issue Dec 28, 2023 · 7 comments
Open

Problem with the function extract_features #1058

AlessioBolpagni98 opened this issue Dec 28, 2023 · 7 comments
Labels

Comments

@AlessioBolpagni98
Copy link

The problem:
I have a script that run everyday and in this script i use the tsfresh function extract_features(), but sometimes the script remain stucked in the function with the progressbar blocked at a certain percentage. The function doesn't raise any excpetion and the code remain blocked.
Packages (1).txt

  • Python version: 3.10.12
  • Operating System: Linux Ubuntu 22.04.3
  • tsfresh version: 0.20.1
  • Install method (conda, pip, source): pip
@nils-braun
Copy link
Collaborator

Hi @AlessioBolpagni98 !
Is this deterministic - meaning: it always gets stuck with the same data? Do you see a certain pattern in which data it gets stuck?
And which feature calculators are you using?

@sidneyzhu
Copy link

sidneyzhu commented Jul 16, 2024

i encountered the similar issues, my raw dataframe has 1k ids, 27k rows, 140 features. it can be well done by full feature extraction with MultiprocessingDistributor(n_workers=12) on a 64GB machine within 30 mins. but it always hanged with ClusterDaskDistributor with 4 nodes of 64GB workers. and i noticed that it hanged in the result gathering step. after about 4 hours, the extract_features job will be killed out of memory.
my environment is :
python 3.10.12
tsfresh:0.20.2
dask:2024.7.0
pandas:2.2.2
OS:ubuntu 22.04.1 LTS (Jammy Jellyfish)

@sidneyzhu
Copy link

@AlessioBolpagni98 have you fixed this issue?

@AlessioBolpagni98
Copy link
Author

AlessioBolpagni98 commented Jul 17, 2024

My problem was that i was using the extract_features() function in a improper way. i was using the same column for the parameter 'column_id' and 'column_sort'.

This was my problematic function:

def get_features(df_BTC):
    """extract features using TSfresh, return a dataframe with the features"""
    df_BTC = df_BTC.reset_index(drop=False)
    # Estrae le caratteristiche
    # Retry the function up to 3 times
    params = {
        "timeseries_container": df_BTC,
        "column_sort": "Date",
        "column_id": "Date",
    }

    extracted_features = extract_features(**params)
    impute(extracted_features)  # inplace

    cols_zero = []  # rimuovi le feature con dev. st. nulla
    for col in extracted_features.columns:
        if extracted_features[col].std() == 0:
            cols_zero.append(col)
    extracted_features_pulito = extracted_features.drop(columns=cols_zero)
    extracted_features_pulito["Date"] = df_BTC["Date"]

    return extracted_features_pulito

@AlessioBolpagni98
Copy link
Author

to solve this in my case all the rows must have the same ID, so i have created an ID 'A' for all the rows

@sidneyzhu
Copy link

thanks for your reply. in my case , the code can be finished by multiprocess of n_jobs=8 in about 30mins, but it can't be finished in clustered 8 workers on different machines.

@sidneyzhu
Copy link

i fixed it by shifting to dask_feature_extraction_on_chunk(), the ClusterDaskDistributor still failed with a lot of communication errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants