Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Baseline Data Cleaning #840

Closed
zogomii opened this issue Jun 14, 2024 · 1 comment
Closed

feat: Baseline Data Cleaning #840

zogomii opened this issue Jun 14, 2024 · 1 comment
Labels
wontfix This will not be worked on

Comments

@zogomii
Copy link
Contributor

zogomii commented Jun 14, 2024

Is your feature request related to a problem?

Subtask of #710

Desired solution

Create method Baseline._clean(table: Table, target_column: str)->TabularDataset for baseline data cleaning

  1. Remove columns with high idness or stability (either above 90%), excluding the target column
  2. Remove columns with high missing value ratio (above 60%)
  3. Impute all remaining columns with missing values using highest (absolute) correlating column
  4. One hot encode all non-numerical columns with less than 20 different values, remove all other non-numerical columns
  5. Remove outliers
  6. Normalise columns with values greater than 100

Possible alternatives (optional)

No response

Screenshots (optional)

No response

Additional Context (optional)

No response

@lars-reimann
Copy link
Member

AutoML is currently out-of-scope.

@lars-reimann lars-reimann closed this as not planned Won't fix, can't repro, duplicate, stale Jan 15, 2025
@github-project-automation github-project-automation bot moved this from Backlog to ✔️ Done in Library Jan 15, 2025
@lars-reimann lars-reimann added the wontfix This will not be worked on label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
Status: ✔️ Done
Development

No branches or pull requests

2 participants