Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Power-law histogram generations #1165

Merged
merged 5 commits into from
Feb 6, 2024
Merged

Power-law histogram generations #1165

merged 5 commits into from
Feb 6, 2024

Conversation

brilee
Copy link
Contributor

@brilee brilee commented Feb 5, 2024

Fixes #1161 partially. (by generating better distributions). If there are 95% NaN, then this still causes lopsided bars that don't represent value ranges properly.

Screenshot 2024-02-05 at 1 25 28 PM
Screenshot 2024-02-05 at 1 58 34 PM
Screenshot 2024-02-05 at 1 58 38 PM
Screenshot 2024-02-05 at 1 58 47 PM
Screenshot 2024-02-05 at 1 58 56 PM

@brilee
Copy link
Contributor Author

brilee commented Feb 5, 2024

Screenshot 2024-02-05 at 2 02 35 PM

@brilee brilee requested a review from dsmilkov February 5, 2024 19:03
# Defined for numeric features.
min_val: Optional[Union[float, date, datetime]] = None
max_val: Optional[Union[float, date, datetime]] = None
value_samples: Optional[list[float]] = None # Used for approximating histogram bins
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a list of 100 floats; no need to transmit this to the client but I was too lazy to figure out how to nullify this when serializing the field

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can nullify in get_stats in router_dataset.py right before returning the result, but no big deal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, figured out how to exclude in pydantic

@brilee
Copy link
Contributor Author

brilee commented Feb 5, 2024

Screenshot 2024-02-05 at 2 19 35 PM

tested some larger datasets for performance

Copy link
Collaborator

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

lilac/data/dataset_duckdb.py Outdated Show resolved Hide resolved
# Defined for numeric features.
min_val: Optional[Union[float, date, datetime]] = None
max_val: Optional[Union[float, date, datetime]] = None
value_samples: Optional[list[float]] = None # Used for approximating histogram bins
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can nullify in get_stats in router_dataset.py right before returning the result, but no big deal

@brilee brilee enabled auto-merge (squash) February 5, 2024 19:32
@brilee brilee disabled auto-merge February 5, 2024 19:32
@brilee brilee enabled auto-merge (squash) February 5, 2024 19:32
@brilee brilee merged commit e7f1915 into main Feb 6, 2024
4 checks passed
@brilee brilee deleted the histogram branch February 6, 2024 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Histograms are not visually correct.
2 participants