Skip to content

Support weighted quantiles in cut #423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 21, 2025
Merged

Support weighted quantiles in cut #423

merged 15 commits into from
May 21, 2025

Conversation

nalimilan
Copy link
Member

This requires adding an extension point for StatsBase. Unfortunately more copies of the data and weights are done than necessary as StatsBase does not support in-place weighted quantile! on pre-sorted data nor taking a view of weights vectors (JuliaStats/StatsBase.jl#723).

Supersedes #209. On top of #422.

`Statistics.quantile` returns values which are not the most appropriate
to generate labels. It is more intuitive to choose values from the actual data,
which are likely to have fewer decimals and make more sense for users.

Unfortunately, since we use intervals closed on the left, we cannot use
any of the seven standard definitions of quantiles. Type 1 is the closest,
but we have to take the value next to it as a cutpoint to prevent it from
being included into the next quantile group. This gives essentially consistent
group attributions to R's `Hmisc::cut2` or
`cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T))`,
though with different cutpoints in labels.
1) The quantile number isn't needed in most cases in the label,
and anyway it's shown when printing an ordered `CategoricalValue`.
Only use it by default when `allowempty=true` to avoid data-dependent
errors if there are duplicate levels.

2) Round breaks by default to a number of significant digits chosen by
`sigdigits`. This number is increased if necessary for breaks to remain unique.
This generates labels which are not completely correct as rounding may make
the left break greater than a value which is included in the interval,
but this is generally minor and expected. Taking the floor rather than
rounding would be more correct, but it can generate unexpected labels
due to floating point trickiness (e.g. `floor(0.0003, sigdigits=4)`
gives 0.0002999). This is what R does.

Add a deprecation to avoid breaking custom `labels` functions which did
not accept `sigdigits`.
This requires adding an extension point for StatsBase.
Unfortunately more copies of the data and weights are done than necessary
as StatsBase does not support in-place weighted quantile! on pre-sorted data
nor taking a view of weights vectors (JuliaStats/StatsBase.jl#723).
Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Base automatically changed from nl/cutlabels to master May 21, 2025 16:59
@nalimilan nalimilan merged commit 11d43c1 into master May 21, 2025
9 checks passed
@nalimilan nalimilan deleted the nl/cutweights branch May 21, 2025 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants