Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement quantile binning #34

Merged
merged 1 commit into from
Feb 28, 2022
Merged

Implement quantile binning #34

merged 1 commit into from
Feb 28, 2022

Conversation

xukai92
Copy link

@xukai92 xukai92 commented Feb 22, 2022

Following #33 we decide to implement a separate binning based on the quantile directly.

@xukai92 xukai92 changed the title implement quantile binning Implement quantile binning Feb 22, 2022
Copy link
Contributor

@tawheeler tawheeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick PR!

n ≥ nbins || error("too many bins requested")

qs = range(0, 1; length=(nbins+1))
retval = quantile.(Ref(data), qs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retval = quantile(data, qs) might be better. If we broadcast, every call probably has to sort data. If we pass in one list and ask Statistics.pkg to give us the quantiles, it can properly reason about things under the hood to be more efficient.

qs = range(0, 1; length=(nbins+1))
retval = quantile.(Ref(data), qs)

isunique = diff(retval) .> 1e-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the opposite of is-unique, but rather whether the bin edge is different. Can we do:

# trim bins that are too small
keep_bin = diff(retval) .> alg.max_bin_width

@@ -0,0 +1,15 @@
@test DiscretizeQuantile(2) == DiscretizeQuantile(2, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding tests!

It may not be covered by the other methods, but could you please test [1,1,1] and verify that we throw an error?

mask = vcat([true], isunique)
retval = retval[mask] # note this makes length(retval) < nbins
else
# Throw error as in other methods if there are non-unique bin egdes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: "egdes" -> "edges"


if alg.trim # trim bins that are too small
# Ref: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/_discretization.py#L288
mask = vcat([true], isunique)
Copy link
Contributor

@tawheeler tawheeler Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do:

# Always keep the first bin edge
insert!(keep_bin, 1, true)
retval = retgval[keep_bin] # NOTE: this reduces our bin count

and then ensure that we have at least 2 edges remaining?


struct DiscretizeQuantile <: DiscretizationAlgorithm
nbins::Int
trim::Bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a simple comment?

trim::Bool # Whether to automatically remove extremely narrow bins

retval = retval[mask] # note this makes length(retval) < nbins
else
# Throw error as in other methods if there are non-unique bin egdes
any(.!(isunique)) && error("binedges non-unique")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any(.!(isunique)) allocates a new vector. How about !all(keep_bin)?

@@ -0,0 +1,32 @@
using Statistics: quantile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be added at the top-level, in Discretizers.jl?

@@ -20,6 +20,7 @@ export
# discretization algorithms
DiscretizeUniformWidth,
DiscretizeUniformCount,
DiscretizeQuantile,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect

@codecov-commenter
Copy link

Codecov Report

Merging #34 (004c012) into master (11a2c3e) will increase coverage by 0.10%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #34      +/-   ##
==========================================
+ Coverage   93.95%   94.05%   +0.10%     
==========================================
  Files           9       10       +1     
  Lines         877      892      +15     
==========================================
+ Hits          824      839      +15     
  Misses         53       53              
Flag Coverage Δ
unittests 94.05% <100.00%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/Discretizers.jl 100.00% <ø> (ø)
src/disc_quantile.jl 100.00% <100.00%> (ø)
src/linear_discretizer.jl 98.95% <0.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11a2c3e...004c012. Read the comment docs.

@tawheeler tawheeler merged commit 004c012 into sisl:master Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants