Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New statistic for counting bucketed features #33

Open
katmatson opened this issue May 29, 2024 · 0 comments · May be fixed by #35
Open

New statistic for counting bucketed features #33

katmatson opened this issue May 29, 2024 · 0 comments · May be fixed by #35

Comments

@katmatson
Copy link

I'm planning to create a new summary statistics that will assign vectors to buckets based on a property's value and count the number of vectors within each bucket. The motivating example is the number of large, medium, and small lakes, but it seems like this could be used in other contexts as well and cannot be done with pandas' built-in aggregation methods.

I believe that the required components for this are:

  1. add the ability to specify this new statistic in the viz-staging config. Most parts of this can be defined with the config as it exists already--the 'aggregation_method' will need to be the name of the function in (2)--but there will need to be a way to define the bounds between the buckets. For this, I propose a dict mapping from the bucket name to the bucket's upper bound. So, for example, to create a statistic to count the number of large (>100km^2), medium (> 10km^2, <= 100km^2), and small (<= 10km^2) lakes from a dataset with an area property that is in km^2, the entry in the statistics dict would look like:
{
    "name": "lake_size",
    "weight_by": "count",
    "property": "area",
    "aggregation_method": "bucketed_count"
    "resampling_method": "sum",
    "buckets": {
        "small": 10
        "medium": 100
        "large": None
    },
    # val_range and palette should be specified as appropriate for display
}

This would create three columns for the summary: lake_size_small, lake_size_medium, and lake_size_large (by combining the statistic name with each bucket's name) each counting the number of lakes within each tile of the respective size.

  1. create a method to perform this aggregation. This will be passed to pandas.DataFrame.agg (here or here). According to the documentation, this function must take a pandas.Series as input. It would also take as inputs the upper and lower bound for the bucket. A new entry to the appropriate aggregation method dict would be added for each bucket, wrapping this aggregation function in a lambda that sets the appropriate bounds. This means that, unlike the other statistics, there would be multiple columns created (one for each bucket) rather than only a single column.

  2. Because this creates multiple columns (one for each bucket), viz-staging's config will need a change to get_resampling_methods to repeat the resampling method once per bucket for this statistic rather than only once as it is for every other statistic.

It will probably also make sense to add some sort of validation for the config that the buckets are defined if and only if the "bucketed_count" aggregation_method is used. Also, it shouldn't be much extra work to add support for an area-weighted version of this statistic (it would mostly just take a second method like the one in 2 that uses the area of the vector rather than just counting the number of vectors), so while that isn't outlined here it might be worth doing too while making these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant