New statistic for counting bucketed features #33

katmatson · 2024-05-29T16:32:56Z

I'm planning to create a new summary statistics that will assign vectors to buckets based on a property's value and count the number of vectors within each bucket. The motivating example is the number of large, medium, and small lakes, but it seems like this could be used in other contexts as well and cannot be done with pandas' built-in aggregation methods.

I believe that the required components for this are:

add the ability to specify this new statistic in the viz-staging config. Most parts of this can be defined with the config as it exists already--the 'aggregation_method' will need to be the name of the function in (2)--but there will need to be a way to define the bounds between the buckets. For this, I propose a dict mapping from the bucket name to the bucket's upper bound. So, for example, to create a statistic to count the number of large (>100km^2), medium (> 10km^2, <= 100km^2), and small (<= 10km^2) lakes from a dataset with an area property that is in km^2, the entry in the statistics dict would look like:

{
    "name": "lake_size",
    "weight_by": "count",
    "property": "area",
    "aggregation_method": "bucketed_count"
    "resampling_method": "sum",
    "buckets": {
        "small": 10
        "medium": 100
        "large": None
    },
    # val_range and palette should be specified as appropriate for display
}

This would create three columns for the summary: lake_size_small, lake_size_medium, and lake_size_large (by combining the statistic name with each bucket's name) each counting the number of lakes within each tile of the respective size.

create a method to perform this aggregation. This will be passed to pandas.DataFrame.agg (here or here). According to the documentation, this function must take a pandas.Series as input. It would also take as inputs the upper and lower bound for the bucket. A new entry to the appropriate aggregation method dict would be added for each bucket, wrapping this aggregation function in a lambda that sets the appropriate bounds. This means that, unlike the other statistics, there would be multiple columns created (one for each bucket) rather than only a single column.
Because this creates multiple columns (one for each bucket), viz-staging's config will need a change to get_resampling_methods to repeat the resampling method once per bucket for this statistic rather than only once as it is for every other statistic.

It will probably also make sense to add some sort of validation for the config that the buckets are defined if and only if the "bucketed_count" aggregation_method is used. Also, it shouldn't be much extra work to add support for an area-weighted version of this statistic (it would mostly just take a second method like the one in 2 that uses the area of the vector rather than just counting the number of vectors), so while that isn't outlined here it might be worth doing too while making these changes.

The text was updated successfully, but these errors were encountered:

katmatson linked a pull request Jun 18, 2024 that will close this issue

Create a statistic to count the number of values within specified bounds #35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New statistic for counting bucketed features #33

New statistic for counting bucketed features #33

katmatson commented May 29, 2024

New statistic for counting bucketed features #33

New statistic for counting bucketed features #33

Comments

katmatson commented May 29, 2024