Row Group Based Subtotals with .group_by() #18823

Chretien · 2024-09-18T22:11:16Z

Description

Hey Polars team! Congratulations on the progress you've all made with Polars. It is my tool of choice for everything data-related, and I am a huge fan of the package. It just gets better and better each release.

Right now, I work as a finance analyst for a mid-size company, and I use Polars as often as possible. It does everything I need swiftly and cleanly - for the most part.

This issue is not unique to Polars, but I find myself writing a ton of code for something that I feel should be a simple feature (although I know that just because it should be simple doesn't mean it is). The issue is creating subtotals for row groups in a group_by() operation. Is this something that could be feasibly be implemented within the Polars API?

Here is some code that I used to successfully perform a subtotal by row group, and it is very long-winded. Also, I am sure it's not the neatest, best way to do it in the first place but it's how I got it figured out.

masterReport = pl.DataFrame()
 
for area in groupedReport.select('Area').unique().to_series().to_list():
    y = groupedReport.filter(pl.col('Area') == area)
    subtotalReport = pl.DataFrame([area,'Total',y.select('Revenue').sum().item(),
                          y.select('Payroll Amount').sum().item(), y.select('Hours Worked').sum().item(),
                          y.select('Workers').mean().item(),y.select('Labor % of Sales').mean().item(),
                          y.select('Sales per Worker').mean().item(), y.select('Sales per Day').mean().item(),
                          y.select('Sales per Day per Worker').mean().item(),
                          y.select('Bonus').sum().item()], orient='columns', strict=False).transpose()
    subtotalReport.columns = y.columns
    subtotalReport = subtotalReport.with_columns(pl.col(['Revenue','Payroll Amount','Hours Worked','Labor % of Sales','Sales per Worker','Sales per Day','Sales per Day per Worker','Bonus']).str.to_decimal().cast(pl.Float64),
                                                 pl.col('Workers').str.to_decimal().cast(pl.Float64).round(0).cast(pl.UInt32))
 
    display(y.vstack(subtotalReport))
 
    masterReport = pl.concat([y.vstack(subtotalReport),masterReport], how='diagonal_relaxed').sort('Area')

I think adding support for Excel-style pivot subtotals by group would set Polars apart even more. Getting this done in Pandas is also quite a process, and I am surprised it's not a standard feature in these packages, but that could just be my ignorance of what it would take to design this functionality. However, I would be over the moon if Polars had this capability.

By the way, I LOVE the great_tables and Altair integrations, spot on.

cmdlineluser · 2024-09-19T08:50:20Z

Just another user here - trying to understand the problem. (a minimal runnable example with input/ouput would help)

It looks like the filtering is essentially emulating a regular group_by / agg?

(
    groupedReport
     .group_by('Area')
     .agg(
         pl.lit('Total').alias('Total'),
         pl.col('Revenue', 'Payroll Amount', 'Hours Worked').sum(),
         pl.col('Workers', 'Labor % of Sales', 'Sales per Worker', 'Sales per Day', 'Sales per Day per Worker').mean(),
         pl.col('Bonus').sum()
     )
)

And you want these rows combined back into groupedReport?

There was a previous request for pivot total stuff which may be relevant:

Add contingency table/ frequency table functionality in .pivot method #16372

Chretien · 2024-09-19T14:31:06Z

Just another user here - trying to understand the problem. (a minimal runnable example with input/ouput would help)

It looks like the filtering is essentially emulating a regular group_by / agg?
(
    groupedReport
     .group_by('Area')
     .agg(
         pl.lit('Total').alias('Total'),
         pl.col('Revenue', 'Payroll Amount', 'Hours Worked').sum(),
         pl.col('Workers', 'Labor % of Sales', 'Sales per Worker', 'Sales per Day', 'Sales per Day per Worker').mean(),
         pl.col('Bonus').sum()
     )
)
And you want these rows combined back into groupedReport?

There was a previous request for pivot total stuff which may be relevant:

Add contingency table/ frequency table functionality in .pivot method #16372

So, in this instance, groupedReport is already an aggregated table grouped by Area. The reason I loop through it is to isolate each unique Area and then calculate sums for some columns, means for others, append the sums and means back into that isolated table, and then finally append this to a new data frame. This is for presentation purposes, as I will style this with the great_tables package and save it as a PNG to send off to management.

In something like Excel, you would just create a pivot table and enable subtotals for the desired effect. I did take a look at the request you linked, and that is exactly what I am referencing. However, I was disappointed to see that the Polars team doesn't believe a feature like this fits into the scope of the package, and I would have to vehemently disagree with that. Python is ubiquitous in the data sphere and is becoming necessary even for lowly analysts like me, and data presentation is a top priority.

I do have faith, however, that the team may choose to revisit the level of importance a feature like this could rank. I specifically remember reading an old Github request for allowing different encoding methods when reading CSV's where the Polars team said they would never implement that, and yet here we are with the ability to choose "unicode_escape" encoding (which I use DAILY as I work off of QuickBooks files).

Chretien added the enhancement New feature or an improvement of an existing feature label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Row Group Based Subtotals with .group_by() #18823

Row Group Based Subtotals with .group_by() #18823

Chretien commented Sep 18, 2024

cmdlineluser commented Sep 19, 2024

Chretien commented Sep 19, 2024

Row Group Based Subtotals with .group_by() #18823

Row Group Based Subtotals with .group_by() #18823

Comments

Chretien commented Sep 18, 2024

Description

cmdlineluser commented Sep 19, 2024

Chretien commented Sep 19, 2024