Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Row Group Based Subtotals with .group_by() #18823

Open
Chretien opened this issue Sep 18, 2024 · 2 comments
Open

Row Group Based Subtotals with .group_by() #18823

Chretien opened this issue Sep 18, 2024 · 2 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@Chretien
Copy link

Description

Hey Polars team! Congratulations on the progress you've all made with Polars. It is my tool of choice for everything data-related, and I am a huge fan of the package. It just gets better and better each release.

Right now, I work as a finance analyst for a mid-size company, and I use Polars as often as possible. It does everything I need swiftly and cleanly - for the most part.

This issue is not unique to Polars, but I find myself writing a ton of code for something that I feel should be a simple feature (although I know that just because it should be simple doesn't mean it is). The issue is creating subtotals for row groups in a group_by() operation. Is this something that could be feasibly be implemented within the Polars API?

Here is some code that I used to successfully perform a subtotal by row group, and it is very long-winded. Also, I am sure it's not the neatest, best way to do it in the first place but it's how I got it figured out.

masterReport = pl.DataFrame()
 
for area in groupedReport.select('Area').unique().to_series().to_list():
    y = groupedReport.filter(pl.col('Area') == area)
    subtotalReport = pl.DataFrame([area,'Total',y.select('Revenue').sum().item(),
                          y.select('Payroll Amount').sum().item(), y.select('Hours Worked').sum().item(),
                          y.select('Workers').mean().item(),y.select('Labor % of Sales').mean().item(),
                          y.select('Sales per Worker').mean().item(), y.select('Sales per Day').mean().item(),
                          y.select('Sales per Day per Worker').mean().item(),
                          y.select('Bonus').sum().item()], orient='columns', strict=False).transpose()
    subtotalReport.columns = y.columns
    subtotalReport = subtotalReport.with_columns(pl.col(['Revenue','Payroll Amount','Hours Worked','Labor % of Sales','Sales per Worker','Sales per Day','Sales per Day per Worker','Bonus']).str.to_decimal().cast(pl.Float64),
                                                 pl.col('Workers').str.to_decimal().cast(pl.Float64).round(0).cast(pl.UInt32))
 
    display(y.vstack(subtotalReport))
 
    masterReport = pl.concat([y.vstack(subtotalReport),masterReport], how='diagonal_relaxed').sort('Area')

I think adding support for Excel-style pivot subtotals by group would set Polars apart even more. Getting this done in Pandas is also quite a process, and I am surprised it's not a standard feature in these packages, but that could just be my ignorance of what it would take to design this functionality. However, I would be over the moon if Polars had this capability.

By the way, I LOVE the great_tables and Altair integrations, spot on.

@Chretien Chretien added the enhancement New feature or an improvement of an existing feature label Sep 18, 2024
@cmdlineluser
Copy link
Contributor

Just another user here - trying to understand the problem. (a minimal runnable example with input/ouput would help)

It looks like the filtering is essentially emulating a regular group_by / agg?

(
    groupedReport
     .group_by('Area')
     .agg(
         pl.lit('Total').alias('Total'),
         pl.col('Revenue', 'Payroll Amount', 'Hours Worked').sum(),
         pl.col('Workers', 'Labor % of Sales', 'Sales per Worker', 'Sales per Day', 'Sales per Day per Worker').mean(),
         pl.col('Bonus').sum()
     )
)

And you want these rows combined back into groupedReport?

There was a previous request for pivot total stuff which may be relevant:

  1. Add contingency table/ frequency table functionality in .pivot method #16372

@Chretien
Copy link
Author

Just another user here - trying to understand the problem. (a minimal runnable example with input/ouput would help)

It looks like the filtering is essentially emulating a regular group_by / agg?

(
    groupedReport
     .group_by('Area')
     .agg(
         pl.lit('Total').alias('Total'),
         pl.col('Revenue', 'Payroll Amount', 'Hours Worked').sum(),
         pl.col('Workers', 'Labor % of Sales', 'Sales per Worker', 'Sales per Day', 'Sales per Day per Worker').mean(),
         pl.col('Bonus').sum()
     )
)

And you want these rows combined back into groupedReport?

There was a previous request for pivot total stuff which may be relevant:

  1. Add contingency table/ frequency table functionality in .pivot method #16372

So, in this instance, groupedReport is already an aggregated table grouped by Area. The reason I loop through it is to isolate each unique Area and then calculate sums for some columns, means for others, append the sums and means back into that isolated table, and then finally append this to a new data frame. This is for presentation purposes, as I will style this with the great_tables package and save it as a PNG to send off to management.

In something like Excel, you would just create a pivot table and enable subtotals for the desired effect. I did take a look at the request you linked, and that is exactly what I am referencing. However, I was disappointed to see that the Polars team doesn't believe a feature like this fits into the scope of the package, and I would have to vehemently disagree with that. Python is ubiquitous in the data sphere and is becoming necessary even for lowly analysts like me, and data presentation is a top priority.

I do have faith, however, that the team may choose to revisit the level of importance a feature like this could rank. I specifically remember reading an old Github request for allowing different encoding methods when reading CSV's where the Polars team said they would never implement that, and yet here we are with the ability to choose "unicode_escape" encoding (which I use DAILY as I work off of QuickBooks files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants