Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: handle multiple disputes #88

Merged

Conversation

bramrodenburg
Copy link
Contributor

@bramrodenburg bramrodenburg commented Sep 9, 2024

Please provide your name and company

Bram - Gigs

Link the issue/feature request which this PR is meant to address

#87

Detail what changes this PR introduces and how this addresses the issue/feature request linked above.

Although it happens rarely (see here), a payment can be disputed more than once. The existing balance_transactions model cannot handle this scenario. In this PR, that is fixed by aggregating the dispute_[id/reason/amount] before joining the dispute data back.

How did you validate the changes introduced within this PR?

Followed the steps here and built the dbt_stripe package against our data.

Which warehouse did you use to develop these changes?

BigQuery

Did you update the CHANGELOG?

  • Yes

Did you update the dbt_project.yml files with the version upgrade (please leverage standard semantic versioning)? (In both your main project and integration_tests)

  • Yes

Typically there are additional maintenance changes required before this will be ready for an upcoming release. Are you comfortable with the Fivetran team making a few commits directly to your branch?

  • Yes
  • No

If you had to summarize this PR in an emoji, which would it be?

🥸

Feedback

We are so excited you decided to contribute to the Fivetran community dbt package! We continue to work to improve the packages and would greatly appreciate your feedback on our existing dbt packages or what you'd like to see next.

PR Template

* MagicBot/documentation-updates

* Apply suggestions from code review

* Apply suggestions from code review

---------

Co-authored-by: Jamie Rodriguez <[email protected]>
Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bramrodenburg thanks for opening this PR to address the issue you raised! This definitely is a change we would like to fold into the broader package; however, I do have a few questions:

  • Please see my question below regarding the use of array_agg and if we should consider string_agg?
  • Would you be able to create a CHANGELOG.md entry for these changes? Since this will be changing field names, we will need to make this a breaking change (v0.15.0).

Comment on lines 84 to 85
array_agg(dispute_id) as dispute_ids,
array_agg(dispute_reason) as dispute_reasons,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these aggs is there a reason you used the array_agg function instead of string_agg? Reason I ask is because in BigQuery and Databricks especially I know the array_agg function can change the nature of the table. Do you find using array_agg to be easier to leverage in queries down the road? Would there be any concerns with using string_agg instead, or would that impact the usability of these fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for using array_agg is that I find it easier to use the data downstream (e.g. if I need to join back to dispute using the dispute_id). If I would use string_agg, I would first have to split the string again. However, this is a "fictional" use case - I am not doing this now :)

I am perfectly fine with using string_agg if that's the preferred way of doing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced array_agg with string_agg

@bramrodenburg
Copy link
Contributor Author

Would you be able to create a CHANGELOG.md entry for these changes? Since this will be changing field names, we will need to make this a breaking change (v0.15.0).

Makes sense. Will do.

source_relation,
string_agg(dispute_id) as dispute_ids,
string_agg(dispute_reason) as dispute_reasons,
sum(dispute_amount) as dispute_amount

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the same amount can get disputed more than once, I think it makes more sense to also string_agg the dispute_amount. Here's an example where I have a balance transaction where the full amount has been disputed twice. Summing the dispute_amount in this case will erroneously double the amount of the transaction.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsnorthrup thanks so much for chiming in and sharing this level of insight. We definitely would not want to have a scenario where we could possibly be double counting the amount.

It looks like the two disputes you shared have a different "status" where it seems only one was "won". Since there is a "status" field in the source table (via the ERD), do you think it would be appropriate to retain the sum, but only sum disputes that are of status='won'?
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would work for when the status is "won", but what about if it hasn't been won yet? Will the dispute_amount be 0? In that case I think the column needs to be renamed to dispute_amount_won to indicate it's not showing dispute amounts in other statuses.

If you go that route, it might be good to add another column dispute_amounts that still has a string_agg of each individual dispute amount to match dispute_ids and dispute_reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For us, we are not only interested in disputes won, but also lost, under_review etc. If we would go down the route of adding a dispute_amount_won, then I feel we should also add similar columns for the other statuses. There are currently 7 different statuses in Stripe for disputes (see here). So you would get something like this:

  • dispute_amount_won
  • dispute_amount_lost
  • dispute_amount_under_review
  • dispute_amount_needs_response
  • dispute_amount_warning_closed
  • dispute_amount_warning_under_review
  • dispute_amount_warning_needs_response

If we wouldn't do this, then the user of the table needs to join back to the dispute table to get this information, which kind of renders having dispute amounts in this table obsolete. (I actually naively took the sum in PR, because we are actually already doing this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wouldn't do this, then the user of the table needs to join back to the dispute table to get this information, which kind of renders having dispute amounts in this table obsolete.

@bramrodenburg I agree with this idea. I also worry if we go the string_agg route for the dispute information then it won't provide much analytical value. Whereas, if we take the approach in creating distinct aggregate sums for each status, these values can be used more easily for any downstream analysis. Since we can confirm from Stripe's documentation that there are only 7 statuses, I'd request we take the approach of having distinct dispute aggregate records for each status like you suggest.

My only remaining question here - Is it possible for Stripe to provide multiple entries for a dispute status that could still result in us double counting? For example, could there be some reason the same dispute amount has two won records in the underlying table? If not, then I'm comfortable with the above approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jsnorthrup - In that scenario, would you want to only look at the most recent dispute amount for won and lost disputes (or any other statuses)? So in the above screenshot, if you won Dispute 2, the model would only include the second $175 in the dispute_amount_won sum.

If so, I think this would be possible by doing something like row_number() over (partition by charge_id, dispute_status, source_relation order by dispute_created_at desc) = 1 or dispute_status not in ('won', 'lost') as is_dispute_included and filtering based on that

Otherwise, yes, perhaps joining the balance_transactions model and with dispute downstream is the most sensible option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bramrodenburg @jsnorthrup I suppose my bigger question is how do you two think about Disputes? What is the bottom line you'd like to analyze or see surfaced?

Also, are there cases were a singe transaction's disputes can have different dollar amounts? So like in the above example, what if the second dispute only pertained to some items in the charge and therefore had a dispute value $100 instead of $175? Assuming you won Dispute 2, would you want the dispute_won_amount to reflect the max amount or most recent amount (or both, which would lead me to think that just joining in dispute downstream may make more sense)?

Copy link

@jsnorthrup jsnorthrup Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple scenarios are possible. If different parts of the total are disputed, the amounts could be additive. If the total amount is disputed multiple times, you'd only want the max.

I think doing the latest might be a good compromise. Remember, this is a very rare occurrence. I have exactly one instance of multiple disputes on the same transaction. If we have the dispute columns only refer to the latest dispute (in which case I recommend prefixing with latest_), it will represent the total disputed amount in the vast majority of cases, and in those rare cases where there are multiple disputes, we could have a dispute_count field to signal that you should reference the dispute table for more information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that! @bramrodenburg if this compromise sounds good to you, I'll go ahead and merge your PR into my own working branch and add this extra code to stripe__balance_transactions, as it's a bit involved.

Specifically, I'll:

  • Add code to select the latest dispute in each status to account for the rare (but possible) case of multiple disputes per transaction: row_number() over (partition by charge_id, dispute_status, source_relation order by dispute_created_at desc) = 1 as is_latest_dispute.
  • This filter will be leveraged to create the following dispute_amount fields split by status:
    • latest_dispute_amount_won
    • latest_dispute_amount_lost
    • latest_dispute_amount_under_review
    • latest_dispute_amount_needs_response
    • latest_dispute_amount_warning_closed
    • latest_dispute_amount_warning_under_review
    • latest_dispute_amount_warning_needs_response
  1. Add a count(distinct dispute_id) as dispute_count field that indicates whether there are multiple disputes for a transaction. If so, users should then join in the dispute table for more details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. @jsnorthrup is completely right that there can be multiple scenarios.

The suggested approach sounds great. Thanks for merging this into your working branch @fivetran-jamie

@fivetran-jamie fivetran-jamie changed the base branch from main to bugfix/handle-multiple-disputes September 27, 2024 23:24
@fivetran-jamie fivetran-jamie dismissed fivetran-joemarkiewicz’s stale review September 27, 2024 23:28

Joe's going on PTO and I'm making additional changes to this in a working branch before merging to main

@fivetran-jamie fivetran-jamie changed the base branch from bugfix/handle-multiple-disputes to releases/v0.15.latest September 27, 2024 23:32
@fivetran-jamie fivetran-jamie changed the base branch from releases/v0.15.latest to release/v0.14.0 September 28, 2024 00:01
@fivetran-jamie fivetran-jamie merged commit 78ee45e into fivetran:release/v0.14.0 Sep 28, 2024
fivetran-jamie added a commit that referenced this pull request Oct 3, 2024
…nsactions (#92)

* fix!: handle multiple disputes (#88)

* Documentation Standard Updates (#85)

* MagicBot/documentation-updates

* Apply suggestions from code review

* Apply suggestions from code review

---------

Co-authored-by: Jamie Rodriguez <[email protected]>

* fix!: handle multiple disputes

* replace array_agg with string_agg

* update changelog

* bump version

---------

Co-authored-by: Joe Markiewicz <[email protected]>
Co-authored-by: Jamie Rodriguez <[email protected]>

* changes

* aws and changelog

* validations

* Update models/stripe__balance_transactions.sql

Co-authored-by: Renee Li <[email protected]>

* renee feedback

* docs

* update package ref

* changelog

---------

Co-authored-by: bramrodenburg <[email protected]>
Co-authored-by: Joe Markiewicz <[email protected]>
Co-authored-by: Renee Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants