Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pacer.email): improve bankruptcy short description parsing #1276

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

grossir
Copy link
Contributor

@grossir grossir commented Dec 17, 2024

Solves #912, Solves #914

  • simplify parsing by getting rid of cases by court groups
  • support multi docket NEF parsing: add examples for deb, ctb, mdb, ndb, nhb, paeb, txnb
  • support flsb
  • correct wrong parsing for vaeb and mdb after double checking on PACER
  • updated paeb_1 example file where creation of example file had broken parsing

Solves #912, Solves #914

- simplify parsing by getting rid of cases by court groups
- support multi docket NEF parsing: add examples for deb, ctb, mdb, ndb, nhb, paeb, txnb
- support flsb
- correct wrong parsing for vaeb and mdb after double checking on PACER
- updated  paeb_1 example file where creation of example file had broken parsing
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, you really simplified the code while adding a bunch more test cases. Love it.

How much have you checked vs. the pacer history reports to make sure that your output is correct?

Copy link
Contributor Author

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a screenshot of the "History/Documents" report on most JSON files to explain the changes. I noticed I need to improve the parsing of a couple of them

@@ -16,7 +16,7 @@
"pacer_doc_id": null,
"pacer_magic_num": null,
"pacer_seq_no": null,
"short_description": "Hearing Held"
"short_description": "Hearing Held (BK)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document number 56 has short description "Hearing Held (BK)"
image


# Deletes:
# - extra docket number 'components', such as `federal_dn_judge_initials_assigned`
# - Chapter component
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On deleting "Chapter", see nhb1 example
The email's subject is
Subject:21-10245-BAH Chapter 13 Mykle Lepene Affidavit of Compliance with Discharge Requirements
if we didn't delete the "Chapter" the short_description would be
"Chapter 13 Affidavit of Compliance with Discharge Requirements"
but it's different on pacer
image

@@ -16,7 +16,7 @@
"pacer_doc_id": null,
"pacer_magic_num": null,
"pacer_seq_no": null,
"short_description": "CHAP - Hearing Continued"
"short_description": "CHAP - Hearing Continued (Bk Other)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On keeping "CHAP" and "(Bk Other)"
image

@@ -16,7 +16,7 @@
"pacer_doc_id": "188040985133",
"pacer_magic_num": "49963627",
"pacer_seq_no": "101",
"short_description": "Notice of Dismissal"
"short_description": "Notice of Dismissal - CerDocTyp"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On keeping "CerDocTyp"
image

@@ -16,7 +16,7 @@
"pacer_doc_id": null,
"pacer_magic_num": null,
"pacer_seq_no": null,
"short_description": "Docket Order - Continue Hearing (Auto) Ch 13"
"short_description": "Docket Order - Continue Hearing (Auto)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On deleting "Ch 13"

image

"pacer_doc_id": null,
"pacer_magic_num": null,
"pacer_seq_no": null,
"short_description": "Close Adversary Case"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subject in the email is
"Multiple Cases "Close Adversary Case" - AP -"

Should be reduced to "Close Adversary Case". It doesn't follow the usual clean up process, that's why I am using the early return
image

"pacer_doc_id": "050057572723",
"pacer_magic_num": "58443666",
"pacer_seq_no": "384",
"short_description": "UST Form 11"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should be "UST Form 11-MOR"
image

@@ -16,7 +16,7 @@
"pacer_doc_id": "092051714783",
"pacer_magic_num": "11582823",
"pacer_seq_no": "409",
"short_description": "Order on Motion To Quash"
"short_description": "Order on Motion To Quash - CH"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one shouldn't have the lagging " - CH"
image

"pacer_doc_id": null,
"pacer_magic_num": null,
"pacer_seq_no": null,
"short_description": "Close Adversary Case"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is OK
image

@@ -4,7 +4,7 @@
"court_id": "paeb",
"dockets": [
{
"case_name": "Br Cun",
"case_name": "Brittany Cunningham",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Editing the example source had messed up the parsing of this case

@mlissner mlissner assigned albertisfu and unassigned mlissner Dec 18, 2024
@mlissner mlissner requested a review from albertisfu December 18, 2024 17:59
@mlissner
Copy link
Member

Great. Thanks Gianfranco. Assigning to Alberto for review.

@grossir
Copy link
Contributor Author

grossir commented Dec 18, 2024

I am missing one more commit; I have been checking the old example files too and doing some minor updates

grossir and others added 2 commits December 18, 2024 15:08
"pacer_doc_id": "154018316665",
"pacer_magic_num": "88545762",
"pacer_seq_no": "31",
"short_description": "BNC Certificate of Notice (341 Meeting Notice (Chapter 13))"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep the "Chapter ..." string

Email subject:
Subject:Ch-13 1:24-bk-00941-HWV -Charlene R. House BNC Certificate of Notice (341 Meeting Notice (Chapter 13))

Document history report:
image

"pacer_doc_id": "152032523670",
"pacer_magic_num": "29282863",
"pacer_seq_no": "78",
"short_description": "Chapter 13 Plan"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep the "Chapter..." string

Email subject:
Subject:Ch-13 23-13878-pmm Chapter 13 Plan - Kervince Markenzy

History report
image

@grossir
Copy link
Contributor Author

grossir commented Dec 18, 2024

@albertisfu this now ready for review

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @grossir I reviewed the refactor applied, and it looks good to me.
I just left a comment and a couple of concerns after finding some examples where the conditions for deciding when to keep or remove some suffixes and Chapter could be problematic.

return ""
# Some courts have subjects like
# `Multiple Cases "{docket} {case name} Close Adversary Case"`
if "close adversary case" in subject.lower():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this assumption. I looked for short descriptions that contain Close Adversary Case, and it seems safe to return just Close Adversary Case in these cases. However, I think we should also consider its quoted version, "Close Adversary Case", as seen in txnb_multi_1.json. The short description should simply be Close adversary case, as I confirmed in its docket history report.
Screenshot 2024-12-23 at 6 51 17 p m

# these are usually 3 letters long. However, we want to keep some special acronyms
# such as MOR (Merchant of Record?)
# - "NEF: " placeholder
component_regex = r"((?!-MOR)(\-[A-Z]{2,}))|(\-[a-z]{2,})|(NEF:? )"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a couple of examples where removing suffixes outside the exceptions listed here can be problematic:

For instance, this case where the subject contains:
Order -Non-motion related-

But it’s being parsed as:
Order -Non related-

pawd_1.txt

Screenshot 2024-12-23 at 7 26 51 p m

Or in the same case:
Motion to Withdraw/Dismiss Document -bk-

I couldn’t find an email example for this one, but I modified a subject and tested its parsing. The result was:
Motion to Withdraw/Dismiss Document

Screenshot 2024-12-23 at 7 33 25 p m

So this also seems difficult in determining when these suffixes should be removed or which ones should remain as exceptions. Perhaps we should just consider adding exceptions as we encounter them?

This seems easier maybe keeping the suffixes in parentheses works?

cacb.txt
Screenshot 2024-12-23 at 7 48 06 p m

ORDER to continue/reschedule hearing (BNC-PDF)

Currently the description is being parsed as:
ORDER to continue/reschedule hearing (BNC )

component_regex = r"((?!-MOR)(\-[A-Z]{2,}))|(\-[a-z]{2,})|(NEF:? )"
if self.court_id in ["paeb", "pamb"]:
# keeps the "Chapter ..." description on the short description
chapter_regex = r"(C[Hh][- ]?(13|7|9|11))|(C[hH][\s-]*$)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without looking too much I was able to find a couple of examples where the removal of Chapter... is not compatible.

For instance this one:
cacb_2.txt
Screenshot 2024-12-23 at 7 10 00 p m

You can see that in the Docket history report, many entries retain "Chapter" in their short descriptions.

Another example comes from pawb:

Screenshot 2024-12-23 at 7 11 59 p m

It seems difficult to determine when Chapter... should be removed. I’m also not sure if it's a consistent behavior across courts. If it should sometimes be removed and sometimes not in the same court that's a problem unless we can found a different pattern to decide. However, if it is consistent, we should look for more courts where it should be retained and add them to the condition.

@albertisfu albertisfu assigned grossir and unassigned albertisfu Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Do
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants