-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pacer.email): improve bankruptcy short description parsing #1276
base: main
Are you sure you want to change the base?
Conversation
Solves #912, Solves #914 - simplify parsing by getting rid of cases by court groups - support multi docket NEF parsing: add examples for deb, ctb, mdb, ndb, nhb, paeb, txnb - support flsb - correct wrong parsing for vaeb and mdb after double checking on PACER - updated paeb_1 example file where creation of example file had broken parsing
1369764
to
ecfc625
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, you really simplified the code while adding a bunch more test cases. Love it.
How much have you checked vs. the pacer history reports to make sure that your output is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a screenshot of the "History/Documents" report on most JSON files to explain the changes. I noticed I need to improve the parsing of a couple of them
@@ -16,7 +16,7 @@ | |||
"pacer_doc_id": null, | |||
"pacer_magic_num": null, | |||
"pacer_seq_no": null, | |||
"short_description": "Hearing Held" | |||
"short_description": "Hearing Held (BK)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
juriscraper/pacer/email.py
Outdated
|
||
# Deletes: | ||
# - extra docket number 'components', such as `federal_dn_judge_initials_assigned` | ||
# - Chapter component |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On deleting "Chapter", see nhb1 example
The email's subject is
Subject:21-10245-BAH Chapter 13 Mykle Lepene Affidavit of Compliance with Discharge Requirements
if we didn't delete the "Chapter" the short_description would be
"Chapter 13 Affidavit of Compliance with Discharge Requirements"
but it's different on pacer
@@ -16,7 +16,7 @@ | |||
"pacer_doc_id": null, | |||
"pacer_magic_num": null, | |||
"pacer_seq_no": null, | |||
"short_description": "CHAP - Hearing Continued" | |||
"short_description": "CHAP - Hearing Continued (Bk Other)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -16,7 +16,7 @@ | |||
"pacer_doc_id": "188040985133", | |||
"pacer_magic_num": "49963627", | |||
"pacer_seq_no": "101", | |||
"short_description": "Notice of Dismissal" | |||
"short_description": "Notice of Dismissal - CerDocTyp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -16,7 +16,7 @@ | |||
"pacer_doc_id": null, | |||
"pacer_magic_num": null, | |||
"pacer_seq_no": null, | |||
"short_description": "Docket Order - Continue Hearing (Auto) Ch 13" | |||
"short_description": "Docket Order - Continue Hearing (Auto)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": null, | ||
"pacer_magic_num": null, | ||
"pacer_seq_no": null, | ||
"short_description": "Close Adversary Case" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": "050057572723", | ||
"pacer_magic_num": "58443666", | ||
"pacer_seq_no": "384", | ||
"short_description": "UST Form 11" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -16,7 +16,7 @@ | |||
"pacer_doc_id": "092051714783", | |||
"pacer_magic_num": "11582823", | |||
"pacer_seq_no": "409", | |||
"short_description": "Order on Motion To Quash" | |||
"short_description": "Order on Motion To Quash - CH" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": null, | ||
"pacer_magic_num": null, | ||
"pacer_seq_no": null, | ||
"short_description": "Close Adversary Case" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -4,7 +4,7 @@ | |||
"court_id": "paeb", | |||
"dockets": [ | |||
{ | |||
"case_name": "Br Cun", | |||
"case_name": "Brittany Cunningham", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Editing the example source had messed up the parsing of this case
Great. Thanks Gianfranco. Assigning to Alberto for review. |
I am missing one more commit; I have been checking the old example files too and doing some minor updates |
Keep "Chapter ..." string in paeb and pamb courts
for more information, see https://pre-commit.ci
"pacer_doc_id": "154018316665", | ||
"pacer_magic_num": "88545762", | ||
"pacer_seq_no": "31", | ||
"short_description": "BNC Certificate of Notice (341 Meeting Notice (Chapter 13))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": "152032523670", | ||
"pacer_magic_num": "29282863", | ||
"pacer_seq_no": "78", | ||
"short_description": "Chapter 13 Plan" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertisfu this now ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @grossir I reviewed the refactor applied, and it looks good to me.
I just left a comment and a couple of concerns after finding some examples where the conditions for deciding when to keep or remove some suffixes and Chapter
could be problematic.
return "" | ||
# Some courts have subjects like | ||
# `Multiple Cases "{docket} {case name} Close Adversary Case"` | ||
if "close adversary case" in subject.lower(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this assumption. I looked for short descriptions that contain Close Adversary Case
, and it seems safe to return just Close Adversary Case
in these cases. However, I think we should also consider its quoted version, "Close Adversary Case"
, as seen in txnb_multi_1.json
. The short description should simply be Close adversary case
, as I confirmed in its docket history report.
# these are usually 3 letters long. However, we want to keep some special acronyms | ||
# such as MOR (Merchant of Record?) | ||
# - "NEF: " placeholder | ||
component_regex = r"((?!-MOR)(\-[A-Z]{2,}))|(\-[a-z]{2,})|(NEF:? )" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a couple of examples where removing suffixes outside the exceptions listed here can be problematic:
For instance, this case where the subject contains:
Order -Non-motion related-
But it’s being parsed as:
Order -Non related-
Or in the same case:
Motion to Withdraw/Dismiss Document -bk-
I couldn’t find an email example for this one, but I modified a subject and tested its parsing. The result was:
Motion to Withdraw/Dismiss Document
So this also seems difficult in determining when these suffixes should be removed or which ones should remain as exceptions. Perhaps we should just consider adding exceptions as we encounter them?
This seems easier maybe keeping the suffixes in parentheses works?
ORDER to continue/reschedule hearing (BNC-PDF)
Currently the description is being parsed as:
ORDER to continue/reschedule hearing (BNC )
component_regex = r"((?!-MOR)(\-[A-Z]{2,}))|(\-[a-z]{2,})|(NEF:? )" | ||
if self.court_id in ["paeb", "pamb"]: | ||
# keeps the "Chapter ..." description on the short description | ||
chapter_regex = r"(C[Hh][- ]?(13|7|9|11))|(C[hH][\s-]*$)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without looking too much I was able to find a couple of examples where the removal of Chapter...
is not compatible.
For instance this one:
cacb_2.txt
You can see that in the Docket history report, many entries retain "Chapter" in their short descriptions.
Another example comes from pawb
:
It seems difficult to determine when Chapter...
should be removed. I’m also not sure if it's a consistent behavior across courts. If it should sometimes be removed and sometimes not in the same court that's a problem unless we can found a different pattern to decide. However, if it is consistent, we should look for more courts where it should be retained and add them to the condition.
Solves #912, Solves #914