Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Juriscraper to Support Bundling of Separate Opinions #883

Open
8 tasks
flooie opened this issue Jan 24, 2024 · 7 comments
Open
8 tasks

Enhance Juriscraper to Support Bundling of Separate Opinions #883

flooie opened this issue Jan 24, 2024 · 7 comments

Comments

@flooie
Copy link
Contributor

flooie commented Jan 24, 2024

Issue Description:

Currently, a handful of courts provide separate opinions in their opinion lists, which are not currently supported by juriscraper and CourtListener (CL). This lack of support for bundling separate opinions can lead to incomplete or segmented case information being scraped and processed.

Suggested Enhancement:

I propose updating juriscraper to allow for the bundling of separate opinions. This enhancement would ensure that all opinions related to a case are collected and processed together, providing a more comprehensive view of the case proceedings and decisions.

Courts: (in progress list)

  • Connecticut
  • Conn. Court of Appeals
  • West Virginia
  • West Virginia Court of Appeals
  • Michigan Court of Appeals
  • Tennessee Supreme Court
  • Texas Supreme
  • Texas Court of Appeals
@mlissner
Copy link
Member

To be clear here, what you're proposing is upgrading Juriscraper to return multiple opinion objects under one key, like we have with clusters/opinions in CL itself, right? Assuming so, can you provide a link or screenshot or something as an example?

@flooie
Copy link
Contributor Author

flooie commented Jan 24, 2024

Yes - I was working this thru in my head - before I laid out my vision.

@flooie
Copy link
Contributor Author

flooie commented Jan 24, 2024

{'date': '2/14/2023', 'docket': 'SC20164', 'name': 'State v. Juan A. G.-P.', 'opinion_type': '010combined'}

{'date': '1/31/2023', 'docket': 'SC20627', 'name': 'CT Freedom Alliance, LLC v. Dept. of Education', 'opinion_type': '010combined'}

{'date': '1/31/2023', 'docket': 'SC20633', 'name': 'Devine v. Fusaro', 'opinion_type': '010combined'}

{'date': '1/24/2023', 'docket': 'SC20679', 'name': 'Grant v. Commissioner of Correction', 'opinion_type': '010combined'}

{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '010combined'}
{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '030concurrence'}
{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '040dissent'}

{'date': '1/24/2023', 'docket': 'SC20597', 'name': 'Solon v. Slater', 'opinion_type': '010combined'}

{'date': '1/17/2023', 'docket': 'SC20453', 'name': 'State v. James A.', 'opinion_type': '010combined'}
{'date': '1/17/2023', 'docket': 'SC20453', 'name': 'State v. James A.', 'opinion_type': '030concurrence'}

I fixed and rewrote part of Connecticut - to take advantage of the opinion_type changes. Here are some excepts from self.cases

We can take these results and either call a method to combine the multiple opinions here into clusters and only slightly modify CL to save each opinion together with the cluster

@mlissner
Copy link
Member

I'd expect this to mirror the fields in CL pretty closely. Why not do the joining in JS so that CL has a nice JSON object of clusters with nested opinions?

grossir added a commit to grossir/courtlistener that referenced this issue Jan 31, 2024
Supports new juriscraper scraper class and returned objects, and also keeps legacy interface

- Supports: freelawproject/juriscraper#883
- Supports: freelawproject/juriscraper#889
@grossir
Copy link
Contributor

grossir commented Jan 31, 2024

I checked the changes required on Courtlistener to support this new paradigm, while still supporting the legacy scrapers. I found the following:

  • We can return a nested object, but we must keep a minimal interface (dict keys) for compatibility with cl_scrape_opinions tasks of dup checking
{
    "Docket": {...},
    "OpinionCluster": {...},
    "Opinion": {...},
    "case_names": "",  # for site.hash dup checking
    "download_urls: "",  # for sha1 checking of url content 
    "precedential_statuses": "" # for sha1 checking in case of nev
    "case_dates": "", # for sorting and dup checking,
}

Even if we return objects of the following shape we would have to return an item for each opinion (because of dup checking), causing a somewhat ugly duplication

{
    "OpinionCluster": {
             "Opinion": [
                     {...},
                     {...}
              ]
    }
}

Here is a branch where I show the changes needed in CL, which turned out rather small. This is still a concept, would have to be tested and improved

https://github.com/freelawproject/courtlistener/compare/main...grossir:courtlistener:support_juriscraper_nested_objects?expand=1

@mlissner
Copy link
Member

Gianfranco, it's very OK to change CL as part of this, if it means making the interface better while hitting our design requirements. I'd rather do this now and have something we like instead of being stuck with half measures. Does that change your thinking about approach?

@grossir
Copy link
Contributor

grossir commented Mar 6, 2024

It took quite some time but I have a draft working on integration with Courtlistener (which will be another parallel PR)
First I will paste some nice screenshots, then I will dive into some problems and opportunities I found while working on this

Results

I used tex as a working scraper to test the new class. As a useful example, we have this recent Supreme Court case, which has a OpinionCluster of 3 opinions.
This is how the cluster looks on my local docker env:
image
image
image

How it currently looks on Courtlistener
image

Also, the scraper captures search_originating_court_information
image
and extra columns for our usual objects

Implementation details

It's better to look at the code, even if there is still pending work. I have written comments extensively.

#952

On Courtlistener:
freelawproject/courtlistener#3864

Besides the "code" code review, I will need some "data" code review, to see if I am using properly the nature_of_suit, cause, opinion.type, etc fields

Of note, I found a way to keep tests of secondary/deferred page's examples. For tex it was as simple as tweaking the href leading to the secondary page, so that it points to the precise example file.

Pending work

I still have a bunch of bugs to solve and tests to write for this to be mergeable

  • writing tests for the JSON Validator (I know it is currently not validating nested objects)
  • writing a custom type checker for python dates
  • support deferring attributes
  • adapting texcrimapp and texapp_* to the new tex class

Further work

There is a clear opportunity to scrape people_db objects, like Person, Party, Attorney, and to support them in cl_scrape_opinions. However, this would take more work and testing since lookups for this objects have to be used

Some bugs found on the way

Bugs on OpinionSite[Linear] integration with CL: Attributes that we can return but are never picked up in CL (defined on OpinionSite class)

            "dispositions",
            "causes",
            "divisions",
            "docket_attachment_numbers",
            "docket_document_numbers",
            "lower_courts",
            "lower_court_judges",
            "lower_court_numbers",

These are actually used on some sources, so we are not inserting data we do collect. For example, lower_courts is used in tenn, nev, ind, bap1, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants