Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate http client from requests to httpx async client #739

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ttys0dev
Copy link
Contributor

@ttys0dev ttys0dev commented Oct 6, 2023

No description provided.

@ttys0dev ttys0dev force-pushed the async branch 7 times, most recently from ce26428 to 8dd8337 Compare October 7, 2023 01:05
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting PR. I gave it a look through and it looks fine as far as I can tell, but I worry about how this will affect callers of this library. Won't they all need to be rewritten whever this version goes live?

@ttys0dev
Copy link
Contributor Author

Won't they all need to be rewritten whever this version goes live?

Yes, callers will need to be tweaked to make async calls but that should be fairly straight forwards.

@mlissner
Copy link
Member

Hm, I was afraid of that, this library is used by others as well. If this is an important change to make, we'll want to do a full version bump, or if there's a way to keep it backwards compatible, we'll want to do that.

If we do do the version bump, is there a way for others using the library to upgrade without going full async? We should write something in the changelog about that too before releasing this.

@ttys0dev
Copy link
Contributor Author

if there's a way to keep it backwards compatible, we'll want to do that.

Not really a good way to do this without making things a lot more complex.

If we do do the version bump, is there a way for others using the library to upgrade without going full async?

Yeah, I mean we can generally use async_to_sync wrappers or similar from calling code.

@mlissner
Copy link
Member

So, for many years I've occasionally seen libraries that use tornado or other async frameworks, and whenever I did, I ran away as quickly as I could, knowing that it was a whole universe of bugs I didn't understand or what to learn about.

I'm fully on board now that async is the right architecture for this library, but I'm afraid that if we make this library async-first, that'll have a similar affect on anybody that is considering using it, even if we document the heck out of it. I know as recently as a year ago, I'd have run away from this library as soon as I saw the word async.

So I'm not sure what to do with this PR. Async is the right thing to do architecturally, but the wrong thing to do for the community, at least for now.

A couple thoughts:

  • We fork Juriscraper, and the async version gets a new name. Anybody that wants to maintain the old version is free to do so.
  • We do a major version bump and warn prolifically to those that might upgrade. They can do so, or not, as they deem fit.
  • We don't do this and stick with the wrong approach for a few more years until async is common in the Python world (if ever?)

I'm also aware of the fact that Juriscraper has a number of things about it that are really bad:

  • The FdSys module is obsolete and needs to be removed (the website is scraped is gone).
  • The modular approach to the state scrapers is painful to anybody that has to learn it, and we're in an awkward transition to the linear scrapers from the regular ones.
  • Many of the state scrapers are broken.
  • Async makes more sense.
  • Maybe moving the PACER libraries out to their own tool would increase their popularity
  • Etc.

I can't help but think that the thing to do is abandon ship and start breaking things out. First, we do async PACER in a new library then we do states, one at a time, in a greenfield approach, then we see if there's anything else we care about that's left behind.

@johnhawkinson
Copy link
Contributor

I feel like it was alluded to above, but what's wrong with synchronous wrappers around asynchronous core functions?

  • Maybe moving the PACER libraries out to their own tool would increase their popularity

    I can't help but think that the thing to do is abandon ship and start breaking things out. First, we do async PACER in a new library

I do not think that the PACER/CMECF code is unpopular because it is "buried."
When I needed to deal with CM/ECF parsing for a personal (well) project, I wrote my own parser because I disliked juriscraper's architecture and…I also I suspected it was easier to do so than to wrangle juriscraper's parser to my requirements.

@mlissner
Copy link
Member

What's wrong with synchronous wrappers around asynchronous core functions?

Well, it's just kind of ugly? Everybody that uses Juriscaper would have to sprinkle little async_to_sync wrappers throughout their code all the time, and it just feels like one of those things that makes you think, "Hm, maybe this library isn't for me?" And as a coder that grew up on Python, as soon as I see the word "async", I run.

also I suspected it was easier to do so than to wrangle juriscraper's parser to my requirements.

Sure, so we can fix that too with a clean break.

@johnhawkinson
Copy link
Contributor

What's wrong with synchronous wrappers around asynchronous core functions?

Well, it's just kind of ugly? Everybody that uses Juriscaper would have to sprinkle little async_to_sync wrappers throughout their code all the time,

Sorry, no. I mean we retain our existing interface and make that interface a syncrhonous wrapper around the asynchronous core interface. No changes for juriscraper callers or users, unless they want to use async functions, in which case they have to call the async version.

@mlissner
Copy link
Member

Oh, so you're proposing the django thing of every method having an a___ method. For example, get becomes aget, filter becomes afilter?

@ttys0dev
Copy link
Contributor Author

So, for many years I've occasionally seen libraries that use tornado or other async frameworks, and whenever I did, I ran away as quickly as I could, knowing that it was a whole universe of bugs I didn't understand or what to learn about.

Well native asyncio is a bit different than the many older 3rd party library variants since it's built in to python.

We do a major version bump and warn prolifically to those that might upgrade. They can do so, or not, as they deem fit.

I think this probably the best approach.

And as a coder that grew up on Python, as soon as I see the word "async", I run.

Was the opposite for me generally, non-async functions doing network-IO always seemed to be potential performance foot-gun to be avoided.

I can't help but think that the thing to do is abandon ship and start breaking things out. First, we do async PACER in a new library then we do states, one at a time, in a greenfield approach, then we see if there's anything else we care about that's left behind.

IMO for something like this incremental refactoring is probably a better option.

I mean we retain our existing interface and make that interface a syncrhonous wrapper around the asynchronous core interface. No changes for juriscraper callers or users, unless they want to use async functions, in which case they have to call the async version.

I think it's better to have calling code handle the synchronous wrappers and not implement it in juriscraper directly, we don't really want to make assumptions here about say the asgiref library being used as the wrapper as other sync wrapping techniques may make more sense in non-django/non-asgi based applications.

@mlissner
Copy link
Member

What I like about forking is that it allows the old Juriscraper to continue living on (in infamy?), whereas if we don't fork, it's pretty hard to do bug fixes or anything like that for folks that are on the old version. This feels big enough to me that a fork makes sense. It's also probably time to fix all these many issues, and make a more focused PACER-only library.

we don't really want to make assumptions here about say the asgiref library being used

Are there many others and would they be incompatible? Django made it's decision, I guess, but it feels like if Django can do a-prefixed async functions alongside the regular ones, we should be able to also. (But if we fork, this doesn't matter so much.)

@ttys0dev
Copy link
Contributor Author

if we don't fork, it's pretty hard to do bug fixes or anything like that for folks that are on the old version

I mean, you can create a legacy branch if there's any actual interest in a sync version still(which is kind of an unknown at this point).

Are there many others and would they be incompatible?

Well there's stuff like asyncio.run_coroutine_threadsafe.

Django made it's decision, I guess, but it feels like if Django can do a-prefixed async functions alongside the regular ones, we should be able to also. (But if we fork, this doesn't matter so much.)

Well django internally is more sync oriented still than async and tends to shim things in the reverse direction(ie sync native functions shimmed to have async semantics using thread pools), while this would be native async being shimmed to sync so it's not quite the same.

@mlissner
Copy link
Member

You're focused on the PACER part, right? Why not do a async-native PACER fork in a fresh library? It shouldn't be too terribly hard and it'd be useful for others. We can start there, then do something for the state scrapers if we want to, in a separate approach/library/version/etc. But if we look just at PACER, I think that shouldn't be too terribly hard?

@ttys0dev
Copy link
Contributor Author

You're focused on the PACER part, right?

Well anything doing network IO I was trying to convert.

Why not do a async-native PACER fork in a fresh library?

Maintaining two codebases that basically do the same thing seems to be a lot of extra work for no significant benefit(unless there really are a lot of applications that depend on juriscraper that can't be modified to handle an async API easially).

@mlissner
Copy link
Member

I don't propose we maintain the code here anymore. We would say that it's old and that we're focused on ajuriscraper or whatever we wind up calling it, and that we recommend people use that instead.

@ttys0dev
Copy link
Contributor Author

I don't propose we maintain the code here anymore. We would say that it's old and that we're focused on ajuriscraper or whatever we wind up calling it, and that we recommend people use that instead.

I mean, if the code here wouldn't be maintained then what's the point of creating a separate project?

@mlissner
Copy link
Member

That would let us untether parts of Juriscraper. So PACER stuff goes into a new library and the PACER stuff here goes into a slow death spiral. Meanwhile, the rest of Juriscraper can continue advancing. It also gives a strong signal of "This is something new and different," and lets us think in greenfield kinds of ways.

@ttys0dev
Copy link
Contributor Author

That would let us untether parts of Juriscraper.

It's a bit unclear to me what we want to untether exactly, but I'm not seeing an advantage to creating a separate juriscraper project for that.

So PACER stuff goes into a new library and the PACER stuff here goes into a slow death spiral.

I don't really see how that would be better than just refactoring things here as needed, then we don't have two implementations doing effectively the same thing.

It also gives a strong signal of "This is something new and different," and lets us think in greenfield kinds of ways.

If there's two different versions of juriscraper that have significant overlapping functionality I think it becomes less clear which one projects should be using vs a single version. Also large flag day upgrades tend to be difficult and risky in general so they often are avoided vs say more incremental migrations/refactoring over time.

@mlissner
Copy link
Member

I think the biggest advantage is that when you Google for "pacer python" you'd get a package called pypacer or something, instead of Juriscraper as result four or five. So part of it is what you might call branding.

The second advantage is just making the package smaller in general. If all you want to do is scrape PACER, you get a lean library just for that. If you want to fix a bug in the PACER library, you can do that and run tests more quickly, without worrying about the rest, etc. I much prefer that as a library user, rather than big ones that do a lot more than I need. For example, if we do this, it gets a lot easier to document the library, than if we want to document all of Juriscraper and its various modules.

The third thing is that it'd allow us to open our minds to bigger changes to the API (go async, clean up a few other things), because we can change things without worrying about breaking other people's systems or demanding that they sprinkle async_to_sync-like wappers throughout their code.

@ttys0dev
Copy link
Contributor Author

I think the biggest advantage is that when you Google for "pacer python" you'd get a package called pypacer or something, instead of Juriscraper as result four or five. So part of it is what you might call branding.

I mean, juriscraper is the top active python pacer scraping library in google, the others are either completely abandoned/unmaintained or unrelated to pacer. Splitting things to a separate library may make things worse there.

If you want to fix a bug in the PACER library, you can do that and run tests more quickly, without worrying about the rest, etc.

You can run a subset of tests already, for example like this:

python -m unittest -v tests.local.test_DocketParseTest.DocketParseTest

I much prefer that as a library user, rather than big ones that do a lot more than I need. For example, if we do this, it gets a lot easier to document the library, than if we want to document all of Juriscraper and its various modules.

As long as the library is modular where you can pull in on the functionality you need I don't really think this is a big issue, and having some things in the same library does have benefits in regards to simplifying code sharing/maintenance.

The third thing is that it'd allow us to open our minds to bigger changes to the API (go async, clean up a few other things), because we can change things without worrying about breaking other people's systems or demanding that they sprinkle async_to_sync-like wappers throughout their code.

I still don't see a good reason for forking the library unless there's a longterm plan to maintain both versions which is kinda a PITA.

By the way I tried to find any active projects using juriscraper other than courtlistener but didn't see any so this seems likely to be only a theoretical issue to some degree.

I'd say we should refactor/clean things up here and if anyone complains then see what migration/compatibility strategy appears to best address their concerns at that point. Trying to prematurely anticipate if changing something like going async will cause major downstream issues is kinda hard without knowing specifics regarding the downstream users use case/architecture.

@mlissner
Copy link
Member

Discussed via Slack, and @ttys0dev wins. I'll give this another look early next week.

@mlissner
Copy link
Member

Man, time flies, but it'd be good to return to this, if you're interested, @ttys0dev. The new challenge we have is that a lot of work is being done on Juriscraper now by our new developer, @grossir and by @flooie, so we'd need to have a branch with this that stays up to date with their work until we cut this version, and then we'd quickly need to have a PR in CL that's ready to use the new version.

I think I'd propose the following, but I'd love input from the three of you:

  1. We get this branch good again and get it through the review process again.
  2. We write detailed release notes (and add them to this PR).
  3. @ttys0dev, you keep this branch updated with main (you're good at that).
  4. @ttys0dev, you do a PR for CL to use this branch, and it can even use a revision number from this branch to make tests pass.
  5. We cut Juriscraper 2.0.0.
  6. We cut over from the revision number to version 2.0.0 in CL.

An optional step is to repeat this process for each module of Juriscraper, so we do PACER first, then opinions, then oral arguments, etc, but I'm not sure it's necessary. It would be safer, but it'd also wind up with Juriscaper 2.0, 2.1, 2.2, etc (i.e., lots of breaking releases).

Would the approach above be safe, incremental, and efficient to get this big change done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants