Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send std::net:http requests #7742

Merged
merged 11 commits into from
Sep 18, 2024
Merged

Send std::net:http requests #7742

merged 11 commits into from
Sep 18, 2024

Conversation

scotttrinh
Copy link
Contributor

@scotttrinh scotttrinh commented Sep 12, 2024

TODO

  • Get batches of pending std::net::http::ScheduledRequest objects
  • Create httpx request task
  • Run this work queue on server
  • Add end-to-end tests with test HTTP server

Copy link
Member

@fantix fantix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good!

response = await http_send(request)
request_state = 'Completed'
response_status = response.status_code
response_body = response.content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
response_body = response.content
response_body = await response.read()

content may not be always present, we should use read() here. That leads to an interesting question if it only fails to read the response body - shall we still store the remaining partial response?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read isn't awaitable? It seems like it returns bytes here? Maybe the change is:

Suggested change
response_body = response.content
response_body = response.read()

The documentation doesn't seem to give a really clear picture on how to do what we're trying to do here which is save the response bytes without any extra decoding. Although... Maybe we want it decoded (to account for compression) and the re-encode it so that we're saving something that can be easily to deal with on the consumer's end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry it should be await response.aread().

without any extra decoding

I suppose you meant the textual codecs (UTF-8, ISO-8859-1, etc.) instead of the protocol codes (gzip, deflate, etc.), right? There is the response.aiter_text() but we'll then need to handle the decoding errors - coming back to the previous question of storing partial responses. I'd prefer we just keep the bytes (only decoding the gzip/deflate/... stuff).

Or maybe we can try to decode it into str and - like you said - store encoded bytes with confidence (like with an additional optional field on the Response indicating the confirmed text encoding, or even always re-encoded as UTF-8), or just store the raw bytes if we failed to decode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you meant the textual codecs (UTF-8, ISO-8859-1, etc.) instead of the protocol codes (gzip, deflate, etc.), right?

Ahh, I mostly meant compression, not text encodings, but it looks like response.content already handles decompression:

https://www.python-httpx.org/quickstart/#binary-response-content

Any gzip and deflate HTTP response encodings will automatically be decoded for you. If brotlipy is installed, then the brotli response encoding will be supported. If zstandard is installed, then zstd response encodings will also be supported.

Oh sorry it should be await response.aread().

👍

Or maybe we can try to decode it into str and - like you said - store encoded bytes with confidence (like with an additional optional field on the Response indicating the confirmed text encoding, or even always re-encoded as UTF-8), or just store the raw bytes if we failed to decode.

The main problem I was worried about was having consumers have to read a bunch of headers to know how to decompress the bytes into the encoded text, but since that's not really the issue here, I'm personally fine with just storing the raw bytes, and having the consumer decode themselves in what is most likely utf-8. Although, maybe that's relatively hostile since we imagine that these responses will typically only be read at troubleshooting time, and typically via the UI which would be a subpar experience.

On the other hand, I really don't want to make this part of the module super complicated and the simplest thing to actually store is the raw bytes. @1st1 any thoughts here?



async def http_send(request: ScheduledRequest) -> httpx.Response:
async with httpx.AsyncClient() as client:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can cache an httpx.AsyncClient per tenant to pool connections, but that also leads to racing cookie issues among concurrent requests. So yeah, this is probably fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea of caching AsyncClient to minimize number of Python object invocations? Is it expensive to instantiate? Or would it reuse some underlying resources?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuse some underlying resources

Yes, it reuses the TCP/HTTP connections with keep-alive, avoiding some connection establishment time (which we don't care much, because it is supposed to be async and slow). In case of HTTP/2, it could multiplex concurrent requests over the same TCP connection, allowing higher concurrency with minimal resource.

The cost to create an AsyncClient is not that much, Python objects are also cleared away after __aexit__(). However, the cookies store is shared among all requests on the same client, this may or may not be the expected behavior. For example, AsyncClient would follow redirections; if the redirected request depends on the cookie from the initial response, 2 concurrent requests with both redirections on the same domain may step on each other's feet.

In the end of the day, it's still simpler to just use a separate client per request. But I guess in the future we can make it an option/config for the user to decide if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's handy that the client can handle cookies, but I don't think any user (ourselves included) should depend on something stateful like cookies when using this module. Maybe there is a way to disable cookies altogether?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one other nice thing we'd get out of sharing a client is that we can set the max connections in the client itself to avoid managing the concurrency directly ourselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a way to disable cookies altogether?

encode/httpx#2992 looks like we'll need to hack our own way in.

we can set the max connections in the client itself

Yep, there's also that. And it'll be the hardware max connections in case of HTTP/2 (like the number of FDs I believe), which is something we really want I guess (instead of logical number of concurrent requests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have a null cookie jar, share a single AsyncClient per-tenant, and set the max connections on the client. I'll throw that together and we can iterate from there if we want something more robust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +97 to +99
await asyncio.sleep(
POLLING_INTERVAL.to_microseconds() / 1_000_000.0
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to interrupt this sleep with notifications later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"notifications" meaning a Postgres notify event? I think so! I'm not sure exactly how to set that up, but the end goal is definitely push rather than polling (maybe with a longer poll still happening in the case of missed notifications).

Copy link
Member

@fantix fantix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

For now, treat errors getting the body bytes as a completed request
State is the request's state, and status is the reponse's status code.
@scotttrinh scotttrinh changed the title wip Send std::net:http requests Send std::net:http requests Sep 18, 2024
@scotttrinh scotttrinh marked this pull request as ready for review September 18, 2024 10:23
@scotttrinh scotttrinh requested a review from fantix September 18, 2024 10:24
Copy link
Member

@fantix fantix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go! 👍

@scotttrinh scotttrinh merged commit cbff7eb into master Sep 18, 2024
24 of 26 checks passed
@scotttrinh scotttrinh deleted the 7727-http-worker branch September 18, 2024 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants