Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1247 Fix race condition in HttpProtocol while having multiple proxy settings #1250

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chhsiao90
Copy link
Contributor

@chhsiao90 chhsiao90 commented Jul 8, 2024

In HttpProtocol implementation, the client builder was singleton and may be accessed and modified by different threads at same time. The result is that a wrong proxy will be used or a wrong proxy auth will be configured.

To fix it, create a local builder insteand of having a field-level builder.

Fixes #1247

Thank you for contributing to Apache StormCrawler.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a issue associated with this PR? Is it referenced in the commit message?

  • Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

  • Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn clean verify?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

@jnioche
Copy link
Contributor

jnioche commented Jul 10, 2024

This is a substantial change that could have an impact on how things work in SC. It needs careful reviewing, annoyingly I don't have much time to do that at the moment but I hope I will be able to soon. Reviews from other committers would be welcome

@pjfanning
Copy link
Contributor

pjfanning commented Jul 10, 2024

My gut reaction is that adding even more shared state to the classes is not a good way to go.

The more I look at this the more I think there is a major design flaw. The HttpProtocol classes should not be reused for HTTP calls that have different configs. There should be separate HttpProtocol class instances created for each different config. In #1247, there are 2 proxies described, HTTP calls for those 2 proxies should not be made with the same HTTPProtocol instance.

@pjfanning
Copy link
Contributor

The solutions applied in okhttp and httpclient classes seem very different. It would greatly simply review to have 2 different PRs that can be reviewed independently.

In HttpProtocol implementation, the client builder was singleton and may
be accessed and modified by different threads at same time. The result
is that a wrong proxy will be used or a wrong proxy auth will be
configured.

To fix it, create a local builder insteand of having a field-level
builder.

Fixes apache#1247
@chhsiao90
Copy link
Contributor Author

The solutions applied in okhttp and httpclient classes seem very different. It would greatly simply review to have 2 different PRs that can be reviewed independently.

Sure, I'll separate the httpclient in a separated PR.

My gut reaction is that adding even more shared state to the classes is not a good way to go.

Actually, those newly added fields are not shared state. They are just the http options retrieved from the conf.

The easiest and most readable way is to just have a local field conf, and we always build the http client from the conf. But that would introduce unnecessary map.get operations as the conf should keep unchanged when it was loaded.

So I decided to have these option fields to be loaded during the void configure(conf) method call. And I think that's how most other storm crawler classes do, ex: JSoupParserBolt.

What do you think?

The more I look at this the more I think there is a major design flaw. The HttpProtocol classes should not be reused for HTTP calls that have different configs. There should be separate HttpProtocol class instances created for each different config. In #1247, there are 2 proxies described, HTTP calls for those 2 proxies should not be made with the same HTTPProtocol instance.

I think it's might be a common case to round-robin a large proxy pool (may have thousands or more ips). So it makes sense to me to have a single HttpProtocol, and select a proxy from ProxyManager whenever it was called.

If we'd like to have multiple HttpProtocol instances for each proxy, it means that we may round-robin the HttpProtocol in ProtocolFactory. And I'm not sure if it's easy to implement it.

@chhsiao90
Copy link
Contributor Author

Pushed the update to remove the change of httpclient HttpProtocol.

@rzo1
Copy link
Contributor

rzo1 commented Oct 2, 2024

Actually, I am not using proxies for my setup, so I am unable to test it. Would it be possible to add unit tests (leveraging test containers) to spin up a proxy-based setup to demonstrate the correctness of this change?

@jnioche
Copy link
Contributor

jnioche commented Nov 3, 2024

My gut reaction is that adding even more shared state to the classes is not a good way to go.

The more I look at this the more I think there is a major design flaw. The HttpProtocol classes should not be reused for HTTP calls that have different configs. There should be separate HttpProtocol class instances created for each different config. In #1247, there are 2 proxies described, HTTP calls for those 2 proxies should not be made with the same HTTPProtocol instance.

I like this approach. We already have the ability to have more than 1 instance of each protocol, ideally we would be able to configure them differently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HttpProtocol (both okhttp and apache) race condition while having different proxies in different threads
4 participants