Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpProtocol (both okhttp and apache) race condition while having different proxies in different threads #1247

Open
1 of 3 tasks
chhsiao90 opened this issue Jul 5, 2024 · 2 comments · May be fixed by #1250
Open
1 of 3 tasks

Comments

@chhsiao90
Copy link
Contributor

What kind of issue is this?

  • Question. This issue tracker is not the best place for questions. If you want to ask how to do
    something, or to understand why something isn't working the way you expect it to, use StackOverflow
    instead with the label 'stormcrawler': https://stackoverflow.com/questions/tagged/stormcrawler

  • Bug report. If you’ve found a bug, please include a test if you can, it makes it a lot easier to fix things. Use the label 'bug' on the issue.

  • Feature request. Please use the label 'wish' on the issue.

Reproduce steps

To reproduce it, we can run the HttpProtocol main function with many urls with MultiProxyFactory

the crawler.conf

config:
  http.agent.name: test
  http.proxy.manager: org.apache.stormcrawler.proxy.MultiProxyManager
  http.proxy.file: proxies
  http.robots.file.skip: true

the proxies file

http://first:password@proxy1:8888
http://second:password@proxy2:8888

Root cause

The HttpProtocol (both okhttp and apache) is not thread-safe

  • the same instance which was initiated by ProxyFactory may be used in different bolts (different workers) at same time
  • the shared request/client builder was manipulated by different bolt/thread at same time

Example 1 (wrong proxy auth)

  • (Thread 2) builder.setProxy(secondProxy)
  • (Thread 1) builder.setProxy(firstProxy)
  • (Thread 1) builder.setAuth(firstAuth)
  • (Thread 2) builder.setAuth(secondAuth)
  • (Thread 1) builder.build()
  • We'll have firstProxy + secondAuth

Example 2 (wrong proxy used)

  • (Thread 1) builder.setProxy(firstProxy)
  • (Thread 1) builder.setAuth(firstAuth)
  • (Thread 2) builder.setProxy(secondProxy)
  • (Thread 2) builder.setAuth(secondAuth)
  • (Thread 1) builder.build()
  • Now both requests use the second proxy
@jnioche
Copy link
Contributor

jnioche commented Jul 5, 2024

thanks @chhsiao90, are you able to suggest a fix for it?

@chhsiao90
Copy link
Contributor Author

Sure, I can have a PR for it.

chhsiao90 added a commit to chhsiao90/incubator-stormcrawler that referenced this issue Jul 8, 2024
In HttpProtocol implementation, the client builder was singleton and may
be accessed and modified by different threads at same time. The result
is that a wrong proxy will be used or a wrong proxy auth will be
configured.

To fix it, create a local builder insteand of having a field-level
builder.

Fixes apache#1247
chhsiao90 added a commit to chhsiao90/incubator-stormcrawler that referenced this issue Jul 8, 2024
In HttpProtocol implementation, the client builder was singleton and may
be accessed and modified by different threads at same time. The result
is that a wrong proxy will be used or a wrong proxy auth will be
configured.

To fix it, create a local builder insteand of having a field-level
builder.

Fixes apache#1247
chhsiao90 added a commit to chhsiao90/incubator-stormcrawler that referenced this issue Jul 15, 2024
In HttpProtocol implementation, the client builder was singleton and may
be accessed and modified by different threads at same time. The result
is that a wrong proxy will be used or a wrong proxy auth will be
configured.

To fix it, create a local builder insteand of having a field-level
builder.

Fixes apache#1247
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants