Blocking fetcher thread #996

Mikwiss · 2022-09-02T10:18:47Z

Thanks again for all your work ! Now, let me expose you our fetcher thread issue.

Resume

Our cluster have 6 worker nodes. We are fetching more than 3 million URLs per day with our topology. It is deployed on 16 worker slots and use 16 fetchers, one by worker slot.

OkClient.HttpProtocol

The worst issue was spotted with the OkClient.HttpProtocol. Sometime, one of the worker nodes step up to 100% CPU usage. For example, the worker 5 in this case:

On StromCrawler board, we can see the fetcher count increase up to 50 (our fetcher limit) :

Worst, in another case, all the topologies are impacted :

All fetchers are impacted, and the topology is running slowly. The only way to fix the problem, is to kill and redeploy the topology. On kill phase, the log confirms some blocking thread:

2022-05-30 06:37:06.557 O.A.S.D.W.WORKER SHUTDOWNHOOK-SHUTDOWNFUNC [INFO] SHUTTING DOWN EXECUTORS
...
2022-05-30 06:37:07.028 O.A.S.E.EXECUTORSHUTDOWN SHUTDOWNHOOK-SHUTDOWNFUNC [INFO] SHUTTING DOWN EXECUTOR FETCHER:[30, 30]
2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD
2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD
2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD
2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD

HttpClient.HttpProtocol

We had tried to change the protocol to fix this issue. The CPU has never reach again 100%. But periodically, some fetcher threads are not released.

After some days, those “zombie” threads increase. We are often redeploying the topology (for functional update) and obviously, a new deployment reset thread count.

For now, the issue is less critical then the OkClient one, but we are trying to understand. Do you have any ideas or similar case?

sebastian-nagel · 2022-09-02T11:08:52Z

Hi @Mikwiss, for the OKHttp protocol, see #918: OkHttp's internal connection pool implementation does not scale up to 1000 or more open connections. You might want to try to tune your pool configuration. Note: the issue with connection pooling was discovered on Nutch and then ported to Stormcrawler. The best pool configuration depends on how many hosts are crawled, the distribution of URLs over hosts and the configured partitioning. Could you share more information, including which Stormcrawler and Storm versions are used? Also the Storm UI provides insight which bolts in the topology are actually the bottleneck.

Mikwiss · 2022-09-02T14:12:45Z

Hi @sebastian-nagel ! Thanks for reply ! I will check #918 in order to understand.

Each crawler we have crawls only one host. We are currently in SC 2.2 and Storm 2.3.0. We have a task in our backlog to update SC to 2.5 and Storm to 2.4.0.

jnioche · 2022-09-03T05:51:45Z

Thanks @Mikwiss for reporting this issue and thanks @sebastian-nagel for your comment

Each crawler we have crawls only one host

so the 16 fetchers all deal with the same host? Did you chose that over a single Fetcher so that the tuples get distributed evenly across the Parser tasks?

What value do you have in your conf for fetcher.threads.per.queue ?

Mikwiss · 2023-01-10T14:16:39Z

Hi !

Sorry for the delay. To reproduce the issue we have to wait a long time.

So, this issue occurs on only once topology (with specific target/host and 50 fetcher.threads.per.queue). According to our conversation with @jnioche, we decrease the http.content.limit to 10 000 000 (instead of -1).
We still have the same issue :

okhttp : 100% CPU on one worker after a few moment
httpclient : some thread zombie

But, it's seems less violent. So we'll decrease again this parameter. We keep in touch.

jnioche added bug fetcher labels Sep 3, 2022

jnioche modified the milestone: 2.6 Sep 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking fetcher thread #996

Blocking fetcher thread #996

Mikwiss commented Sep 2, 2022

sebastian-nagel commented Sep 2, 2022

Mikwiss commented Sep 2, 2022

jnioche commented Sep 3, 2022

Mikwiss commented Jan 10, 2023

Blocking fetcher thread #996

Blocking fetcher thread #996

Comments

Mikwiss commented Sep 2, 2022

Resume

OkClient.HttpProtocol

HttpClient.HttpProtocol

sebastian-nagel commented Sep 2, 2022

Mikwiss commented Sep 2, 2022

jnioche commented Sep 3, 2022

Mikwiss commented Jan 10, 2023