executeClientScripts semantics #1128

Phyks · 2025-01-08T15:27:20Z

Hi,

executeClientScripts is actually used at the moment not to simply "execute client scripts", but rather to upgrade from a simple node-fetch to a full browser fetcher. This has the side effect of enabling to run client scripts, but this does much more than this and seems to be about as much used for bot detection evasion (puppeteer browser having a stealthier signature than node-fetch) as for actually running the client scripts.

This has multiple implications:

Spawning a full browser is quite cost intensive, so it should probably be avoided as much as possible.
Bot evasion is probably easier handled by dedicated tool such as https://github.com/lexiforest/curl-impersonate?tab=readme-ov-file (modified / stealthier curl)
Running client scripts can also be handled with a plain DOM fetcher given that JSDom is already embedded in the codebase. This requires setting parameters in JSDom to enable in-HTML script execution as well as external resources loading (https://github.com/jsdom/jsdom?tab=readme-ov-file#executing-scripts).

For these reasons, I believe that executeClientScripts should not be a simple binary parameter but a set of two binary parameters: useFullBrowser / executeClientScripts. These would have the following semantics:

executeClientScripts / useFullBrowser	false (default)	true
false	curl-backed fetcher / no script execution	Puppeteer without JS
true (default)	curl-backed fetcher / script execution with JSDom	Puppeteer

This would likely mean a backward-incompatible change to the services declarations, so not sure how to properly handle such an evolution.

Best,

The text was updated successfully, but these errors were encountered:

Ndpnt · 2025-01-22T10:53:07Z

Hi @Phyks,

Thank you so much for your valuable feedback on this option.
You’ve raised some very relevant points about the current implementation and how it could be improved.

At the moment, the team is focused on other priorities, so I can’t commit to a timeline or guarantee when we will investigate more on this question, but I keep your suggestions in mind.

Thanks again for taking the time to share this feedback, I really appreciate your input!

MattiSG · 2025-01-24T11:53:16Z

Interesting idea, thanks!

If I remember correctly, we had sampling data that proved that executeClientScripts was not so prevalent and that the basic fetcher worked in most cases. Adding an intermediary state could be interesting if it significantly both:

Increased bot blockers evasion.
Decreased resource consumption compared to starting a full Puppeteer.

Before adding complexity to the codebase and config, I believe it would be critical to gather the following data:

Prevalence of executeClientScripts: true in a large sample (ideally whole federation, otherwise I'd suggest Contrib).
Proportion of failures that are corrected by both full Puppeteer and non-JS Puppeteer in a large sample.
Resource consumption difference between full Puppeteer and non-JS Puppeteer (CPU cycles, RAM usage, time to start).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executeClientScripts semantics #1128

executeClientScripts semantics #1128

Phyks commented Jan 8, 2025

Ndpnt commented Jan 22, 2025

MattiSG commented Jan 24, 2025

executeClientScripts semantics #1128

executeClientScripts semantics #1128

Comments

Phyks commented Jan 8, 2025

Ndpnt commented Jan 22, 2025

MattiSG commented Jan 24, 2025