CheerioCrawler not persisting cookies #2618

taythebot · 2024-08-14T10:27:58Z

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/cheerio (CheerioCrawler)

Issue description

The CheerioCrawler is not persisting cookies at all. The session storage does have the cookies for the request.url but it is not being set. Manually trying to set it in the preNavigationHooks does not work as session.getCookieString(request.url) is empty.

Create new CheerioCrawler with useSessionPool: true and persistCookiesPerSession: true
Visit url that assigns cookie on response
Visit url again
Cookie is not being set in request headers

Code sample

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
	minConcurrency: 1,
	maxConcurrency: 10,
	requestHandlerTimeoutSecs: 30,
	maxRequestRetries: 10,
	useSessionPool: true,
	persistCookiesPerSession: true,
	preNavigationHooks: [
		async ({ request, session }, gotOptions) => {
			gotOptions.useHeaderGenerator = true;
			gotOptions.headerGeneratorOptions = {
				browsers: [{ name: 'firefox', minVersion: 115, maxVersion: 115 }],
				devices: ['desktop'],
				operatingSystems: ['windows'],
				locales: ['en-US', 'en'],
			};

			// Cookies are not present here on the second request
			console.log(session.getCookieString(request.url));
		},
	],
	requestHandler: async ({ request, session, addRequests }) => {
		// Cookies are present here
		console.log(session.getCookies(request.url));

		// Requeue same URL with different uniqueKey
		await addRequests([{ url: request.url, uniqueKey: new Date().toString() }]);
	},
});

await crawler.run(['http://localhost:8000']);

Package version

v3.11.1

Node.js version

v20.16.0

Operating system

MacOS Sonoma

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

No response

Other context

Here's a small Python script to test if Crawlee is properly setting cookies. It will set a cookie on GET /

#!/usr/bin/env python3

import http.server as SimpleHTTPServer
from http import cookies
import socketserver as SocketServer
import logging

PORT = 8000

class GetHandler(
        SimpleHTTPServer.SimpleHTTPRequestHandler
        ):

    def do_GET(self):
        logging.error(self.headers)
        self.send_response(200)
        self.send_header("Content-type", "text/html")

        cookie = cookies.SimpleCookie()
        cookie['a_cookie'] = "Cookie_Value"
        self.send_header("Set-Cookie", cookie.output(header='', sep=''))

        self.end_headers()
        self.wfile.write(bytes("TEST", 'utf-8'))


Handler = GetHandler
httpd = SocketServer.TCPServer(("", PORT), Handler)

httpd.serve_forever()

The text was updated successfully, but these errors were encountered:

B4nan · 2024-08-14T11:00:43Z

Cookies are persisted per session, your second request is (almost certainly) getting a new session.

taythebot · 2024-08-14T11:22:18Z

Cookies are persisted per session, your second request is (almost certainly) getting a new session.

How do I make sure the second request is using the same session?

B4nan · 2024-08-14T12:14:56Z

What are you trying to do?

B4nan · 2024-08-14T12:19:33Z

You could set maxPoolSize: 1, that way there will be only one session. Otherwise I don't think we have a way to force a session id on new requests (but we should add one, that's a good point).

taythebot · 2024-08-15T06:16:23Z

What are you trying to do?

The website I'm trying to scrape has a anti-bot feature where you need to wait in a access queue. The access queue page sends a Refresh header which indicates the amount of seconds you need to wait. Afterwards you need to refresh the page to gain access. After you gain access you are given an access cookie which must be present in all future requests.

When I detect this I'm sleeping the required amount and then re-queuing the same URL. I can't find a way to refresh a pay via Cheerio directly so I'm having to requeue it with a different unique key. However this seems difficult to implement with many sessions since I cannot specify the request go through the same session. Maybe there's a better way to handle this use case in Crawlee I'm not aware of?

sriraj66 · 2024-10-09T09:10:45Z

Can you give me the Url of that website ?

taythebot added the bug Something isn't working. label Aug 14, 2024

fnesveda added the t-tooling Issues with this label are in the ownership of the tooling team. label Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CheerioCrawler not persisting cookies #2618

CheerioCrawler not persisting cookies #2618

taythebot commented Aug 14, 2024

B4nan commented Aug 14, 2024

taythebot commented Aug 14, 2024

B4nan commented Aug 14, 2024

B4nan commented Aug 14, 2024

taythebot commented Aug 15, 2024

sriraj66 commented Oct 9, 2024

CheerioCrawler not persisting cookies #2618

CheerioCrawler not persisting cookies #2618

Comments

taythebot commented Aug 14, 2024

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

B4nan commented Aug 14, 2024

taythebot commented Aug 14, 2024

B4nan commented Aug 14, 2024

B4nan commented Aug 14, 2024

taythebot commented Aug 15, 2024

sriraj66 commented Oct 9, 2024

I have tested this on the `next` release