Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

useIncognitoPages doesn't rotate fingerprints #2310

Open
1 task
mnmkng opened this issue Jan 30, 2024 · 1 comment
Open
1 task

useIncognitoPages doesn't rotate fingerprints #2310

mnmkng opened this issue Jan 30, 2024 · 1 comment
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@mnmkng
Copy link
Member

mnmkng commented Jan 30, 2024

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

If you run the code with incognito pages, you will always get the same browser. If you comment incognito pages and uncomment one page per browser, you will get different user agents.

Code sample

import { Actor } from "apify";
import { PlaywrightCrawler } from 'crawlee';

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    browserPoolOptions: {
        useFingerprints: true,
        // maxOpenPagesPerBrowser: 1,
    },
    launchContext: {
        useIncognitoPages: true,
    },
    preNavigationHooks: [
        async ({ page }) => {
            page.once('request', async (req) => {
                try {
                    const headers = await req.allHeaders()
                    console.dir(headers);
                } catch (e) {
                    console.log('req inspection failed')
                }
            })
        }
    ],
    requestHandler: async ({ request, page, log}) => {
        const text = await page.innerText('pre');
        log.info(text);
    },
});


await crawler.run([
    'https://api.ipify.org?format=json&a',
    'https://api.ipify.org?format=json&b',
    'https://api.ipify.org?format=json&c',
    'https://api.ipify.org?format=json&d',
    'https://api.ipify.org?format=json&e',
    'https://api.ipify.org?format=json&f',
]);

Package version

3.7.2

Node.js version

18

Operating system

MacOS

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@mnmkng mnmkng added the bug Something isn't working. label Jan 30, 2024
@B4nan B4nan added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 12, 2024
@B4nan B4nan added this to the 83rd sprint - Tooling team milestone Feb 12, 2024
@barjin barjin changed the title useIncognitoPages doesn't use fingerprints, even if they are explicitly enabled useIncognitoPages doesn't rotate fingerprints Feb 13, 2024
@barjin
Copy link
Contributor

barjin commented Feb 14, 2024

Seems like a sign of a much larger underlying issue:

New sessions / fingerprints / proxyUrls are generated only on a browser launch.

The following snippet doesn't rotate the fingerprints correctly - all requests are done with one session only. This is because the useIncognitoPages was written with Playwright contexts in mind - we relied on the "newPage() creates a separate environment" invariant, so all the pages/contexts are launched in one browser.

sessionPoolOptions: {
    sessionOptions: {
        maxUsageCount: 1,
    },
},
launchContext: {
   useIncognitoPages: true,
},

The following snippet rotates the fingerprints correctly:

sessionPoolOptions: {
    sessionOptions: {
        maxUsageCount: 1,
    },
},
launchContext: {
   useIncognitoPages: false,
},

This works well because an "expired" session throws away the whole browser instance, causing the new pages to launch a whole new browser (see the parallel with the maxOpenPagesPerBrowser, which does the same thing). This is crazy expensive though, while launching and closing a context 100 times in one browser takes ~3.9 seconds, launching and closing a browser 100 times takes 40 seconds.

The entire browser-pool and session rotation logic is quite convoluted and worth a total rewrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants