Thanks #1

NikolaiT · 2021-01-31T21:31:46Z

Just wanted to drop by and say thanks. It's good to be aware of those techniques. It's insanely complex to not get detected.

What kind of scraping setup do you suggest?

I am currently going with something like this, what do you think?

/**
* This test uses the real Google Chrome browser and not a precompiled puppeteer binary.
* 
* Furthermore, we start the browser manually and not with puppeteer.
*/
const puppeteer = require('puppeteer-core');
const exec = require('child_process').exec;
const fs = require('fs');

// change this when necessary
const GOOGLE_CHROME_BINARY = '/usr/bin/google-chrome-stable';

function sleep(ms) {
 return new Promise(resolve => setTimeout(resolve, ms));
}

function execute(command, callback){
 exec(command, function(error, stdout, stderr){ callback(stdout); });
}

/**
* Poll browser.log periodically until we see the wsEndpoint
* that we use to connect to the browser.
*/
async function getWsEndpoint() {
 let wsEndointFile = './browser.log';
 for (let i = 1; i <= 10; i++) {
   await sleep(500);
   if (fs.existsSync(wsEndointFile)) {
     let logContents = fs.readFileSync(wsEndointFile).toString();
     var regex = /DevTools listening on (.*)/gi;
     let match = regex.exec(logContents);
     if (match) {
       return match[1];
     }
   }
 }
 console.log('Could not get wsEndpoint');
 process.exit(0);
}

(async () => {
 // start browser
 const command = GOOGLE_CHROME_BINARY + ' --remote-debugging-port=9222 --no-first-run --no-default-browser-check 2> browser.log &';
 execute(command, (stdout) => {
   console.log(stdout);
 });

 // now connect to the browser
 // we do not start the brwoser with puppeteer,
 // because we want to influence the startup process
 // as little as possible
 const browser = await puppeteer.connect({
   browserWSEndpoint: await getWsEndpoint(),
   defaultViewport: null,
 });

 const page = await browser.newPage();
 await page.goto('https://google.com');
 await sleep(1000);
 await page.screenshot({path: "bot.png", fullPage: true});

 await page.close();
 await browser.close();
})();

niespodd · 2021-02-03T02:28:36Z

Hi @NikolaiT

The list of fingerprinting/detection surfaces that I have covered so far displays barely tip of the iceberg. In the upcoming weeks I will make some more updates. Stay tuned 😎

Generally the all bot detection technologies work in three "dimensions" and aim to find irregularities:

Zero-Level Traffic Analysis The anti-bot WAF is installed on the top of a CDN service (e.g. Akamai, Cloudflare, PerimeterX) and analyzes various aspects of client traffic. This (may) be IP reputation (geography, associated ISP, databases like ISDB and more), client bandwidth, request timing etc.
Passive Browser Fingerprinting The target website asks the browser to execute Javascript code that yields results based on the hardware, browser and browser version used by the client (e.g. WebGL fingerprint, timing attacks),
Behavior Analysis This is feasible only on websites that require a good number of interactions - clicks, scrolls and typing etc. You will find it often in many financial services websites as part of anti-fraud systems. As a matter of fact there are companies using only keystroke characteristics to build the "second factor" of 2-FA.

At the first sight it may sound overhelming, but you need to keep in mind that no anti-bot system should block access for regular users.

To put it differently, if the anti-bot system is not 100% sure you are a bot, you are very likely not one, and you will pass the test. The system may generate you a score and based on that apply some evasion techniques e.g. slow down your requests, display "shadowed" data, send a captcha gateway. At this point your job is to polish your scraper, proxy until it perfectly resembles a real browser.

Now, to your question:

What kind of scraping setup do you suggest?

I suggest addressing all three points mentioned above:

Make your traffic look legit e.g. make sure you didn't set Windows user-agent when puppeteer is running in Linux
Go for puppeteer-extra-plugin-stealth the less you override in the original Chrome the better it works (i.e. disable plugins evasion). With this approach however, because it's public code and anti-bots are quickly following up, it may work only 60% of the time... 🤣
Set timeouts between each action, so that it resembles real user behavior

I am currently going with something like this, what do you think?

Good idea with using original Chrome. I can't say more than that, because I am not sure if https://google.com is what you intend to scrape. If that's the case you'll need some more stealth-iness 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanks #1

Thanks #1

NikolaiT commented Jan 31, 2021

niespodd commented Feb 3, 2021 •

edited

Thanks #1

Thanks #1

Comments

NikolaiT commented Jan 31, 2021

niespodd commented Feb 3, 2021 • edited

niespodd commented Feb 3, 2021 •

edited