ScriptVeda
All posts
Web ScrapingJune 24, 20265 min read

How we scrape large sites at scale without getting blocked

Pulling a hundred rows off a website is a weekend project. Pulling a few million every week, reliably, is a different sport. Here's what actually keeps a large scraper alive.

ScriptVeda Team
Author

The first scraper anyone writes works beautifully. You point it at a page, the data comes back, you feel like a genius. Then you let it run against a few hundred thousand pages, and somewhere around the one hour mark the site stops answering, or worse, starts handing you polite little blocks instead of data. Getting a hundred rows is a weekend project. Getting a few million every week, reliably, is a different sport.

Here is what actually keeps a large scraper alive, based on the ones we run in production.

Blocking is not one problem. It is about five.

People say "I got blocked" as if it were a single event. It almost never is. A site decides you are a bot from a stack of smaller signals: how fast you ask, what your requests look like, where they come from, whether you run JavaScript like a real browser, and whether your behaviour matches a human's. Fix one and ignore the rest, and you still get caught. You have to handle all of them together.

Proxies: where your requests appear to come from

If thousands of requests pour out of a single IP address, you are done. Proxies spread that traffic across many addresses so no single one looks suspicious. The honest tradeoff is between two kinds:

  • Datacenter proxies are cheap and fast, and easy for big sites to spot and ban in bulk. Fine for softer targets.
  • Residential proxies route through real consumer connections, so they look like ordinary people browsing. They cost more, and on tough sites they are the only thing that works.

On the hard jobs we lean on residential pools from providers like Bright Data and Zyte, and rotate through them so the traffic reads as a crowd rather than one very busy robot.

Look like a browser, because half the web now demands one

A lot of modern sites build their content with JavaScript, so a plain HTTP request gets you an empty shell. That is where headless browsers come in. We drive real browser engines with Playwright and Selenium so the page loads exactly as it would for a person, including the data that only appears after the scripts run. The same tools let us set believable headers and user agents, so the request does not announce itself as a script in its very first line.

Go slow on purpose

The fastest scraper is usually the one that gets banned first. Hammering a site flat out is both the rudest and the dumbest approach. We add delays between requests, cap how many run at once, and spread the load so the site barely notices us. A scraper that finishes in four hours and never gets blocked beats one that races for forty minutes and then dies for a day. Slower is faster once you count the retries.

Assume the site will change, because it will

Websites get redesigned. A class name moves, a layout shifts, and a scraper that was perfect yesterday now quietly collects garbage. The dangerous failure is not the loud crash, it is the silent one where you keep getting rows that are subtly wrong. So we validate as we go: check that prices look like prices, that required fields are present, that today's volume is in the same ballpark as yesterday's. When something drifts, we want an alert, not a nasty surprise three weeks later.

Store the raw page first, parse it later

One habit that saves projects: save the raw response before you extract anything from it. Parsing logic changes, you spot a field you missed, the site shifts. If you kept the raw data, you can re-run extraction without re-scraping the whole internet. If you only kept the parsed output, a mistake means starting from zero. Storage is cheap. Re-scraping a few million pages is not.

The boring truth

None of this is magic. Scraping at scale is mostly discipline: spread your requests, behave like a browser, slow down, watch for changes, and keep your raw data. The clever one liner you found online will get you a hundred rows. Everything above is what turns it into a pipeline that delivers clean data every week without anyone babysitting it.

That unglamorous machinery, the part that just keeps working, is what we build for clients. If you have a source you need pulled reliably and at volume, we are glad to take a look.

Have a project like this?

If you need a scraper, a data pipeline, or a full product built and maintained properly, we would love to hear about it.

Start a project