
Most teams tune scrapers around code, not the network. The blockers you hit first are shaped by how the web is actually deployed: encryption everywhere, traffic that is heavily automated, and gatekeepers that sit in front of a large share of sites. Grounding your approach in measurable facts makes pipelines steadier and cheaper to run.
The shape of the web, by the numbers
Over 95% of page loads in major browsers occur over HTTPS. This means every request you make must negotiate TLS correctly and consistently. Fingerprints at the TLS and HTTP layers are now part of many defenses, so stable client settings and header order matter as much as your parser.
Roughly half of global web traffic is automated, and about one third of all traffic is classified as malicious automation. Site owners operate with defensive assumptions by default. If your traffic behaves like the median bot, expect it to be treated accordingly.
About one in five websites sits behind a large reverse proxy and security edge. These intermediaries unify rate limits, fingerprinting, and reputation checks across many domains. If you only test against origin servers, your error profile in production will look very different, with more 403 and 429 responses.
A typical page draws 70 to 80 separate resources and weighs around 2 MB. Even simple collection tasks turn into multi-request workloads. Any retry policy multiplies bandwidth consumption fast, so measuring retries per resource is as important as tracking page-level success.
More than 98% of websites use JavaScript. You do not need a browser for every job, but you do need to account for script-driven navigation and API calls. Mapping the minimum fetch set before you scale saves both bandwidth and block risk.
Around 40% of user traffic now uses IPv6. Many networks and filters evaluate IPv6 differently from IPv4. Dual-stack readiness is a practical way to unlock capacity and reduce contention with crowded IPv4 ranges.
Why identity quality beats volume
Because defensive layers work on reputation and consistency, your network identity is often more important than your concurrency budget. IP origin, ASN, and behavior over time are all live signals. Rotating through vast pools of unstable addresses produces noisy fingerprints and higher fail rates, while a smaller set of steady, consumer-grade addresses tends to pass heuristics that punish churn. When you need this style of identity, ISP proxies provide residential-like IP space with data center reliability, helping close the gap between lab and production behavior.
Sizing runs with simple math
If the median page transfer is about 2 MB and you plan to fetch 1 million pages, you are budgeting roughly 2 TB of data just for successful first attempts. With a 15% retry rate at the resource level, that budget can swell by hundreds of gigabytes. Small gains in first-try success save real money and shrink run times, especially when your workload fans out across dozens of domains guarded by shared edges.
Concurrency should be set from observed tail latency, not averages. If your p95 time to first byte on guarded endpoints is 800 ms and total page assembly spans 70 resources, firehosing with thousands of parallel connections will only amplify 429 responses. Grow concurrency until the 429 rate ticks up, then back off and hold.
What to monitor continuously
Track status code mix with special attention to 403 and 429, handshake failures, and timeouts. Segment by ASN and IP family to spot noisy pools early. Watch for shifts in required subresources that indicate new script gates or bot checks. Tie these metrics to cost, so you can see when a new block pattern makes a target economically unattractive.
Closing notes
Scraping at scale is mostly about aligning with how the public web operates today: encrypted by default, guarded by intermediaries, and complex at the page level. Build from measurements, keep identities clean and consistent, and let the network’s realities set your pacing. The result is quieter traffic, fewer retries, and steadier output.




