For the uninitiated, web scrapers are applications that can connect to the internet and download a web page, just like a human user downloading it to their browser. ‘Price scrapers’ is the name given to scrapers that connect to eCommerce websites to download the page, go through the textual content and determine the price of the product.
For all site owners, scrapers are pain in the ass.
Search engines also use scrapers a.k.a crawlers, and they use them to download the pages on your website. No doubt, they’re useful. But, crawling is a kind of activity that anybody won’t perform every 20 minutes. So let’s isolate crawlers from scrapers and we’ll see that:
- Scrapers create unwanted, unsolicited, unwelcome traffic
- They’re barely useful to the site owners.
- They consume resources like RAM and CPU by making multiple requests
- They steal information. Yes, it’s
stealingbecause of the scale at which it happens
For the above few reasons, site owners often time block such scrapers by following few techniques:
- They determine the source (aka client’s) IP
- Check the frequency at which they are accessing this site
- If the frequency goes beyond a threshold level (say 25 requests per minute or so), they block the IP and redirect them to some other page
Recently I came across another technique. Not sure if it’s unique to the industry or not. I found it in-use by Snapdeal.
It was a special case of `Geographical IP` based blocking, where Site owners
- Whitelisted target country’s IPs
- Blacklisted all IPs belonging to hosting providers and proxy sites from the target country
- And the above, specifically done for such services from other countries.
- The non human clients will be redirected to a stale page containing stale information (pricing, or offer, or whatever)
- But, will the human clients from other countries still be able to access these sites directly?
- Not sure. We used a VPN but still we were given the stale website.
- So, I can say, either they have blacklisted VPNs, or probably blacklisted every other Non-Indian IP
Geo IP based blocking is nothing new, but the way it has been used here is unique.
Here, they have selectively targeted Proxy Sites and Hosting providers.
But, there is another implication to it as well.
It could be mere differential pricing for other geographies, which I’m misinterpreting to be filtering.
Amazon is known to do this kind of differential pricing. Prices over
amazon.com accessed directly from India, and accessed from a
vpn will give you different prices, the latter being lesser.