Web Scraping Explained

Julius Černiauskas, the CEO of Oxylabs, one of the most prominent companies that offer web scraping solutions in the world, sheds light on the practice, how it has developed over the years, and what lies ahead.

Let’s start from the beginning. How has web scraping evolved to its current state?

Web scraping, in some sense of the word, began with the first attempts to index pages scattered across the internet. While it wasn’t automated public data collection in the way we understand now, crawling across the web does much of the same as scraping.

So, I would say that web scraping began almost with the advent of the World Wide Web, likely sometime in the early 90s, depending on who you ask. It may look surprising to some as web scraping still seems like something completely novel, but the technology and its underlying core principles have been with us for an incredibly long time.

Naturally, many people would then ask about the changes that must have occurred for web scraping to go from something few know about to a nearly household name. There are a few important steps in evolution, I think, that both, directly and indirectly, impacted scraping.

First, the digitalisation of businesses has probably had the most significant impact. Businesses thrive on competition, and competition is all about knowing what to do next for the best results. It would follow that any information on industry peers would be of great benefit. As such, there has always been an implicit drive for data acquisition technologies to develop.

Second, as these technologies developed, they changed from being something accessible to the select few to a much wider audience. Looking through the same competition lens, demand for public data acquisition spurred companies to improve these technologies and lower operating costs.

In fact, we were surprised by how deeply web scraping has entrenched itself in some industries. We recently conducted a survey, inquiring about data management practices in the UK and US finance sectors, which revealed that web scraping is about as popular as internal data acquisition practices.

Finally, the prominence and importance of data in all fields, even outside business, has greatly increased. As a side effect, any collection method that purports to ease a usually complicated and drawn out process gets attention.

So, could you expand upon how the perception of web scraping has changed over the years?

Of course, as I’ve mentioned, web scraping had long been in the shadows as doing anything similar had been extremely resource-intensive and complicated. In fact, many people, even today, don’t know that search engines are founded upon web scraping. They couldn’t exist otherwise.

As it moved more towards mainstream consciousness, there had been a certain air of suspicion as with all new technologies. Unfortunately, it still hasn’t passed, partly because there is no industry-wide regulation and legislation. So, a lot of rules and best practices exist in somewhat of a gray area. We are doing our best to push web scraping out of it through sharing our extensive legal knowledge and ethical best practices.

So, now it exists at a mid-way point where, on the one hand, it’s widely accepted for business use. On the other hand, public perception is still far away from how we see it internally as it’s still considered something slightly unusual.

You’ve mentioned the complicated legal side of web scraping. Could you expand a bit on that?

There has been no direct legislation on web scraping. We closely follow case law and data related regulations and legislation. These, however, are not exhaustive, which makes the entire process quite fraught with challenges.

The case law so far has pointed us in the direction of an understanding that most publicly accessible data is fair game for web scraping. Even that, however, comes with numerous caveats, buts, and ifs. For example, data containing personally-identifying information is very likely to fall under GDPR, so even if it is publicly accessible, its collection would be subject to additional privacy law requirements.

Then come in all the copyright and intellectual property laws and regulations, which, again, may forbid the usage of such materials for commercial purposes without acquiring a license. What falls under copyright law is also quite a complicated subject, making web scraping something akin to navigating a dark maze with lots of traps strewn about.

Recent rulings have stated much of the same as we knew already. Some outlets wrote impressive headlines that all web scraping has been legitimised by the HiQ Labs vs LinkedIn case, but, as our Head of Legal Denas Grybauskas put it, the real result was much milder as this decision essentially said that a certain US computer hacking law couldn’t be applied to cases of web scraping.

There’s really only one best practice - to have a legal professional, or a team of them, on hand at all times. Each new source or idea should be carefully processed and analysed by legal professionals to ensure no potential risks are involved.

We understand how murky and complicated the legal landscape of web scraping is, so we’re doing our best to share everything we know through conferences and providing commentary. We press onwards for more clarity regarding legal matters by promoting the idea of self-regulatory practices.

So, even after all of these risks and difficulties, web scraping is still worthwhile to businesses?

Certainly. In fact, some business models depend upon web scraping for their entire existence. A great example would be various price comparison websites, including travel and hotel fare aggregators, such as trivago. There’s no other way to run such a business other than web scraping. Resource costs would be too intensive if manual data collection was the backbone of the business.

Other businesses significantly improve their decision making through web scraping. Ecommerce and retail companies, for example, frequently employ such practices for pricing and business intelligence. In a sector where competition is incredibly fierce, every bit helps.

For-profit organisations aren’t the only ones benefiting from web scraping either. There has been a multitude of great uses for web scraping that have turned out to further social good. For example, the Billion Prices Project, a macroeconomic research initiative, used web scraping to collect product pricing data at a huge scale, allowing them to measure inflation.

In the end, web scraping is a tool for public data collection. As data has always been incredibly useful, there’s no reason to believe web scraping is going to disappear any time soon. Our goal at Oxylabs, right now at least, is to further expand the industry through innovation and to bring it closer to complete legitimacy. Everyone should be able to benefit from web scraping without fear.