Introduction to web scraping

Web scraping is still a mystery to most, says Aleksandras Šulženko, Product Owner at Oxylabs.io. Unravelling the mystery can reveal new vistas of growth and development.

If you've ever had to copy data from one sheet or system to another, that's somewhat close to web scraping. Web scraping's primary goal is to move data from the web to storage. However, other processes have to happen to make the data useful. Some of these are sometimes put under web scraping as well...

Web scraping in a few simple steps

It all begins with the desire to acquire data from external sources. These sources, usually, are websites. Businesses extract publicly available data from ecommerce websites, search engines, and other pages where important information may be stored. For example, an ecommerce business might want to know the pricing of its competitor products to develop specific strategies.

While it can be done manually, automation is preferred. Automation is usually done by preparing a script in a preferred programming language such as Python. Python is commonly used due to the language having numerous publicly available libraries that aid in web scraping.

Usually, the script is written in such a manner that it acts like a regular internet user - uses a browser to access a specific URL. Once there, the script downloads the source code in an HTML format. All of this may be repeated as much as necessary.

Life after scraping

Once the HTML is downloaded, the information which is on the URL in question is stored locally. A different script can then search through the HTML for relevant data. For example, if a product's price is required, the script will automatically extract only that part precisely.

Unfortunately, the process isn't as simple as it may seem. HTML is a language that is used to create beautiful visuals in browsers. It's a terrible candidate for data analysis. Therefore, data extracted from HTML files need to be normalised to become valuable - a process called parsing.

Parsing is an addition to the scraping code. Parsing is often considered one of the most resource-intensive tasks as lots of developer time is needed to write parsers for all page types. In addition to that, they are prone to breaking as they are developed with a particular layout and HTML structure in mind. If the structure changes, the parser breaks. It's such a prevalent issue in web scraping, that we are providing our own machine-learning-based adaptive parsing solution.

Finding out you've been blocked

However, everyone who has aspirations to make it in web scraping quickly runs into one issue - the differentiation between bots. Scripts are bots that go through a list of actions, website owners and administrators will see them as such. Most administrators will ban the IP address without delving deeper.

Getting blocked means losing access to data. That happens even to those with the best intentions. Therefore, workarounds need to be used as blocks are essentially inevitable when web scraping is performed at scale. While there are ways to increase the survival time of an IP address, replacing it is the easiest solution. Of course, that's where proxies come in.

In large part, there are two different types of proxies - residential and datacentre. The former are IP addresses assigned by ISPs to any regular device. The latter are IP addresses hosted in datacentres.

Datacentre proxies are significantly faster and more reliable than residential ones. However, they do have the drawback of being allocated and used in sequential address chunks called subnets, making their usage easier to detect.

On the other hand, residential proxies are hard to acquire, slower, and it's harder to maintain the same IP address for a long period of time if needed. However, if they are used, it's nigh on impossible to detect whether the website is being accessed by a regular internet user or not.

Ethical and legal intricacies

For some, the idea that data is being collected en masse, even if it is publicly available, creates an uneasy feeling - as if some personal space had been invaded. However, ethical web scraping solutions and proxy providers have a lot of restrictions. Unfortunately, as there are no industry-wide regulations, these restrictions are largely self-imposed.

One of the first steps a web scraping solution provider or user should take is to ensure that the proxies are acquired ethically. As datacentre proxies are usually acquired through businesses or similar entities, there's little in the way of ethical acquisition. However, residential proxies are a completely different beast.

Residential proxies have to be acquired directly from the owners of the devices. Ethical procurement, in our eyes, means that the users whose devices become residential proxies understand the process and get rewarded for their internet use. While we hope that, at least a similar mindset will take hold within the industry, other providers have different ideas.

Finally, there are some legal precedents that have provided web scraping with some guidelines. Established industry practices say that limiting oneself to only publicly accessible data and consulting with a legal professional is the way to go. Private and personal data are protected under GDPR, CCPA, and regional legislations across the globe. Scraping such information is bound to cause immense harm.

For you

Be part of something bigger, join BCS, The Chartered Institute for IT.

Where does the data end up?

A simple application of web scraped data lies within the ecommerce sector. Dynamic pricing (i.e. automatically adjusting pricing based on the actions of competitors or other factors) strategies rely on real-time data acquisition. Such data is then implemented into a pipeline that measures the most optimal changes according to a ruleset.

However, the usefulness of web scraped data doesn't end with completely automated integrations. Even the same ecommerce business can have more use cases. For example, some companies gather customer reviews and comments to measure sentiment on brands, products, or services. These are then used in R&D departments for development purposes or fine-tuning marketing strategies.

Businesses in other industries have been using web scraping at immense scale as well. Cybersecurity has benefited exceptionally from web scraping. There are numerous ways to protect regular users and brands through the use of automated data collection. Cybersecurity companies will scan enormous amounts of websites and marketplaces in order to detect potentially illegal or malicious actions such as piracy, phishing, or data breach dumping.

Companies in the financial sector have derived utility from web scraping as well. Ever since the "Correlating S&P 500 stocks with Twitter data" study showed promising results on market predictions, the interest in web scraping and other forms of automated collection have gained steam. Nowadays, a majority of businesses in the financial sector utilise some form of web scraped data.

Yet, web scraping isn’t used just for profit or academic research. Numerous projects have used web scraping to further public good. “Copy, Paste, Legislate” is a great example - it used web scraping to discover nearly identical legislation being pushed across various USA states by special interest groups. On the other hand, as part of our social responsibility initiative, we have partnered with the Lithuanian government to create a tool that uses web scraping and AI to detect highly illegal content.

Conclusion

Web scraping has become a modern necessity to stay competitive in business, helping organisations utilise data to track trends and strategise for the future. The data could be used in real time to keep pricing in line with rival companies, or could be used to track the misuse of data and illegal sales.

The legitimate webscraper can access and use the latest information - and this can be used in all areas, from commerce and crime, right through to helping society unify and become stronger through a shared vision - exposed through data analysis. Data is one of the most valuable commodities in the modern world. We must be careful in both how we acquire it and how we use it.