Book page

Issue 24 - Navigating the digital frontier: Unlocking the potential of web scraped data for official statistics

WIN Logo

This blog post aims to demystify the complex landscape of web scraping for official statistics, highlighting both its potential and the challenges it entails. Its goal is to foster an informed and engaged community of professionals dedicated to harnessing the power of web data.

Web scraping has become a popular way of collecting large amounts of data in various scientific research [1]. This onset has also resulted in the increased use of this type of data, widely referred to as big data, in official statistics. Over the past decade, much work was done in this area both on the national level, i.e. within a considerable number of national statistical institutes, and on the international level, particularly by Eurostat (see [2] and references therein). This blog aims to present and discuss current problems related to accessing, extracting, and using information from websites, along with the suggested potential solutions. The authors' real-world experiences in conducting small and large-scale web scrape studies for various topics are used to illustrate the downsides and advantages of using web-scraped data. 

The promise and peril of web scraping

Web scraping is a way to collect vast amounts of data at reasonably low costs. Any data on the web can be collected once or at regular intervals and applied to study various topics such as labour market trends from online job postings, tracking real estate price fluctuations, or compiling comprehensive price indices. As long as data for the topic studied is available on the internet, it can potentially be used to obtain insightful and timely statistics. 

However, depending on the topic investigated, there is a major downside, as not all data needed may be available on the web, not all data needed may be accessible on the web, and/or not all data collected may represent the target population studied well. The latter is essential for the production of reliable (trustworthy) statistics (in [3], Chapter 3). Hence, the organization that collects the data must ensure that web data is obtained for a representative population group for which statistics need to be produced. This essential starting point could already be compromised by selectivity in the various types of websites scraped. Since the internet is very dynamic and access to particular websites may fluctuate daily, it is, therefore, essential to carefully look at the quality of the data and the population on which data is obtained during (and after) the data collection phase. In addition, one has to check if it is legally allowed to collect web data for the purpose at hand (see [4] for more details). To make the best use of web-scraped data, several technical, legal, and methodological challenges need to be dealt with. These three challenges include dealing with inaccessible websites, navigating legal and ethical considerations, and ensuring data reliability and representativeness.

Dealing with these challenges

Addressing this broad range of challenges requires a multifaceted approach. First, employing big data tools and techniques can significantly enhance the efficiency and scope of web scraping efforts. It helps here to use a web scraping technique that enables collecting data from a whole range of websites using a technique that is not immediately blocked. On the other hand, identifying the web scraping tool used – e.g. via the User-Agent header - as an official scraper of a national statistical institute should (in the long run) be beneficial. In that sense, it is important to figure out the legal situation for the country under study and that of the NSI before starting the data collection process. Once you have discovered that web scraping can be applied, this opens up the automated collection of huge amounts of data, followed by the subsequent analysis of that data. Be aware that this usually requires the availability of an IT infrastructure (and software) that can handle those amounts of data. To indicate the size of data collected, the collection of up to 200 pages for a list of 500.000 web domains can easily result in a set of 8 million websites that, in total, may comprise 1 TB of data. 

In addition, developing sophisticated methodologies to estimate the total population of relevant websites and assess the representativeness of the scraped data is crucial. These methodologies must be able to correct for the dynamic nature of the web, ensuring that the data remains relevant and reliable over time. Here, it helps to let go of the commonly used mindset of NSIs that tends to focus on collecting data for (small) samples [5] instead of collecting data for the entire population. The latter's advantage is that it is less affected by the dynamic nature of the internet. Alternative approaches are: i) looking for other sources of data, such as administrative data or commercially collected web data, and ii) collaborating with other national or international organizations active on the topic studied (see next point).  

Collaboration and knowledge sharing among NSIs and international bodies like Eurostat are also key. By sharing best practices, tools, and experiences, institutions can more effectively navigate the complexities of web scraping, overcoming common obstacles and harnessing the full potential of web data for official statistics. 

The road ahead

As we look to the future, web scraping's role in the production of official statistics will undoubtedly continue to grow. Its potential to provide timely, cost-effective insights makes it an invaluable tool in our digital age. It has the additional advantage of seriously reducing the response burden on persons and companies and opening up the production of statistics on new topics in more detail. However, realizing its potential benefits requires ongoing efforts to address the abovementioned challenges.

It is, therefore, essential that NSI employees stay informed about the latest developments in using web-scraped data and remain engaged in collaborative efforts. Only by adopting innovative solutions can one navigate the digital frontier more effectively, unlock new insights, and enhance the quality and relevance of official statistics.

In conclusion, web scraping offers a promising avenue for enriching official statistics with timely and relevant data. By embracing its potential while thoughtfully addressing its challenges, statistical institutes can enhance their ability to inform policy, guide decision-making, and contribute to societal progress. This blog post aims to demystify the complex landscape of web scraping for official statistics, highlighting both its potential and the challenges it entails. Its goal is to foster an informed and engaged community of professionals dedicated to harnessing the power of web data.

 

References:

1.    Wikipedia (2024). Web scraping page

2.    Daas, P. & Maslankowski, J. (2023). Current challenges and possible big data solutions for the use of web data as a source for official statistics, Wiadomosci Statystyczne/The Polish Statistician 68(12), 49-64

3.    Kaptein, M. & van den Heuvel E. (2022). Statistics for Data Scientists: An Introduction to Probability, Statistics, and Data Analysis. Springer

4.    Apify blog (2024). Is web scraping legal? 

5.    Molnar, C. (2022). Modeling Mindsets: The Many Cultures of Learning From Data

Published June 2024

Return