Book page

Issue 26 - Online prices of household appliances and audio-visual, photographic equipment - update

Introduction

As the final project year for the ESSNET WIN project is coming to an end, I would like to touch on a specific phenomenon that Statistics Sweden has encountered during the last few months of exploring data for Use Case 3. The data consists of web-scraped data and cash register data collected throughout 2023. 

 

The initial idea for this use case was to combine the data sources and see if there is a way to increase the value of the data beyond the initial benefits of a streamlined gathering process with reliable data for one company. By analysing sales quantities in relation to online popularity, we aimed to apply these relationships to companies where only web-scraped data is accessible, ultimately estimating sales volumes.

 

Our final report, due later this month, will show our findings. However, for this post, I would like to discuss something we discovered during the process. 

 

The Challenge: Dynamic Product Availability

The initial idea when setting up the scrapers was to web-scrape all products for two specific product groups: TVs and laptops. Later, we would go on to include only a sample of the products that stayed available online during the entire timeframe. 

 

What we came to realise, though, is that the products within our focused product groups are rotated sufficiently over a short time that none of our products stayed available online during the entire timeframe. Even if we limited the scope to as short as seven weeks, the sample was reduced to zero products for both product groups and both companies.

 

We have addressed this issue by changing our approach. Instead of looking at the same sample and doing what we would call classical index calculations for the Consumer Price Index (CPI), we calculate average price development for the different groups. This lets us see which method comes closest to the actual average derived from the cash register data.

 

A Broader Discussion: Data Gathering Dilemmas

This issue could have been handled if we initially set out to web scrape and gather information that allowed us to use a multivariate regression model to account for changes in the sample. At Statistics Sweden, we only collect the data we need as a general policy. This is a good approach overall since it limits the burden on data providers, among other benefits. However, when working with exploration and modern data sources, it has become clear that we only sometimes know the value of data and how we later will be able to or want to use it. Traditional statistical production typically doesn't involve gathering exhaustive webpage content to derive future potential insights. It raises questions about feasibility, ethics, and practical application. I wouldn't need to find out what I would name a column in an SQL database containing all that information in one long string.

 

Future Considerations: "Crape first, Ask Later."

Unfortunately, we can't send the post with some lesson or recommendation, mainly because I am unsure what the right approach to these challenges is. However, this could raise some interesting questions regarding how we approach new data sources and the new methodological and ethical dilemmas they pose.

 

Is "crape first ask later" "the proper way to handle web scraping and modern data gathering? 

 

What are the policies and the general approach to modern data gathering in your organisation?

 

Most importantly, what on earth would I call a column holding all the HTML content from a product web page?

 

Petrus Munter

Statistics Sweden

Published October 2024

Return