Book page

Issue 20 - Online prices of household appliances and audio-visual, photographic and information processing equipment

WIN Blog

Introduction

Use case three is currently in the last project year, and we would like to take the opportunity to go over some of our findings, lessons, and plans for the last few months of the project. The current members of the use case are Statistics Sweden and the National Statistical Institute of Bulgaria (BNSI). Some of the insights that are brought up will be specific to each country, and some will be tied to the entire use case.

The entire name of the use case is UC3 Online, which provides prices for household appliances and audio-visual, photographic, and information-processing equipment. The intention of our efforts differs between countries due to different prerequisites concerning both technical aspects and access to different data sources. Statistics Sweden has had the opportunity to gather scanner data (also called cash register data), which is data regarding sales directly supplied from specific companies, and web scraped data. BNSI, on the other hand, has only had the opportunity to gather data through web scraping and is in the early process of implementing this source into their production for the Consumer Price Index (CPI). This post will focus on the work on Swedish sources to learn more about the correlation between the two different sources and how we can gain more value from getting a deeper understanding of each source by learning how they correlate. This effort is mainly made to increase the quality of CPI, and the weights and methods referred to below are related to CPI.

Lessons and challenges

Technical

The technical aspect has been central to this use case as much work has been done just to get a hold of data. Different statistical agencies work with different tools to web scrape but face some common challenges. The critical point to raise is that even though there are several options to get a relatively fast solution to start using web scraping as a source of data, there is always a risk of data loss due to changes online. These changes are of such nature as the webpages changing, thus making it complicated to keep a script going over time without some maintenance work. Another challenge has to do with the page itself and how it operates. Other languages than HTML sometimes need to be managed when gathering data online. This could be JavaScript scripts that make web scraping more intricate. 

During the project, these challenges have posed issues to keep the time series going longer, which applies to both countries involved. Luckily, we have both been able to gather enough data over a long enough timespan that we have enough to continue our research.

Setting boundaries

Statistics Sweden aims to find the correlation between how "popular" products seem online and how many items are sold. Understanding this correlation could allow us to estimate sales in areas where we don't have the actual sales numbers, as in cases where we have cash register data.

This is a relatively new area to research, which means that we would like to investigate several different tracks of thought. Below are a few areas of thought that would make interesting research areas. However, within the boundaries of the current use case, we will have to commit to one track and follow that one through to get a proof of concept.

Weight function estimates

The idea of the method is to estimate a function for the relation between items sold and their position in regard to popularity on the webpage. Of course, the ideal situation without a time limit would be to have several different ways to do this estimation and compare which distribution/fitted model is better. But in the current situation, we will just commit to some kind of linear regression.

Data sources/categories

Another boundary that we have had to put down is regarding how many different companies and product areas we have been able to keep track of and gather within the project. Right now, we have data from 2 companies and 2 different product groups as a data source for the experiment. This kind of method could prove helpful in some areas and not so much in others. We won't be able to get a broader picture of this within the project.

Cutoffs in distribution

The estimated weights could also be distributed differently, which could influence the method's effectiveness. We could look at each product over time and continuously assign different weights over time as the popularity changes for a product. Another option would be to broadly assign areas within the distribution and "lock in" weights at the start of the period. In practice, this would mean that we would have an area where, for instance, the top 10 products have their weights equal and also be kept the same over time, then we would have another limit between 10 and 20 most popular and so on.

 

Last project year

Lastly, we would like to show some plots on the relationship between sales and web position. The first two of these are from a company we have learned has their sales product stacked around the 100 mark of most popular products. This results in a spike in sales around the 100 mark. Apart from that, we can see a decreasing trend from these distributions, and we are eager to conduct a deeper analysis of this data.

 

Because of the sensitivity of the data, we cannot be more specific about what companies or product groups are researched.