Book page

Issue 03 - Exploring potential new data sources

WIN project blog logo

Welcome to the European Statistical System Collaborative Network (ESSnet) Web Intelligence Network project. As mentioned in our earlier issue, the overall goal of Work Package 3 is to tap into the potential of new data sources, which will later be integrated into the Web Intelligence Hub, developed by Eurostat. Parallel to exploring the data sources, we aspire to produce experimental statistics using these new data sources, given that they meet the quality criteria. The experimental statistics results will cover the territorial scope of all of those consortium members who work on the experimental statistics product.

Work package three consists of 10 partners, 8 of which are National Statistical Institutes and two regional statistical authorities. Our partners are GUS-Statistics Poland, STATA-Statistics Austria, BNSI-Bulgarian National Statistical Institute, INSEE-National Institute of Statistics and Economic Studies, SF-Statistics Finland, SSI-BBB Statistical office Berlin-Brandenburg, HSL-Statistical Office Hesse, CBS-Statistics Netherlands, SCB-Statistics Sweden, and ONS-Office for National Statistics.

Today’s issue will delve deeper into Work Package 3, dedicated to exploring non-traditional data sources for official statistical production. 

Work package 3’s activities are divided into six use cases, each having distinct characteristics and specific goals:

  • Use Case 1 aims to explore new data sources and monitor the real estate market.
  • Use Case 2 aims to derive early estimates of construction activities based on real estate web portals for already built and planned buildings. 
  • Use Case 3 aims to collect data about online prices of household appliances and audiovisual, photographic and information processing equipment by web scraping online shops and, at a later stage, compare the data with scanner data for the shop’s sales.
  • Use Case 4 aims to develop new indices for tourism statistics, using the data from booking portals, air traffic portals, travel agencies portals and portals related to the quality of life. 
  • Use Case 5 is concentrated on mass web scraping, primarily for the enhancement of the quality of the business register via linking URLs of enterprises and predicting main economic activity codes (NACE) 
  • Use Case 6 aims to explore the use of publicly available traffic camera data to produce new indicators. A peculiar data source is used in this use case – pictures from traffic cameras and induction loops. 

Use cases 1-4 share similar characteristics in terms of data sources. Along with expected experimental indicators and adhere to pre-defined process steps. They also include “New data sources exploration”, “Programming, production of software”, “Data acquisition and recording”, “Data processing”, “Modelling and interpretation”, and “Dissemination of the experimental statistics and results”. Use cases 5 and 6 take a slightly different approach due to their extraordinary data sources and do not adhere to the aforementioned steps.

During the first year, the Work package 3 achieved meaningful results, such as:

  • Checklist used as a tool for assessment 
  • Justification of data sources
  • Defined a set of mandatory and optional variables to be extracted from the data sources
  • Sets of minimal indicators, based on the mandatory variables
  • Successfully set up and tested their working environment and software solutions for the upcoming data collection
  • The literature review focused on URL finding methodology and tools and the use of business websites to predict the economic activity of enterprises, preparation of training and tests sets and accompanying methodology for URL finding
  • Preparation of the upcoming NACE prediction and classification
  • Exploration of the available assessment of the model results,
  • Implementation of Machine-learning pipeline for publicly accessible traffic camera data.

We are also scheduled to begin testing Eurostat’s Web Intelligence Hub for specific use cases from our Work package, which volunteered. 

While we have successfully implemented our initial planned activities for the first project year, we continue our work, constantly monitoring the available resources, arising issues and quality of the data, which will be collected and processed during the second project year. The different use cases have already encountered potential and expected issues. For example, the possible changes in the source of web data structure and website changes, checks for legal and copyright constraints, non-standard variables, and mechanisms blocking data extraction (e.g., JavaScript, captchas, etc.). The viability of training and test sets for both URL finding and NACE prediction, difficulties when comparing results with other partners, since NACE code classification is knowledge-intensive and language-specific sources have to be used, regular updates of the data source. Due to the peculiar data sources for some use cases, we have encountered unsolvable issues like weather variation (e.g., snow, rain, darkness). Some of the issues have been solved, while others remain.

If you would like to know more about our journey, how we overcome these obstacles and what results we have achieved, please follow our blog and do not hesitate to contact us if you think you have what it takes to contribute to our goals.

You can reach us via email at:

Galya Stateva, gstateva@nsi.bg 

Published 13 April 2022