Book page

Issue 08 - Moving Big Data into statistical production

WIN project blog logo

Moving Big Data into Statistical Production

Welcome to the European Statistical System Collaborative Network (ESSnet) project and the eighth blog. The project team is divided into four work packages focusing on different aspects. This edition focuses on Work Package 2, which aim is to move Big Data methods and repositories from experimental statistics towards statistical production.

What use cases are we conducting?

Work Package 2 (WP2) has two main use cases:

  • OJA – Online Job Advertisements, which aim to provide reliable information on labour market demand based on the data acquired from websites, mostly job offer portals,
  • OBEC – Online Based Enterprise Characteristics, aims to provide data on characteristics based on the enterprise websites.

What population is used in use cases?

The population for OJA consists of the largest job offer portals, having data for all 27 EU countries and selected European countries not being members of the European Union.

The population of the OBEC use case consists of enterprises employing ten or more employees. We are collecting the list of URLs using mixed methods, i.e. data from business registers, external databases or URL-finding software dedicated to Web Intelligence Network. Therefore, several different data sources can be used to acquire URL lists of enterprises. One of them is to use official registers, which in many cases, have this information available publicly. For example, in Poland, enterprise registers like KRS or CEIDG are available online. It is possible to scrap the website URL if the company shares this information in the register. There are also third-party databases that include all the information about enterprises in a specific country.

More information can be found in URL finding methodology:

https://ec.europa.eu/eurostat/cros/content/url-finding-methodology_en

What methods are used to get and process the data?

Mostly we are getting data using web scraping methods supported by Web Intelligence Network dedicated solution named Web Intelligence Platform. We are using the Selenium engine for better results. It allows us to scroll the web page and go deeper into the website by crawling the links available on subpages.

We are using text mining, including regular expressions and supervised machine learning to process the data. Data is checked at every stage of its acquisition, processing and disseminating.

How does it work?

Below you can find general information on the methods and components that are used to conduct our project. 

blog issue 8 image

Firstly, we have the Web Intelligence Platform, written in Java, which is responsible for acquiring the data from websites and storing this data in JSON files in the ElasticSearch database (1). Then the data from ElasticSearch is accessed via WIN DataLab (2), a Linux-based solution with various languages that can be used to filter and transform the data. It includes Python, R, Java, native ElasticSearch language and more. Data processed this way can be disseminated in tables (3).

Is data quality enough to treat the data as official statistics?

Both use cases are not easy to conduct as we know that Big Data sources, especially web data, are not the highest quality. This is why we must apply the complex Big Data quality framework. It includes verifying the data quality at a different stage of statistical production, i.e. during data acquisition, data processing and data dissemination. For instance, WP2 members were involved in monitoring the quality of the machine learning model by verifying the results of mapping job occupations to the description of job offers. More information on data quality related to OJA data was discussed in the Blog No. 5 of Web Intelligence Network:

https://ec.europa.eu/eurostat/cros/content/issue-5-path-quality-framework-oja-data-source_en

Where can I find more information on WIN WP2?

Information in this blog was based on the current work of WP2, including the Deliverable 2.1: WP2 1st Interim Progress Report, which you can access by contacting the WP2 Leader: Jacek Maślankowski, j.maslankowski@stat.gov.pl

Published 17 November 2022