Issue 04 - Developing and improving business registers

Welcome to the European Statistical System Collaborative Network (ESSnet) project and the fourth blog. As mentioned in the first blog, the project team is divided into four work packages focusing on different aspects. This edition focuses on Work Package Three, exploring non-traditional sources for official statistical production via six use cases. One use case focuses on business register quality enhancements from web sources.

Statistical business registers (SBRs) are an important cornerstone in official statistics. They provide information on enterprises necessary for producing official figures on business and macroeconomic statistics on a global, national or regional scale. Traditionally, business registers are derived and maintained from administrative data such as Chamber of Commerce (COC) registers combined with surveys. These days most enterprises have one or more websites. These websites may contain valuable information to supplement or improve the business registers. This notion is the subject of this use case, where statistical offices from Austria, Finland, Hessen (Germany), the Netherlands (leading), and Sweden work closely together.

The challenge of using web data to improve business registers may seem simple at first glance: collect texts from websites and derive variables, such as enterprise size, economic activity (NACE) or others. In practice, this is not so straightforward at all. Many websites lack identifying information indicating which company or companies are behind, which is difficult to find. Moreover, the variety of website designs and richness of website content varies heavily. Therefore, this project develops advanced strategies, methodologies, and tools to transform messy web data into reasonably reliable business register knowledge.

The work is split into two main topics:

The website(s) identification belongs to a statistical business register unit, which can be a legal or statistical entity or a group of enterprises belonging together. This process is known as "URL finding".
Interpreting and deriving variables from website contents with the main focus on predicting or improving knowledge on the economic activity of business known as "NACE codes".

Before a scraper can inspect enterprises' websites, it needs to know the URL to visit. In most cases, this information is not contained in the SBR, so a URL finding phase needs to be executed. The project distinguishes between two ways of doing this:

Feeding enterprise names and optionally other information from the SBR to a search engine. We call this "search mode".
Linking the SBR to other data sources, such as scraped data from other firms, to identify the URL belonging to the unit in the SBR is known as "linkage mode".

The project developed mandatory and optional criteria to judge the quality of URL findings in both modes, such as the percentage of businesses with no, few or multiple URLs found or the frequency distribution of linkage probability.

Country experiences on URL finding include:

Austria: using multiple search methods via the Google Search API from R and selenium to discover URLs. Special consideration has been given to the privacy protection of one-person businesses. Results were compared with Statistics on ICT usage in 2021.
Finland: using DuckDuckGo in combination with an extensive blacklist of URLs to ignore unwanted results, and country-code top-level domain registry operator's API, for URL finding.
Hessen: finding URLs using R, Google search, regular expressions and ML.
Netherlands: has been linking data from web scraping company DataProvider to determine URLs for legal units in the SBR. Linkage counts and probabilities for different groups of enterprises have been calculated.
Sweden: is testing the use of DataProvider data.

A literature review was performed with respect to the activities on detecting NACE codes of companies from website content. Based on this field's state of the art, ideas on a common target population have been written down, and a scraping strategy has been developed. The scraping strategy optimizes between completeness (scraping all) and being selective in scraping (only content that seems interesting for NACE). It includes heuristics for scraping subpages with link names that seem interesting regarding economic activity detection. Since machine learning plays a dominant role here, good practices for training and test sets and common criteria for judging model results were developed.

Country experiences on NACE detection include:

Austria: determined two sets of words (size 200 and 500) for each NACE2 code. Those sets were used for experiments with NACE detection via neural networks.
Netherlands: applied knowledge-based features derived from an existing NACE classification tool to website texts from the homepage and their 'about us' pages for NACE detection and experimented with different ML models.

All in all, the web content of enterprises seems promising for SBR improvements due to the rich cross-country availability. But there are also many challenges in connecting web content to legal units and deriving high-quality variables from the ever-messy web.

For more information, we refer to the deliverables to appear shortly on this site, especially the WP3 1st interim report and a joint report with WP2 on "URL finding methodology".

See: https://cros.ec.europa.eu/book-page/web-intelligence-network-reports

Published 12 May 2022