Book page

Issue 12 - Statistical business registers - an important cornerstone in official statistics

WIN project blog logo

Welcome to the European Statistical System Collaborative Network (ESSnet) project and the 12th blog. As mentioned in the first blog, the project team is broken down into four work packages focusing on different aspects. This edition focuses on Work Package Three, exploring non-traditional sources for official statistical production and in particular on use case number five which is focusing on business register quality enhancements from web sources. In this blog we give an update with respect to the state of play as described in the blog of April 2022.

Statistical business registers (SBRs) are an important cornerstone in official statistics. They provide information on the population of enterprises necessary for producing official figures on business and macroeconomic statistics. Traditionally, business registers are derived and maintained from administrative data such as from Chamber of Commerce (COC) registers in combination with surveys. These days most of the enterprises have one or more websites, which may contain valuable information to be used to supplement or improve the business registers. Other sources such as domain registry data, news or social media items may also help to improve the business registers. This notion is the subject of this work, where statistical offices from Austria, Finland, Hesse (Germany), the Netherlands (leading), and Sweden work closely together.

The work is split into two main topics:

  1. The identification of the website(s) that belong to a unit in the statistical business register. This process is known as “URL finding”
  2. Interpreting and deriving variables from website contents with the main focus on predicting or improving “NACE codes”.

On the topic of URL finding the work in the second year includes:

  • Statistics Hesse has continued to develop their R package for URL finding and analyzing enterprise websites. The package, called Escra (enterprise scraper), was already used to produce the results reported in the last year. Changes were made with regard to the web scraping IT infrastructure (using a proxy, improved scalability, using containers), the automated browser used for scraping (headless browser splash) and the machine learning procedure to identify correct URLs. The procedure has been refined to a two-step approach where it is first checked if a statistical unit has a website and then a prediction score is calculated.
  • Statistics Netherlands updated the linkage process with third party web scraped data from company DataProvider. Apart from other improvements a logistic regression-based linkage process is used which is easier to adapt to the availability of new linkage variables in future. Moreover, the quality of the links was evaluated by experts (4*400 samples of mixed link types) leading to a better estimate of the linkage probability. This procedure can be repeated over years.
  • Statistics Finland is using domain registry data from Traficom to determine websites for statistical units. For about half a million entities in the SBR the API was queried, the results were filtered and checked in various different steps.
  • Statistics Sweden performed a pre-feasibility URL finding study on a sample of the SBR, using Google search. They found that by far most enterprises in Sweden have an organisation number on their website, which makes identification easier. They also found that the ‘main’ or ‘about’ page can be scraped in most of the cases.

On the topic of Interpreting and deriving variables the work in the second year includes:

  • Statistics Austria experimented with improving the NACE prediction quality by including more subpages of a business website, using a ‘hierarchical modeling approach’ by adding NACE-1 codes and using a different model (a more memory efficient XGBoost model). Considering the hierarchy of the NACE seems to improve the prediction while using a different model did not yield any improvements. A further in-depth analysis of the results showed that the prediction of NACE codes is more difficult for larger companies.
  • Statistics Netherlands worked on improving the detection of misclassifications and predictions. They found a promising method to predict whether a registered NACE is incorrect from website texts and SBR variables, which will be tested in 2023 on NACE section R in more detail. Moreover, they worked on a new method for predicting NACE using feature sets with knowledge-specific words; this study will be extended in 2023.
  • Statistics Sweden experimented with NACE detection using the KB-BERT method which was adapted and extended for Swedish language. They experienced some challenges on the difference between website texts versus annual reports, on identifying enterprises with multiple activities and on the scraping strategy to be used for websites that are in English only.
  • Statistics Hesse experimented with collecting contact information from web sources. They produced an email address collection pipeline, based on the use of regular expressions, supporting special character conversions, recognizing ‘semi-protected’ email addresses (such as name[at]google.com), filtering irrelevant addresses and categorizing the results into a level of interest.

All in all, the results from the ongoing work on correcting, completing, or improving business registers from web data and other online sources indicate that this use case is still very promising. In the coming years the work will continue to arrive at a mix of methods that can be applied for SBR improvement depending on the state of play of a statistical business register and the availability of online sources.

Published 25 April 2023