Book page

Issue 22 - Business registers enhancements using web data

WIN Blog

One of the use cases within the WIN Work Package, exploring new online data sources for official statistical production, focuses on business register quality enhancements. In this use case, statistical offices from Austria, Finland, Hesse (Germany), the Netherlands, and Sweden work together to develop methods and techniques to improve statistical business registers from online digital traces. In an earlier blog, we briefly explained the global concept, which we repeat here.

National Statistical Institutes (NSIs) worldwide maintain statistical business registers (SBRs), which are comprehensive databases of information on statistical units such as enterprises within their respective jurisdictions. For example, these registers contain detailed information about each enterprise, including its size, location, economic activity, and administrative details such as contact information and ownership relations. They also capture the relationship with the legal unit and other important relationships among enterprises. The SBRs serve as essential sampling frames for statistical surveys and are indispensable for producing official economic statistics. The UC5 project aims to use online data to enhance SBRs by incorporating better, more detailed, or new information that may be difficult or impractical to acquire through traditional methods.

The work in this use case falls into two main topics: 1) URL finding, a methodology for finding URLs of businesses, including linkage of data obtained from third parties, and 2) using those URLs to collect information for SBR enhancements, such as determining the correctness or predicting economic activity of businesses (NACE).

Responsible use of search engines and automatic reading and interpreting enterprises' websites are important elements in this work. Search engines can be used in different ways to determine the – maybe complex – relationships between statistical units in the business register and the websites that provide meaningful information on these units. Once these relationships are known, the websites and related media, such as press releases, blogs, advertisements, or web shops, have to be interpreted to retrieve information on the actual activities performed by the enterprises and possibly other variables. 

In the 2nd and 3rd years of the project, some of the highlights of the work performed in the different countries were:

  • Statistics Hesse improved their URL-finding implementation with new extensions (implemented in R) combining ML techniques to predict the reliability of the URLs found. They compared the URL-finding approach with the use of third-party data for finding enterprise websites. They restricted the third-party data to websites where exact identifiers were found, such as VAT numbers, because only then was the statistical units-website linkage certain. This led to the interesting conclusion that URL-finding was overwhelmingly more complete for small enterprises, which is an important aspect of business register completeness.
  • In addition, they worked on detecting contact information and the names of business executives/managers from website imprint pages, which are required by German law to be present. A reliable method has been developed and tested with satisfying results.
  • Statistics Netherlands used more third-party data for finding URLs using probabilistic linkage methods. They developed a logistic regression-based probability link function to link 3rd party data to the business register. An audit sample was used to calibrate the linkage probability. In addition, a longitudinal analysis has been performed on the monthly linking results, showing different patterns of gaps, which were classified into stayers, entries, exits, etc. The results will be used to improve the linking procedure further by taking historical knowledge into account.
  • A two-step approach has been designed to derive NACE. In the first step, misclassifications are detected automatically. In the second step, the top-3 candidate codes are provided to the manual NACE editors, who make the final decision. The misclassification method currently works better on NACE codes with homogenous activities. Tests with different feature sets are ongoing.
  • Statistics Austria kept their earlier developed URL-finding approach stable and focused on NACE detection improvements. They researched extending the word-driven approach to consider NACE's hierarchical nature. They experimented with different methodological setups, including advanced ML models applying a tailor-made hierarchical classification scheme to predict NACE levels up until the 4th digit for a selected number of NACE subsections.
  • Statistics Finland continued to experiment with using domain registry information for URL-finding. In addition, they studied the use of data from open mapping services (Google, Bing, OpenStreetMap) to improve administrative information in the business register. In a test, they already found outdated addresses in the SBR, which proves that this data source might have additional value.
  • Statistics Sweden studied the state of play regarding identifying information on Swedish enterprise websites. They found that most enterprises have an organization number on their website.

To detect NACE, they experimented with natural language processing models adapted for the Swedish language in combination with texts from annual reports.

Overall, searching for digital traces of enterprises to improve the business registers of statistical organisations is promising. However, we also see that there are country—and organisation-specific aspects to be faced. The data landscape differs per country, and there are challenges to be solved, such as how to interpret messy data sources into reliable statistical information. New methods, such as large language models, may help and can be the subject of further study.

At the end of 2024, the project's work will be summarised in a final deliverable to be published free and open on this portal. That deliverable is intended to give statistical organisations theoretical and practical knowledge from this work package on how they may apply the methods and techniques in this use case to their own statistical business register. 

In addition, a topic-contributed session will be dedicated to the work in this Use Case at the International Conference on Establishment Statistics, ICES VII conference, on 18 June 2024 in Glasgow. More information on that session can be found here.

Finally, we want to mention the WIN conference, which will be announced later and held in early 2025. This would be an excellent opportunity to meet in person, share ideas, and discuss further business register improvement ideas in a larger community. Please keep an eye on this website!

Published May 2024

Return