Online based enterprise characteristics (OBEC)
Introduction
The Online-Based Enterprise Characteristics (OBEC) use case is being developed as part of the Web Intelligence Network (WIN), a collaborative effort to explore new data sources for statistical purposes. OBEC focuses on retrieving and exploring web-based information about enterprises to provide insights into their characteristics, economic activities, and online presence. This initiative builds upon previous research within the European Statistical System (ESS) and aims to enhance traditional business statistics by incorporating web intelligence methodologies.
While the Web Intelligence Hub (WIH) facilitates the infrastructure for data retrieval and processing, it is important to note that no official statistical data is currently being produced under this use case.
Methodology
The OBEC methodology is centered around the automated retrieval and classification of enterprise-related data from publicly available online sources, in particular the enterprises websites. The key steps in this process include:
- Defining the Target Population: The OBEC use case defines its enterprise population based on statistical business registers available at the national statistical authorities.
- Website Identification: Statistical registers include the URL of the websites of many enterprises. However, this information is not available in all cases. When the website URL is not available it is obtained with the use of search engines.
- Web Content Acquisition: Web scraping tools are used to collect structured and unstructured content from enterprises' websites.
- Classification and Standardization: Extracted content is classified according to established statistical classifications such as NACE (for economic activity). Ontologies and Natural Language Processing (NLP) techniques are employed to perform the classification.
Quality
Ensuring high-quality OBEC data is a key challenge due to the heterogeneous nature of online sources. The Web Intelligence Network is actively working on improving data validation and classification through:
- Data Quality Assessment: The reliability and consistency of extracted data are continuously assessed, with particular attention to missing information, duplicate records, and inconsistencies in enterprise classifications.
- Comparison with Official Sources: Efforts are underway to cross-validate OBEC data against existing business registers and other datasets, identifying gaps and opportunities to improving accuracy.
- Enhancement of Classification Models: The WIN project team is refining NLP-based classification approaches to reduce errors in economic activity coding and enterprise name recognition.
Despite these ongoing improvements, OBEC data remains experimental and should be interpreted with caution when used for statistical purposes.
Use of OBEC Data
At present, OBEC data is not yet publicly available for statistical use. However, the Web Intelligence Network is actively developing the OBEC use case.
Future developments aim to:
- Provide controlled access to OBEC data for statistical research.
- Develop guidelines for integrating web intelligence into official business statistics.
- Explore the potential for OBEC data to support economic indicators and digital economy insights.
Statisticians interested in contributing to the OBEC project or exploring its potential applications are encouraged to engage with the Web Intelligence Network and participate in ongoing discussions on web-based business statistics.