Issue 18 - Getting enterprise characteristics based on website data

In the 8th issue of this Blog, referring to “Moving Big Data into Statistical Production”, which was published in November 2022, we mentioned two use cases, OJA (Online Job advertisements) and OBEC (Online Enterprise Characteristics), that are core components of the Web Intelligence Network. We have provided general information about the use cases. Just a brief reminder of what OJA and OBEC stands for:

OJA - the aim to provide reliable information on the labour market demand based on the data acquired from websites, mostly job offer portals
OBEC - aims to provide data on characteristics based on enterprise websites.

If you need more information on that, refer to the Issue 8.

In this Blog, we would like to concentrate on the current state of these two projects.

OBEC – Online Based Enterprise Characteristics

Regarding the OBEC, we explored the viability of attaining comparable results by utilising Web Intelligence Network-related technologies, including Python and NoSQL. Five countries, namely Austria, Bulgaria, Germany, Lithuania, and Poland, have already collected data for tables containing the Social Media Presence (SMP) indicator. This indicator represents the percentage of companies engaged in social media channels across various categories such as Facebook, X, or LinkedIn. It works as presented in the graph below:

Issue 18 image In Austria, the business population amounted to approximately 43 thousand units. In Bulgaria, our analysis covered 26,864 units, while in Germany, out of 29,878 units, URLs were identified for 10,679. Lithuania underwent scrutiny, with 14,652 entities examined, of which 10,856 boasted an online presence through a dedicated website. Unfortunately, 8,237 units were scrapped due to site blockages or unavailability. Results for Poland were derived through two avenues – the first based on the full population of 130 thousand units and the second on a sample of 20 thousand units. Python scripts facilitated both approaches, yielding similar results, yet the discussion favoured those obtained from the full population as more desirable.

Allow us to present more detailed data for Poland. After analysing 51 thousand URLs for the entire population, the SMP indicator as of September 2023 revealed that 36.9% of Polish enterprises with ten or more employees were present on Facebook, 7.3% on X, and 13.4% on LinkedIn. Notably, these values align with official data published by Eurostat based on the ICT Usage and E-commerce in Enterprises survey. Currently, Eurostat reports a general indicator indicating that 47.6% of Polish enterprises utilise social media, emphasising the comparability and reliability of our findings.

Our aim is to extend this use case in the next year to include more European countries. Before we achieve this goal, the next step is to validate the results using methods such as Levenshtein, Jaro and Damerau Levensthein distances.

OJA - Online Job Advertisements

Regarding OJA, most European countries can generate tables based on the dataset delivered by Eurostat, which includes data mostly from 2018 up to the present. This data is quarterly published by Eurostat in the Web Intelligence Hub datalab, with the most recent being the third quarter of 2023.

The majority of countries involved in WIN are calculating a static view of the OJA data, representing the demand on the labour market for specific occupations or skills. However, a distinctive approach has been adopted by Italy, Bulgaria, and Slovenia, which includes a set of tables based on occupations recorded in the 4-digit ISCO (International Standard Classification of Occupation). Two methods of calculating results have been embraced: OJA data stock (as of the day) and OJA data flow (within a month).

Our current focus is on investigating the data quality of this dataset to determine how deeply we can disaggregate the data for publication as official statistics while adhering to the data quality rules of official statistics. Consequently, WIN members are collaborating with Eurostat to concentrate on data validation and quality rules. Presently, we are engaged in annotation exercises with the aim of obtaining information on how accurately the machine learning algorithm has mapped occupations, skills, educational attainment levels, etc. The results will guide us in deciding the extent to which we should publish the data, considering the accuracy thresholds we set (e.g., 1-digit ISCO with 90% accuracy or 2-digit ISCO with 85% accuracy).

In conclusion, we are now equipped with a set of tables containing basic indicators. Our collaboration within the Web Intelligence Network is currently focused on ensuring the quality aspects of the data slated for publication. This necessitates ensuring that our dataset adheres to the quality standards set for official statistics. However, our approach differs from the standards associated with traditional statistical representative surveys. Given that we employ diverse datasets, such as web data, instead of responses to traditional surveys and employ different methods, such as machine learning, rather than traditional interviews, we must approach our data with a unique perspective. It is crucial to recognise that the nature of Big Data renders datasets based on web data more susceptible to fluctuations and instability compared to traditional administrative datasets or datasets derived from statistical surveys.