What benefits and opportunities do new data sources bring?
Our society is undergoing a fast digitalisation. Digital artefacts and online processes are an important portion of our everyday life from an organisational, social and economic perspective. The scale and scope of big data (supported by connectivity, platforms and algorithms) have become wider and wider, even compared to other disruptive technologies that emerged in the course of history. This leads to great opportunities but also great challenges.
In a world where smart technology (e.g., related to web, telecommunication and IoT) is ubiquitous, official statistics evolve to use new technologies for the production of smart statistics whose quality can be trusted. The Trusted Smart Statistics was conceived as a service based on smart systems which:
- embeds auditable and transparent data life cycles,
- ensures the validity and accuracy of the outputs,
- respects data subjects' privacy and protects confidentiality.
Following the digitalisation of our societies, new non-traditional data sources are seen as an opportunity for the production of official statistics, motivated by the multiple potential benefits:
- improved timeliness and accuracy,
- increased level of detail and relevance,
- finer temporal & spatial granularity,
- low production cost.
Quality in OJA data
OJAs are job advertisements published on the web, potentially including information on job occupation and location, employer characteristics (e.g. economic activity) and job requirements (e.g. work experience and skills). Currently, the OJA contains information on more than 100 million distinct OJAs collected from 316 sources in Europe.
While useful for a variety of statistical purposes and providing important advantages compared to traditional data sources, OJA data also carries limitations, in particular:
- OJA represent a part of the job demand, as not all job vacancies are advertised online. Thus, OJAs do not represent the entire job ads population:
- Some occupations and economic activities are less well represented in web advertisements than others are.
- In some regions, digital tools may not be widespread enough to encourage employers to publish job advertisements online.
- The penetration of OJV markets varies in and across countries and may change over time.
- The volume, variety and quality of the data depending on the selection of portals (also called landscaping), a crucial part of the data collection.
- The classifying models and data processing tools may be subject to errors even though they use the most up-to-date technologies.
- The models developed and used to sort and organise such a diverse and complex universe of OJA data has to be considered.
This blog discusses some important data quality dimensions for official statistics, putting them in relation to OJA data.
RELEVANCE
Perhaps the greatest potential strength of OJA data is that it allows producing statistical information for unexplored phenomena relevant for users. Compared to traditional labour market data sources, it contains previously unavailable information (e.g., skill requirement), and it has greater granularity (see Ascheri et al., 2021, exploiting it in their labour market concentration index for functional urban areas) and timeliness (useful, for example, for labour market series nowcasting). Data accuracy assessed through classifier evaluation and meta-data availability are necessary to increase confidence in this newly available information and reap its potential benefits.
The relevance can be influenced by some contextual factors of the institutional environment. For example, the type of agreement with the owner of the big data source (e.g. whether it is binding) can influence its sustainability over time. Legislation and privacy concerns may also limit the usability and potentially the relevance of the OJA data.
DATA ACCURACY, SELECTIVITY AND REPRESENTATIVENESS
UNECE (2014) stresses the importance of selectivity as a quality dimension of web data and as a sub-dimension of data accuracy. The selectivity of a data source is its degree of representativeness. As discussed above, there is a problem of selectivity in OJA data. The fact that only online job ads are observed in the OJA database is likely to impose a bias in the calculation of indicators due to the fact that a number of job ads are not observed. This problem is less serious within occupations, especially those with a large fraction of ads published online. This is the case, for example, for jobs related to IT, which are found to be more represented than other occupations in web-collected job ad data and are represented by very large numbers of ads in the OJA database.
OJA data may also encounter some data accuracy issues, especially due to the complexity and the amount of data that has to be classified in the data processing phase. Some degree of miss-classification may be encountered for all the categorical variables in the OJA.
Finally, due to a variety of other reasons, such as scarce information in the ads, inadequate classifiers, interruptions in the data collection pipeline for some of the selected sources (due to spam, problems with portal/site access, ...), some variables can be affected by a substantial amount of missing values in the OJA.
COMPARABILITY OVER TIME
Comparability over time relates to the UNECE quality dimension of "time-related factors", such as "Timeliness", "Periodicity", and "Changes through time". In the context of the OJA data collection:
- The tools used for the processing of the data may be subject to errors even though they use the most up-to-date technologies.
- The algorithms for data scraping and processing are still being refined and documented.
Thus, the OJA data may not be directly comparable across time and data releases. The current algorithms and data sources are tuned to the current and most recent data acquisitions; therefore, the OJA data is best suited to cross-sectional comparisons for recent periods of time.
The most accurate time measure in the online job ad dataset is the day on which an ad was scraped. The comparability over time will be improved with the introduction of more standardised statistical processes for regularly produced statistical products.
AUDITABILITY
Some quality factors for the big data validation systems (e.g., performance, security and robustness) have also to be considered in the quality assessment. A way to ensure the quality along the production process is to apply audit checks in the relevant phases of the data lifecycle and a feedback mechanism in the quality assessment process. In the case of the OJA data, given the large size (millions of ads), the statistical variables have to be extracted from text in natural language in an automated way.
As it concerns the processing of the data (throughput), the quality of the UNECE framework refers to some general principles, such as system independence, the existence of steady states and the existence of quality gates.
- System independence means that data processing should follow theoretical principles and not be dependent on the system that implements them.
- Steady states consist of intermediate datasets which are maintained in the data processing pipeline for control purposes (classifiers evaluation via audit samples).
- Quality gates are points in the data processing pipeline where data quality is assessed (data validation process).
Therefore, their ability to generate accurate, consistent and comparable data needs to be addressed by implementing specific frameworks proposed for methodology and quality with new sources of data. When producing official statistics, the data quality assessment is a crucial step. Currently, comprehensive analysis and research of quality standards and quality assessment methods for new data sources are rare. While working with big data, it is almost impossible to state generally applicable quality guidelines. To overcome this issue and simultaneously provide the user with clear instructions in using web data, a formulation of quality indicators is inevitable.
Latest quality improvements are done on OJAs
A careful check on classifiers is constantly performed to maximise classification performances and to increase classifier precision. Lately, a transversal quality check on location variables was done to ensure that specific localised terminology is fully respected. For example, a transversal quality check on several cities, with a focus on Portugal and Italy, was applied. Moreover, the classifier on the educational level was adapted to avoid the association of multiple educational levels to one single advertisement.
Additionally, the consistency of the data over time is important to further produce statistics of high quality. In this perspective, a change in the classifiers for some of the sources of the data processing was made to improve the data consistency in the time series (e.g. classifiers used for variables working time and economic activity).
Challenges in OJA data quality
Two important challenges for OJA are the irregularity of web content retrieval that are reflected in irregular time series and the changing landscape of data sources, i.e. changes in the structure of portals and websites that are crawled.
Ideally, a fixed set of all relevant data sources would be scraped regularly for a fixed period of time, with data collected daily or downloaded through an API providing complete access. However, in practice, scrapers can fail due to several reasons, like changes in the structure of the website or website temporary overload. In these cases, a delay occurs in between two scrapings, lasting from a few days to even months, as already identified in the OJA data scrapping. In addition, the market for online ads changes, leading to changes in the list of data sources used to collect OJAs.
There are three problems that cause irregular time series in OJA data, which need to be addressed to track the actual flow of OJAs over time:
- Some ads disappear before they are scraped (during the fixed scraping period), leading to a problem of missing data. This is an "invisible problem", as there is no sampling frame for OJAs, so it is not known how many ads are not recorded and what are their characteristics.
- The scraper occurring after a delay collects all the ads that appeared during the delay. This results in an irregular time series pattern for a particular source, with 0 ads observed for all the days without a scraper running and a large number of ads on the date when finally, the scraper collected the data. Plotting ad counts over time is not reliable because the measure of time recorded in the OJA data, which usually corresponds to the scraping date, is not a precise measure of the date in which the ad effectively appeared online ("posting date").
- Data from some sources are not collected at all before or after a certain date, leading to a changing source population and therefore limiting the comparability of OJA data over time. This causes irregular time series.
Next steps in OJA data quality improvement
In a global big data quality framework, the quality at all stages of the big data lifecycle should be addressed. Currently, three main quality assessment and improvement tools are being developed to apply a quality framework to OJA data: validation, classifier evaluation and landscaping. Validation is currently the only one that has been fully deployed. Yet, it still needs to extend in two directions: first, some validation rules may also be checked in the early stages of the data processing to give earlier feedback to the system; second, higher levels of validation could be implemented by systematically comparing OJA data distributions with other labour market data like job vacancies.
The classifier evaluation has seen great progress but has yet to be tested on the data. This can be done by using data annotation processes that will provide meaningful information on the variable classification process. It explores the richness of the information extracted from online job portals, mainly the full description of the job ads, but also the additional text extracted from structured fields (e.g. raw text on job title, salary, etc.). When fully operational, the evaluation of the classifiers will help improving the existing classifications as well as the existing OJA ontologies by comparing raw job descriptions with final classification outcomes.
Landscaping still needs to be fully integrated in the data quality framework that is being established, which is an important next step for OJA quality assurance. The standardised landscaping of data sources is the only tool able to ensure adequate coverage of the population of online job ads, which is a prerequisite for the reliability of analysis such as comparison with other labour market data series.
Even the application of standardised source landscaping will not solve two other challenges of the OJA data collection: the irregularity of source crawling and the changing landscape of data sources. These problems are due to changes over time in the set of websites that are crawled. Crawlers and scrapers sometimes fail due to changes in the structure of websites or to their temporary overload. In these cases, data from some websites become unavailable, temporarily or permanently, negatively affecting the comparability of data over time. A number of statistical methods are currently being tested to improve OJA time series, based on the rich time-related information available for each ad (date found online, last date seen online, time information reported in the ad such as "date posted").
Finally, an important step is to monitor data quality will be the compilation of a list of quality indicators for systematic monitoring. While some such indicators already exist (for example, those illustrated in the section on validation), others are still being conceptualised (for example, classifier accuracy rates or the number of sources with which there is agreement), and yet others need to be thought and developed for some quality domains (e.g., quality of processing).
We hope you enjoyed our post on quality in big data.
Published July 2022