Book page

Issue 11 - Online Job Advertisements time series

WIN project blog logo

Online Job Ads Time Series

Web Intelligence Hub (WIH) team is developing a methodology to build time series for online job ads (OJAs) collected from the web from multiple sources using web scraping and other data acquisition methods. 

When attempting to develop OJA time series the main points are to be considered:

  • the regularity of OJA data ingestion is less than ideal,
  • scrapers can fail for reasons like changes in website structure or temporary overload,
  • APIs may change structure or stop updating, leading to gaps in the data ingestion process,
  • the market for online ads changes, leading to variations in the list of data sources from which data needs to be collected to maintain coverage of the OJA population.

This leads from a methodological point of view to the following three problems:

  • Missing data:

If a delay occurs between two consecutive ingestion dates, some ads may not be collected, which may lead to underestimation of total number of ads but also to possible alteration of the distribution of ads across different categories (e.g. occupations).

  • Uncertain posting dates:

If data were ingested daily (or even more frequently) from each source, then the actual date at which an ad has been posted would be known with certainty. However, this is usually not the case. Data ingestion delays induce an irregular data pattern on the affected source, with 0 ads recorded for each day without data ingestion, and a potentially large number of ads collected in a single day when ingestion is restored. This pattern increases the noise of the time series.

  • Incomparable source sets:

Data from some sources are not collected at all before or after a certain date, leading to a changing source population over time. This limits the time comparability of OJA data, especially over the medium-long period.

The new method to generate the OJA time series uses classic models from the fields of statistics (survival analysis and chaining) to address the problems above. Its robustness has been assessed by comparing its results with other existing methods, in particular by building raw time series without adjustments and sub-setting a sample of sources (deemed to offer a stable data flow over time). This new method is subject to improvement and revision, as soon as new improved OJA data would be available or new proposal methods arise.

Call for feedback

For any feedback and comment that can help us improve the OJA data, please don't hesitate to contact us: estat-wih@ec.europa.eu.

We wish for a fruitful collaboration,

WIH Team

Published February 2023