Book page

Results of The Web intelligence online job advertisement (OJA) Deduplication Challenge

Default profile image
Agnieszka ZAJAC @Eurostat • 19 January 2026
Web intelligence

Back to introduction

The European Statistics Awards Programme Web Intelligence competitions aim at stimulating innovation when retrieving data from the world wide web for producing European statistics. The first "deduplication" challenge was focused on identifying potential duplicates of job postings published on the web. Deduplication is a basic condition to produce high quality statistics from online job advertisements as companies often publish job advertisements on different web portals. Posting advertising the same jobs must be identified and removed using automatic and robust solutions that allow the treatment of big amounts of data in an efficient manner to avoid double counting.

The competition dataset contained 112 000 online job advertisements, retrieved from around 400 websites active in the European Union. The competition organisers have taken unique authentic job advertisements and created full, semantic, temporal and partial duplicates across different languages, thus creating a synthetic multilingual dataset for the competition.

The source of the original dataset is the European Web Intelligence Hub, where around 200 million online job advertisements have been collected and classified since July 2018.

The participants had to provide documented scripts in either R of Python that would identify duplicate job advertisements (full duplicate, semantic duplicate, temporal duplicate or partial duplicate). They had to address a number of challenges including identifying duplicates within a multilingual dataset by applying cross-lingual techniques (identifying semantic duplicates for online job advertisements in different languages), field mismatch (i.e. job advertisements having different field values which represent the same thing), etc. Handling cross-linguality is a specifically important task when employers are advertising jobs internationally.

The deduplication challenge was launched in December 2022, with a final deadline for submissions of duplicates in March and documentation in April 2023.

Participation was quite wide as a total of 69 teams, comprised of 137 individuals from 17 countries, signed up for this challenge. The results of the evaluation are announced below.

The participants were competing for three types of awards:

  • Accuracy – for identifying as many duplicates as possible within the synthetic dataset created by the organisation team. The Accuracy Award addressed the cross-linguality aspects of the competition.
  • AccuracyPlus – this is a "discovery" prize, rewarding teams that manage to find potential duplicates not identified by the organisation team. To resolve the issue that no gold standard exists, the AccuracyPlus scoring was based on inter-team agreement.
  • Reproducibility – the most reproducible and scalable solutions for regular production.

Accuracy Award winners

PlacePrizeTeam nameTeam membersCountry
1st place10 000 EURTwoTiredLeonard Mandtler
Axel Forsch
Thomas Lüke
Germany
2nd place4 000 EURTheDeDuplicatorsJannic Cutura
Dimitris Petridis
Stefan Pasch
Charis Lagonidis
Germany and Greece
3rd place3 000 EURIDAJakub Żerebecki
Mikołaj Tym
Poland

AccuracyPlus Award winners

PlacePrizeTeam nameTeam membersCountry
1st place3 000 EURSmrekSamo Kosík
Marek Cedula
Radoslav Čársky
Slovakia
2nd place2 000 EURStudentiUniboRoberto Cornali
Sofia Camilla
Todeschini
Italy
3rd place1 000 EUR name not disclosed (winner not yet reached) 

Reproducibility Award winners

PlacePrizeTeam nameTeam membersCountry
1st place10 000 EURTheDeDuplicatorsJannic Cutura
Dimitris Petridis
Stefan Pasch
Charis Lagonidis
Germany and Greece
2nd place4 000 EURNinsAntoine PalazzoloFrance
3rd place3 000 EURIDAJakub Żerebecki
Mikołaj Tym
Poland

Solutions of the Reproducibility Award winners

Since the Reproducibility award was intended to support the most thoroughly described and documented solutions, with the most innovative, open approach, we are happy to share with you the solutions which won 1st, 2nd and 3rd prize:

1st place: TheDeduplicators solution, which won 1st place is available on the following links:

2nd place: Nins solution, which won 2nd place is available on the following links:

3rd place: IDA solution, which won 3rd place is available on the following link: