Book page

Lessons learned from Eurostat’s Deduplication Challenge

This webinar, which took place on 13 May 2024, presented the work of two teams that participated in Eurostat's first deduplication challenge as part of the European Statistics Awards Programme Web Intelligence competition.  

Included in the webinar:

  • Different methods will be used to identify duplicates in a multilingual dataset, using the job advertisement as a case study. These will include Entity Recognition, transformer-based approaches to comparing the similarity of the offers vector embeddings, or MintHash experimentations.
  • Here are some best practices for conducting a data science project (with the deduplication challenge as an example), such as using the Kedro framework for Python and a presentation of the Onyxia Datalab.

You can view the slides from the presentation (below) and watch the webinar recording on our YouTube channel.

Files