In December 2022, Eurostat launched its first deduplication challenge as part of the European Statistics Awards Programme Web Intelligence competitions and a series of nowcasting challenges. Open to all, these competitions aim to "unveil innovative methodologies and valuable data resources that could improve the production of European Statistics".
The deduplication challenge, which lasted around four months, originated from the Online Job Advertisements (OJA) project. Through the latter, 200 million job offers have been scraped from the web and classified for statistical purposes since 2018. However, the same offers can be posted on different websites (Linkedin, Indeed, etc.), in other languages or by different companies. Deduplicating these collected ads is essential to publish unbiased statistics.
112k offers retrieved from around 400 websites were thus selected and anonymised to enable participants to optimise their deduplication approaches. To do that, several fields from the job advertisements were made available:
- A job title;
- A description of the job;
- A location, extracted automatically from the job description;
- A company name, extracted from the description as well;
- The advertisement retrieval date (by the bots of the WIH).
From there, the goal was to identify different kinds of duplicates:
- Full duplicates
- Same job title and description
- Semantic duplicates
- The same job position advertised
- Same content in terms of the job characteristics (ex, required skills or education) but expressed differently
- Temporal duplicates
- Semantic duplicates with varying advertisement retrieval dates
- Partial duplicates
- Describe the same job position but do not contain the same characteristics
In May 2024, two teams presented their work in a webinar:
- SPub.Fr, a collaboration between Insee and Dares;
- Nins, which received the second prize for reproducibility.
Several approaches that were presented can be generalised to broader problems in the fields of (possibly multilingual) deduplication.
First, web-scraped data can sometimes be messy, for instance, due to remaining HTML tags, special characters and trailing spaces. As for most NLP projects, data processing is a crucial step here, including different steps depending on where we need to go until:
- Standardisation of the text: lowercase, without accents or punctuation;
- Lemmatisation of the texts: breaking down words to their root meanings;
- Stop word removal: getting rid of determiners or pronouns;
- Filtering out the most frequent words that do not carry much meaning;
- Filtering out poorly described offers, for instance, where the description only consists of the cookies disclaimer.
Additional features may also need to be computed to improve the preprocessing efficiency or to get more discriminant elements for further distinguishing duplicates:
- Language detection, for instance, with fastText detection library;
- Named Entity Recognition (NER) to identify organisations and locations within the job descriptions, for instance, with HuggingFace specialised models;
- The length of the description or smaller chunks of it is used to accelerate later operations.
Armed with our clean dataset and new features, it is possible to jump into deduplication. However, instead of going straight for big language models, starting with more intuitive approaches is always interesting.
For instance, one can look for exact matches between pairs of offers based on all of the fields or only a subset of them. Two offers with the same job title, company name, and location, along with close retrieval dates, are very likely duplicates. If the company is known to publish offers in different languages, the same is true for two offers with the same company name, location, and retrieval date.
A next step can be to look for close (and not exact) matches, for example, if two offers are the same, besides a few spelling mistakes. Then, a notion of distance or similarity between texts needs to be introduced. Among the most widely used ones, we can mention the Levenstein distance, the Jaccard Distance, or the Jaro-Winkler similarity. All of them are based on the similarity between the forms of the texts, i.e., their vocabulary or the closeness of the spellings.
However, those methods can only lead us so far. A more efficient approach is using MinHash, a probabilistic algorithm to estimate the similarity between two sets. It works by applying several hash functions to our dataset and selecting the minimum hash value for each text, thus converting them into signatures. The probability of two sets having the same signature is proportional to their Jaccard similarity. The Record Linkage toolkit combines it by checking the MinHash similarity between shingles of 4 characters in our offers and the Jaro-Winkler similarity between company names and locations. Two ads with a computed similarity exceeding a threshold will be considered potential duplicates.
The approach's speed, combined with custom optimisations based on detected languages and description length ranges, makes it efficient on big datasets. Nevertheless, MinHash focuses only on monolingual duplicates based on the form of the texts and not on their true meaning. Other approaches then need to be considered.
A first solution could be to translate all of the offers to English with specialised pre-trained models to eliminate the problem. While this idea proved to be quite efficient, other approaches based on embedding the offers into numerical vectors were chosen during the challenge.
One can think of the TF-IDF tokenisation, which embeds the data into a very high-dimensional space where each dimension will correspond to a measure of the originality of a given word in an offer compared to the whole corpus. While this approach allows the representation of offers based on the content that discriminates them from others and brings reliable results, it requires further work to manually reduce the dimension of the embeddings and still does not focus on multilingual pairs.
The most natural solution here is to rely on meaning-based embeddings, which represent the job advertisements in a high-dimensional space. The vectors representing the offers will be as geometrically close as the meanings are similar, no matter what the language behind them is.
Some pre-trained (on multilingual corpus) exist to do that, like the following transformers: Multilingual BERT, XLM RoBERTa, and Distiluse base multilingual (from sentence-transformers). They will analyse each advertisement as a whole, meaning that each word is not considered independently so that the context and meaning can be considered, making the model resistant to paraphrasis.
Choosing higher embedding dimensions or different hyperparameters through grid search, using even bigger models or finetuning these last ones to the specific context of our corpus could lead to better results. Still, for our use case, it can quickly become too costly.
Once the embeddings are computed, the idea is to check if the measure of similarity we chose (for instance, cosine similarity) exceeds a given threshold. Optimisations, such as focusing on multilingual companies, parallelising the comparisons per chunk, etc., can be useful to reduce the number of comparisons.
While these approaches lead to high recall, they will probably induce poor precision. Additional filters are then necessary to find the actual duplicates among all the candidates and distinguish between the kinds of duplicates. These filters can be:
- A limit on the duration between two retrieval dates;
- A maximum difference in the length of the descriptions;
- A maximum Levenstein distance between company names and locations;
- A comparison between the named entities extracted with our NER.
If the precision gets high enough, it is possible to use the transitivity of the relation "A is a duplicate of B" by representing our offers with a non-oriented graph. Each advertisement is a node, and an edge represents the "is duplicate" relation. All connected components can then be converted into cliques to account for transitivity and improve recall even more. However, if the precision is higher, the results may be better.
Overall, many approaches (here quite fast to execute) can be stacked to identify eligible pairs before adding additional layers to discard false positives. One can also check the winning teams' repositories here to see what methods led to the best results.
Regarding reproducibility, several aspects are important besides the methodology, the code itself and the usual recommendations that can be made about it, such as:
- It is choosing the right environment to work in. Coding on one's computer can lead to several issues, such as the famous "It works on my machine, why doesn't it work on yours?" Leveraging the power of Cloud-based data platforms brings huge advantages in terms of flexibility, security, and reproducibility. An example of such a platform is Onyxia, the reference project for the One-Stop Shop. You can check out its implementation with the SSP Cloud.
- Frameworks are often mentioned within data science best practices: they offer a way to structure the codes to make them reproducible, maintainable, and modular. Kedro is an example of such a framework in Python. It provides several useful features, including a pipeline visualisation tool to better represent how objects and functions interact (and in what order) during the workflows. In R, targets are also a beneficial framework.
- Using the version control system Git to store archives and share code enhances efficiency when working within teams or with one's future self. The GitHub platform then allows sharing open source code, such as the repositories for the deduplication challenge mentioned earlier that inspired this blog issue!
Published June 2024