Book page

Scraping online data: sources, tools and methodologies. Hands-on analysis of Online Job Advertisements 2nd edition - 2024

Scraping online data: sources, tools and methodologies. Hands-on analysis of Online Job Advertisements 
Course LeaderMauro Pelucchi
Target GroupOfficial statisticians working on big data methodology, data science and in employment and education statistics, as well as other statistical domains which can profit from this data source.
Entry Qualifications
  • Sound command of English. Participants should be able to make short interventions and to actively participate in discussions
  • Domain knowledge on Labour Market Intelligence
  • Preliminary Big Data knowledge
  • Familiarity with base analytical techniques
  • Familiarity with base programming knowledge
  • Understand how to collect Web Data regarding Online Job Vacancies and store them
  • Understand of data processing techniques
  • Understand the challenges and the issues of web data
  • Base understand of data classification techniques on standard taxonomies and base understand of advanced techniques on taxonomies improvement
  • Landscaping the online job market
  • OJV data ingestion (e.g.: source selection, ingestion techniques)
  • Overview of web technology (HTML, CSS, JS, XPATH, ...);
  • Scraping vs Crawling vs Search (including URLs discovery via surveys, search engines and crowdsourcing);
  • Data extraction via API (HTTP messages, requests and response codes, POST, REST, JSON format, R package 'httr');
  • Data extraction via scraping tools;
  • OJV data processing (e.g.: pipeline, vacancy detection, deduplication)
  • Automatic classification of OJV data (e.g.: multi- language environment, feature extraction, classifiers)
  • Text processing and multi-language environment
  • Classification processes, feature extraction and machine learning
  • Focus on occupation’s categorization
  • Focus on skill’s categorization
  • Analysis of OJV data with the Big Data Science Workbench tools
Expected OutcomeSample script that extract Job Vacancies and other data from a web source, cleans them and prepare for analytical path
Training Methods
  • Presentations and lectures
  • Exchange of views/experiences on national practices
  • Exercises/DataLab
Required ReadingNone
Suggested ReadingNone
Required PreparationNone


Mauro Pelucchi

Colombo Ettore


2nd edition

Practical Information    
WhenDurationWhereOrganiserAPPLICATION VIA National Contact Point
07–11.10.20245 daysCologne, Germany


Public Sector GmbH

Deadline: 12.08.2024


‹ Scraping online data: sources, tools and