Book page

Scraping online data: sources, tools and methodologies. Hands-on analysis of Online Job Advertisements 1st edition - 2024

Scraping online data: sources, tools and methodologies. Hands-on analysis of Online Job Advertisements 
Course LeaderMauro Pelucchi
Target GroupOfficial statisticians working on big data methodology, data science and in employment and education statistics, as well as other statistical domains which can profit from this data source.
Entry Qualifications
  • Sound command of English. Participants should be able to make short interventions and to actively participate in discussions
  • Domain knowledge on Labour Market Intelligence
  • Preliminary Big Data knowledge
  • Familiarity with base analytical techniques
  • Familiarity with base programming knowledge
Objective(s)
  • Understand how to collect Web Data regarding Online Job Vacancies and store them
  • Understand of data processing techniques
  • Understand the challenges and the issues of web data
  • Base understand of data classification techniques on standard taxonomies and base understand of advanced techniques on taxonomies improvement
Contents
  • Landscaping the online job market
  • OJV data ingestion (e.g.: source selection, ingestion techniques)
  • Overview of web technology (HTML, CSS, JS, XPATH, ...);
  • Scraping vs Crawling vs Search (including URLs discovery via surveys, search engines and crowdsourcing);
  • Data extraction via API (HTTP messages, requests and response codes, POST, REST, JSON format, R package 'httr');
  • Data extraction via scraping tools;
  • OJV data processing (e.g.: pipeline, vacancy detection, deduplication)
  • Automatic classification of OJV data (e.g.: multi- language environment, feature extraction, classifiers)
  • Text processing and multi-language environment
  • Classification processes, feature extraction and machine learning
  • Focus on occupation’s categorization
  • Focus on skill’s categorization
  • Analysis of OJV data with the Big Data Science Workbench tools
Expected OutcomeSample script that extract Job Vacancies and other data from a web source, cleans them and prepare for analytical path
Training Methods
  • Presentations and lectures
  • Exchange of views/experiences on national practices
  • Exercises/DataLab
Required ReadingNone
Suggested ReadingNone
Required PreparationNone
Trainer(s)/

Lecturer(s)

Mauro Pelucchi

Colombo Ettore

 

1st edition

Practical Information    
WhenDurationWhereOrganiserAPPLICATION VIA National Contact Point
13–17.05.20245 daysCologne, Germany

ICON INSTITUTE

Public Sector GmbH

Deadline: 25.03.2024

 

2nd edition

Practical Information    
WhenDurationWhereOrganiserAPPLICATION VIA National Contact Point
07–11.10.20245 daysCologne, Germany

ICON INSTITUTE

Public Sector GmbH

Deadline: 12.08.2024