Scraping online data: sources, tools and methodologies. Hands-on analysis of Online Job Advertisements
Course Leader
Mauro Pelucchi
Target Group
Official statisticians working on big data methodology, data science and in employment and education statistics, as well as other statistical domains which can profit from this data source.
Entry Qualifications
Sound command of English. Participants should be able to make short interventions and to actively participate in discussions
Domain knowledge on Labour Market Intelligence
Preliminary Big Data knowledge
Familiarity with base analytical techniques
Familiarity with base programming knowledge
Objective(s)
Understand how to collect Web Data regarding Online Job Vacancies and store them
Understand of data processing techniques
Understand the challenges and the issues of web data
Base understand of data classification techniques on standard taxonomies and base understand of advanced techniques on taxonomies improvement
Contents
Landscaping the online job market
OJV data ingestion (e.g.: source selection, ingestion techniques)
Overview of web technology (HTML, CSS, JS, XPATH, ...);
Scraping vs Crawling vs Search (including URLs discovery via surveys, search engines and crowdsourcing);
Data extraction via API (HTTP messages, requests and response codes, POST, REST, JSON format, R package 'httr');
Data extraction via scraping tools;
OJV data processing (e.g.: pipeline, vacancy detection, deduplication)
Automatic classification of OJV data (e.g.: multi- language environment, feature extraction, classifiers)
Text processing and multi-language environment
Classification processes, feature extraction and machine learning
Focus on occupation’s categorization
Focus on skill’s categorization
Analysis of OJV data with the Big Data Science Workbench tools
Expected Outcome
Sample script that extract Job Vacancies and other data from a web source, cleans them and prepare for analytical path
Training Methods
Presentations and lectures
Exchange of views/experiences on national practices