Scraping online data: sources, tools and methodologies. Hands-on analysis of Online Job Advertisements | |
Course Leader | Mauro Pelucchi |
Target Group | Official statisticians working on big data methodology, data science and in employment and education statistics, as well as other statistical domains which can profit from this data source. |
Entry Qualifications | - Sound command of English. Participants should be able to make short interventions and to actively participate in discussions
- Domain knowledge on Labour Market Intelligence
- Preliminary Big Data knowledge
- Familiarity with base analytical techniques
- Familiarity with base programming knowledge
|
Objective(s) | - Understand how to collect Web Data regarding Online Job Vacancies and store them
- Understand of data processing techniques
- Understand the challenges and the issues of web data
- Base understand of data classification techniques on standard taxonomies and base understand of advanced techniques on taxonomies improvement
|
Contents | - Landscaping the online job market
- OJV data ingestion (e.g.: source selection, ingestion techniques)
- Overview of web technology (HTML, CSS, JS, XPATH, ...);
- Scraping vs Crawling vs Search (including URLs discovery via surveys, search engines and crowdsourcing);
- Data extraction via API (HTTP messages, requests and response codes, POST, REST, JSON format, R package 'httr');
- Data extraction via scraping tools;
- OJV data processing (e.g.: pipeline, vacancy detection, deduplication)
- Automatic classification of OJV data (e.g.: multi- language environment, feature extraction, classifiers)
- Text processing and multi-language environment
- Classification processes, feature extraction and machine learning
- Focus on occupation’s categorization
- Focus on skill’s categorization
- Analysis of OJV data with the Big Data Science Workbench tools
|
Expected Outcome | Sample script that extract Job Vacancies and other data from a web source, cleans them and prepare for analytical path |
Training Methods | - Presentations and lectures
- Exchange of views/experiences on national practices
- Exercises/DataLab
|
Required Reading | None |
Suggested Reading | None |
Required Preparation | None |
Trainer(s)/
Lecturer(s) | Mauro Pelucchi Colombo Ettore |