Scraping online data: sources, tools and methodologies. Hands-on analysis of Online Job Advertisements | |
Course Leader | Mauro Pelucchi |
Target Group | Official statisticians working on big data methodology, data science and in employment and education statistics, as well as other statistical domains which can profit from this data source. |
Entry Qualifications | Sound command of English. Participants should be able to make short interventions and to actively participate in discussions Domain knowledge on Labour Market Intelligence Preliminary Big Data knowledge Familiarity with base analytical techniques Familiarity with base programming knowledge |
Objective(s) | Understand how to collect Web Data regarding Online Job Vacancies and store them Understand of data processing techniques Understand the challenges and the issues of web data Base understand of data classification techniques on standard taxonomies and base understand of advanced techniques on taxonomies improvement |
Contents | Landscaping the online job market OJV data ingestion (e.g.: source selection, ingestion techniques) Overview of web technology (HTML, CSS, JS, XPATH, ...); Scraping vs Crawling vs Search (including URLs discovery via surveys, search engines and crowdsourcing); Data extraction via API (HTTP messages, requests and response codes, POST, REST, JSON format, R package 'httr'); Data extraction via scraping tools; OJV data processing (e.g.: pipeline, vacancy detection, deduplication) Automatic classification of OJV data (e.g.: multi- language environment, feature extraction, classifiers) Text processing and multi-language environment Classification processes, feature extraction and machine learning Focus on occupation’s categorization Focus on skill’s categorization Analysis of OJV data with the Big Data Science Workbench tools |
Expected Outcome | Sample script that extract Job Vacancies and other data from a web source, cleans them and prepare for analytical path |
Training Methods | Presentations and lectures Exchange of views/experiences on national practices Exercises/DataLab |
Required Reading | None |
Suggested Reading | None |
Required Preparation | None |
Trainer(s)/ Lecturer(s) | Mauro Pelucchi Colombo Ettore |
Practical Information | ||||
When | Duration | Where | Organiser | APPLICATION VIA National Contact Point |
13–17.05.2024 | 5 days | Cologne, Germany | ICON INSTITUTE Public Sector GmbH | Deadline: 25.03.2024 |
07–11.10.2024 | 5 days | Cologne, Germany | ICON INSTITUTE Public Sector GmbH | Deadline: 12.08.2024 |
Please log in or sign up to comment.