Book page

Issue 17 - Measuring and predicting construction activities using data from online advertisements on internet real estate platforms - update

WIN project blog logo

In issue 7 of this blog, Measuring and Predicting Construction Activities Using Data from Online Advertisements on Internet Real Estate Platforms, we discussed the general goal of Use Case 2 in Working Package 3. We outlined the general setting of our plans for the upcoming project periods. We provided background information on the project's long-term goals: how scraping internet real estate platforms could enrich the portfolio of official statistics by providing information not contained in official statistics and offering early estimates for the real estate market. In this blog post, we will discuss the progress made and the obstacles and challenges encountered so far. Since there is a significant overlap between Use Cases 1 and 2 in the initial project steps according to the BREAL model, the previous blog posts 6 and 14 can serve as additional sources of information.

Our first step, "New data sources exploration," led to identifying real estate portals suitable for scraping. The second step, "Programming, Production of Software," allowed us to collect data regularly. Detailed information on how we carried out these tasks can be found in the first interim report. Today, we will focus on the next two steps of the BREAL model: "Data Acquisition and Recording" and "Data Processing."

Since the beginning of 2022 (slightly later for some portals), we have been scraping data from several German and Swedish real estate platforms. The scraping was performed by three offices: the Swedish NSO, SCB, and two regional statistical offices in Germany, HSL and SSO-BBB. While Sweden covered the entire country, the German offices limited their activities to their respective regions: Hesse and the metropolitan area of Berlin/Brandenburg. HSL and SSO-BBB established a permanent data exchange for their scraped data, ensuring that each office only scraped data from specific real estate portals. This reduced the workload for the statistical offices and fostered closer collaboration between the two German offices.

Obstacles encountered during data acquisition:

  • Portals changing their layout. This means that not only the visual appearance of the site has changed. More importantly, this typically includes changes within the structure of the source code of a portal's pages on which data extraction using web scraping methods rely on. Sometimes, changes to a site's structure may not even be visible to a user but cause scrapers to fail to extract relevant information.
  • Missing data. For some advertisements, not all information may be available. Especially missing address information (e.g. street name and house number) poses challenges in deciding whether two objects from different portals refer to the same object. Since Use Case 2 focuses on newly constructed buildings, the year of construction of an object and the month of actual first availability to users is crucial. Whereas the year of construction in most cases is available or could be derived consistently, reliably assigning a new object to a specific month of the year is a challenge, too.

Whilst the initial steps of data processing were carried out as soon as the first scraped data became available, these primarily involved tasks related to data cleaning. We quickly realised that the results obtained from different portals varied significantly. While some scrapers (especially those using APIs or JSON-based data) produced organised and homogeneous datasets, generating consistent datasets from other portals proved more challenging. During regular use case meetings, we helped each other with various issues and agreed upon procedures and variables necessary for subsequent process steps.

Currently, our work focuses on two tasks: duplicate detection and linking data to official statistics. Duplicates can occur in two ways: the same objects appearing in data from a single portal over different, often lengthy, periods. Due to weekly scraping, active ads overlap significantly, and objects that occur more than once have to be filtered out to avoid overcounting. Filtering out duplicates in these cases has proven to be a relatively straightforward task. Another challenge arises from objects listed on multiple portals at the same time. With the exception of one special case, no dedicated identifier is available, so individual objects must be identified based on their various properties. This task closely links the results obtained from web scraping to microdata from official statistics. Regarding Germany, it is still being determined whether the law permits linking at the object level or only at an aggregated level, such as postal code areas. These issues are currently being discussed with subject experts and our legal departments.

Some linking to official statistics is crucial for the next major step: establishing a model for early prediction of construction activities. On the basis of the data collected thus far, we can create a first model using data from 2022 and the results of official statistics published in mid-2023. By the end of 2023, we aim to have the first early prediction based on this model available. The model can then be refined using scraped data and official results for subsequent years, starting from mid-2024. Once the model proves to be reasonably reliable, the results could be published as experimental statistics.

Published October 2023