Book page

Issue 25 - Measuring Construction Activities Using Data from Online Advertisements on Internet Real Estate Platforms: Insights in Data Quality Discussions

Issues 7 and 17 of this blog described the overall aim, progress, and obstacles to measuring construction activities using data from online advertisements on internet real estate platforms.

Advertisements have been gathered since early 2022 by scraping several platforms regularly (typically weekly or at least monthly) or using data directly provided by platform owners (contracting). After more than two years of gathering, cleaning, combining, and analysing data, it is time to discuss some results, give insights into our data quality discussion and prepare conclusions!

 

First, data sources proved relatively stable over a more extended period. However, there have been some minor changes to the website’s structure, and scrapers had to be adapted accordingly. Therefore, the data contains a few gaps for some platforms and for a few months. Using web scraping techniques for gathering data has worked well overall so far. Of course, data provided directly by platform providers is complete and does not show these gaps. 

 

Second, a considerable amount of advertisements can be gathered from each platform. As described in earlier blog articles, we cover several platforms: larger and smaller ones, platforms covering only specific regions or the complete country, and platforms specialising in advertising complete construction projects, typically consisting of up to dozens of more or less similar apartments. On average, we covered hundreds of objects each month. At this point, it is open to further analysis to investigate how large the number of scraped advertisements is compared to the target population of all newly constructed houses and apartments.

 

Still, additional questions about data quality need to be investigated and discussed. Under coverage and over coverage are of major concern in this Use Case.

 

Undercover describes an apartment or house, part of the target population, as not being part of at least one of the data sources. For example, this is when the newly constructed apartment or house is not advertised online for sale or rent at the moment of scraping (the advertisement may show up later or never). These are often objects built and used by their owners for their use, for example. No platform will ever cover these objects at this point in time, as there is no necessity to advertise the new building. Suppose there is a stable relationship between the number of advertised objects and objects not advertised. In that case, it is still possible to conclude the number of not advertised newly constructed objects.

 

Another form of under coverage is related to the way and terms of data collection. We are scraping most platforms at fixed times and intervals, say weekly. When an object of our target population is advertised only for a short period of time – less than one week – we may miss the advertisement of this new apartment or house. This might be the case in areas with high demand for housing units, as many of us and our readers can confirm from our own experiences. Experiments have been run to measure this specific form of under coverage.

Over coverage, on the other hand, describes that scraping the very same apartment or house occurs in the data twice or even several times when it should only occur once. This could happen if an object is listed in one data source multiple times or if several data sources contain the same object at least twice after being combined. Some of the duplicates are very hard to come by. In larger, newly constructed buildings (which often are part of larger construction projects), many of the apartments get advertised at the same time, and typically, many of these advertisements look very similar – or even identical – regarding price, size, location, and address. Within and especially between platforms, it is very hard to distinguish whether two advertisements with very similar or identical characteristics refer to the same object. For example, 32 apartments within a larger housing project are advertised at a given address (in the same city, with identical postal codes, street names and house numbers). Separate apartments may be differentiated by floor number and the number of rooms. But even on the 2nd floor, there may be more than one apartment of a given size for a specific rent price. Let’s imagine that these 32 advertisements stem from one platform. Maybe there is good reason to assume that all these advertisements refer to separate apartments. After combining data from several platforms, 34 advertisements have the exact address details. How to treat these two additional advertisements – again with similar or identical information?

 

The problem of identifying real duplicates and not treating distinct objects as duplicates is a tough one. It becomes even more challenging when taking measurement errors or missing variables into account. There may be very good reasons to treat advertisements from one or several sources with identical key variables as duplicates (same city, postal code, street name, house number, floor number, number of rooms, size in square meters, price to rent or buy). Nevertheless, what if the address information is incomplete (with no street name and house number)? Or if one or more variables are missing or are not identical but only very similar (1150€ vs. 1250€; 81m² vs. 82m²)? Concepts, such as prices or calculating and presenting sizes between platforms, may differ, which might cause different but similar vital variables.

On the other hand, an error in one advertisement for an object may be the reason for an additional – corrected – advertisement for the same object, which may or may not appear as a likely duplicate. Comparing platform data with official statistics on construction permits or completions could help solve this puzzle. However, addresses may not be stored after official statistics on construction permits and completions have been produced. This makes a comparison of official data and scraped data difficult, as addresses are a key variable for this linkage, too.

 

Looking at aggregates and comparing the number of advertisements for apartments in newly constructed buildings in Hesse for 2023 from one of the largest platforms for real estate advertisements in Germany shows coverage of about 70% of the number of apartments in newly constructed buildings according to numbers reported by official statistics. However, looking at results at a lower aggregation level, say NUTS 3 (‘Landkreise’ in Hesse), coverage heavily varies between urban and rural regions. In terms of the overall goal of an early estimate of construction activities, this implies that an early estimate may be less accurate in rural areas with a lower number of advertisements as well as a lower coverage.

 

In summary, scraping advertisements from real estate platforms is an effective method for data collection, but it requires quick action when website structures change to prevent data gaps. One of the most persistent challenges is determining which advertisements are duplicates or distinct objects – a task still lacking a one-size-fits-all solution. When comparing the scraped data to official statistics, platform coverage is about 70%. At the NUTS level 3, a clear urban-rural divide emerges, with urban areas significantly more represented, suggesting that rural regions may be underreported. These challenges will be discussed in more detail in the final report of the Use Case. 

 

Blog published September 2024

Return