Book page

Issue 07 - Measuring and predicting construction activities using online data

WIN project blog logo

Error message

Access denied. You must log in to view this page.

Measuring and Predicting Construction Activities Using Data from Online Advertisements on Internet Real Estate Platforms 

Nowadays, most of the real estate market takes place on the net: whenever people search for a house or an apartment to buy or rent, they typically search on the net using one or more internet sites. On many different online platforms, owners or real estate agencies advertise accommodation to rent or to sell, describing in more or less detail many aspects of the accommodation, e.g., its location, price, size, number of rooms, configuration, and facilities. 

Since data from online real estate platforms can be collected and processed automatically and regularly by scraping technologies, the next step is to investigate if the same activity would work within the construction industry. If predictions of the volume and recent trends of construction activities can be published much earlier than official statistics about construction activities, say completed constructions in a given year (and region). 

The overall aim of Use Case 2 (UC2) of Work Package 3 (WP3) is to gather these online ads and to predict construction completions – including objects not advertised online or objects even not for rent or sale (e.g., self-owned newly constructed buildings). The result will be an experimental statistic on the yearly number of newly constructed buildings and apartments. 

Within the European Statistical System (ESS), the official statistics on construction completion typically is published yearly. (e.g., „in 2021, # buildings/apartments have been newly constructed “). These numbers about the previous year‘s constructions are published in the mid of the following year. Using data from online real estate platforms, an experimental statistic should provide reliable numbers already at the end of the same year!  

How do we get the early experimental statistic? Online ads are collected continuously (say weekly) by web scraping technologies. After one year of scraping online advertisements, microdata linkage with official statistics data can occur. The next step is to create a model to predict the official statistic – including microdata linkage between several online platforms and strategies to correct for undercounting online platform data. With another year of scraped advertisements, this model can be used in the next year to predict this year’s numbers – much earlier than the official statistic gets published. 

What are the benefits of the new data source? The main advantage of this data source is its early availability compared to official statistics. Often, buildings and apartments are advertised online even before they are built. Second, data can be made available and processed relatively easy using modern web scraping technologies without burden for respondents or the servers of the platforms. Real estate advertisements contain a lot of information that might interest potential buyers or tenants – even more information than gathered for the official statistic on new constructions – say, facilities or costs. Using this data source solely or in addition, official statistics can offer an additional different view to its users. 

Are there drawbacks? Yes, as with all data sources and methods, there also are some disadvantages or challenges. First, there is some raw undercount since not all newly constructed buildings and apartments are assumed to appear on online real estate platforms. For example, some of them do not appear as advertisements since the building is not for rent or sale. However, this only is a statistical problem which standard statistical techniques can solve (e.g., applying a correction factor). Speaking of „a new data source" underemphasizes the fact that not one but several online real estate platforms are used and scraped regularly. This means that advertisements for the same object appearing on several platforms (at the same or different times) must be identified to avoid overcounting. Again, this only is a minor issue. To employ a model to predict completed constructions from online ads, an additional linkage of advertised buildings to the new constructions of the official statistic is necessary. Since not all ads contain the full address, this essential step of linking decreases the data basis of the model. A smaller sample typically increases uncertainty in a model. However, this is something statisticians are used to and can deal with. A question to answer is whether all platforms and official statistics share the identical set of concepts and definitions necessary to use online data to predict the outcome of the official statistic. Even if they do not share identical concepts and definitions, they constantly correspond, so a constant relationship between concepts and definitions exists, and a model can be established. 

Scraping online advertisements already started in early 2022. After official numbers on the construction completions for 2022 are published in mid-2023, the model for the early prediction can be established. Results for 2023 can be predicted and published after another year of scraping at the end of 2023 – instead of mid-2024. 

Published 13 October 2022