Issue 06 - Exploring potential new data source - real estate

In the previous blog, Issue 3 - Exploring Potential New Data Sources | CROS (Europa.EU, we briefly described the use cases carried out under Work Package 3 (WP3). In this entry, we focus on the use case, which aims to produce aggregated statistics on the real estate market based on online real estate advertisements.

In addition to monitoring the dynamics of the number of real estate for sale and rent, we intend to experiment with the possibility of using web data in other areas of official statistics, such as constructing the Housing Rental Index (also used to calculate the HICP) or to augmenting the real estate market studies by new indicators.

The use case involves six partners, national statistical institutes (NSI) from Bulgaria, Finland, France, Germany and Poland, that obtain data by scraping or based on agreements with web portals owners. Despite different data sources, all partners follow the same steps of the BREAL model:

New data sources exploration
Programming, production of software
Data acquisition and recording
Data processing
Modelling and interpretation
Dissemination of the experimental statistics and results.

New data sources exploration

During the first year of work, we completed two first steps of the BREAL model. We began with data sources exploration, i.e., examining the sources, their availability and usefulness. We did this by identifying the initial scope of information to be obtained by all the partners. We conducted a review of publicly available websites presenting apartments for sale or rent. This resulted in the selection of several sources for each country to be further assessed. In the case of France and Finland, their experiences in previous attempts to obtain the data showed that the portals presenting the largest number of offers blocked access for their acquisition. Therefore, these NSIs established a direct collaboration with the data providers (based on the signed contracts for selected periods). Also, Germany gathers data from one of the portals based on such an agreement.

In order to assess the usefulness of data sources, we prepare a set of detailed criteria (see below) related to website availability, timeliness of data and technical aspects of site responses. Based on this assessment, the final lists of portals were created for each partner (13 portals altogether).

Software production and data acquisition

For each portal to be scraped, a dedicated software was prepared. For some data sources, we decided to use software developed by Bulgarian NSI in Scrapy. It acquires data from websites using the previously indicated selectors and provides a more generic way of acquiring the data. Currently, dedicated scrapers collect monthly data and require constant control to observe the website's structure. So far, we have encountered a few changes in the way the information is presented on the websites, which entailed the need to modify some of the programs. We are considering employing a more flexible approach, which would help avoid such problems.

Using the rules of netiquette, we adhere to the limitations indicated in the robots.txt files. The programs use time breaks between sending requests to the portals so as not to overload the administrator's server.

More widely available access via API would be a great way to gather the data, along with the agreements with data providers, which would allow direct data transfer in an adequate format. It would certainly be less complicated and more automated. Such a solution would be less exposed to changes in the structure of the website or not so dependent on creating/modifying scrapers by the NSIs' staff. However, obtaining data directly from the administrator also has its constraints, such as extended negotiations time and uncertainty regarding collaboration sustainability, to mention a few. The use case partners that use the latter model of data acquisition have received data only for an indicated period. The possibility of receiving them in the future requires further negotiations. Thus, they risk time series breakage and potentially the lack of possibilities to collect the data differently.

Further steps

A significant challenge use case partners face processing the data in a way that can be compared between countries. Due to the different data sources and different NSIs' needs, it was decided to create a common framework of mandatory variables to develop, which involves:

number of offers
average price per m2
share of offers by price per m2 classes
average surface in m2
share of offers by surface classes
average number of rooms
share of offers by number of rooms

Up until now, we have been working scrapers and data acquired since April 2022 in monthly intervals (with some data gathered from the data providers referring to particular historical periods). At the moment, we focus on establishing standardized data cleaning rules. Also, this year we plan to assess the sources again to check whether they are still in line with the rules and criteria previously imposed.

If you want to know more about our journey, problems encountered, and what results we have achieved so far, please follow our blog and do not hesitate to contact us with any questions. You can reach us via email at:

Dominik Dabrowski, D.Dabrowski@stat.gov.pl

Klaudia Peszat, K.Peszat@stat.gov.pl

Detailed criteria of the assessment sheet

Criteria	Description
Captcha	Whether a web source uses captcha or not
Robots blocking	Whether a web source blocks robots or not
JavaScript	Whether a web source uses JavaScript or not
List of pages	The web source has a list of pages with pagination
Filter criteria	If a web source offers a content filtering functionality relevant to the use case
GET HTTP method	If a web source use GET method for HTTP requests
Up to date content	Whether the content of web source has new user content published last month
Number of ads > X	Whether the number of ads on the web source is bigger than X e.g., 100, 1000, 10000, etc.
Structured description	Whether the web source has structured presentation of the content or just a plain text
HTML Microdata	Whether the web source uses HTML Microdata https://www.w3.org/TR/microdata/
Description schema	Whether the web source uses description schema https://schema.org/IndividualProduct
HTML code changed every X	Frequency of HTML code change e.g., 1 year, 2 years, 3 years etc.
Specific time period filter	Whether a web source allows scraping the content published during specific time period selected via the content filter
Scraping of yesterday publications	Whether a web source allows scraping the content published only yesterday
Multilanguage	Whether a web source have option to change language and currency
Ratings	Whether a web source have option to rate an offer and leave a comment
Cookies and tracking	Whether a web source force to accept cookies and tracking information
Aggregator	Whether a web source display information gathered from many portals
Dynamic class tags	Whether a web source code is generated automatically
Terms of use	Whether a web source term of use allow web scraping
robots.txt	Whether a web source lists relevant pages as disallowed in robots.txt
Offers API	Whether the website offers an api
CDN	Blocking by content delivery network services (like Cloudflare)
File extension	File extensions that do not contain renderable websites (e.g. .xlsx, .docx)
HTTP error	a URL returning a temporary (HTTP) error
Sale URL	a URL that is ‘for sale’
Scope of the data	Whether the data are representative for the entire territory (we have the example of specific websites concerning only a small fraction of the national territory)
Frequency of the data delivery/transmission	Whether the data are provided at least every month – in our case since rents are part of CPI, we target a monthly refreshment at least)
Representativity of the data	Whether the data stands for a significant part of the entire rental offers (could be 20% or 30%)
Data description – metadata	Whether metadata/a minimal set of data description is delivered along with the data
Data completion	Whether the rate of non-response/non completion does not exceed a given threshold for given variables that are considered as critical – should be declined for rents/surface/type of dwellings

Published 22 September 2022