Book page

Issue 06 - Exploring potential new data source - real estate

WIN project blog logo

In the previous blog, Issue 3 - Exploring Potential New Data Sources | CROS (Europa.EU, we briefly described the use cases carried out under Work Package 3 (WP3). In this entry, we focus on the use case, which aims to produce aggregated statistics on the real estate market based on online real estate advertisements. 

In addition to monitoring the dynamics of the number of real estate for sale and rent, we intend to experiment with the possibility of using web data in other areas of official statistics, such as constructing the Housing Rental Index (also used to calculate the HICP) or to augmenting the real estate market studies by new indicators.

The use case involves six partners, national statistical institutes (NSI) from Bulgaria, Finland, France, Germany and Poland, that obtain data by scraping or based on agreements with web portals owners. Despite different data sources, all partners follow the same steps of the BREAL model:

  1. New data sources exploration
  2. Programming, production of software
  3. Data acquisition and recording
  4. Data processing 
  5. Modelling and interpretation
  6. Dissemination of the experimental statistics and results.

New data sources exploration

During the first year of work, we completed two first steps of the BREAL model. We began with data sources exploration, i.e., examining the sources, their availability and usefulness. We did this by identifying the initial scope of information to be obtained by all the partners. We conducted a review of publicly available websites presenting apartments for sale or rent. This resulted in the selection of several sources for each country to be further assessed. In the case of France and Finland, their experiences in previous attempts to obtain the data showed that the portals presenting the largest number of offers blocked access for their acquisition. Therefore, these NSIs established a direct collaboration with the data providers (based on the signed contracts for selected periods). Also, Germany gathers data from one of the portals based on such an agreement.

In order to assess the usefulness of data sources, we prepare a set of detailed criteria (see below) related to website availability, timeliness of data and technical aspects of site responses. Based on this assessment, the final lists of portals were created for each partner (13 portals altogether). 

Software production and data acquisition

For each portal to be scraped, a dedicated software was prepared. For some data sources, we decided to use software developed by Bulgarian NSI ­in Scrapy. It acquires data from websites using the previously indicated selectors and provides a more generic way of acquiring the data. Currently, dedicated scrapers collect monthly data and require constant control to observe the website's structure. So far, we have encountered a few changes in the way the information is presented on the websites, which entailed the need to modify some of the programs. We are considering employing a more flexible approach, which would help avoid such problems. 

Using the rules of netiquette, we adhere to the limitations indicated in the robots.txt files. The programs use time breaks between sending requests to the portals so as not to overload the administrator's server. 

More widely available access via API would be a great way to gather the data, along with the agreements with data providers, which would allow direct data transfer in an adequate format. It would certainly be less complicated and more automated. Such a solution would be less exposed to changes in the structure of the website or not so dependent on creating/modifying scrapers by the NSIs' staff. However, obtaining data directly from the administrator also has its constraints, such as extended negotiations time and uncertainty regarding collaboration sustainability, to mention a few. The use case partners that use the latter model of data acquisition have received data only for an indicated period. The possibility of receiving them in the future requires further negotiations. Thus, they risk time series breakage and potentially the lack of possibilities to collect the data differently. 

Further steps

A significant challenge use case partners face processing the data in a way that can be compared between countries. Due to the different data sources and different NSIs' needs, it was decided to create a common framework of mandatory variables to develop, which involves:

  • number of offers 
  • average price per m2
  • share of offers by price per m2 classes 
  • average surface in m2 
  • share of offers by surface classes
  • average number of rooms 
  • share of offers by number of rooms

Up until now, we have been working scrapers and data acquired since April 2022 in monthly intervals (with some data gathered from the data providers referring to particular historical periods). At the moment, we focus on establishing standardized data cleaning rules. Also, this year we plan to assess the sources again to check whether they are still in line with the rules and criteria previously imposed.

If you want to know more about our journey, problems encountered, and what results we have achieved so far, please follow our blog and do not hesitate to contact us with any questions.  You can reach us via email at:

Dominik Dabrowski, D.Dabrowski@stat.gov.pl

Klaudia Peszat, K.Peszat@stat.gov.pl

Detailed criteria of the assessment sheet

Criteria

Description

Captcha

Whether a web source uses captcha or not

Robots blocking

Whether a web source blocks robots or not

JavaScript

Whether a web source uses JavaScript or not

List of pages

The web source has a list of pages with pagination

Filter criteria

If a web source offers a content filtering functionality relevant to the use case

GET HTTP method

If a web source use GET method for HTTP requests

Up to date content

Whether the content of web source has new user content published last month

Number of ads > X

Whether the number of ads on the web source is bigger than X e.g., 100, 1000, 10000, etc.

Structured description

Whether the web source has structured presentation of the content or just a plain text

HTML Microdata

Whether the web source uses HTML Microdata https://www.w3.org/TR/microdata/

Description schema

Whether the web source uses description schema https://schema.org/IndividualProduct

HTML code changed every X

Frequency of HTML code change e.g., 1 year, 2 years, 3 years etc.

Specific time period filter

Whether a web source allows scraping the content published during specific time period selected via the content filter

Scraping of yesterday publications

Whether a web source allows scraping the content published only yesterday

Multilanguage

Whether a web source have option to change language and currency

Ratings

Whether a web source have option to rate an offer and leave a comment

Cookies and tracking

Whether a web source force to accept cookies and tracking information

Aggregator

Whether a web source display information gathered from many portals

Dynamic class tags

Whether a web source code is generated automatically

Terms of use

Whether a web source term of use allow web scraping

robots.txt

Whether a web source lists relevant pages as disallowed in robots.txt

Offers API

Whether the website offers an api

CDN

Blocking by content delivery network services (like Cloudflare)

File extension

File extensions that do not contain renderable websites (e.g. .xlsx, .docx)

HTTP error

a URL returning a temporary (HTTP) error

Sale URL

a URL that is ‘for sale’

Scope of the data

Whether the data are representative for the entire territory (we have the example of specific websites concerning only a small fraction of the national territory)

Frequency of the data delivery/transmission

Whether the data are provided at least every month – in our case since rents are part of CPI, we target a monthly refreshment at least)

Representativity of the data

Whether the data stands for a significant part of the entire rental offers (could be 20% or 30%)

Data description – metadata

Whether metadata/a minimal set of data description is delivered along with the data

Data completion

Whether the rate of non-response/non completion does not exceed a given threshold for given variables that are considered as critical – should be declined for rents/surface/type of dwellings

 

Published 22 September 2022