Book page

Issue 06 - Exploring potential new data source - real estate

WIN project blog logo

In the previous blog, Issue 3 - Exploring Potential New Data Sources | CROS (Europa.EU, we briefly described the use cases carried out under Work Package 3 (WP3). In this entry, we focus on the use case, which aims to produce aggregated statistics on the real estate market based on online real estate advertisements. 

In addition to monitoring the dynamics of the number of real estate for sale and rent, we intend to experiment with the possibility of using web data in other areas of official statistics, such as constructing the Housing Rental Index (also used to calculate the HICP) or to augmenting the real estate market studies by new indicators.

The use case involves six partners, national statistical institutes (NSI) from Bulgaria, Finland, France, Germany and Poland, that obtain data by scraping or based on agreements with web portals owners. Despite different data sources, all partners follow the same steps of the BREAL model:

  1. New data sources exploration
  2. Programming, production of software
  3. Data acquisition and recording
  4. Data processing 
  5. Modelling and interpretation
  6. Dissemination of the experimental statistics and results.

New data sources exploration

During the first year of work, we completed two first steps of the BREAL model. We began with data sources exploration, i.e., examining the sources, their availability and usefulness. We did this by identifying the initial scope of information to be obtained by all the partners. We conducted a review of publicly available websites presenting apartments for sale or rent. This resulted in the selection of several sources for each country to be further assessed. In the case of France and Finland, their experiences in previous attempts to obtain the data showed that the portals presenting the largest number of offers blocked access for their acquisition. Therefore, these NSIs established a direct collaboration with the data providers (based on the signed contracts for selected periods). Also, Germany gathers data from one of the portals based on such an agreement.

In order to assess the usefulness of data sources, we prepare a set of detailed criteria (see below) related to website availability, timeliness of data and technical aspects of site responses. Based on this assessment, the final lists of portals were created for each partner (13 portals altogether). 

Software production and data acquisition

For each portal to be scraped, a dedicated software was prepared. For some data sources, we decided to use software developed by Bulgarian NSI ­in Scrapy. It acquires data from websites using the previously indicated selectors and provides a more generic way of acquiring the data. Currently, dedicated scrapers collect monthly data and require constant control to observe the website's structure. So far, we have encountered a few changes in the way the information is presented on the websites, which entailed the need to modify some of the programs. We are considering employing a more flexible approach, which would help avoid such problems. 

Using the rules of netiquette, we adhere to the limitations indicated in the robots.txt files. The programs use time breaks between sending requests to the portals so as not to overload the administrator's server. 

More widely available access via API would be a great way to gather the data, along with the agreements with data providers, which would allow direct data transfer in an adequate format. It would certainly be less complicated and more automated. Such a solution would be less exposed to changes in the structure of the website or not so dependent on creating/modifying scrapers by the NSIs' staff. However, obtaining data directly from the administrator also has its constraints, such as extended negotiations time and uncertainty regarding collaboration sustainability, to mention a few. The use case partners that use the latter model of data acquisition have received data only for an indicated period. The possibility of receiving them in the future requires further negotiations. Thus, they risk time series breakage and potentially the lack of possibilities to collect the data differently. 

Further steps

A significant challenge use case partners face processing the data in a way that can be compared between countries. Due to the different data sources and different NSIs' needs, it was decided to create a common framework of mandatory variables to develop, which involves:

  • number of offers 
  • average price per m2
  • share of offers by price per m2 classes 
  • average surface in m2 
  • share of offers by surface classes
  • average number of rooms 
  • share of offers by number of rooms

Up until now, we have been working scrapers and data acquired since April 2022 in monthly intervals (with some data gathered from the data providers referring to particular historical periods). At the moment, we focus on establishing standardized data cleaning rules. Also, this year we plan to assess the sources again to check whether they are still in line with the rules and criteria previously imposed.

If you want to know more about our journey, problems encountered, and what results we have achieved so far, please follow our blog and do not hesitate to contact us with any questions.  You can reach us via email at:

Dominik Dabrowski, D.Dabrowski@stat.gov.pl

Klaudia Peszat, K.Peszat@stat.gov.pl

Detailed criteria of the assessment sheet

CriteriaDescription
CaptchaWhether a web source uses captcha or not
Robots blockingWhether a web source blocks robots or not
JavaScriptWhether a web source uses JavaScript or not
List of pagesThe web source has a list of pages with pagination
Filter criteriaIf a web source offers a content filtering functionality relevant to the use case
GET HTTP methodIf a web source use GET method for HTTP requests
Up to date contentWhether the content of web source has new user content published last month
Number of ads > XWhether the number of ads on the web source is bigger than X e.g., 100, 1000, 10000, etc.
Structured descriptionWhether the web source has structured presentation of the content or just a plain text
HTML MicrodataWhether the web source uses HTML Microdata https://www.w3.org/TR/microdata/
Description schemaWhether the web source uses description schema https://schema.org/IndividualProduct
HTML code changed every XFrequency of HTML code change e.g., 1 year, 2 years, 3 years etc.
Specific time period filterWhether a web source allows scraping the content published during specific time period selected via the content filter
Scraping of yesterday publicationsWhether a web source allows scraping the content published only yesterday
MultilanguageWhether a web source have option to change language and currency
RatingsWhether a web source have option to rate an offer and leave a comment
Cookies and trackingWhether a web source force to accept cookies and tracking information
AggregatorWhether a web source display information gathered from many portals
Dynamic class tagsWhether a web source code is generated automatically
Terms of useWhether a web source term of use allow web scraping
robots.txtWhether a web source lists relevant pages as disallowed in robots.txt
Offers APIWhether the website offers an api
CDNBlocking by content delivery network services (like Cloudflare)
File extensionFile extensions that do not contain renderable websites (e.g. .xlsx, .docx)
HTTP errora URL returning a temporary (HTTP) error
Sale URLa URL that is ‘for sale’
Scope of the dataWhether the data are representative for the entire territory (we have the example of specific websites concerning only a small fraction of the national territory)
Frequency of the data delivery/transmissionWhether the data are provided at least every month – in our case since rents are part of CPI, we target a monthly refreshment at least)
Representativity of the dataWhether the data stands for a significant part of the entire rental offers (could be 20% or 30%)
Data description – metadataWhether metadata/a minimal set of data description is delivered along with the data
Data completionWhether the rate of non-response/non completion does not exceed a given threshold for given variables that are considered as critical – should be declined for rents/surface/type of dwellings

 

Published 22 September 2022 

Return