In the previous blog, Issue 3 - Exploring Potential New Data Sources | CROS (Europa.EU, we briefly described the use cases carried out under Work Package 3 (WP3). In this entry, we focus on the use case, which aims to produce aggregated statistics on the real estate market based on online real estate advertisements.
In addition to monitoring the dynamics of the number of real estate for sale and rent, we intend to experiment with the possibility of using web data in other areas of official statistics, such as constructing the Housing Rental Index (also used to calculate the HICP) or to augmenting the real estate market studies by new indicators.
The use case involves six partners, national statistical institutes (NSI) from Bulgaria, Finland, France, Germany and Poland, that obtain data by scraping or based on agreements with web portals owners. Despite different data sources, all partners follow the same steps of the BREAL model:
- New data sources exploration
- Programming, production of software
- Data acquisition and recording
- Data processing
- Modelling and interpretation
- Dissemination of the experimental statistics and results.
New data sources exploration
During the first year of work, we completed two first steps of the BREAL model. We began with data sources exploration, i.e., examining the sources, their availability and usefulness. We did this by identifying the initial scope of information to be obtained by all the partners. We conducted a review of publicly available websites presenting apartments for sale or rent. This resulted in the selection of several sources for each country to be further assessed. In the case of France and Finland, their experiences in previous attempts to obtain the data showed that the portals presenting the largest number of offers blocked access for their acquisition. Therefore, these NSIs established a direct collaboration with the data providers (based on the signed contracts for selected periods). Also, Germany gathers data from one of the portals based on such an agreement.
In order to assess the usefulness of data sources, we prepare a set of detailed criteria (see below) related to website availability, timeliness of data and technical aspects of site responses. Based on this assessment, the final lists of portals were created for each partner (13 portals altogether).
Software production and data acquisition
For each portal to be scraped, a dedicated software was prepared. For some data sources, we decided to use software developed by Bulgarian NSI in Scrapy. It acquires data from websites using the previously indicated selectors and provides a more generic way of acquiring the data. Currently, dedicated scrapers collect monthly data and require constant control to observe the website's structure. So far, we have encountered a few changes in the way the information is presented on the websites, which entailed the need to modify some of the programs. We are considering employing a more flexible approach, which would help avoid such problems.
Using the rules of netiquette, we adhere to the limitations indicated in the robots.txt files. The programs use time breaks between sending requests to the portals so as not to overload the administrator's server.
More widely available access via API would be a great way to gather the data, along with the agreements with data providers, which would allow direct data transfer in an adequate format. It would certainly be less complicated and more automated. Such a solution would be less exposed to changes in the structure of the website or not so dependent on creating/modifying scrapers by the NSIs' staff. However, obtaining data directly from the administrator also has its constraints, such as extended negotiations time and uncertainty regarding collaboration sustainability, to mention a few. The use case partners that use the latter model of data acquisition have received data only for an indicated period. The possibility of receiving them in the future requires further negotiations. Thus, they risk time series breakage and potentially the lack of possibilities to collect the data differently.
Further steps
A significant challenge use case partners face processing the data in a way that can be compared between countries. Due to the different data sources and different NSIs' needs, it was decided to create a common framework of mandatory variables to develop, which involves:
- number of offers
- average price per m2
- share of offers by price per m2 classes
- average surface in m2
- share of offers by surface classes
- average number of rooms
- share of offers by number of rooms
Up until now, we have been working scrapers and data acquired since April 2022 in monthly intervals (with some data gathered from the data providers referring to particular historical periods). At the moment, we focus on establishing standardized data cleaning rules. Also, this year we plan to assess the sources again to check whether they are still in line with the rules and criteria previously imposed.
If you want to know more about our journey, problems encountered, and what results we have achieved so far, please follow our blog and do not hesitate to contact us with any questions. You can reach us via email at:
Dominik Dabrowski, D.Dabrowski@stat.gov.pl
Klaudia Peszat, K.Peszat@stat.gov.pl
Detailed criteria of the assessment sheet
Criteria | Description |
Captcha | Whether a web source uses captcha or not |
Robots blocking | Whether a web source blocks robots or not |
JavaScript | Whether a web source uses JavaScript or not |
List of pages | The web source has a list of pages with pagination |
Filter criteria | If a web source offers a content filtering functionality relevant to the use case |
GET HTTP method | If a web source use GET method for HTTP requests |
Up to date content | Whether the content of web source has new user content published last month |
Number of ads > X | Whether the number of ads on the web source is bigger than X e.g., 100, 1000, 10000, etc. |
Structured description | Whether the web source has structured presentation of the content or just a plain text |
HTML Microdata | Whether the web source uses HTML Microdata https://www.w3.org/TR/microdata/ |
Description schema | Whether the web source uses description schema https://schema.org/IndividualProduct |
HTML code changed every X | Frequency of HTML code change e.g., 1 year, 2 years, 3 years etc. |
Specific time period filter | Whether a web source allows scraping the content published during specific time period selected via the content filter |
Scraping of yesterday publications | Whether a web source allows scraping the content published only yesterday |
Multilanguage | Whether a web source have option to change language and currency |
Ratings | Whether a web source have option to rate an offer and leave a comment |
Cookies and tracking | Whether a web source force to accept cookies and tracking information |
Aggregator | Whether a web source display information gathered from many portals |
Dynamic class tags | Whether a web source code is generated automatically |
Terms of use | Whether a web source term of use allow web scraping |
robots.txt | Whether a web source lists relevant pages as disallowed in robots.txt |
Offers API | Whether the website offers an api |
CDN | Blocking by content delivery network services (like Cloudflare) |
File extension | File extensions that do not contain renderable websites (e.g. .xlsx, .docx) |
HTTP error | a URL returning a temporary (HTTP) error |
Sale URL | a URL that is ‘for sale’ |
Scope of the data | Whether the data are representative for the entire territory (we have the example of specific websites concerning only a small fraction of the national territory) |
Frequency of the data delivery/transmission | Whether the data are provided at least every month – in our case since rents are part of CPI, we target a monthly refreshment at least) |
Representativity of the data | Whether the data stands for a significant part of the entire rental offers (could be 20% or 30%) |
Data description – metadata | Whether metadata/a minimal set of data description is delivered along with the data |
Data completion | Whether the rate of non-response/non completion does not exceed a given threshold for given variables that are considered as critical – should be declined for rents/surface/type of dwellings |
Published 22 September 2022