In the previous blogs, Issue 3 - Exploring Potential New Data Sources and Issue 6 – Exploring Potential New Data Source - Real Estate, we briefly described the use cases carried out under Work Package 3 (WP3); we presented the job done on real estate data. In this entry, we will focus on the data we have gathered. The main goal of the project aims to produce aggregated statistics on the real estate market based on online real estate advertisements, but the road to that is full of challenges. The crucial one is data assessment, pre-processing and improvement.
The project is based on totally new data sources, which are known at the moment of their acquisition. Both in their distributions and the amount of missing data or data errors. Their way of creating makes it even harder, which comes from different people (users) and can be unpredictable.
When we first obtained the datasets by web scraping, we had to overcome multiple problems that usually do not exist, or their impact is almost minimal in traditionally acquired data within the statistical systems. Typically, when we prepare a new project in statistics, we set all the rules for the information we want from our respondents. All the definitions, metadata, chosen population, sample frames or data from other systems are helpful and maybe even seem something that we do not appreciate enough in our everyday job. The same applies to the tool for data acquisition in the form of a report with logically ordered questions and fully standardised sets of answers or well-known classifications. All this makes it possible to keep the acquired data standardised and maintain them within certain predetermined rules. However, in the case of web data, all of them are out of reach. We may not know the population, referring points, how many different labels one variable may present, the lengths of variables, their definitions, how the data was pre-processed, and maybe deduplicated or re-coded.
All of those unknown areas are hard to overcome and narrow our possibilities of utilising the data. During our work on the project that aims to produce statistics on the real estate sale and rent market, we met lots of different obstacles to overcome. To mention:
- Missing data, especially in the case of newly sold objects where the price is not given;
- Duplicated data, on the level of one source, we found the same offers with different IDs, but all other variables were the same;
- Duplicated data between different data sources, which is an even bigger issue when it comes to portals that have the same offers presented parallelly on each of them (because of their owner's business relationship);
- Categorised information on values that we would want to produce. For example, categorised areas of apartments won't let us see the maximum values. Those problems guide the situation when in final datasets, we need to present them as they are stored in the sources. The highest category of floor level has to be shown as ten and more, even though in some sources, we have the data de-aggregated. The same is the case for the number of rooms, where we had to use the most narrowed category within the sources: one room, two rooms, three rooms, four rooms, five and more.
- Some offers present more than one object, so extracting precise information referring to each apartment is impossible.
- There are also cases of offers out of scope, like fields for sale in the category of apartments or very low prices in apartments for sale, which refer to apartments for rent.
Working with data within statistical systems also needs to prepare them for presentation using Statistics classifications. In the case of our project, it is very important to be able to show the regional distribution of variables. So the step of matching the scraped localisation information with the territorial classification (NUTS) is needed to be done. It is even harder to accomplish that when the localisation of the sold/rented object is provided by hand rather than using some closed list of possible occurrences. To visualise the problem, I'll mention that we had 38 different possible names for one specific town in some cases.
Of course, some of the information mentioned above (like number precise number of rooms or correct localisation) may be found in full descriptions of the offers. However, it is lots of text with multiple different information, which makes it harder to choose an appropriate extraction method. For this challenge, we see a possibility to use machine learning techniques for text classification. Still, it would need to first train the models by manually classifying the datasets, which would be time-consuming. In future, it could be possible to prepare and share the results of such a task.
In addition to all the problems specified above, not all of them concern only the data itself but also the software we use and its vulnerability to change. During the assessment of scraping programs, we found that in some portals, the number of offers acquired changes a lot according to the time we scrape. If the program fails, we have to re-execute it, and it creates a situation where the referring date is different than in previous data acquisitions. A more difficult situation happens if the HTML structure of the page changes because it is never known how quickly the fixes will be implemented. And it is obvious that a full break in the scraper may end with no data gathered for the given time period. What may be unacceptable for Statistics?
All the challenges we have overcome and are still trying to deal with prove how complicated it is to work with data obtained from the Internet. In particular, from sources unrelated to the statistical system and not using its meta information. However, for now, we are working on the software to prepare the first aggregated results and looking forward to finding appropriate solutions to the remaining problems.
If you want to know more about our journey, the problems encountered, and what results we have achieved so far, please follow our blog and do not hesitate to contact us with any questions. You can reach us via email at:
Dominik Dabrowski, D.Dabrowski@stat.gov.pl
Klaudia Peszat, K.Peszat@stat.gov.pl
Published July 2023