Book page

Issue 19 - Integrating Big Data into Tourism Statistics

WIN blog logo

Integrating Big Data into Tourism Statistics

Reflecting on the journey of our use case, it's evident that the landscape of tourism statistics has undergone significant transformation in recent years. As outlined in our previous blogs, the evolving trends in tourism and technological advancements have compelled national statistical offices (NSIs) to adapt their methodologies and IT systems continuously. This adaptation is crucial to accommodate the influx of data from diverse sources, including unstructured big data, and to enhance the consistency and comparability of generated results in the realm of tourism statistics.

The primary objective of our use case - "Experimental Indices in Tourism Statistics" - is to develop experimental indices leveraging big data, particularly sourced from web portals. Recognizing the tourism industry's multifaceted nature, our project aims to utilize a diverse array of information beyond traditional tourism-related data. This encompassed variables such as accommodation prices, airline ticket costs, and local transportation expenses. By incorporating such data, we sought to enrich the tourist travel survey, address data gaps, and monitor rapid fluctuations in service prices, especially in response to disruptive events like the COVID-19 pandemic.

Exploring Diverse Data Sources

Since 2022, we've gathered data from leading booking portals like Booking.com, Hotels.com and AirBnB.com. These platforms have become vital sources, providing us with invaluable information about accommodation establishments in Poland and Bulgaria. Booking.com alone offered a wealth of knowledge, boasting 82,965 unique listings in Poland and 32,873 in Bulgaria, while Hotels.com and AirBnB.com contributed additional insights on establishments and amenities. This data and information from the tourism survey frame form the basis for our analysis, enabling a comprehensive understanding of the tourism landscape.

Handling Data Integration Challenges

Dealing with data from booking portals poses unique challenges. Differences in data availability, formats, and classification standards create significant obstacles. However, overcoming these challenges is crucial to fully harness the potential of these diverse data sources and derive valuable insights.

Exploring Data Integration Methods

To address the demand for high-quality tourism-related data, we're exploring innovative methods to integrate information from diverse sources. Our study delves into various data linkages and deduplication methods, such as fuzzy matching, Natural Language Processing (NLP) or machine learning. These techniques are adept at handling the complexities of merging different datasets and offer unique insights into the integration process.

  1. Fuzzy Matching

Fuzzy Matching techniques, incorporating algorithms like Levenshtein and Jaro-Winkler, provide a robust framework for comparing textual data and identifying potential duplicates. By accommodating variations such as spelling errors and typos, Fuzzy Matching enhances the accuracy of deduplication processes, ensuring the integrity of integrated datasets. Additionally, incorporating geodetic distances through formulas like Vincenty's formula further enhances the precision of deduplication, particularly in geographically diverse datasets.

  1. Natural Language Processing (NLP)

Natural Language Processing (NLP) plays a pivotal role in integrating textual data from sources such as booking portals and tourism surveys. By employing libraries like Faiss and SentenceTransformer, NLP enables the efficient processing and comparison of textual data, facilitating the identification of duplicate entries and enhancing the overall accuracy of tourism statistics. While readily available for widely spoken languages like English, adapting NLP for languages like Polish or Bulgarian requires specialized libraries, such as spaCy and bgNLP, to ensure robust and reliable data integration processes.

  1. Machine Learning with K-Nearest Neighbors (K-NN)

Machine learning algorithms, such as the K-Nearest Neighbours (K-NN) technique, offer a robust data linkage and deduplication framework. By combining techniques like Term Frequency-Inverse Document Frequency (TF-IDF) and N-gram analysis, K-NN facilitates the identification of similarities between data instances, aiding in the detection of potential duplicates and improving data quality. Furthermore, the unsupervised nature of K-NN enables its application across diverse datasets without the need for prior classification, enhancing its versatility and effectiveness in data integration tasks.

Implications and Future Directions

Integrating web-scraped data enriches the tourism survey frame, offering a more comprehensive understanding of accommodation establishments. As we continue to refine and expand our data integration methodologies, we move closer to realizing the transformative potential of big data in shaping the future of tourism analysis.

Looking ahead to 2024, our roadmap entails the refinement of integration methods. This will lead us to the implementation of methodologies for creating new indices, thereby advancing the capabilities of tourism statistics to provide timely and comprehensive insights into the industry's dynamics.