Datos
Supervisor: Santiago Andrés Azcoitia, Departamento de Señales, Sistemas y Radiocomunicaciones
Fecha de inicio: 1 de febrero de 2025
Requisitos: Estudiante de Grado en Ingeniería y Sistemas de Datos.
Solicitudes: Enviar CV y expediente académico a santiago.andres@upm.es antes del 7 de enero de 2025.
Background:
Spurred by the widespread adoption of AI / ML, ‘data’ is becoming a key production factor, comparable in importance to capital, land, or labour in an increasingly digital economy. In spite of an ever-growing demand for third-party data in the B2B market, firms are generally reluctant to share their information. This is due to the unique characteristics of ‘data’ as an economic good (a freely replicable, nondepletable asset holding a highly combinatorial and context-specific value). As a result, most of those valuable assets still remain unexploited in corporate silos nowadays.
However, there is already an ecosystem of companies that trade data over the Internet [1]. Some analysts have estimated the potential value of the data economy at $ 2.5 trillion globally by 2025 [2, 3], and the development of healthy data markets would be the key to making the most of AI/ML, which is expected to reach a market of $ 15-20 trillion in 2030 [4,5]. Recent studies revealed more than 2k data providers offering data products in commercial data marketplaces [6]. Even when there are already some standards like W3C’s DCAT v3.0, neither the metadata describing data products in commercial data marketplaces follows any standard, nor respects a common structure. As a result, many features describing data assets (e.g., update frequency, delivery methods, volume of data being offered, etc.) are found in the plain language descriptions attached to data products in marketplaces.
Objective
This Master Thesis aims to use NLP models and techniques, including LLMs, to create a tool to structure the information stemming from the description of data products in commercial data marketplaces. The tool will be developed using Python. The student will also carry out an analysis of the resulting information to provide some insights about data products being offered across commercial data markets, answering questions such as what kind of data is being offered, how sellers price the data, at what prices, how many data providers are using commercial data marketplaces, etc.
Methodology
This research will involve the design and development of an information retrieval tool to structure information about data products based on their descriptions [6]. We will use prompt engineering to refine the queries to LLM models fed with data product descriptions in order to structure information on key features buyers demand knowing when purchasing data.
Expected results
This Master Thesis is expected to produce a modular tool to structure information about data products in data marketplaces, and provide empirical evidence and insights into the situation of data markets. Optionally, the student will participate in writing a research paper to disseminate the results of the project.
[1] S. Andrés Azcoitia and N. Laoutaris, A Survey of Data Marketplaces and Their Business Models. ACM SIGMOD Record, 51(3), (Sep 2022), ACM, New York, NY, USA.
[2] N. Henke, J. Bughin, M. Chui, J. Manyika, T. Saleh, B. Wiseman and G. Sethupathy. The Age of analytics: Competing in a data-driven world. McKinsey Global Institute. Dec. 2016
[3] G. Micheletti; N, Raczko, C. Moise; D. Osimo, and G. Cattaneo. European DATA Market Study 2021–2023. IDC & The Lisbon Council. May 2023
[4] PWC Consulting. Sizing the prize What’s the real value of AI for your business and how can you capitalise? 2017
[5] J. Bughin, J. Seong, J. Manyika, M. Chui, and R. Joshi. Notes from the AI frontier: Modeling the impact of AI on the world economy. McKinsey Global Institute. 2018
[6] S. Andrés Azcoitia, C. Iordanou and N. Laoutaris, «Understanding the Price of Data in Commercial Data Marketplaces, 2023 IEEE 39th Internatio