Datos TFM
Supervisor: Santiago Andrés Azcoitia, Departamento de Señales, Sistemas y Radiocomunicaciones
Fecha de inicio: 1 de febrero, 2025
Requisitos: Estudiante de Máster de un título oficial de la ETSIT, preferiblemente en Ingeniería de Telecomunicación, o en Tratamiento de la Señal y Comunicaciones.
Solicitudes: Enviar CV y expediente académico a santiago.andres@upm.es antes del 7 de enero de 2025.
Background:
Spurred by the widespread adoption of AI / ML, ‘data’ is becoming a key production factor, comparable in importance to capital, land, or labour in an increasingly digital economy. In spite of an ever-growing demand for third-party data in the B2B market, firms are generally reluctant to share their information. This is due to the unique characteristics of ‘data’ as an economic good (a freely replicable, nondepletable asset holding a highly combinatorial and context-specific value). As a result, most of those valuable assets still remain unexploited in corporate silos nowadays.
There is already an ecosystem of companies that trade data over the Internet [1]. Some analysts have estimated the potential value of the data economy at $ 2.5 trillion globally by 2025 [2, 3], and the development of healthy data markets would be the key to making the most of AI/ML, which is expected to reach a market of $ 15-20 trillion in 2030 [4,5]. Recent studies revealed more than 2k data providers offering data products in commercial data marketplaces [6]. Setting the price for their data assets represents a significant challenge for companies offering their data, which would value a price reference based on the existing offer in the market.
Objective
This Master Thesis aims to design, build and optimize prediction models to estimate the value of a data product based on already-available information about data products in the market. The models will be developed using Python and libraries like pytorch, tensorflow, or keras. The student will also carry out an explainability analysis of the resulting models to provide insights on the most relevant features driving the value of data, answering questions such as what characteristics of data were more valuable, what kind of data products command lower prices, and why, etc.
Methodology
This research will involve the design and development and optimization of a DNN model regressor to guess the price of data out of the metadata that describes a data product [6]. The student will use sentence transformers to capture the semantics of data product description, and AI interpretability techniques such as SHAP to understand the price predictions of data products, feature importance techniques to understand the features of data driving its price in the market and why, etc.
Expected results
This Master Thesis is expected to produce a DNN regression model that outperforms SOTA in estimating the price of data products in data marketplaces, and generate explainable and reasonable predictions based on existing data [6]. Optionally, the student will participate in writing a research paper to disseminate the results of the project.
[1] S. Andrés Azcoitia and N. Laoutaris, A Survey of Data Marketplaces and Their Business Models. ACM SIGMOD Record, 51(3), (Sep 2022), ACM, New York, NY, USA.
[2] N. Henke, J. Bughin, M. Chui, J. Manyika, T. Saleh, B. Wiseman and G. Sethupathy. The Age of analytics: Competing in a data-driven world. McKinsey Global Institute. Dec. 2016
[3] G. Micheletti; N, Raczko, C. Moise; D. Osimo, and G. Cattaneo. European DATA Market Study 2021–2023. IDC & The Lisbon Council. May 2023
[4] PWC Consulting. Sizing the prize What’s the real value of AI for your business and how can you capitalise? 2017
[5] J. Bughin, J. Seong, J. Manyika, M. Chui, and R. Joshi. Notes from the AI frontier: Modeling the impact of AI on the world economy. McKinsey Global Institute. 2018
[6] S. Andrés Azcoitia, C. Iordanou and N. Laoutaris, «Understanding the Price of Data in Commercial Data Marketplaces