Project coordinator: Paolo Papotti
Abstract
This projects addresses a pressing need in data science applications: besides reliable models for decision making, we need data that has been processed from its original, raw state into a curated form, a process referred to as "data cleaning". In this process, data engineers collaborate with domain experts to collect specifications, such as business rules on salaries, physical constraints for molecules, or representative training data. Specifications are then encoded in cleaning programs to be executed over the raw data to identify and fix errors. This human-centric process is expensive and, given the overwhelming amount of today's data, is conducted with a best effort approach, which does not provide any formal guarantee on the ultimate quality of the data. The goal of InfClean is to rethink the data cleaning field from its assumptions with an inclusive formal framework that radically reduces the human effort in cleaning data. This will be achieved in three steps:
- by laying the theoretical foundations of synthesizing specifications directly with the domain experts;
- by designing and implementing new automated techniques that use external information to identify and repair data errors;
- by modeling the interactive cleaning process with a principled optimization framework that guarantees quality requirements.
Publications
- R. Cappuzzo, P. Papotti, S. Thirumuruganathan
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks.
In SIGMOD, 2020. (.pdf) (code) - F. Geerts, G. Mecca, P. Papotti, D. Santoro,
Cleaning data with Llunatic.
VLDB Journal, 2019. (.pdf) (code) - P. Huynh, P. Papotti.
A Benchmark for Fact Checking Algorithms Built on Knowledge Bases.
CIKM, 2019. (.pdf) (code) - P. Huynh, P. Papotti.
Buckle: Evaluating Fact Checking Algorithms Built on Knowledge Bases..
VLDB (demo), 2019. (.pdf)
(code)
- N. Ahmadi, P. Huynh, V. Meduri, P. Papotti, S. Ortona.
Mining Expressive Rules in Knowledge Graphs.
Journal of Data and Information Quality (JDIQ), 2020. (.pdf)
(code)
- N. Ahmadi, J. Lee, P. Papotti, M. Saeed.
Explainable Fact Checking with Probabilistic Answer Set Programming.
Conference for Truth and Trust Online (TTO), 2019. (.pdf) (code)