InfClean: Effective Inference of Cleaning Programs from Data Annotations

Project coordinator: Paolo Papotti

Abstract

This projects addresses a pressing need in data science applications: besides reliable models for decision making, we need data that has been processed from its original, raw state into a curated form, a process referred to as "data cleaning". In this process, data engineers collaborate with domain experts to collect specifications, such as business rules on salaries, physical constraints for molecules, or representative training data. Specifications are then encoded in cleaning programs to be executed over the raw data to identify and fix errors. This human-centric process is expensive and, given the overwhelming amount of today's data, is conducted with a best effort approach, which does not provide any formal guarantee on the ultimate quality of the data. The goal of InfClean is to rethink the data cleaning field from its assumptions with an inclusive formal framework that radically reduces the human effort in cleaning data. This will be achieved in three steps:

by laying the theoretical foundations of synthesizing specifications directly with the domain experts;
by designing and implementing new automated techniques that use external information to identify and repair data errors;
by modeling the interactive cleaning process with a principled optimization framework that guarantees quality requirements.

The project will lay a solid foundation for data cleaning, enabling a formal framework for specification synthesis, algorithms for increased automation, and a principled optimizer with quality performance guarantees for the user interaction. It will also broadly enable accelerated information discovery, as well as economic benefits of early, well-informed, trustworthy decisions. To provide the right context for evaluating these new techniques and highlight the impact of the project in different fields, InfClean plans to address its objectives by using real case studies from different domains, including health and biodiversity data.

Publications

R. Cappuzzo, P. Papotti, S. Thirumuruganathan
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks.
In SIGMOD, 2020. (.pdf) (code)
F. Geerts, G. Mecca, P. Papotti, D. Santoro,
Cleaning data with Llunatic.
VLDB Journal, 2019. (.pdf) (code)
P. Huynh, P. Papotti.
A Benchmark for Fact Checking Algorithms Built on Knowledge Bases.
CIKM, 2019. (.pdf) (code)
P. Huynh, P. Papotti.
Buckle: Evaluating Fact Checking Algorithms Built on Knowledge Bases..
VLDB (demo), 2019. (.pdf)

code

N. Ahmadi, P. Huynh, V. Meduri, P. Papotti, S. Ortona.
Mining Expressive Rules in Knowledge Graphs.
Journal of Data and Information Quality (JDIQ), 2020. (.pdf)

code

N. Ahmadi, J. Lee, P. Papotti, M. Saeed.
Explainable Fact Checking with Probabilistic Answer Set Programming.
Conference for Truth and Trust Online (TTO), 2019. (.pdf) (code)

Project code: ANR-18-CE23-0019

Paolo Papotti
Professor at the
Data Science Department
EURECOM
Campus SophiaTech
450 route des Chappes
06410 Biot, France

Tel: +33 (0)4 - 9300 8147
Room 423
papotti at MyInstitutionName .fr

Useful quote

Everything should be made as simple as possible; but no simpler.

(A.Einstein)

About

I published some books and took some pictures. I have also started a travel web site long time ago.