Data Management Research Center for Human-centered, Efficient, and Scalable Systems

Welcome to UtahDB, the Data Management Lab at the Kahlert School of Computing, the University of Utah. We are situated in the beautiful Salt Lake City. Our research focus is on designing and developing human-centered, efficient, and scalable data-management systems. Check out this overview presentation to know more about our lab's current projects and vision. Our lab consists of an amazing group of people and we are growing!

We are looking for a number of motivated PhD students for Fall 2026. Apply today!

Faculty

Jeff M Phillips

Algorithms for Big Data Analytics: Geometric Data Analysis, Computational Geometry, Coresets and Sketches, Handling Uncertainty, Data Mining, Databases, Machine Learning, Spatial Statistics.

El Kindi Rezig

Data preparation, data discovery, data debugging, data integration, user interfaces, information extraction, data quality, data cleaning, and database usability

Anna Fariha

Data systems usability, Data summarization, Trusted machine learning, Explainable AI, Data exploration and user interfaces, Data quality, Data cleaning, Data debugging, Responsible data management, Data fairness

Research

Democratizing data-driven systems: This project focuses on three key aspects of data system democratization: enhancing usability of data systems for non-experts and experts, providing explanation frameworks to enable understanding of system behavior, and achieving trust and fairness in machine learning.

Data structures for scalable computing: This project focuses on advancing the theory and practice of compact, dynamic, and scalable data structures to tackle the challenges of modern data analyses pipelines. We work on filters, hash tables, trees, succinct, and write-optimized data structures.

Large-scale indexing raw genomics data: This project focuses on building scalable data processing pipelines for quickly indexing and searching through tera-bytes of raw genomic, transcriptomic, and metagenomics data.

Efficient parallel graph processing: This project focuses on building highly parallel data structures and algorithms for efficiently processing static, streaming, and dynamic graphs. This project further explores using hardware accelerators such as GPUs for massively parallel processing of dynamic graphs.

Persistent Data Summaries: This project builds summaries for massive data arriving over time, which are small space, efficient to build and query, and amenable to data analysis. Moreover, they can be queried with respect to a time window for retrospective analysis.

Data Sketching: We design and implement sketch data structures which are compressed representations of data with guaranteed trade-offs between the space and the accuracy of queries. Our group has designs sketches for quantiles, multi-dimensional data, frequent items, shape-fitting, trajectories data, and many more.

Spatial Exposome Data: CEDaR is be an open exposomic data resource that can be used by researchers across disciplines to increase understanding of the environment and health. Sources of environmental exposure data are sparse, inconsistent, and rarely linked to individuals, making research complicated and difficult. Through CEDaR, we provide a single platform containing cleaned and standardized environmental exposure measures that can be used independently or to create holistic measures of the exposome.

Data Systems on Modern Hardware: This project exploits modern compute hardware such as GPUs, FPGAs and storage hardware such as PMEMs, HBMs for accelerating data systems. Our group designs new algorithmic techniques to model the performance of new hardware and then analyzes data systems in the light of the new algorithmic models to accelerate them.

Publications

2025
  • ICDE Ankita Sharma, Jaykumar Tandel, Xuanmao Li, Lanjun Wang, Anna Fariha, Liang Zhang, Syed Arsalan Ahmed Naqvi, Irbaz Bin Riaz, Lei Cao, Jia Zou:
    DemoDataMorpher: Automatic Data Transformation Using LLM-based Zero-Shot Code Generation.
2024
  • IEEE Big Data Arman Azad, El Kindi Rezig:
    WorkshopCan Causal DAGs Generate Data-based Explanations of Black-box Models?
  • VLDB Whanhee Cho, Anna Fariha:
    DemoUTOPIA: Automatic Pivot Table Assistant.
  • VLDB Zifan Liu, Shaleen Deep, Anna Fariha, Fotis Psallidas, Ashish Tiwari, Avrilia Floratou:
    Rapidash: Efficient Detection of Constraint Violations.
  • WWW Tao Yang, Cuize Han, Chen Luo, Parth Gupta, Jeff M. Phillips, and Qingyao Ai:
    Mitigating Exploitation Bias in Learning to Rank with an Uncertainty-aware Empirical Bayes Approach.
  • SIGCSE Anjali Singh, Anna Fariha, Christopher Brooks, Gustavo Soares, Austin Henley, Ashish Tiwari, Chethan M, Heeryung Choi, Sumit Gulwani:
    Investigating Student Mistakes in Introductory Data Science Programming.
  • EMNLP Soohyeong Kim, Whanhee Cho, Minji Kim, Yong Suk Choi:
    Bidirectional Masked Self-attention and N-gram Span Attention for Constituency Parsing.
2023
  • IEEE BigData Chin-Chia Michael Yeh, Yan Zheng, Menghai Pan, Huiyuan Chen, Zhongfang Zhuang, Junpeng Wang, Liang Wang, Wei Zhang, Jeff M. Phillips, and Eamonn Keogh:
    Sketching Multidimensional Time Series for Fast Discord Mining.
  • SIGMOD Bhavya Chopra, Anna Fariha, Sumit Gulwani, Austin Z. Henley, Daniel Perelman, Mohammad Raza, Sherry Shi, Danny Simmons, Ashish Tiwari:
    DemoCoWrangler: Recommender System for Data-Wrangling Scripts.
  • Knowledge and Information Systems Hasan Pourmahmood Aghababa, Jeff M. Phillips:
    An experimental study on classifying spatial trajectories.
  • SIGCSE Rowan Hart, Brian Hays, Connor McMillin, El Kindi Rezig, Gustavo Rodriguez-Rivera, Jeffrey A. Turkstra:
    Eastwood-Tidy: C Linting for Automated Code Style Assessment in Programming Courses.
2022
  • SIGMOD Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, Divesh Srivastava:
    DataPrism: Exposing Disconnect between Data and Systems.
  • SIGMOD Maliha Tashfia Islam, Anna Fariha, Alexandra Meliou, Babak Salimi:
    Through the Data Management Lens: Experimental Analysis and Evaluation of Fair Classification.
  • CIDR El Kindi Rezig, Anshul Bhandari, Anna Fariha, Benjamin Price, Allan Vanterpool, Andrew Bowne, Lindsey McEvoy, Vijay Gadepally:
    AbstractExamples are All You Need: Iterative Data Discovery by Example in Data Lakes.
  • Poly/DMAH@VLDB Andrew Bowne, Lindsey McEvoy, Dhruv Gupta, Cameron Brown, Vijay Gadepally, El Kindi Rezig:
    WorkshopA Survey of Data Challenges Across a Modernizing Bureaucracy: A New Perspective on Examining Old Government Problems.
  • TKDE Zhao Chang, Dong Xie, Feifei Li, Jeff M. Phillips, Rajeev Balasubramonian:
    Efficient Oblivious Query Processing for Range and kNN Queries.
2021
  • SIGMOD Dong Xie, Jeff M. Phillips, Michael Matheny, and Feifei Li:
    Spatial Independent Range Sampling.
  • SIGMOD Benwei Shi, Zhuoyue Zhao, Yanqing Peng, Feifei Li, and Jeff M. Phillips:
    At-the-time and Back-in-time Persistent Sketches.
  • SIGMOD Anna Fariha, Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani, Alexandra Meliou:
    Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems.
  • VLDB El Kindi Rezig, Mourad Ouzzani, Walid G. Aref, Ahmed K. Elmagarmid, Ahmed R. Mahmood, Michael Stonebraker:
    Horizon: Scalable Dependency-driven Data Cleaning.
  • SIGMOD Anna Fariha, Ashish Tiwari, Alexandra Meliou, Arjun Radhakrishna, Sumit Gulwani:
    DemoCoCo: Interactive Exploration of Conformance Constraints for Data Understanding and Data Cleaning.
  • VLDB El Kindi Rezig, Anshul Bhandari, Anna Fariha, Benjamin Price, Allan Vanterpool, Vijay Gadepally, Michael Stonebraker:
    DemoDICE: Data Discovery by Example.
  • VLDBJ Debjyoti Paul, Jeff M. Phillips, and Feifei Li:
    Semantic Embedding for Regions of Interest.
  • CIDR El Kindi Rezig:
    AbstractData Cleaning in the Era of Data Science: Challenges and Opportunities.
  • AIStats Benwei Shi and Jeff M. Phillips:
    A Deterministic Streaming Sketch for Ridge Regression.
2020
  • SIGMOD Anna Fariha, Suman Nath, Alexandra Meliou:
    Causality-Guided Adaptive Interventional Debugging.
  • SIGMOD Anna Fariha, Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani:
    DemoExTuNe: Explaining Tuple Non-conformance.
  • VLDB Anna Fariha, Matteo Brucato, Peter J. Haas, Alexandra Meliou:
    DemoSuDocu: Summarizing Documents by Example.
  • VLDB El Kindi Rezig, Ashrita Brahmaroutu, Nesime Tatbul, Mourad Ouzzani, Nan Tang, Timothy G. Mattson, Samuel Madden, Michael Stonebraker:
    DemoDebugging Large-Scale Data Science Pipelines using Dagger.
  • CIDR El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani, Michael Stonebraker:
    Dagger: A Data (not code) Debugger.
  • TKDD Michael Matheny, Dong Xie, and Jeff M. Phillips:
    Scalable Spatial Scan Statistics for Trajectories.
  • Poly/DMAH@VLDB El Kindi Rezig, Allan Vanterpool, Vijay Gadepally, Benjamin Price, Michael J. Cafarella, Michael Stonebraker:
    WorkshopTowards Data Discovery by Example.
2019
  • VLDB Anna Fariha, Alexandra Meliou:
    Example-Driven Query Intent Discovery: Abductive Reasoning using Semantic Similarity.
  • HILDA@SIGMOD El Kindi Rezig, Mourad Ouzzani, Ahmed K. Elmagarmid, Walid G. Aref, Michael Stonebraker:
    WorkshopTowards an End-to-End Human-Centric Data Cleaning Framework.
  • VLDB El Kindi Rezig, Lei Cao, Michael Stonebraker, Giovanni Simonini, Wenbo Tao, Samuel Madden, Mourad Ouzzani, Nan Tang, Ahmed K. Elmagarmid:
    DemoData Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics.
  • SIGSPATIAL Jeff M. Phillips and Pingfan Tang:
    Simple Distances for Trajectories via Landmarks.
  • SIGSPATIAL Mingxuan Han, Michael Matheny, and Jeff M. Phillips:
    The Kernel Spatial Scan Statistic.
  • International Symposium on Computational Geometry Peyman Afshani and Jeff M. Phillips:
    Independent Range Sampling, Revisited Again.
2018
  • SIGSPATIAL Aria Rezaei, Jie Gao, Jeff M. Phillips, and Csaba D. Toth:
    Improved Bounds on Information Dissemination by Manhattan Random Waypoint Model.
  • SIGMOD Anna Fariha, Sheikh Muhammad Sarwar, Alexandra Meliou:
    DemoSQuID: Semantic Similarity-Aware Query Intent Discovery.
2017
  • VLDB Dong Xie, Feifei Li, and Jeff M. Phillips:
    Distributed Trajectory Similarity Search.
  • SIGKDD Yan Zheng and Jeff M. Phillips:
    Coresets for Kernel Regression.
2016
  • SIGKDD Mina Ghashami, Edo Liberty, and Jeff M. Phillips:
    Efficient Frequent Directions Algorithm for Sparse Matrices.
  • TKDE Mina Ghashami, Amey Desai, and Jeff M. Phillips:
    Improved Practical Matrix Sketching with Guarantees.
  • ICDE El Kindi Rezig, Eduard C. Dragut, Mourad Ouzzani, Ahmed K. Elmagarmid, Walid G. Aref:
    DemoORLF: A flexible framework for online record linkage and fusion.
  • SIGSPATIAL Michael Matheny, Raghvendra Singh, Kaiqiang Wang, Liang Zhang and Jeff M. Phillips:
    Scalable Spatial Scan Statistics through Sampling.
  • SIAM Journal of Computing Mina Ghashami, Edo Liberty, Jeff M. Phillips and David P. Woodruff:
    Frequent Directions: Simple and Deterministic Matrix Sketching.
2015
  • SIGKDD Yan Zheng and Jeff M. Phillips:
    L_infity Error and Bandwidth Selection for Kernel Density Estimates of Large Data.
  • VLDB Ahmed R. Mahmood, Ahmed M. Aly, Thamir Qadah, El Kindi Rezig, Anas Daghistani, Amgad Madkour, Ahmed S. Abdelhamid, Mohamed S. Hassan, Walid G. Aref, Saleh M. Basalamah:
    DemoTornado: A Distributed Spatio-Textual Stream Processing System.
  • ICDE El Kindi Rezig, Eduard C. Dragut, Mourad Ouzzani, Ahmed K. Elmagarmid:
    Query-time record linkage and fusion over Web databases.
2014
  • VLDB Mina Ghashami, Jeff M. Phillips, and Feifei Li:
    Continuous Matrix Approximation on Distributed Data.
2013
  • SIGMOD Yan Zheng, Jeffrey Jestes, Jeff M. Phillips, Feifei Li:
    Quality and Efficiency for Kernel Density Estimates in Large Data.
  • PODS Pankaj K. Agarwal, Boris Aronov, Sariel Har-Peled, Jeff M. Phillips, Ke Yi, and Wuzhou Zhang:
    Nearest Neighbor Searching Under Uncertainty II.
  • ACM Symposium on Computational Geometry Amirali Abdullah, Samira Daruki, and Jeff M. Phillips:
    Range Counting Coresets for Uncertain Data.
2012
  • VLDB Jeffrey Jestes, Jeff M. Phillips, Feifei Li:
    Ranking Large Temporal Data.
  • TODS Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi:
    Mergeable Summaries.
  • ICDE Mingwang Tang, Feifei Li, Jeff M. Phillips, Jeffrey Jestes:
    Efficient Threshold Monitoring for Distributed Probabilistic Data.
2011
  • ICDT Peyman Afshani, Pankaj K. Agarwal, Lars Arge, Kasper Green Larsen, and Jeff M. Phillips:
    (Approximate) Uncertain Skylines.
  • SIGMOD Hazem Elmeleegy, Jaewoo Lee, El Kindi Rezig, Mourad Ouzzani, Ahmed K. Elmagarmid:
    DemoU-MAP: a system for usage-based schema matching and mapping.
2010
  • SIGKDD Arvind Agarwal, Jeff M. Phillips, Suresh Venkatasubramanian:
    Universal Multi-Dimensional Scaling.
  • VLDB Badrish Chandramouli, Jeff M. Phillips, Jun Yang:
    Value-Based Notification Conditions in Large-Scale Publish/Subscribe Systems.
  • SIGKDD Deepak Agarwal, Andrew McGregor, Jeff M. Phillips, Suresh Venkatasubramanian, Zhengyuan Zhu:
    Spatial Scan Statistics: Approximations and Performance Study.

Recent News

Current Students

Alumni

Activities


NWDS 2024, Seattle, USA

SIGMOD 2023, Seattle, USA

Data Science Day, 2019

Videos

[2023] Anna Fariha: Research in Computer Science with a focus on Databases/Data-management research
[2023] Anna Fariha: Blame the Data, not the System: How Data Constraints can Help Explain Causes of Data-system Malfunction | NWDS 2023
[2022] El Kindi Rezig: Examples are All You Need: Iterative Data Discovery by Example in Data Lakes | CIDR 2022
[2021] Anna Fariha: Enhancing Usability and Explainability of Data Systems | University of Pennsylvania
[2021] Anna Fariha: Conformance Constraints Discovery: Measuring Trust in Data-Driven Systems | SIGMOD 2021
[2020] Anna Fariha: SuDocu: Summarizing Documents by Example (best demo runner-up) | VLDB 2020

© 2025 University of Utah. All Rights Reserved

To Top