research-article Open Access
- Authors:
- Andra Ionescu TU Delft, Delft, Netherlands
- Zeger Mouw TU Delft, Delft, Netherlands
- Efthimia Aivaloglou TU Delft, Delft, Netherlands
- Asterios Katsifodimos TU Delft, Delft, Netherlands
HILDA 24: Proceedings of the 2024 Workshop on Human-In-the-Loop Data AnalyticsJune 2024Pages 1–5https://doi.org/10.1145/3665939.3665961
- 0citation
- 0
- Downloads
Metrics
Total Citations0Total Downloads0Last 12 Months0
Last 6 weeks0
- Get Citation Alerts
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
- Publisher Site
- eReader
HILDA 24: Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics
Key Insights from a Feature Discovery User Study
Pages 1–5
PreviousChapterNextChapter
ABSTRACT
Multiple works in data management research focus on automating the processes of data augmentation and feature discovery to save users from having to perform these tasks manually. Yet, this automation often leads to a disconnect with the users, as it fails to consider the specific needs and preferences of the actual end-users of data management systems for machine learning. To explore this issue further, we conducted 19 semi-structured, think-aloud use-case studies based on a scenario in which data specialists were tasked with augmenting a base table with additional features to train a machine learning model. In this paper, we share key insights into the practices of feature discovery on tabular data performed by real-world data specialists derived from our user study. Our research uncovered differences between the user assumptions reported in the literature and the actual practices, as well as some areas where literature and real-world practices align.
References
- Sara Alspaugh, Nava Zokaei, Andrea Liu, Cindy Jin, and Marti A Hearst. 2018. Futzing and moseying: Interviews with professional data analysts on exploration practices. IEEE transactions on visualization and computer graphics 25, 1 (2018), 22--31.Google Scholar
- Sumon Biswas, Mohammad Wardat, and Hridesh Rajan. 2022. The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. In Proceedings of the 44th International Conference on Software Engineering. 2091--2103.Google Scholar
Digital Library
- Erik Blair. 2015. A reflexive exploration of two qualitative data coding techniques. Journal of Methods and Measurement in the Social Sciences 6, 1 (2015), 14--29.Google Scholar
Cross Ref
- Chengliang Chai, Jiayi Wang, Yuyu Luo, Zeping Niu, and Guoliang Li. 2022. Data management for machine learning: A survey. IEEE Transactions on Knowledge and Data Engineering 35, 5 (2022), 4646--4667.Google Scholar
- Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David Karger. 2020. ARDA: automatic relational data augmentation for machine learning. PVLDB (2020), 1373--1387.Google Scholar
- Tianji Cong, James Gale, Jason Frantz, HV Jagadish, and Çağatay Demiralp. 2022. WarpGate: A Semantic Join Discovery System for Cloud Data Warehouse. arXiv preprint arXiv:2212.14155 (2022).Google Scholar
- Anamaria Crisan, Brittany Fiore-Gartland, and Melanie Tory. 2020. Passing the data baton: A retrospective analysis on data science work and workers. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2020), 1860--1870.Google Scholar
Cross Ref
- Mahdi Esmailoghli, Jorge-Arnulfo Quiané-Ruiz, and Ziawasch Abedjan. 2021. COCOA: COrrelation COefficient-Aware Data Augmentation.. In EDBT. 331--336.Google Scholar
- Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In ICDE. 1001--1012.Google Scholar
- Andra Ionescu, Kiril Vailev, Florena Buse, Rihan Hai, and Asterios Katsifodimos. 2024. AutoFeat: Transitive Feature Discovery over Join Paths. In ICDE. IEEE, 1861--1873.Google Scholar
- Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. IEEE transactions on visualization and computer graphics 18, 12 (2012), 2917--2926.Google Scholar
- Eser Kandogan, Aruna Balakrishnan, Eben M Haber, and Jeffrey S Pierce. 2014. From data to insight: work practices of analysts in the enterprise. IEEE computer graphics and applications 34, 5 (2014), 42--50.Google Scholar
- Stephen Kasica, Charles Berret, and Tamara Munzner. 2023. Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--18.Google Scholar
Digital Library
- Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, and Renée J Miller. 2022. Integrating Data Lake Tables. Proceedings of the VLDB Endowment 16, 4 (2022), 932--945.Google Scholar
Digital Library
- Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams. In Proceedings of the 38th International Conference on Software Engineering. 96--107.Google Scholar
Digital Library
- Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2017. Data scientists in software teams: State of the art and challenges. IEEE Transactions on Software Engineering 44, 11 (2017), 1024--1038.Google Scholar
Cross Ref
- Arun Kumar, Jeffrey Naughton, Jignesh M Patel, and Xiaojin Zhu. 2016. To join or not to join? Thinking twice about joins before feature selection. In SIGMOD. 19--34.Google Scholar
- Jiabin Liu, Chengliang Chai, Yuyu Luo, Yin Lou, Jianhua Feng, and Nan Tang. 2022. Feature augmentation with reinforcement learning. In ICDE. IEEE, 3360--3372.Google Scholar
- Yaoli Mao, Dakuo Wang, Michael Muller, Kush R Varshney, Ioana Baldini, Casey Dugan, and Aleksandra Mojsilović. 2019. How data scientists work together with domain experts in scientific collaborations: To find the right answer or to ask the right question? Proceedings of the ACM on Human-Computer Interaction 3, GROUP (2019), 1--23.Google Scholar
Digital Library
- Alessandra Maciel Paz Milani, Fernando V Paulovich, and Isabel Harb Manssour. 2020. Visualization in the preprocessing phase: Getting insights from enterprise professionals. Information Visualization 19, 4 (2020), 273--287.Google Scholar
Cross Ref
- Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How data science workers work with data: Discovery, capture, curation, design, creation. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1--15.Google Scholar
Digital Library
- Fatemeh Nargesian, Abolfazl Asudeh, and HV Jagadish. 2022. Responsible Data Integration: Next-generation Challenges. In SIGMOD. 2458--2464.Google Scholar
- Fahad Pervaiz, Aditya Vashistha, and Richard Anderson. 2019. Examining the challenges in development data pipeline. In Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies. 13--21.Google Scholar
Digital Library
- Sergey Redyuk. 2019. Automated documentation of end-to-end experiments in data science. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2076--2080.Google Scholar
Cross Ref
- Roee Shraga and Renée J Miller. 2023. Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V. Proceedings of the VLDB Endowment 16, 6 (2023), 1587--1600.Google Scholar
Digital Library
- April Yi Wang, Dakuo Wang, Jaimie Drozdal, Xuye Liu, Soya Park, Steve Oney, and Christopher Brooks. 2021. What makes a well-documented notebook? a case study of data scientists' documentation practices in kaggle. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. 1--7.Google Scholar
Digital Library
- April Yi Wang, Dakuo Wang, Jaimie Drozdal, Michael Muller, Soya Park, Justin D Weisz, Xuye Liu, Lingfei Wu, and Casey Dugan. 2022. Documentation matters: Human-centered ai system to assist data science code documentation in computational notebooks. ACM Transactions on Computer-Human Interaction 29, 2 (2022), 1--33.Google Scholar
Digital Library
- Kanit Wongsuphasawat, Yang Liu, and Jeffrey Heer. 2019. Goals, process, and challenges of exploratory data analysis: An interview study. arXiv preprint arXiv:1911.00568 (2019).Google Scholar
- Zixuan Zhao and Raul Castro Fernandez. 2022. Leva: Boosting machine learning performance with relational embedding data augmentation. In SIGMOD. 1504--1517.Google Scholar
Cited By
View all
Recommendations
- Study on the Discovery Algorithm of the Frequent Item Sets
ASIA '09: Proceedings of the 2009 International Asia Symposium on Intelligent Interaction and Affective Computing
Data mining technology is an interdisciplinary which has developed rapidly at home. It involves database, statistics, artificial intelligence, machine learning and other fields. The popularity of computer use produced a large amount of data. Data mining ...
Read More
- The key user discovery model based on user importance calculation
Recently, more and more users publish their views on events in social media. Identifying influential users in social media can help to analyse the impact of hot events or enterprise products in the real world. The existing mainstream methods are based on ...
Read More
- Query construction for user-guided knowledge discovery in databases
Knowledge discovery in databases (KDD) and data mining have good potential in many applications. However, in order to make KDD useful, many problems remain to be solved. One such problem is the query formulation problem: ''What to do if one does not ...
Read More
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in
Full Access
Get this Publication
- Information
- Contributors
Published in
HILDA 24: Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics
June 2024
91 pages
ISBN:9798400706936
DOI:10.1145/3665939
- Program Chairs:
- Jean-Daniel Fekete
Inria & Université Paris-Saclay
, - Behrooz Omidvar-Tehrani
AWS AI Labs
, - Kexin Rong
Georgia Institute of Technology
, - Roee Shraga
Worcester Polytechnic Institute
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2024
Qualifiers
- research-article
Conference
Acceptance Rates
Overall Acceptance Rate28of56submissions,50%
Funding Sources
Other Metrics
View Article Metrics
- Bibliometrics
- Citations0
Article Metrics
- View Citations
Total Citations
Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet
PDF Format
View or Download as a PDF file.
eReader
View online with eReader.
eReader
Digital Edition
View this article in digital edition.
View Digital Edition
- Figures
- Other
Close Figure Viewer
Browse AllReturn
Caption
View Table of Contents
Export Citations
Your Search Results Download Request
We are preparing your search results for download ...
We will inform you here when the file is ready.
Download now!
Your Search Results Download Request
Your file of search results citations is now ready.
Download now!
Your Search Results Download Request
Your search export query has expired. Please try again.