Data Engineering in the Age of Large Language Models: Transforming Data Access, Curation, and Enterprise Interpretation
Main Article Content
Abstract
For many years, the guiding principle in enterprise data management has been, ”garbage-in, garbage-out”—meaning that the quality of downstream reporting and analyses can only be as good as the quality of the data that enter the system. These words still remain true as organizations struggle to gain insight from their enterprise data. Yet, 2024 is shaping up to be the year of ”garbage-in, garbage-read” for enterprise data interpretation. Large language models (LLMs), such as ChatGPT, have demon- strated an unprecedented capability to convert unstructured text into human-like natural language, answering questions regardless of the source of the input text and summarizing content as part of higher-level reasoning. The impact of LLMs is much broader than natural language processing alone—they affect how organizations curate and access information as well. This article summarizes research about AI techniques that help users get the right data in the right format; automate the evaluation and curation of data; and, finally, apply natural language processing directly to the transformed data to provide enterprise intelligence.