Phil Meredith

Apr 103 min

Gen-AI Runs Into a Data Quality Roadblock

Updated: May 8

April 2024

Large Language Models (LLMs) and Generative AI is all the rage right now but will only work for organizations that have a solid grasp on the quality of their data. Investing in this technology without first addressing data quality issues can end with disaster.

Consider your Gen-AI initiatives similar to an auto-pilot feature found in today's modern aircraft. Pilots of these aircraft trust this technology because: A) they know how they work and B) they trust the information being fed into them. In situations where this trust is broken, bad things tend to happen.

As soon as your end users start to get poor results from your Gen-AI solution, they will stop using it. If they are unable to tell poor results from good results they may make poor decisions based on the information that has been provided to them. This is counter intuitive into why Gen-AI is being used in the first place.

Trust is the Key to Gen-AI Success

If Gen-AI is allowed drive certain business processes it can only mean that it has been thoroughly and rigorously tested. When results do not match expectations it can only mean a disconnect exists in the data that fed the model. Finding this disconnect can be very difficult and because of this, the organization will face many challenges when developing this type of solution.

Challenges organizations will face when trying to develop Gen-AI solutions:

  • Where is the data? Often a large chunk of an organization's data and its data-decisioning process is really tucked away in spreadsheets. Spreadsheets have a mix of text, logic and math put together in an ad-hoc fashion which will be extremely difficult to feed into a LLM. Context here is very important and the organization runs the risk of creating a tremendous amount of noise if context is not looked at very carefully.

  • How connected is the data? Merger and acquisition activity brings data from different organizations together. It is already difficult enough for one organization to wrangle its data. Imagine now having to do this for two or more completely disconnected entities? Any machine learning model would struggle trying to piece this data together when unique identifiers or descriptors used vary.

  • Where are the blind spots? Does the organization have a true understanding of the data assets available and their value in creating an accurate LLM model? Does the modelers have true transparency into the source of the data and what has happened prior to it being loaded into the model.

  • What about the context? We simply cannot dump data into an LLM without controlling the boundaries of this information - not fully understanding context: the Who, What, When, Where? Etc. will make it impossible to certify the results.

What Organizations Should Do Today

The promise of Gen-AI can be game-changing but organizations stand to waste precious resources if they do not first emphasize data quality improvement. Here are some steps organizations need to take right away:

  • Implement processes that provide feedback mechanisms. Human feedback needs to be captured early and often. Integrating workflow and using it to capture business process metrics will provide much needed insight and assist in the development of the LLM. Implementing these mechanisms will have the added benefit of bringing problem-solvers closer to data and allow them to bring about change.

  • Replace the spreadsheets! Spreadsheets contain information that is vital to the business but are stored in a manner that can drive massive unforeseen costs. With or without a roadmap for LLMs, a strategy to replace spreadsheets with more durable solutions should already be in place.

  • Implement a graph database. Graph technology excels at its ability to capture, store and analyze relationships. These relationships will come in very handy when understanding context and lineage.

Summary

Before organizations begin their LLM journey, they must first recognize that data quality and data availability are critically important. Any result of an LLM or Generative AI project will hinge on users being able to trust the results. If the organization has trust in its data, they will have a greater trust in the result. Any seed of doubt will require the organization to first take steps to correct this problem otherwise the effort and the results will forever be questioned.