LLMs Hit a Data Quality Roadblock
Large Language Models (LLMs) and Generative AI are all the rage right now but will only work for organizations that have a solid grasp on the quality of their data. Investing in LLM technology without first addressing data quality issues can end with disaster.
There are some suggesting that LLMs can be used to fix data quality issues. On the surface, this sounds somewhat dangerous as organizations will be entrusting machines to actually edit data. Peeling the onion back it is actually not a bad idea if implemented very carefully. Often organizations rely on blanket scripts to do very much the same thing. The question then becomes: is an LLM used in this manner overkill?
Regardless of how the technology is put to work, the biggest question comes down to trust. Consider the LLM as similar to an auto-pilot feature found in today's aircraft. Pilots trust auto-pilots because: a) they know how they work and b) they trust the information being fed into them. In situations where this trust is broken, bad things tend to happen. The same is true for any LLM initiative.
Trust is the Key to LLM Success
If an LLM is allowed drive certain business processes it can only mean that it has been thoroughly and rigorously tested. When results do not match expectations it can only mean a disconnect exists in the data that fed the model. Finding this disconnect can be very difficult and because of this, the organization will face many challenges when developing a LLM:
Challenges organizations will face when trying to develop a LLM:
Where is the data? Often a large chunk of an organization's data and its data-decisioning process is really tucked away in spreadsheets. Spreadsheets have a mix of text, logic and math put together in an ad-hoc fashion which will be extremely difficult to feed into a LLM. Context here is very important and the organization runs the risk of creating a tremendous amount of noise if context is not looked at very carefully.
How connected is the data? Merger and acquisition activity brings data from different organizations together. It is already difficult enough for one organization to wrangle its data. Imagine now having to do this for two or more completely disconnected entities? Any machine learning model would struggle trying to piece this data together when unique identifiers or descriptors used vary.
Where are the blind spots? Does the organization have a true understanding of the data assets available and their value in creating an accurate LLM model? Does the modelers have true transparency into the source of the data and what has happened prior to it being loaded into the model.
What about the context? We simply cannot dump data into an LLM without controlling the boundaries of this information - not fully understanding context: the Who, What, When, Where? Etc. will make it impossible to certify the results.
What Organizations Should Do Today
The promise of LLMs can be game-changing but organizations stand to waste precious resources if they do not first emphasize data quality improvement. Here are some steps organizations considering developing/using LLMs must do:
Implement processes that provide feedback mechanisms. Human feedback needs to be captured early and often. Integrating workflow and using it to capture business process metrics will provide much needed insight and assist in the development of the LLM. Implementing these mechanisms will have the added benefit of bringing problem-solvers closer to data and allow them to bring about change.
Replace the spreadsheets! Spreadsheets contain information that is vital to the business but are stored in a manner that can drive massive unforeseen costs. With or without a roadmap for LLMs, a strategy to replace spreadsheets with more durable solutions should already be in place.
Implement a graph database. Graph technology excels at its ability to capture, store and analyze relationships. These relationships will come in very handy when understanding context and lineage.
Before organizations begin their LLM journey, they must first recognize that data quality and data availability are critically important. Any result of an LLM or Generative AI project will hinge on users being able to trust the results. If the organization has trust in its data, they will have a greater trust in the result. Any seed of doubt will require the organization to first take steps to correct this problem otherwise the effort and the results will forever be questioned.