Heading into 2024 the hype generated by the launch of ChatGPT has not deflated.
If you’ve ventured into Generative AI and their Large Language Models (LLMs), there’s no doubt that you understand the importance of data for LLMs.
Ever heard the phrase “garbage in, garbage out”? If you fuel your body with a steady diet of fast food, you are highly unlikely to win a marathon or be considered healthy by your dietitian.
An alternative could be a diet rich in fruits. Fruits are fantastic; they provide a wealth of vitamins your body needs. However, once again, your body won’t be deemed healthy if you exclusively consume fruits.
If you see where I’m going with this, LLM performance is not solely dependent on the quality of data they are fed. They also require diverse data!
In this blog, we will explore the types of data that will help you get the most out of your LLM.
LLM Performance Relies on the Data and… its Metadata
Metadata can be succinctly described as additional data about the data itself. It can assume various shapes and forms, but in recent years, the rapidly increasing volume of data collected by organizations has led to a corresponding surge in metadata.
Metadata serves various purposes, including data storage and management. It aids in organizing and securing the data, among other functions.
Companies like Alation, Atlan and Databricks with their Unity Catalog are assisting organizations in fulfilling their data cataloging requirements.
Let’s take an example of a customer table, presented as follows:
SELECT * FROM Customer
Technical metadata serves various purposes, including data storage, defining the data structure, relationships between datasets, and more. Examples of technical metadata encompass schema definitions, column specifications (type and size) within a table, table joins, primary and foreign key identifications, and more.
It is used by systems to comprehend the data and assists technical users in organizing and discovering data.
Customer table is defined as followed:
|[PK] Int customer_id
This information represents technical metadata for the table, and additional details can be found in system tables:
SELECT table_name, table_type FROM information_schema.tables
WHERE table_name = 'Customers'
Technical Metadata for LLM Performance
When discussing metadata feeding into LLMs, often, technical metadata types are the data that organizations primarily consider.
In a modern, composable architecture, most of this information can be located within a data warehouse, such as Snowflake or Databricks. The Customer Data Platform (CDP) itself would contain crucial technical metadata in packaged deployments.
Governance metadata is employed for tasks like granting permissions for accessing or using data in specific contexts or by specific users. It may also include data ownership details, data sensitivity (e.g., is it PII?), compliance requirements, and other controls.
This metadata includes information related to permissions, locations, file sizes, and ownership. It plays a vital role in data governance, compliance, controls, and data management.
Our system table regarding the customer table will inherently contain certain governance metadata, such as the table ownership:
SELECT table_name, table_owner FROM information_schema.tables
WHERE table_name = 'Customers'
Governance Metadata for LLM Performance
In the context of LLMs, governance metadata plays a role in determining the sensitivity of requested information, granting permissions to answer queries, and deciding on the utilization or exclusion of specific data to address user requests.
Again, in a composable architecture where data is centralized within a data warehouse, it’s anticipated that most of the governance metadata will be managed within that system. In a packaged deployment, the CDP will contain governance metadata that can be utilized for LLMs.
Descriptive metadata serves various purposes, including data classification, the addition of descriptions, or the incorporation of other business-related meanings. Examples of descriptive metadata may encompass the description of a profile attribute, the definition and description of an audience segment, or the description of a marketing campaign.
For instance, an attribute like “Customer is living in Europe” could be defined within a business application. The various elements associated with it constitute descriptive metadata:
Name: Customer living in Europe
Description: Value set to true for customers living in Europe
CASE WHEN region_name = ‘EMEA’ THEN true ELSE false END
Tags: Location, Region
Descriptive Metadata for LLM Performance
This metadata is frequently stored directly within the business application(s) employed to define these objects. While one might anticipate the data warehouse hosting the technical and governance metadata, the bulk of the business context and descriptive metadata is typically directly stored within the business applications utilized by business users.
Healthy LLM Performance is Achieved With Diverse Data and Metadata
The manner in which data is supplied to your LLMs can vary, whether it’s during initial training, as an embedding or in a RAG architecture. You can find more information in a previous blog that discusses different approaches to LLMs for enterprises.
However, it’s evident that LLMs will experience limitations in their performance if they are not provided with a diverse range of data and metadata types. Enterprise organizations will soon come to realize that technical data and metadata alone won’t suffice, much like how fruits are an excellent source of vitamins but do not fulfill your protein requirements.
Descriptive metadata and business context, which are often directly defined within adtech and martech software, will serve as a crucial source of information for LLMs, helping to ground them and enhance their performance.
Take Action Today, Build Your Metadata For Tomorrow
While I have observed organizations invest time and effort into creating and implementing a comprehensive taxonomy for objects such as attributes, audiences, and campaigns within their marketing technology stack, it’s undeniable that many others have overlooked this aspect.
Today, failing to accurately describe an audience in a title or description field might cause confusion among (new) users, but it won’t affect the software’s performance. These fields were designed for human understanding, not for machine processing.
However, tomorrow will be different. These fields contain data that will significantly enhance the performance of your generative AI use cases.
ActionIQ is Here to Help, Even with Your Metadata and LLM Performance
ActionIQ stores incredible business context that customers have created over the years. We will ensure that enterprises have the option to utilize this existing knowledge to accelerate the deployment and enhance the performance of their Generative AI initiatives.
Do you have thousands of audiences and campaigns saved without definitions? Not a problem. By early 2024, we will offer solutions to expedite your efforts in completing the business context, adding and reviewing descriptions for your attributes, audiences, and more. ActionIQ will be able to generate this information for you. You will remain in control to edit with your own words.
Reach out to our ActionIQ experts to learn more about how ActionIQ can assist enterprises in achieving success with their LLMs and generative AI investments.