As generative AI moves from hype phase towards maturity, more enterprises are actively integrating Large Language Models (LLMs) and multimodal models into their products and realizing the challenges involved in the process. This transformative wave, spearheaded by GenAI/LLMs, represents a paradigm shift, empowering machines to process, comprehend and generate information in unprecedented ways, especially in the realms of text, images and videos. Some even argue that this revolution could surpass the impact of the Internet itself.
LLMs boast remarkable generalizability, effortlessly navigating diverse domains right out of the box, commonly denoted as zero-shot inferencing. Their adaptability is further refined through fine-tuning. They are proficient at generating information based on human language instructions and examples, known as prompt engineering.
However, transitioning from demos to real-world enterprise solutions poses challenges. The industry is grappling with issues (such as closed domain vs. open domain hallucinations), safety concerns associated with LLMs, toxicity, offensive answers and the substantial efforts required for evaluating and benchmarking such systems. There are additional hurdles with large enterprises where technical problems intertwine with organizational and process-oriented issues. For instance, on AI projects, should experts be concentrated in a single group capable of solving problems across different teams or embedded across groups? Another key consideration is determining which primary problems to tackle first.
At Uniphore, our focus is on developing state-of-the-art AI solutions for enterprises. Throughout this journey, we’ve gleaned insights applicable to enterprises across multiple industries. What we’ve learned extends across all enterprise applications, serving as valuable guidelines for navigating the intricate landscape of AI integration in diverse business settings. Here are the three biggest lessons we’ve learned from the frontline trenches of generative AI development:
Enterprises need one cohesive framework for all AI applications.
For products to be effective, a generic framework to handle the enterprise AI problem needs to be built that can be scaled both horizontally (across different use cases) and vertically (in which more depth and building blocks can be added, by keeping the core piece intact). We propose a layered approach that breaks down the problem into parts that can be developed independently yet keeps them cohesive, they are:
Knowledge Layer:
In most enterprise use cases, we want the AI models to get reference/context from enterprise documents, not from the internet. To do so, we need to provide the core model knowledge and connectors to the enterprise documents and conversations. The knowledge layer consists of document ingestors with a datastore (vector DB store, potentially) and connectors of different media/data files from the enterprise.
AI Inference Layer:
This layer consists of a series of AI services made up of in-house or third-party models that are the core/brain of the system. While LLMs/large models may form the core, they are not a hammer that can be applied to all solutions. Instead, post- and pre-processing layers of several smaller ML models and guardrails need to be applied to create an enterprise-ready solution. There is also an additional orchestrator piece that determines when which model will be called.
Co-pilot Apps/APIs:
The apps sit on top of the knowledge and AI inference layer. Examples at this layer may be enterprise specific (i.e., summarization of conversations, chatbots, supportive question/answering system, entity/slot detection, language translation, etc.)
Having common metrics to measure AI solutions is essential.
Along with a common framework, we need a common way to measure AI solutions. At Uniphore, we use the following four metrics in order of priority to decide no/no-go into product:
Accuracy: Measuring accuracy in generative outputs can be challenging. It involves understanding from an ML/NLP perspective how good the model is with respect to the task at hand. Typical metrics include precision/recall (which needs to be derived from information segments) and BLUE/ROUGE. For LLMs, there is also an increased need for human ratings-based feedback on the quality of the solution.
Latency: When an AI output is required on a real-time basis, latency of applications become super critical and the second most important metric to optimize for. There are several libraries available that can be used to optimize large models, including c-translate2, vllms, Tensor RT-LLM and more.
Concurrency: Depending on the workload, enterprises need to know how many concurrent servers (GPUs) are needed to provide support to the product. In the example of call summarization for call center assistants, if the expectation is for the call center to receive summaries for 600 concurrent agents, we will need to be able to match these concurrency needs together with the latency requirements.
Cost: Because LLMs inference has high computational demands, it is critical to consider the availability and cost of GPUs before proposing a generative AI solution in a product. To manage expectations, enterprises need accurate usage calculations with reasonable margins before the deployment of a product.
Dataset curation is critical to developing new algorithms and models.
As you mature in the AI lifecycle from prompt engineering to fine-tuning to training of your own model, you need to curate your dataset to help benchmark new algorithms/models. Regardless of the algorithm, having access to the dataset(s) that can help the model learn is critical. For the algorithm to perform best on your datasets and your use cases, you need to either fine-tune or pre-train a model on the datasets that are closest to the production use-case. Key dataset curation considerations include:
Internally available data: Consistency and accuracy have been known issues, especially with open-source LLMs. Fine-tuning and training an internal model (if resources are available) can improve the performance of the algorithms. For example, by instruction fine-tuning LLMs for our own use cases, we have been able to develop small, fine-tuned models that perform better than a model 10 times its size.
Third-party vendor data: When customer data is scarce, the best way to get data to train/tune the model is through third-party vendors. Many have domain-specific, off-the-shelf and tailor-made datasets for videos, speech, receipts, etc. Getting these datasets manually, or synthetically annotated for the task is critical to train and benchmark.
Open-source datasets: There is a plethora of open-source datasets that the research community has made available to build next-gen AI applications, examples include HotPot Q/A, Prosocial Dialogues, Empathetic Dialogues, Massive.
Using these datasets to complement real-world domain data can help models learn relevant social dialog traits, as well as general domain entities/intents that occur in specific industries (i.e., travel and hospitality, banking).
In summary, navigating the enterprise AI revolution involves adopting three crucial strategies: 1) Implementing a unified framework for all AI applications, 2) Standardizing metrics for benchmarking, and 3) Establishing a robust system for data curation and training. Effectively executing these also requires the building (or buying) of an ML infrastructure and optimization framework to ensure high-speed performance. Stay tuned for more detailed insights on this key component in our upcoming blog posts.
I would like to thank Uniphore’s NLP team and the whole AI team for actively contributing to the evolution of our leaning towards an effective use of Generative AI for the Enterprise.