The Art And Science Of Generative AI: Understanding, Evaluating, Deploying Foundation Models
From Prompting to Fine-Tuning: A Deep Dive into Large Language Models
On July 06, Abi Aryan (Machine Learning Engineer & LLMOps Expert) joined me on “What’s the BUZZ?” and shared how you can fine-tune and operate your generative AI models. Diving into the world of large language models (LLM) is akin to embarking on a complex journey. There's an array of factors to consider: cost, industry applicability, interpretability, and more. These models are far from one-size-fits-all solutions; they require a detailed understanding of the end-user and existing infrastructure. Even once they are deployed, the journey doesn't end. They must be evaluated and fine-tuned for specific domains, a process that's both an art and a science. With evolving benchmarks and varied effectiveness, the evaluation is as dynamic as the models themselves. Where do you start? Here’s what we’ve talked about…
» Watch the latest episodes on YouTube or listen wherever you get your podcasts. «
Considering Fine-Tuning vs. Prompting
Deciding whether to use large language models can be a complicated decision, considering their cost and complexity. These models can be hard to interpret due to their generative nature, and it's crucial to ask why such a model is needed in the first place. Generative AI models aren't ideal for every industry or use case, and consideration needs to be given to potential downsides, such as compliance requirements and evaluation challenges. And the model's performance must be tied to business KPIs before any fine-tuning or specific applications can be discussed.
Even in areas where large language models can be helpful, like copywriting, it's important to set some standards. For example, the content generated should be non-toxic, maintain a balanced sentiment, and not be too politically biased. And depending on your audience, you might need to monitor and fine-tune your models regularly, which can add to the costs. Weigh these costs against the benefits and see if the models help achieve business goals before deciding to fine-tune them.
From Out of the Box to Fine-Tuned: A Language Model's Journey
Understanding the infrastructure and the end-user is key to deciding how to best use a large language model. Some scenarios allow the use of pre-trained models like GPT-3 right out of the box, such as interactive chatbots, while others may require fine-tuning to cater to a specific domain. The latter often involves integrating data from existing databases to ensure relevance and efficiency.
» Fine-tuning has a different cost as compared to to the cost of inferencing or the prompting of the model itself, which is almost four times the cost. And unless you have a way to be able to evaluate a model, you don't really know if fine-tuning is working or not. So yes, you're spending that much money and if you don't really know what's the output out of it. «
— Abi Aryan
For instance, if you're running an e-commerce company, you probably don't need your model to answer questions about current events. Instead, you might want it to focus on FAQs related to your business. And you might not even need to fine-tune your model too much if you're using it for tasks like question-answering or recommendations. But in the end, it's all about understanding your needs and making the best use of the resources you have.
Generative vs. Discriminative Models: Differences and Evaluation
Discriminative models contrast with generative models in their value-addition process. For discriminative models, it's important to consider their performance, including accuracy, inference speed, and latency. Fine-tuning the parameters of the discriminative models is not the key focus; instead, it's about understanding key features and correlations between variables. Fine-tuning in the industry is usually aimed at optimizing for small gains, with the biggest value addition coming from understanding data correlation and causation.
» There's one key difference with generative models versus discriminative models. With the discriminative models, if you were not doing some sort of fine-tuning, you weren't really getting any specific results. A lot of effort was really spent on hyper-tuning the parameters to the extent there was almost a level of craziness with it hyper-tuning the parameters. «
— Abi Aryan
Other factors, such as model latency, also influence the model's performance and acceptance level. Benchmarks and scoring-based methods help evaluate model performance, although their effectiveness can vary. But these factors might have different significance based on the use case. For instance, a chatbot interacting with customers needs to respond quickly, while an internal tool can afford to take a bit longer.
There are publicly available benchmarks and frameworks (such as Eval Harness) that help you figure out how well your model is performing. But these tools might not always be a perfect fit for what you need as use cases and domains can vary greatly. So, it's a good idea to keep exploring other methods and stay flexible. Not everything will work perfectly for every use case, and it's important to think creatively to solve problems. You need to be open to try new methods and explore different ways of doing things to ensure your AI models perform at their best.
Summary
Large language models are powerful but complex tools that require careful consideration before deployment. Their cost, industry relevance, and interpretability are crucial factors to weigh. The nature of the end-user and the infrastructure also influences the model's usage, with some cases requiring fine-tuning for specific domains. Evaluating these models requires understanding data correlation, causation, and setting performance baselines. As benchmarks and scoring methods vary in effectiveness, finding a reliable evaluation method is a common challenge.
What challenges have you faced when deploying large language models?
Listen to this episode on the podcast: Apple Podcasts | Spotify
» Watch the latest episodes on YouTube or listen wherever you get your podcasts. «
What’s next?
Join us for the upcoming episodes of “What’s the BUZZ?”
August 1 - Scott Taylor, aka “The Data Whisperer”, will let us in on how effective storytelling help you get AI project funded.
August 17 - Supreet Kaur, AI Product Evangelist, and I will talk about how you can upskill your product teams on generative AI.
August 29 - Eric Fraser, Culture Change Executive, will join and share his first-hand experience how much of his leadership role he is able to with generative AI.
Follow me on LinkedIn for daily posts about how you can set up & scale your AI program in the enterprise. Activate notifications (🔔) and never miss an update.
Together, let’s turn hype into outcome. 👍🏻
—Andreas