Framework for evaluating Generative AI use cases
  • 01 Apr 2024
  • 10 Minutes to read
  • Dark
    Light

Framework for evaluating Generative AI use cases

  • Dark
    Light

Article Summary

Thank you to Barak Turovsky - VP of AI, Cisco: former Head of Product, Google Languages AI, for sharing his expertise and knowledge in our knowledge base



With the ChatGPT release in November 2022, Large Language Models (LLMs) / Generative AI has taken the world by storm: users either love or are irritated by it, and investors/companies across many industries are asking whether Generative AI will disrupt established modes of core information functions, from search to content generation to knowledge management. However, many of us are still trying to figure out the trillion $$$ question: what are the actual use cases where Generative AI adds the most value, and how to monetize those?  Moreover, what use cases are most practical to implement and monetize in short, medium and long term? This post is my humble attempt to suggest a simple, practical framework to understanding ChatGPT promises, limitations and most importantly, how ChatGPT applies to different use cases and industries.

I find it helpful to evaluate potential use cases across two dimensions: fluency and accuracy. Another important aspect when evaluating use cases is how high stakes the use case is (represented by colors). Plotting different use cases across those dimensions provides an interesting decision framework I hope you will find useful:

 

Fluency, Accuracy, and use case stakes

A lot of excitement about ChatGPT stems from the “fluency” of ChatGPT responses, which indeed look extremely natural and human-like. This is an amazing technological achievement, stemming from a) using ground-breaking Transformer neural networks developed and open-sourced by Google in 2017 and b) training Transformer neural network on humongous (tens of billions of samples) training corpus trained on dialogues - again, huge hat tip to Google that developed and published both first dialogue (LaMDA) and first multilingual translation Large Languages models (M4) research. However, fluency is different from accuracy and stakes involved vary widely across use cases:

  • Need for accuracy: how important is accuracy for the use case? It might not be important when you writing a poem, but very important when providing users with recommendation for a major purchase

  • Need for fluency: is fluent, naturally sounding “story” important for the use case? It would be important when writing a science fiction book, but less important when providing data for a business decision

  • How high stakes the use case is? What is the risk if AI gets the answer wrong? For example, the risk of inaccurate answers when using AI to write a poem is much lower than relying on AI to make a decision where to book your next vacation, or which dishwasher to buy.

Another important view is to evaluate a particular use case at scale, i.e. will your answer change if the task is to leverage AI (assuming no or minimal human intervention) to write millions of poems, or provide millions of answers people will rely on to make important decisions?

Putting it all together

When trying to apply LLMs to different use cases, it is critical to define the requirements of a particular use case across those aspects. Iit is important to understand that getting very high fluency AND very high accuracy is very tricky, due to following limitations of Large Language generative models:

  • When LLMs are inaccurate, they are very confidently (e.g. fluently) inaccurate. The closest human behavior analogy is people who can very confidently and convincingly talk about any topic: when they don’t know something, they very confidently make stuff up. Given their charisma and “smooth talk”, we are often carried away by their confident demeanor and believe what they say (especially if we don’t fully understand the topic). Dialogue-trained LLMs demonstrate similar behavior: they are trained to provide you with an answer (even if it is incorrect or doesn't make sense!). Moreover, in addition to producing confidently incorrect answers, LLMs can at times produce offensive answers, or results that introduce or reinforce existing biases.

  • We can’t always assume linear, super fast improvement in LLMs accuracy with more training data: the very nature of LLMs is that in order to work well, they need to be trained on HUMONGOUS amounts of training data: both ChatGPT and Google LAMBDA models are reportedly trained on billions of words. When Google published groundbreaking research of “massively multilingual, massive neural machine translation (M4)” LLM, it was trained on 25+ BILLION sentence pairs, with 50+ BILLION parameters!

  • Therefore, given humongous sizes of training corpus, doubling it might produce relatively incremental improvement in accuracy. We saw this phenomenon at Google when launching first-ever product (Google Translate) based on deep neural networks: languages with large training data corpus (European languages like Portuguese, Spanish etc) experienced significant, but still incremental translation quality improvement, while languages with less training data like Hindi, experienced much bigger quality jump. Back to fluency topic, neural networks demonstrated much more visible fluency improvement (eg. translations started to sound much more natural).

  • Therefore, improving accuracy of LLMs, while possible, is very complex and more art than a science: for example, improving accuracy for a topic/domain with insufficient training data might require combining data from other “domains” with more training data, and we shouldn’t assume it could be done easily and quickly.

Back to my previous examples, writing a poem (or million poems) doesn’t require a high degree of accuracy, but a high degree of fluency. Moreover, it is a fairly low stakes use case (in terms of risk of getting something wrong). On the other end of the spectrum, generating supporting data for important business decisions primarily requires high accuracy and fairly low fluency: it is also a high stakes use case (high risk if decision was based on wrong data).

Looking at those use cases, I observed an interesting trend: use cases related to improving creator/workplace productivity (writing a poem, composing music, writing children’s books, creating stock images, writing emails/documents/presentations etc), are less complex/risky and could be better fit for current LLM/Generative AI technology (that is amazing in fluency but still has gaps on accuracy), vs. information seeking/decision support use cases (eg. getting an answer about what appliance/car insurance/vacation etc. to buy, data for important business decisions etc.)

 

What about monetization?

When prioritizing different use cases, the critical question is which of them offer a) large monetization potential and b) realistic implementation potential (eg. Generative AI technology is mature enough for users to adopt it at scale for this particular use case)?

While I believe that total monetization potential is directly correlated to higher stakes use cases that are much more complex to implement, I think that the current highest ROI opportunities are in several use cases that provide a “sweet spot” of sizable monetization potential and practical implementation opportunity using Generative AI in the short to medium term.

One way to look at it is whether the use case could rely more on “human labeling/correction” at scale: for example, users who use Generative AI to compose documents/emails/presentations, will likely review the draft output and adjust/correct it. This will not only make the Generative AI system better (user feedback further improves the LLM models), but could still introduce a significant (50%-70%) productivity boost that users will be willing to pay a premium for. In this “AI + human” division of duties, AI will be “responsible” for fluent, smooth “story” (that requires significant effort from many users), while humans will be “responsible” for validating accuracy of LLM output. I am sure there are additional use cases that offer good balance of monetization potential and manageable complexity along similar lines: on a flip side, it is not practical to expect humans to fact check every high stakes answer produced by AI (for example, in search engines use case).

We are in early stages of AI Revolution

We live in an exciting (and, to some — rapidly changing and scary) period of human history: a new incarnation of disruptive technology (many compare AI to the invention of electricity or fire!) will impact every aspect of our lives. I consider myself particularly lucky to have both worked on first major AI breakthrough (using deep neural networks on a first-ever product at enormous scale with Google Translate, and on a new generation of AI that can produce natural, human-like outputs for virtually any topic. As with every new and disruptive technology (think early days of railroads, planes etc), productizing and monetizing this groundbreaking AI technology is both exciting and scary, full of complexities and nuances required in order to cross the chasm. I hope that my framework on how to think about and prioritize Generative AI use cases will be helpful to many of you as you embark on this amazing journey.

This is just the first in a series and I look forward to sharing more of my musings with you soon.

------

Barak Turovsky is VP of AI at Cisco, and former Head of Languages AI product teams at Google (2014–2022), focusing on applying cutting edge AI technologies across Google Translate, Search, Assistant, Ads, Cloud, Chrome, and other products. Most recently, Barak was Executive in Residence at Scale Venture Partners and served as Chief Product Officer (responsible for Product, Engineering and AI teams) for Trax Retail, a late stage startup providing Computer Vision AI solutions for the Retail industry.

Previously, Barak was a product leader within the Google Commerce team, worked as Director of Product in Microsoft’s Mobile & Local Advertising, Head of Mobile Commerce at PayPal and Chief Technical Officer for Telemesser, an Israeli startup.

 

------

Barak Turovsky is VP of AI at Cisco, and former Head of Languages AI product teams at Google (2014–2022), focusing on applying cutting edge AI technologies across Google Translate, Search, Assistant, Ads, Cloud, Chrome, and other products. Most recently, Barak was Executive in Residence at Scale Venture Partners and served as Chief Product Officer (responsible for Product, Engineering and AI teams) for Trax Retail, a late stage startup providing Computer Vision AI solutions for the Retail industry.

Previously, Barak was a product leader within the Google Commerce team, worked as Director of Product in Microsoft’s Mobile & Local Advertising, Head of Mobile Commerce at PayPal and Chief Technical Officer for Telemesser, an Israeli startup.


Was this article helpful?