Split-Second Personalization: How AI Will Keep You Hooked For Hours (And Finish That Movie)
A Glimpse Of The Future Why Converged AI Experiences Will Be Greater Than The Sum Of Their Parts
I can see it in front of my eyes. I’m certain that soon enough, we all can, too — literally.
Innovation in AI is super promising. But it’s happening in separate swim lanes — from basic technologies like chips and foundation models to devices and applications.
Generative AI is largely based on one modality or medium, such as text, or image, or audio, or video. Emerging multi-modal models can process more than one modality, but they are not yet widely distributed. For example, these models can analyze an image and describe what is visible or why in written language.
The reality of independent swimlanes won’t stay the same for long. With all its promise for businesses, AI will fundamentally change the products and services we regularly use. One industry that stands out to me in particular is entertainment. Here is where I see things connect and converge first…
The State Of Generative AI Innovation
The past 18 months alone have brought us dozens of innovations across the entire technology stack. Hype cycle or not — it’s a good indicator of the pace of innovation and how higher-value products and services benefit from lower-level improvements and efficiencies.
Basic technology
This is the lowest layer in the stack. Any platforms and applications are built on top of this layer:
Chips have been in hot demand since the recent boom in Generative AI has started, and newer generations are becoming more powerful. For example, NVIDIA has just announced its Blackwell Graphical Processing Unit (GPUs), which surpasses the current generation, and Groq has designed a Language Processing Unit (LPU) to significantly accelerate text generation.
Small foundation models are emerging and are based on a subset of the training data used to create large ones. This creates more relevant output and makes it more feasible to run these models on the edge.
Multi-modal models such as GPT-4, Gemini, or Apple’s recent cited MM1 take an image as input and create a description of the image using natural language.
Agents are emerging software components that act on a user’s behalf to achieve a loosely defined goal. An early example of an agent is OpenAI’s GPTs.
Types of modes
There are generally four kinds of modes. Current foundation models generate output in at least one of them:
Text generation via proprietary and open-source LLMs has been available for 18-36 months. GPT-4, Gemini, Groq, Mixtral, Falcon, and Llama 2 are examples. The cost to generate information is converging toward zero. This includes generating code, too.
Image generation has also been improving significantly within a short timeframe. For example, earlier iterations of DALL-E 3, Midjourney, or Stable Diffusion had trouble portraying humans or their hands, but greatly improved within less than a year.
Audio generation for synthetic voices or music has recently made leaps and bounds. Whether it is stock voices or uploading your voice for cloning, tools like ElevenLabs or Descript allow the creation of audio based on a text script.
Video generation enables users to create moving images of a scene that the defines via a prompt. Examples include OpenAI Sora, RunwayML, PikaLabs, Stable Video, or D-ID. While most tools on the market can generate 3 seconds of video at the moment, OpenAI has announced its model, Sora, with a length of 60 seconds — enough for an actual movie scene.
Applications in consumer devices
Key developments and product releases bring AI even closer to the top of the stack — end-user applications and devices:
Inference on the edge is a promising field for running foundation models on mobile devices and similar form factors to generate an output.
Augmented reality headsets and sensors are gaining popularity again, whether for entertainment or business. The most recent examples are headsets such as Apple Vision Pro or Meta Quest which ring in the next generation of spatial computing.
While each of these technologies is impressive on its own, they currently require a user to actively take the output of one layer and use it as an input of another. The trick will be connecting them for an automated, rapid execution.
Convergence Of AI Capabilities In Entertainment
When all of these capabilities are connected to one another, outputs become inputs, hardware continues to become smaller and more powerful, and the next frontier of consumer experiences is near. For example, entertainment will become an instantaneous and personalized experience. Your experience will be different from your friends’. If it sounds far-fetched, look at this documentary from this year’s Sundance Festival, which is different every time it is shown.
Video platforms often optimize for watch time. If their service is free, you’re the product. The longer you stay, the more ads they can show you, and the more revenue they can make. Hence, increasing your engagement to stay on the platform is a legitimate goal — and it is even a lucrative one.
What could this look like?
For simplicity’s sake, let’s assume all the components of the stack are greatly enhanced — more capable, more energy-efficient, and smaller in size. So, here we go:
You put on your AR headset to watch an action movie based on a set of recommended titles. Personalization has taken your preferences and watch history into account.
Sensors in your AR headset and smartwatch are monitoring your engagement level as you watch. Does your heart rate go up during the car chase? Are your eyes actively following what’s on the display?
One of the sensor detects that your engagement is dropping. The sensor reading triggers an agent within a split-second. The application needs to act fast before you switch to a different movie or turn it off completely. Every second counts.
The agent’s task is updating the plot and the next few scenes:
a) It prompts the multi-modal LLM to describe the next scene, optimizing for engagement, from location to script, dialog, lighting, visuals, music, and more.
b) It hands off subtasks to specialized worker agents that will create the next 60 seconds of the movie along the modes of text, image, audio, and video. For example: drive the hero off a cliff or have them fend off the bad guys.
c) It synchronizes the different streams.
You movie application renders and plays back the updated scene in your AR headset. Miniaturization and inference at the edge make it feasible.
The sensors in your AR headset and smartwatch continue to measure your engagement and repeat the previous steps if needed. Indeed, your engagement has increased and you are fully immersed in the movie experience again. Mission accomplished!
Whatever keeps you hooked to finish that movie (and start the next one) is fair game. Whether it is individual scenes or entire plots, our experiences might differ vastly.
This puts last summer’s strike by Hollywood writers in a different light, when your favorite actor’s likeness becomes a digital avatar or characters can be spun up with a single prompt. It also puts OpenAI’s recent preview of Sora for Hollywood studios in a different light.
Quite literally, the future is bright — but, wherever there is light, there is shadow.
While We’re On This Subject…
The above scenario is just my personal view. It’s the first use case I see when all the technologies and capabilities converge. But I’m also curious about where others in the industry are seeing things go.
That’s why I’ve invited Futurist & Author Bernard Marr to the “What’s the BUZZ?” livestream. We’ll talk about what converging technologies and AI hold for the future of business beyond entertainment.
Join us live on LinkedIn or YouTube:
Explore related articles
Become an AI Leader
Join my bi-weekly live stream and podcast for leaders and hands-on practitioners. Each episode features a different guest who shares their AI journey and actionable insights. Learn from your peers how you can lead artificial intelligence, generative AI & automation in business with confidence.
Join us live
April 02 - Bernard Marr, Futurist & Author, will discuss how leaders can prepare for converging technologies and AI.
April 16 - T. Scott Clendaniel, VP & AI Instructor at Analytics-Edge, will share how you can use Generative AI to improve the user experience.
April 30 - Elizabeth Adams, Leader, Responsible AI, will share findings from her research on increasing employee engagement for responsible AI.
May 14 - Randy Bean, Founder of Data & AI Leadership Exchange, will join when we discuss how you can move beyond quick-win use cases for Generative AI. [More details soon…]
Watch the latest episodes or listen to the podcast
Follow me on LinkedIn for daily posts about how you can lead AI in business with confidence. Activate notifications (🔔) and never miss an update.
Together, let’s turn HYPE into OUTCOME. 👍🏻
—Andreas