Generative AI: A Data Privacy Deja Vu?
Understanding The Parallels Of Data Privacy Concerns Between Cloud Computing And Generative AI
The Dual Promise and Peril of Generative AI for Businesses
ChatGPT and generative AI are the latest leap in AI. They promise tremendous productivity gains, creating new business models, and realizing new opportunities. McKinsey has recently estimated an expected increase in the global gross domestic product (GDP) of up to $4.4 trillion per year and Morgan Stanley had made projects earlier hire year that generative AI will add $6 trillion in offline spending.
Given so much focus and excitement in the market, leaders want their own business to participate as well. While the innovation potential is high, so are the risks when adopting consumer-grade applications (or even ‘free research previews’) like ChatGPT in an enterprise environment. The parallels compared to the early days of Cloud Computing are clearly visible. But how can leaders balance innovation and risk in these times of AI-driven disruption?
» Watch the latest episodes on YouTube or listen wherever you get your podcasts. «
Echoes of the Past and Present Dilemmas
About 10 years ago, a new computing paradigm has emerged: Cloud Computing. Instead of running your own data centers, you rent computing resources (servers, storage, network) from a provider and pay a fee. But as businesses and IT departments started evaluating “the Cloud”, their concerns grew quickly: Putting our data into someone else’s data center? How is the data handled and by whom? Who has physical access to the facilities? Who can access the data?
Nonetheless, stories of inadequately secured cloud storage exposing sensitive data have appeared in the news over the years. The result were (in hindsight) some creative architectural considerations — for example hosting the physical hardware in a company’s own data center while having managed-service providers access the systems to provision resources.
Fast-forward 10 years to the current generative AI hype. The worries of someone else hosting or operating your company’s infrastructure are a widely gone. But there’s new trouble on the horizon: The majority of currently available generative AI tools do not provide the safeguards that businesses need to protect their data. And while it might be enticing to “10x your productivity with these 10 AI tools”, you’re like also 10x-ing your risk exposure at the same time.
Because, employees using these tools for business purposes can put their company at risk when entering proprietary, sensitive, and confidential data. Business and IT leaders are looking to understand how their providers of choice handle data privacy when incorporating generative AI technology into their products. That sounds like a deja vu.
Businesses’ New Generative AI Predicament
Recent news from Samsung (March 2023) and Amazon (January 2023) underscore the predicament that businesses are in: Leveraging innovation quickly AND responsibly. Once a user submits their instructions to a Large Language Model (e.g. via a prompt), that data is sent to the provider that hosts the mode. After computation, the model then returns the result. But whether the content of the user’s prompt is stored and potentially used during the next model training, is a growing concern. This is especially critical in the case of proprietary data which users might submit inadvertently. Because this data or fragments of it can make it into the information that is returned to anyone by a future model generation. Clearly, that’s a scenario leaders want and need to prevent.
Rightfully so, the key questions they are asking yet again: What happens with our data? Who has access to it? How can we prevent information leakage? How and by whom will our data be processed? What it will be used for next? And do the benefits of being an early adopter outweigh the risks?
The answers vary by provider. For example, Microsoft offers developers an option for their Azure OpenAI services to only use the underlying generative AI models for inference. This means developers can receive information that the model generates without sending the users’ data to Microsoft (respectively OpenAI). OpenAI themselves as well have recently added a data privacy feature ChatGPT which allows its users to use the tool for inference without sharing data with the provider, reacting to the ChatGPT ban in Italy this spring. These capabilities satisfy at least the most immediate concerns. But will they be enough to balance risk and reward?
Navigating Generative AI Data Privacy
Based on the key concerns by Business and IT leaders, there are four layers that businesses can make use of to prevent sensitive data from making it into the provider’s training set. (If you’re aware of any additional ones, please leave a comment…)
Inference-only usage: Check if your vendor offers the ability to use the model for inference only. The intent is that you will only generate results using the model without sharing any data with the vendor. Read your vendor’s fine print to be certain of the scope of what they offer.
Prompt engineering: A simple, yet effective, step can be to tweak your prompts. Understanding how prompts are constructed is crucial. If you don’t give a user the opportunity to define their own task (e.g. via text), you can define what information is needed and transmitted in order for the LLM to generate and present the results to your end-user.
Content filtering: While most vendors apply this technique during model training and to the generated output before it is presented to a user, you can also apply this technique to process the content of any prompts that have been submitted. There are a few vendors and startups already active in this space.
Local copy: There are several Open Source LLM variants available. Creating a local copy of the LLM you want to use (if available) can provide your business more control over the data that is sent to the model and what is going to happen with it. However, this might not be possible or feasible for every available LLM.
Summary
Just like during the early days of Cloud Computing, businesses are starting to have more concerns about data privacy aspects when using generative AI as they evaluate the technology. For Business and IT leaders, it is a matter of trying it out and better understanding its limitations and opportunities. However, data privacy concerns are not particularly new or limited to generative AI. They have existing already with Cloud Computing (and before that, too). From using LLMs for inference only to creating and operating a local copy of the model yourself, there are several options along the way. A thorough review of your vendors’s terms and conditions will reveal additional information and under which circumstances “vendor A” over “vendor B” might be more beneficial.
» Watch the latest episodes on YouTube or listen wherever you get your podcasts. «
What’s next?
Appearances
July 10 - Monday Morning Data Chat with Joe Reis & Matt Housley on whether your business should chase generative AI.
Join us for the upcoming episodes of “What’s the BUZZ?”
July 6 - Abi Aryan, Machine Learning Engineer & LLMOps Expert, will share how you can fine-tune and operate large language models in practice.
August 1 - Scott Taylor, aka “The Data Whisperer”, will let us in on how effective storytelling help you get AI project funded.
Follow me on LinkedIn for daily posts about how you can set up & scale your AI program in the enterprise. Activate notifications (🔔) and never miss an update.
Together, let’s turn hype into outcome. 👍🏻
—Andreas