How Cloud Providers Are Locking In Your AI Platform (And What to Do About It)

If you’ve been building on cloud infrastructure for more than a few years, you’ve seen this movie before. The pattern is predictable: a new capability emerges, the hyperscalers race to offer managed services, the managed services are easy to adopt, and then—gradually, then suddenly—you realize you’ve accumulated switching costs that make migration a multi-quarter project nobody wants to fund.

We watched it happen with databases. We watched it happen with container orchestration. We watched it happen with serverless. And now we’re watching it happen with AI.

The difference this time is speed. The AI platform lock-in cycle is compressing years of the typical vendor capture playbook into months. If you’re making architectural decisions about your GenAI stack right now, the window for intentional choices is smaller than you think.

The Playbook Is Familiar

The cloud providers aren’t doing anything particularly novel here. They’re running the same strategy that worked for them in databases, compute, and Kubernetes—just faster.

Consider how AWS approached database services. They started with RDS, which was mostly just managed PostgreSQL or MySQL with some operational convenience. Easy to adopt, theoretically portable. Then came Aurora, with its proprietary storage layer and auto-scaling capabilities. Still “MySQL compatible,” but now with features you couldn’t replicate elsewhere. Then Aurora Serverless, with its automatic capacity management. Each step was individually reasonable. Each step deepened your commitment. By the time you realized you were locked in, the switching cost estimate made finance wince.

The same pattern played out with container orchestration. ECS was easy. Fargate was easier. The proprietary service mesh integrations were convenient. And somewhere along the way, your “containerized” workload became an AWS-shaped workload that would require substantial rearchitecting to run anywhere else.

AI infrastructure is following the identical trajectory, but the timeline is compressed. What took a decade with databases is happening in two years with GenAI platforms. The convenient on-ramps are already deployed, the proprietary integrations are multiplying, and the switching costs are accumulating faster than most teams realize.

Naming Names

Let’s be specific about what each major cloud provider is doing. This isn’t speculation—it’s all documented in their product pages, just not framed as lock-in.

AWS Bedrock represents Amazon’s full-stack approach to AI platform capture. On the surface, it’s a model hosting service with access to Claude, Llama, and other foundation models. Nothing proprietary about the models themselves. But look at the orchestration layer.

Bedrock Agents uses a proprietary orchestration format. The agent definitions, the tool schemas, the action groups—they’re all Bedrock-specific. If you’ve built sophisticated agents using Bedrock’s native constructs, those agents don’t run anywhere else without rewriting. Your orchestration logic is now an AWS asset.

Knowledge Bases in Bedrock presents a similar challenge. It’s convenient—point at an S3 bucket, get a vector-indexed knowledge store for your agents. But the retrieval pipeline, the chunking strategy, the embedding index—all managed by AWS in a way that doesn’t export cleanly. You can’t take that knowledge base and deploy it on GCP or Azure or your own infrastructure. The data is yours; the operationalized knowledge system is theirs.

Fine-tuning through Bedrock creates model weights that live only in AWS. You can customize Claude or Llama for your domain, but those customized weights exist solely within Bedrock’s infrastructure. The fine-tuning investment—which might represent months of data preparation and iteration—doesn’t travel with you.

Azure AI Studio takes a similar approach with different branding. The integrated environment makes it easy to build end-to-end AI applications, but that integration comes through proprietary tooling. Prompt Flow, Microsoft’s orchestration framework, uses a format that doesn’t translate to other platforms. The flows you build, the evaluation pipelines you create, the prompt variants you test—they’re Azure artifacts.

The Cognitive Services integrations add another layer. Azure makes it convenient to combine OpenAI models with their existing AI services—speech, vision, language understanding. Each integration pulls you deeper into a Microsoft-specific architecture. The convenience is real; so is the accumulating dependency.

GCP Vertex AI completes the picture with Google’s approach. Vertex AI Agents and Reasoning Engine provide powerful orchestration capabilities, but they’re Vertex-specific orchestration capabilities. The agentic workflows you build don’t port to other environments without reconstruction.

Vertex’s managed vector search is particularly interesting to examine. It’s fast, it scales well, and it doesn’t support standard export formats. You can put embeddings in; getting them out in a portable format is not a documented capability. Your vector index—potentially the distilled representation of your entire knowledge corpus—becomes a GCP asset.

Google’s proprietary models compound the situation. If you build on Gemini, you’re building on a model that runs only in Google’s infrastructure. Model portability isn’t even theoretically possible.

How Switching Costs Accumulate

Nobody starts a project planning to get locked in. The progression is more subtle than that.

On day one, you make a reasonable choice: “We’ll use managed services to move fast. We can always switch later.” This isn’t wrong, exactly. Managed services do let you move fast. The “switch later” part is where the reasoning breaks down.

By month three, your team has built competency around the platform’s specific APIs and mental models. The engineers know how to debug Bedrock agent configurations. They understand the quirks of Vertex’s retrieval pipeline. This knowledge represents real investment, and it doesn’t transfer to other platforms.

By month six, you’ve integrated the AI platform with your existing infrastructure. The agents authenticate through your identity provider. The observability data flows to your monitoring systems. The outputs feed your downstream applications. Each integration was individually sensible. Collectively, they’ve woven the AI platform into your operational fabric.

By year two, someone suggests expanding to a different cloud for resilience or cost optimization. The migration estimate comes back: six months of engineering work, substantial rewriting of the orchestration layer, and uncertainty about whether the fine-tuned model performance will translate. The project dies in planning.

This isn’t a failure of planning; it’s the success of the vendor’s product strategy. They designed the product to create this outcome.

What “Open” Actually Means

Cloud providers have become adept at the language of openness while building closed systems. “We support open source models” is technically accurate and practically misleading.

Running Llama on Bedrock is possible. The model weights are open. But the orchestration layer wrapping those weights is proprietary. The knowledge retrieval system connecting to those weights is proprietary. The evaluation and monitoring infrastructure around those weights is proprietary. You have an open model embedded in a closed system.

This distinction matters because the model is typically not the hard part. The hard part is the orchestration logic, the evaluation frameworks, the integrations, the operational tooling—all the infrastructure that makes raw model capabilities useful for your specific workflows. If that infrastructure is proprietary, model openness is largely irrelevant to your portability.

True portability requires openness at every layer: model weights you can move, orchestration logic you can run elsewhere, vector stores you can export, observability data you can redirect. Any proprietary component in the stack becomes a potential anchor.

The Questions Nobody’s Asking

When teams evaluate AI platforms, they typically ask about capabilities, pricing, and model selection. They rarely ask the questions that determine long-term flexibility.

Can we export our vector database in a standard format? Most managed vector services don’t support this. Your embeddings go in; they don’t come out in a portable form. If your knowledge base represents significant investment in data preparation and curation, that investment is trapped.

Does our orchestration logic run outside this cloud? If you’ve built agent workflows using a cloud provider’s native orchestration, the answer is almost certainly no. The agent definitions, the tool configurations, the multi-step reasoning chains—they’re platform-specific artifacts.

Can we switch model providers without rewriting agents? This tests both orchestration portability and the degree to which your prompts are tuned to specific model behaviors. Some orchestration frameworks (LangChain, CrewAI) make this relatively easy. Cloud-native orchestration typically doesn’t.

Do our observability tools work elsewhere? AI observability is still immature as a category, which means the tools you adopt now will shape your operational practices for years. If those tools only work with one cloud’s AI services, you’re building operational muscle memory that doesn’t transfer.

These questions feel premature when you’re trying to get an MVP deployed. They become urgent when you’re trying to understand why a multi-cloud strategy seems impossible.

Architecting for Portability

The alternative to lock-in isn’t avoiding cloud services entirely—that’s impractical for most organizations. The alternative is intentional architecture that preserves optionality.

This starts with orchestration. LangChain, LlamaIndex, CrewAI, and AutoGen all provide orchestration capabilities that run anywhere. The agent logic you write using these frameworks deploys to any cloud—or no cloud. You own the code. You control where it runs. The cloud provider supplies compute, not intellectual property.

Vector storage requires similar consideration. Self-managed options like PostgreSQL with pgvector or Qdrant give you data portability by default. If you choose a managed vector service, understand whether export is supported and what the export format looks like. Some managed services (Pinecone, for instance) support standard export; others don’t.

Infrastructure-as-code provides a layer of protection. If your deployment is fully described in Terraform or Pulumi, you have at least a starting point for multi-cloud portability. The IaC won’t automatically translate across providers, but it documents your architecture in a way that makes translation possible.

Observability deserves particular attention because AI monitoring is evolving rapidly. Building on cloud-specific observability tools means rebuilding your monitoring when you need flexibility. Building on portable tools (or at least tools that export data in standard formats) preserves your operational investment.

None of this is free. Portable architectures require more intentional design than managed service consumption. You’re trading convenience for control—a trade that only makes sense if you value the control. For some organizations, the switching costs of cloud lock-in are acceptable. For others, preserving flexibility is a strategic requirement. The important thing is making that choice consciously.

The Timing Problem

The window for portable AI architecture is narrower than it appears. Every month of accumulating lock-in increases the eventual switching cost. But more subtly, every month of building team competency around proprietary tools increases the organizational resistance to changing course.

Engineers build mental models around the systems they use daily. They become fluent in the quirks and capabilities of specific platforms. Suggesting a migration isn’t just suggesting a technical project; it’s suggesting that their hard-won expertise needs to be rebuilt from scratch. That’s a harder sell than pure technical merit would predict.

If portability matters to you, the architectural decisions you make in the next few months will determine your options for the next few years. The convenient on-ramps are designed to look temporary. They’re not.

A Different Approach

We’ve built AI platforms for companies that arrived with partially locked-in architectures and companies that started with a clean slate. The clean slate is dramatically easier. Once proprietary dependencies have accumulated, extraction is slow, expensive work—not impossible, but rarely prioritized against new feature development.

Our approach is to assume portability matters from day one. We build on open-source orchestration frameworks. We use infrastructure-as-code for everything. We structure observability to be provider-agnostic. We document architectures assuming the client will eventually want to modify or move them.

This doesn’t mean avoiding managed services entirely—that would be impractical. It means choosing managed services that don’t create switching barriers, and using proprietary services only when the capability genuinely can’t be replicated with portable alternatives.

If you’re designing your AI platform architecture now, it’s worth thinking through what portability means for your specific situation. What would it cost to move in two years? What dependencies are you accumulating without realizing it? What questions should you be asking vendors that you’re not?

We’ve helped teams work through these questions before the architecture solidifies. Happy to talk through your situation.

theorem.agency builds AI platforms on open-source components that run in any cloud—or on-prem. We handle the architecture, the orchestration, the prompt engineering, and the handover. You own everything.