20 minute read

This is the second post in a series of technical writeups about architectural patterns for adopting LLMs in SaaS and enterprise-grade applications. Any opinions expressed are solely my own and do not express the views or opinions of my employer. I welcome feedback and discussion—find me via social links on this site!

Introduction

Most LLM integrations start simple: pick a model, configure an API key, and call it from your application. This works fine for consumer apps and internal tools. But enterprise SaaS is a different beast.

Consider a B2B platform where different client companies have different regulatory requirements:

  • Client A (financial services, EU-based) requires all AI processing to use Azure OpenAI Service-hosted models within EU data centers
  • Client B (healthcare) mandates specific model versions that have been audited for their compliance framework
  • Client C (critical infrastructure contractor) has strict Data Processing Agreement (DPA) governing all digital interactions concerning their data, mandating specific siloed environments from which models can run.

Suddenly, your “just call GPT-4” architecture needs to become a dynamic routing system that respects per-client constraints at runtime.

This post explores the solution space for dynamic model routing—selecting which LLM (or embedding model) to use based on runtime context, with a focus on multi-tenant enterprise scenarios.

TLDR; We introduce a ModelResolver abstraction—with a constraint model inspired by Kubernetes affinities and taints—that consults per-client configuration to route requests to compliant model instances, with fallback chains that respect client constraints.

The Problem Space

Let’s formalize the challenge. In a multi-tenant SaaS application using LLMs, we need to handle:

Request arrives for Client A
├── Which LLM providers are allowed? (Azure only)
├── Which models are allowed? (GPT-4, but not Claude)
├── Which model versions are allowed? (specific audited versions)
├── What's the fallback chain if primary is unavailable?
└── Which prompt variant works best with the selected model?

This creates several interrelated problems:

1. Static Configuration Doesn’t Scale

The naive approach—environment variables per model—breaks down quickly:

# This worked when you had one model...
LLM_API_KEY=sk-xxx
LLM_MODEL=gpt-4

# But now you need per-client, per-provider, per-model configs
CLIENT_A_AZURE_GPT4_API_KEY=...
CLIENT_A_AZURE_GPT4_ENDPOINT=...
CLIENT_B_OPENAI_GPT4_API_KEY=...
CLIENT_B_BEDROCK_CLAUDE_API_KEY=...
# Combinatorial explosion ensues

2. Prompts and Models Are Coupled

Different models have different strengths. A prompt optimized for GPT-4 may underperform on Claude, and vice versa. If client constraints force you onto a different model family, you may need a different prompt variant.

This is compounded by organizational reality: in many teams, prompt engineering is a distinct role. Prompt editors work within prompt management platforms like LangFuse or LangSmith, iterating on prompt quality and evaluations. These editors shouldn’t need to know which client uses which provider—but they do need to create and maintain model-specific prompt variants. If the coupling between prompts and model configuration is too tight, every model routing change becomes a prompt engineering task, and vice versa.

3. Fallback Chains Must Respect Constraints

When your primary model provider has an outage, you need fallbacks. But for Client A, falling back from Azure to OpenAI direct might violate their data residency requirements—even if both offer the same model.

4. Embeddings Are Especially Tricky

Unlike LLM responses (ephemeral), embeddings are often persisted for RAG use cases. If you store embeddings generated by Model X, you must query with Model X. Switching embedding models mid-stream requires re-indexing.

The Solution Space

Approach 1: Tenant-Isolated Deployments

The nuclear option: deploy separate instances of your service per client, each with its own static configuration.

Tradeoffs: Operationally expensive, doesn’t scale, but provides maximum isolation. Sometimes required for the most sensitive clients.

Approach 2: Feature Flags Per Client

Use a feature flag system to toggle between a small number of pre-configured model setups. Simple to implement, but becomes unwieldy as the number of configurations grows. Hard to express complex constraints like “Azure OR Bedrock, but not OpenAI direct.”

Approach 3: Dynamic Resolution (Our Focus)

Introduce a ModelResolver abstraction that:

  1. Accepts a logical model request (e.g., “I need a GPT-4 class model for this prompt”)
  2. Consults per-client configuration to determine allowed providers/models
  3. Returns a prioritized list of concrete model instances that satisfy constraints
  4. Handles fallback ordering within compliance boundaries

This approach separates what you need (a capable LLM) from how it’s provisioned (which specific deployment).

What About Managed LLM Gateways?

Before rolling your own, it’s worth surveying the growing ecosystem of managed LLM gateways and routing platforms. Several address significant parts of this problem out of the box:

Third-party gateways like Portkey and LiteLLM offer the most relevant feature sets. Portkey provides first-class model allowlists and denylists scoped to workspaces, built-in prompt management with versioning, automatic fallback chains, and can be deployed in your VPC. LiteLLM (open-source, self-hosted) has a deep multi-tenant architecture with “model access groups” that restrict which models are available per org/team/user, plus integrations with prompt management platforms like LangFuse. Both support 100+ LLM providers behind a unified OpenAI-compatible API.

Hyperscaler-native options are strongest for regulated industries. AWS Bedrock offers cross-region inference with automatic failover, IAM-based model access control, and holds FedRAMP High and DoD IL4/IL5 authorization—making it the go-to for government workloads. Azure AI Foundry introduced a native Model Router (GA Nov 2025) that does intelligent per-prompt cross-model routing across 18+ models, with Azure Policy for organizational governance and comparable compliance certifications.

Quality-based routers like Martian take a different approach entirely, using mechanistic interpretability to predict which model will best answer each individual prompt—optimizing for response quality rather than compliance constraints.

Other notable players include Cloudflare AI Gateway (edge-based routing with automatic failover), Kong AI Gateway (strong if you already run Kong for API management), and OpenRouter (marketplace for 500+ models with automatic fallback, though limited per-tenant governance).

Where the gaps remain. These platforms handle multi-provider failover and, in some cases, per-tenant model access control well. However, none of them address the full problem as described in this post:

  • Prompt-model affinity coordination—selecting the right prompt variant based on which model the client’s constraints allow—is application-layer logic that no gateway handles. This is the integration point with prompt management platforms like LangFuse or LangSmith.
  • Embedding model routing with compliance-aware multi-model indexing (indexing with multiple models at write time, querying with the compliant one at read time) is not addressed by any vendor.
  • Data processing agreements (DPA) ca sometimes eliminate all third-party gateways.

The practical takeaway: if your needs are primarily multi-provider failover with basic per-tenant access control, a managed gateway like Portkey or LiteLLM can save significant engineering effort. If you need deep prompt-model affinity coordination, embedding compliance for RAG, or must operate within strict regulatory boundaries mandated by DPAs and such, you’ll likely need the custom architecture described below—potentially layered on top of a gateway for the lower-level routing primitives.

Proposed Architecture

Core Concepts

The architecture rests on four key abstractions:

Model Identifier: A logical name for a model. Examples: gpt-4, claude-sonnet-3.5, text-embedding-3-large. These are the names prompt editors and use case developers think in. There is some degree of freedom here - For some companies it might make sense to go with fairly particular model references (claude-sonnet-35) whereas for others it might make sense to stay at the level of a model family (claude-sonnet) or even a model “class” (thinking-llm-high). We’ll talk about this more below.

Provider Identifier: The infrastructure provider hosting a model deployment. Examples: azure (Azure OpenAI Service), openai (OpenAI API), bedrock (Amazon Bedrock), anthropic (Anthropic API). A single model can be available from multiple providers.

Model Instance: A concrete, configured deployment we can instantiate a client for. Each instance is characterized by a provider identifier, a model identifier, a configuration key (for looking up credentials, endpoints URLs, additional provider-specific configuration etc.), and a default priority (this is mainly used for ordering fallbacks as we’ll see below). For example, gpt-4 might have two instances: one on Azure (priority 1) and one on OpenAI direct (priority 2).

Client Model Relationship: Per-client rules expressing which models and providers are allowed or blocked. This is where compliance constraints live.

Before we dive into details, here’s a diagram showing how it all comes together as a flow through a typical request processing pipeline:

┌─────────────────────────────────────────────────────────────────┐
│                      Request Context                            │
│  (client_id, prompt_tag, model_preferences)                     │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                      ModelResolver                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │ Client Config   │  │ Model Instance  │  │ Provider        │  │
│  │ (allow/block)   │  │ Registry        │  │ Priorities      │  │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘  │
│           │                    │                    │           │
│           └────────────────────┼────────────────────┘           │
│                                ▼                                │
│                    Resolve to Model Instances                   │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Model Factory                              │
│  Instantiate clients for each Model Instance                    │
│  Wrap in RunnableWithFallbacks for resilience                   │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│              LLM Client (with fallback chain)                   │
│  Primary: Azure GPT-4 → Fallback: Azure GPT-4-turbo             │
│  (both satisfy client constraints)                              │
└─────────────────────────────────────────────────────────────────┘

Client Model Relationship - The Constraint Model

The client configuration model draws inspiration from Kubernetes node affinities and taints/tolerations—a battle-tested system for expressing scheduling constraints. Just as K8s lets you specify both hard (requiredDuringSchedulingIgnoredDuringExecution) and soft (preferredDuringSchedulingIgnoredDuringExecution) affinities, and use taints to repel pods from nodes, this proposed architecture can support both allowlists and blocklists for model/provider combinations, as well as allow potentially for ‘hard’ and ‘soft’ constraints.

There is a trade-off here between expressiveness and maintenance complexity; In most cases, it is probably enough to start with a simple allowlist-only model and semantics of a ‘hard constraint’. In practice this means that we can have a configuration per-client that explicitly enumerates allowed providers/and or models for them.

Each client-model relationship record captures:

Field Description
client_id The client company this rule applies to
type The kind of constraint: AllowedModel, AllowedProvider, BlockedModel, or BlockedProvider
model_identifier Which model this rule targets (e.g. gpt-4). Used with model-level constraints.
provider_identifier Which provider this rule targets (e.g. azure). Used with provider-level constraints.

A key design choice: when a client has any relationships defined, those become the exclusive source of truth. Only explicitly allowed combinations are permitted. Clients with no relationships inherit system defaults. This closed-world assumption is essential for compliance—you want to be certain that unreviewed models won’t accidentally be used.

Example configurations in plain language:

  • Client A (Azure only): One AllowedProvider row for azure. Any model is fine, but only through Azure deployments.
  • Client B (GPT-4 family only): Two AllowedModel rows for gpt-4 and gpt-4-turbo. Any provider can serve them.
  • Client C (no Anthropic): Two BlockedProvider rows for anthropic and bedrock. Everything else is allowed.

These relationships should be managed via a standard RESTful CRUD API, making them editable (and discoverable) by tech support personnel, e.g. during client onboarding and troubleshooting — in particular no code changes or deployments should be required for making changes to this configuration.

Model Instance Registry

The registry maps each logical model identifier to its available deployments, ordered by priority. For example, gpt-4 might be available through Azure (priority 1, preferred) and OpenAI direct (priority 2, fallback). claude-sonnet-3.5 might be available through Amazon Bedrock (priority 1) and Anthropic’s API (priority 2).

This registry can be defined as a static lookup table in code, loaded from configuration, or eventually managed via an API for full runtime flexibility. The priority ordering can also be overridden globally—critical for incident response when you need to quickly shift traffic away from a failing provider.

The ModelResolver

The ModelResolver is the heart of the system. Given a set of preferred model identifiers and an optional client ID, it:

  1. Loads the client’s constraint configuration
  2. For each preferred model, looks up available instances in the registry
  3. Filters out instances that violate the client’s constraints
  4. Sorts remaining instances by priority (respecting any global provider priority overrides)
  5. Returns a prioritized list of compliant model instances, capped at a configurable maximum for the fallback chain

If no compliant instances exist, the resolver raises an error—surfacing the configuration gap rather than silently using a non-compliant model.

Note also that the above workflow can largely be cached, e.g. in crticial latency-sensitive workloads, with TTL-based invalidation or some external cache-invalidation (for example when Client Company Relationship configuration changes are submitted via the API).

Integration with LangChain

The resolved model instances integrate naturally with LangChain’s RunnableWithFallbacks. A factory class takes the resolver’s output, instantiates the appropriate provider-specific client for each instance (Azure, OpenAI, Bedrock, Anthropic, etc.), and wraps them in a fallback chain. If the primary model fails, the chain automatically tries the next compliant alternative—all without the caller needing to know about the routing details.

Prompt-Model Affinity and the Prompt Management Layer

A critical design goal is separation of concerns between prompt authoring and infrastructure routing. Prompt editors—often a distinct role from the engineers building model routing—work within prompt management platforms like LangFuse or LangSmith to craft, version, and evaluate prompts. They shouldn’t need to think about which client uses which provider, but they do need to express which models a given prompt was designed and tested for.

This is where model affinities come in. Each prompt declares the models it works well with as part of its metadata. Crucially, prompts reference only the logical model identifier (e.g. gpt-4)—never the provider or API version. That separation is what allows prompt editors and infrastructure engineers to work independently.

{
  "name": "contract-analysis/default",
  "config": {
    "model_affinities": ["gpt-4", "gpt-4-turbo"],
    "temperature": 0.2
  },
  "version": 3,
  "labels": ["production", "latest"],
  "tags": ["contract-analysis"]
}

When client constraints force a different model family, the system can select an alternative prompt variant. Prompt management platforms make this workflow natural: editors can use version labels to promote variants through staging to production, tags to group interchangeable prompts by use case, and prompt folders to organize related variants.

{
  "name": "contract-analysis/claude",
  "config": {
    "model_affinities": ["claude-sonnet-3.5", "claude-opus-3"],
    "temperature": 0.2
  },
  "version": 2,
  "labels": ["production", "latest"],
  "tags": ["contract-analysis"]
}

At runtime, the system queries for prompts by tag (not by name), retrieves all variants tagged for a given use case, and selects the one whose model affinities align with what the client’s constraints allow. This means a prompt editor can publish a new Claude-optimized variant in LangFuse without any code changes—the routing layer will pick it up automatically for clients whose constraints require Anthropic models.

The use case layer orchestrates this selection: it iterates through candidate prompts (grouped by tag), attempts to resolve a compliant LLM for each prompt’s model affinities, and uses the first successful match. This keeps the use case code generic—it doesn’t hardcode any model or provider, delegating all routing decisions to the resolver.

Special Considerations for Embeddings

Embeddings present unique challenges because they’re often persisted. This introduces a degree of “statefulness” to the system - previous invocations now interact with future ones.

  1. Query embeddings must match index embeddings: You can’t query a vector index created with text-embedding-ada-002 using embeddings from text-embedding-3-large. The vectors live in different embedding spaces.

  2. Model changes require re-indexing: If a client’s constraints change (e.g. they move from Azure to Bedrock), or if you wish to upgrade to newer embedding models across the board, you may need to re-embed their data.

After analyzing a lot of different use cases, I don’t believe there is a one size fits all solution here. However, most solutions fall into one of two approaches that can be used on a case-by-case basis:

  • Multi-model indexing: maintain a supported_models configuration per use case. At index time, create embeddings using all supported models. At query time, use the ModelResolver to determine which embedding model the current client permits, and query only the corresponding index partition. This ensures compliance without sacrificing retrieval quality. This trades storage cost for routing flexibility. For small scale use cases, this is probably the best approach. triggering re-indexing whenever new models are added (or old ones removed) is fairly straight-forward, and this ensures all embedding models in use can be used at query time.

  • Event-driven reindexing: For use cases where maintaining multiple indexes per model is not practical, an alternative approach is to publish an event whenever client-model relationship configuration changes (or when we add a new model to the system which we want supported going forward). An event handling component can then pick up this event, enumerate all affected resources (this will be very implementation-dependent), check if they are still considered “alive”, and if so, trigger re-indexing of their associated embeddings.

Operational Aspects

Hot Reloading a ‘Global Override’ to Provider Priorities

During an incident, you need to quickly shift traffic away from a failing provider. A simple pattern that can accomodate for this is a global provider priority list ‘override’ flag that enables this without code changes.

This can live as an environment variable or be exposed via an API;

MODEL_PROVIDER_PRIORITIES="azure,bedrock,openai,anthropic"

This single configuration value reorders the fallback preferences across all model instances, instantly redirecting traffic.

Monitoring and Alerting

Track resolution outcomes to detect configuration problems early:

  • Resolution success/failure rates per client and model—spikes in failures indicate constraint misconfigurations
  • Fallback invocation rates per provider—elevated fallback usage is an early outage indicator
  • Constraint coverage gaps—alert when new client onboarding hasn’t configured model relationships, or when a client has no compliant model for frequently-used prompts

Caching Client Constraints

Constraint lookups happen on every request, so they need to be fast. Cache client configurations aggressively with TTL-based expiration (e.g. 5 minutes). Consider cache invalidation via the CRUD API that manages constraints—when a tech support engineer updates a client’s allowed models, the cache should be cleared for that client.

Additional Considerations and Tradeoffs

When This Pattern Fits

Dynamic model routing makes sense when:

  • You have enterprise clients with regulatory or compliance requirements (e.g. DPAs, SOC 2, GDPR)
  • Different clients need different providers (data residency, audit requirements)
  • You want operational flexibility to shift traffic during incidents
  • Your prompt library has model-specific variants managed by a dedicated prompt engineering team

When to Consider Alternatives

Simpler approaches may suffice when:

  • All clients can use the same models (no compliance variation)
  • You only use one provider (no routing decisions to make)
  • Your application is single-tenant
  • Model selection is purely based on capability, not compliance

Complexity Costs

This architecture introduces:

  • A new data model (client relationships) that needs CRUD APIs and admin UI
  • Resolution logic that runs on every LLM request (mitigated by caching)
  • Prompt variants that must be maintained per model family—though prompt management platforms like LangFuse and LangSmith help significantly here, enabling prompt editors to manage variants with version labels and evaluation pipelines rather than code changes
  • Embedding index partitioning for RAG use cases

Weigh these costs against the value of per-client configurability.

Conclusion

As LLMs become critical infrastructure for enterprise SaaS, the ability to dynamically route requests based on client constraints becomes essential. The pattern presented here—combining a model registry, per-client relationship configuration inspired by Kubernetes affinities, and a resolver that produces compliant fallback chains—provides a foundation for building this capability.

Key insights:

  1. Separate logical from physical: Model identifiers express what you need; model instances express how it’s deployed
  2. Client constraints are first-class data: Store them in your database with CRUD APIs, not scattered across environment variables
  3. Decouple prompt authoring from routing: Prompt editors working in platforms like LangFuse or LangSmith should declare model affinities, not model configurations—let the routing layer handle the rest
  4. Fallbacks must respect constraints: It’s not enough to fail over—you must fail over to compliant alternatives
  5. Embeddings need special care: Persistence means you can’t switch models without re-indexing

The implementation complexity is non-trivial, but for enterprise SaaS serving regulated industries, dynamic model routing is increasingly becoming table stakes rather than a nice-to-have.


Appendix: Implementation

Didactic Proof-of-Concept: The code samples below are intended as educational illustrations of the architectural concepts discussed above. They demonstrate core patterns and mechanisms but should not be considered production-ready. Real implementations would require additional considerations: comprehensive error handling, monitoring/observability, security hardening, and extensive testing.

Client Model Relationship (Data Model)

The constraint model, implemented as a SQLAlchemy ORM class persisted to PostgreSQL:

class RelationshipType(enum.Enum):
    """Types of client-model relationships."""
    ALLOWED_PROVIDER = "allowed_provider"
    ALLOWED_MODEL = "allowed_model"
    BLOCKED_PROVIDER = "blocked_provider"
    BLOCKED_MODEL = "blocked_model"


class ClientModelRelationship(Base):
    """
    Per-client rules for model/provider allowlists and blocklists.

    When a client has ANY relationships defined, they become the
    exclusive source of truth—only explicitly allowed combinations
    are permitted.

    When no relationships exist, the client inherits system defaults.
    """
    __tablename__ = "client_model_relationships"

    id: Mapped[UUID] = mapped_column(primary_key=True, default=uuid4)
    client_id: Mapped[UUID] = mapped_column(index=True, nullable=False)
    relationship_type: Mapped[RelationshipType] = mapped_column(nullable=False)
    model_identifier: Mapped[str | None] = mapped_column(nullable=True)
    provider_identifier: Mapped[str | None] = mapped_column(nullable=True)
    created_at: Mapped[datetime] = mapped_column(server_default=func.now())
    updated_at: Mapped[datetime] = mapped_column(onupdate=func.now())

Example constraint configurations:

# Client A: Azure only, any model
[
    ClientModelRelationship(
        client_id=client_a_id,
        relationship_type=RelationshipType.ALLOWED_PROVIDER,
        provider_identifier="azure",
    ),
]

# Client B: GPT-4 family only, any provider
[
    ClientModelRelationship(
        client_id=client_b_id,
        relationship_type=RelationshipType.ALLOWED_MODEL,
        model_identifier="gpt-4",
    ),
    ClientModelRelationship(
        client_id=client_b_id,
        relationship_type=RelationshipType.ALLOWED_MODEL,
        model_identifier="gpt-4-turbo",
    ),
]

# Client C: Block Anthropic providers entirely
[
    ClientModelRelationship(
        client_id=client_c_id,
        relationship_type=RelationshipType.BLOCKED_PROVIDER,
        provider_identifier="anthropic",
    ),
    ClientModelRelationship(
        client_id=client_c_id,
        relationship_type=RelationshipType.BLOCKED_PROVIDER,
        provider_identifier="bedrock",
    ),
]

Model Instance Registry

MODEL_INSTANCES: dict[str, list[ModelInstance]] = {
    "gpt-4": [
        ModelInstance(provider="azure", config_key="azure_gpt4_eastus", priority=1),
        ModelInstance(provider="openai", config_key="openai_gpt4", priority=2),
    ],
    "gpt-4-turbo": [
        ModelInstance(provider="azure", config_key="azure_gpt4_turbo_eastus", priority=1),
        ModelInstance(provider="openai", config_key="openai_gpt4_turbo", priority=2),
    ],
    "claude-sonnet-3.5": [
        ModelInstance(provider="bedrock", config_key="bedrock_claude_sonnet_35", priority=1),
        ModelInstance(provider="anthropic", config_key="anthropic_claude_sonnet_35", priority=2),
    ],
    "text-embedding-3-large": [
        ModelInstance(provider="azure", config_key="azure_embedding_3_large", priority=1),
        ModelInstance(provider="openai", config_key="openai_embedding_3_large", priority=2),
    ],
}

ModelResolver

@dataclass
class ResolvedModel:
    """Result of model resolution."""
    model_identifier: str
    instance: ModelInstance


class ModelResolver:
    """
    Resolves logical model requests to concrete instances,
    respecting client constraints and system priorities.
    """

    def __init__(
        self,
        model_registry: dict[str, list[ModelInstance]],
        relationship_store: ClientModelRelationshipStore,
        provider_priorities: list[str] | None = None,
    ):
        self._registry = model_registry
        self._relationships = relationship_store
        self._provider_priorities = provider_priorities or []

    async def resolve(
        self,
        model_identifiers: str | list[str],
        client_id: UUID | None = None,
        max_fallbacks: int = 3,
    ) -> list[ResolvedModel]:
        if isinstance(model_identifiers, str):
            model_identifiers = [model_identifiers]

        constraints = await self._load_client_constraints(client_id)
        resolved: list[ResolvedModel] = []

        for model_id in model_identifiers:
            if model_id not in self._registry:
                continue

            instances = self._filter_by_constraints(
                self._registry[model_id], constraints,
            )
            instances = self._sort_by_priority(instances)

            for instance in instances:
                if len(resolved) >= max_fallbacks:
                    break
                resolved.append(ResolvedModel(
                    model_identifier=model_id,
                    instance=instance,
                ))

        if not resolved:
            raise NoCompliantModelError(
                f"No model instances satisfy constraints for client {client_id}"
            )
        return resolved

    @cached(ttl=300)
    async def _load_client_constraints(
        self, client_id: UUID | None,
    ) -> ClientConstraints:
        if client_id is None:
            return ClientConstraints.empty()
        relationships = await self._relationships.get_for_client(client_id)
        return ClientConstraints.from_relationships(relationships)

    def _filter_by_constraints(
        self, instances: list[ModelInstance], constraints: ClientConstraints,
    ) -> list[ModelInstance]:
        if constraints.is_empty:
            return instances
        return [
            inst for inst in instances
            if constraints.allows(provider=inst.provider, model=inst.model_identifier)
        ]

    def _sort_by_priority(self, instances: list[ModelInstance]) -> list[ModelInstance]:
        def priority_key(inst: ModelInstance) -> tuple[int, int]:
            try:
                provider_priority = self._provider_priorities.index(inst.provider)
            except ValueError:
                provider_priority = 999
            return (provider_priority, inst.priority)
        return sorted(instances, key=priority_key)

LLM Factory with Fallback Chains

Integration with LangChain’s RunnableWithFallbacks:

class LLMFactory:
    """Factory for creating LLM clients with fallback chains."""

    def __init__(self, resolver: ModelResolver, config_loader: ConfigLoader):
        self._resolver = resolver
        self._config = config_loader

    async def create(
        self,
        model_preferences: list[str],
        client_id: UUID | None = None,
        **kwargs,
    ) -> RunnableWithFallbacks:
        resolved = await self._resolver.resolve(
            model_identifiers=model_preferences,
            client_id=client_id,
        )

        clients = []
        for resolved_model in resolved:
            config = self._config.load(resolved_model.instance.config_key)
            client = self._create_client(
                provider=resolved_model.instance.provider,
                config=config, **kwargs,
            )
            clients.append(client)

        if len(clients) == 1:
            return clients[0]

        primary, *fallbacks = clients
        return primary.with_fallbacks(fallbacks)

    def _create_client(self, provider: str, config: ModelConfig, **kwargs) -> BaseChatModel:
        match provider:
            case "azure":
                return AzureChatOpenAI(
                    azure_endpoint=config.endpoint, api_key=config.api_key,
                    api_version=config.api_version, deployment_name=config.deployment_id,
                    **kwargs,
                )
            case "openai":
                return ChatOpenAI(api_key=config.api_key, model=config.model_name, **kwargs)
            case "bedrock":
                return ChatBedrock(model_id=config.model_id, region_name=config.region, **kwargs)
            case "anthropic":
                return ChatAnthropic(api_key=config.api_key, model=config.model_name, **kwargs)
            case _:
                raise ValueError(f"Unknown provider: {provider}")

Use Case: Prompt-Model Coordination

Demonstrates how a use case selects a compatible prompt and LLM at runtime:

class ContractAnalysisUseCase:
    """Use case demonstrating prompt-model coordination."""

    async def analyze(self, contract_text: str, client_id: UUID) -> AnalysisResult:
        # Get candidate prompts for this use case (by tag, not name)
        prompts = await self.prompt_manager.get_by_tag("contract-analysis")

        # Find a prompt whose model affinities intersect with
        # what the client allows
        llm, selected_prompt = await self._resolve_llm_and_prompt(
            prompts=prompts, client_id=client_id,
        )

        chain = selected_prompt.create_chain(llm=llm)
        return await chain.ainvoke({"contract": contract_text})

    async def _resolve_llm_and_prompt(
        self, prompts: list[Prompt], client_id: UUID,
    ) -> tuple[RunnableWithFallbacks, Prompt]:
        """Find compatible LLM and prompt combination."""
        for prompt in prompts:
            try:
                llm = await self.llm_factory.create(
                    model_preferences=prompt.config.model_affinities,
                    client_id=client_id,
                )
                return llm, prompt
            except NoCompliantModelError:
                continue

        raise ConfigurationError(
            f"No prompt variant compatible with client {client_id} constraints"
        )

Embedding Manager with Multi-Model Support

class EmbeddingManager:
    """Manages embeddings with model-awareness for compliant RAG."""

    async def index_document(
        self, document: Document, supported_models: list[str],
    ) -> list[StoredEmbedding]:
        """Index a document with multiple embedding models."""
        embeddings = []
        for model_id in supported_models:
            client = await self.embedding_factory.create(
                model_preferences=[model_id],
                client_id=None,  # System-level, no client constraints
            )
            vector = await client.embed_query(document.content)
            embeddings.append(StoredEmbedding(
                document_id=document.id,
                model_identifier=model_id,
                vector=vector,
            ))
        return embeddings

    async def query(self, query_text: str, client_id: UUID) -> list[Document]:
        """Query using client-compliant embedding model."""
        resolved = await self.resolver.resolve(
            model_identifiers=["text-embedding-3-large", "text-embedding-ada-002"],
            client_id=client_id,
            max_fallbacks=1,
        )
        model_id = resolved[0].model_identifier

        client = await self.embedding_factory.create(
            model_preferences=[model_id], client_id=client_id,
        )
        query_vector = await client.embed_query(query_text)

        return await self.vector_store.similarity_search(
            vector=query_vector,
            model_filter=model_id,  # Only search compatible embeddings
        )

Are you building LLM-powered features for enterprise clients? I’d love to hear about the compliance and routing challenges you’ve encountered.

Updated: