Platform Engineering

Dynamic Persona Selection in Real-Time Voice AI

Lightweight topic classification, context-drift detection, and runtime instruction switching in a production voice pipeline

STStephen TraynorMarch 202612 min read

Real-time voice conversations don't stay on topic. A data protection officer starts a session reviewing open assessments — an administrative task — then shifts to asking about the legal basis for international data transfers, then asks about the technical architecture of a proposed cloud migration. Each domain benefits from different expertise, different behavioural constraints, and different tools.

The standard approach in voice AI is to pick one persona at session start and leave it fixed. This guarantees persona-topic mismatch the moment the conversation drifts. An alternative is explicit switching — the user says "switch to legal advisor" — but that requires them to remember what specialists exist and interrupts conversational flow. A third option is a single general-purpose persona that handles everything, sacrificing depth for breadth.

None of these work well in a compliance domain where the difference between a precise and an imprecise answer has regulatory consequences.

This article describes how we built Acompli's dynamic persona selection system — a runtime architecture that continuously monitors conversation topic, detects when the active persona no longer matches the user's domain of inquiry, and switches the governing instructions mid-session without the user ever asking. It runs within the latency budget of a real-time voice conversation, and the user typically doesn't notice the switch happened.

The problem: context drift in advisory conversations

We distinguish context drift from topic switching. A topic switch is explicit: the user announces a new subject. Context drift is gradual. A question about data retention periods (compliance) leads to questions about database backup schedules (technical), which leads to questions about vendor SLA terms (legal). The user doesn't announce "I am now asking about a different domain." They follow a thread of curiosity.

When the active persona is optimised for domain X and the user's inquiry has drifted into domain Y, several things go wrong:

Reduced answer quality. A legal-specialist persona lacks the technical vocabulary for infrastructure questions. Responses become vague or superficial.

Hallucinated expertise. The model, instructed to behave as a domain expert, generates authoritative-sounding but incorrect responses outside its instructed domain.

Tool mismatch. Domain-specific personas carry whitelists of permitted tools. A mismatched persona may lack access to tools the user needs, or expose tools irrelevant to the current context.

Incorrect compliance guidance. In regulated work, a general-purpose persona may suggest a data transfer mechanism that doesn't satisfy the requirements of the applicable jurisdiction. This isn't merely unhelpful — it's actively harmful.

The constraint that makes this hard is latency. Voice conversations are real-time. End-to-end response latency in our pipeline runs 300–800ms from the user finishing their utterance to the system beginning its spoken response. Any classification or switching mechanism that adds perceptible delay will feel like the system has broken.

Context drift in a compliance conversation: the system detects topic shifts and switches the active persona without the user asking

Context drift across a single session: admin → legal → technical, with automatic persona switches at domain boundaries

Architecture: a smart proxy with a parallel classification loop

Rather than building drift detection into the main language model's system prompt (which adds token overhead and complexity to every response), we run classification as a completely separate, non-blocking process alongside the voice relay.

The system sits between the user's browser and the real-time voice API as a bidirectional WebSocket proxy. It intercepts, processes, and augments messages flowing in both directions.

Voice pipeline architecture: classification runs parallel to the voice relay, never in the critical path

The classification branch is fire-and-forget — it never blocks the voice relay

When the proxy intercepts a completed user transcription event from the voice API, it appends the transcript to the session state. If the drift-detection state machine indicates a check is due, the proxy launches a fire-and-forget classification task. This runs concurrently with the ongoing voice relay. The user's current interaction is never blocked.

Three components compose the system:

Transcript Monitor. Captures user and assistant transcriptions from real-time voice API events, maintains a rolling buffer, and tracks message counts for classification triggering.

Topic Classifier. A lightweight LLM call (nano-class model, ~150 tokens per call) that analyses recent turns against the current persona and available alternatives. Returns a structured JSON verdict.

Persona Switcher. Applies the classifier's recommendation by constructing and sending a session.update event to the voice API, reconfiguring instructions, tools, and behavioural constraints mid-session.

The persona registry

Personas are stored as versioned database records that can be forked per tenant. Each persona contains:

System instructions — domain-specific behavioural and expertise instructions layered on top of shared base voice instructions. These define role, expertise boundaries, response style, and domain-specific constraints.

Classification hints — keyword signals (e.g., ["legal", "GDPR", "law", "regulation"]) that help the lightweight classifier match user utterances to persona domains without requiring deep semantic understanding.

Tool whitelist — the set of tools this persona is permitted to invoke, intersected at runtime with the user's RBAC permissions. A legal advisor persona cannot invoke risk extraction tools. A compliance specialist cannot access tools reserved for technical architecture review.

Confirmation-required tools — a subset requiring explicit user confirmation before execution. In compliance domains, write operations (creating assessments, updating records) have regulatory consequences.

Quick actions — suggested actions surfaced in the frontend when this persona is active.

Personas are loaded at session start, filtered by status, channel, and type, then cached for the session duration. New personas are added through the skill management API — no code deployment required.

Lightweight topic classification

The classifier uses a nano-class language model chosen specifically for latency and cost. The main conversational model handles user-facing dialogue; using it for classification would add 500–1000ms+ latency per check. The nano model completes in ~20–50ms at ~150 tokens — two orders of magnitude cheaper, well within the latency budget even when classification isn't fully parallelised.

The classifier receives a structured prompt containing:

Current persona context. The active persona's ID and domain description.

Recent conversation turns. The last 10 turns (configurable), each capped at 300 characters. The last 3 are marked with a [RECENT] tag to signal recency weighting.

Available persona catalogue. Alternative personas (excluding current), each with ID, name, description, and classification hints.

The classifier returns JSON:

{
  "action": "stay" | "switch",
  "recommended_persona_id": "..." | null,
  "confidence": 0.0-1.0,
  "reasoning": "brief explanation (max 20 words)"
}

Several guards are applied before any recommendation influences the system. Invalid actions default to "stay." If the classifier recommends the current persona as the switch target (a logical impossibility that nano models occasionally produce), the recommendation is silently converted to "stay." Confidence must meet the effective threshold.

The classification prompt instructs the model to focus on the last 2–3 messages, recommend staying if the topic is ambiguous, and only recommend a switch when there is a clear domain mismatch — not for minor topic variations within the same domain.

The drift detection state machine

Classification doesn't run on every message. A state machine tracks history and applies heuristic guards against unnecessary or disruptive switching.

Drift detection state machine: adaptive thresholds and anti-oscillation guards prevent unnecessary switching

Four adaptive mechanisms keep the system neither too trigger-happy nor too sluggish

Four adaptive mechanisms prevent the system from being either too trigger-happy or too sluggish:

Adaptive check intervals

The effective check interval starts at a base value (default: every 3 user messages) and increases by 1 for every 2 consecutive "stay" results, up to a maximum (default: 8 messages). If the last 6 checks all returned "stay," the conversation is stable — checking less frequently is cheaper and less likely to produce a false positive. The interval resets to the base value whenever a switch occurs.

A cooldown timer (default: 15 seconds) prevents rapid re-checks even if the message threshold is met.

Adaptive confidence threshold

The base threshold for actioning a switch is 0.80. After a configurable number of consecutive "stay" results (default: 3), the threshold increases by 0.05 up to a maximum of 0.95. This stability bias makes it progressively harder to trigger a switch in long-stable conversations. The intuition: the longer a persona has been appropriate, the more evidence should be required to justify changing it.

Content-hash deduplication

Before making an LLM call, the classifier computes an MD5 hash of the conversation text concatenated with the current persona ID. If this matches the previous check, the cached result is returned without an API call. This prevents wasted calls when classification triggers fire but no new meaningful content has been added — for example, when recent turns were dominated by tool calls with minimal user text.

Anti-flip-flop guard

A sliding window tracks the last 4 persona switches. Before actioning a recommendation, the system checks whether the target persona appears in the recent switch history (specifically the last 2 entries). If switching to persona B would create an A→B→A pattern, the switch is suppressed regardless of confidence. This prevents oscillatory behaviour in conversations that straddle domain boundaries.

Runtime persona switching

When the drift detector determines a switch should occur, the system reconfigures the real-time voice session without disrupting the ongoing conversation.

The real-time voice API supports mid-session reconfiguration via a session.update event sent over the WebSocket connection. This carries the complete session configuration — instructions, tool definitions, voice settings, audio configuration. The API applies the new configuration to subsequent responses without requiring a new session.

The switch constructs a session.update as follows:

def build_session_update(state):
    instructions = compose(
        VOICE_BASE_INSTRUCTIONS,      # Shared across all personas
        persona.system_instructions,   # Domain-specific layer
        f"You are speaking with {user_name}.",
        memory_context,                # User facts + session history
    )

    tools = intersect(
        persona.tool_whitelist,        # Persona-scoped tools
        user.rbac_permissions,         # Role-based access control
    ) + [switch_persona_meta_tool]     # Always available

    return {
        "type": "session.update",
        "session": {
            "model": "gpt-realtime-1.5",
            "instructions": instructions,
            "tools": tools,
            "tool_choice": "auto",
        }
    }

What changes: System instructions are replaced with the new persona's domain-specific layer. Tool definitions are reconfigured. Pending confirmations are cleared.

What is preserved: Conversation history (the API maintains full context across session updates). Voice characteristics (deliberately locked to the user's preference — changing voice mid-conversation would be jarring). User context (name, memory facts, session summaries remain in instructions). Session identity (same WebSocket connection and session ID).

Deferred switching

This is a critical implementation detail. If the voice API is currently generating a response (tracked via response.created and response.done events), the persona switch is deferred. The new persona is queued, and the switch applies when the response.done event fires. This prevents the model from changing expertise, tone, or tool access partway through a spoken sentence.

async def check_persona_drift(openai_ws, client_ws):
    result = await drift_detector.check_drift(...)

    if result.should_switch:
        persona = load_persona(result.recommended_persona_id)
        drift_state.record_switch(persona.id)

        # Notify frontend of drift detection
        client_ws.send({"type": "persona_drift_detected", ...})

        if response_active:
            # Defer to avoid mid-sentence switch
            deferred_persona = persona
        else:
            # Apply immediately
            activate_persona(persona, openai_ws, client_ws)

The frontend receives a persona_drift_detected event followed by persona_switched, allowing the UI to update contextual elements — tool panels, suggested actions, persona indicators — without user action.

Drift-triggered switches suppress the persona's greeting message. A greeting makes sense at session start; mid-conversation it would feel artificial ("Hello, I'm your legal advisor" inserted right after the user asks a legal question).

Integration with the broader agent system

Cross-channel consistency

The same PersonaDriftDetector module runs in both the voice proxy (as a fire-and-forget async task) and the text-based agent executor (synchronously before the main LLM call in the ReAct loop). Consistent detection behaviour across channels.

Voice memory service

A three-tier persistence system captures voice session data:

Tier 1 — Voice session logs. Full transcripts, tool calls, and persona metadata. 90-day retention for audit.

Tier 2 — Session summaries. LLM-generated 2–3 sentence summaries injected into subsequent sessions as context.

Tier 3 — User voice memory. Persistent facts about the user (role, department, projects) extracted from transcripts and injected into persona instructions. Personalisation that survives across sessions and across persona switches.

Memory context is loaded at session start and injected into whichever persona is active. Persona switches don't lose personalisation.

Explicit switching

In addition to implicit drift-detected switching, a _switch_persona meta-tool is always included in tool definitions regardless of active persona. Users can always request a switch verbally ("let me talk to the legal specialist"). The implicit system handles the cases where they don't think to ask.

Design trade-offs we made

Why a nano model instead of the main conversational model

The nano model completes classification in ~20–50ms at ~150 tokens. The main model would take 500–1000ms+ for an equivalent call. In a voice pipeline where end-to-end latency is already 300–800ms, adding 500ms for classification would nearly double perceived response time.

The accuracy trade-off is mitigated by: classification hints providing keyword-level signals that don't require deep semantic understanding; the task being structured as a simple routing decision with a small output space; confidence thresholds and adaptive mechanisms filtering low-confidence results; and errors being self-correcting (a missed detection is caught on the next cycle, a false positive can be reversed).

Why check every 3 messages, not every message

Consecutive messages within the same conversational turn rarely shift domains. Checking every message wastes API calls and increases false-positive risk on momentary topic mentions. Checking every 10+ messages misses genuine drift for too long. The adaptive backoff refines this further — stable conversations stretch to 8-message intervals; after a switch, sensitivity returns to baseline.

Why hard switching, not gradual blending

We considered verbal handoffs ("I'm passing you to our legal specialist"), instruction blending, and multi-persona composition. All rejected.

Verbal handoffs draw attention to internal mechanics and may reduce confidence ("why can't the same assistant answer my question?"). Instruction blending creates conflicts between different behavioural constraints. Multi-persona composition introduces ambiguity about which constraints take precedence and which tools are appropriate.

Hard switching — combined with deferred execution at response boundaries — is simpler, more predictable, and in practice smooth enough that users don't report perceiving the transition.

Why single persona, not multi-persona blending

Each persona's system instructions are a complete, self-consistent behavioural specification. Merging two would create conflicts. Each tool whitelist reflects a domain scope. And in a compliance domain, it matters which set of behavioural constraints governed a particular response. Single-persona selection gives a clear audit trail.

Where this is heading

The current system selects one active persona at a time. Conversations that genuinely span multiple domains simultaneously — "what are the legal implications of this technical architecture decision?" — get routed to whichever domain is the stronger signal. The architecture is designed to support multi-persona composition in future, where the system could blend instructions from two specialists for cross-domain queries, though the constraint enforcement and tool scoping problems that introduces are non-trivial.

We're also building a feedback loop where switching outcomes inform threshold tuning per organisation. Different compliance teams have different domain boundaries — what counts as a "legal" question versus a "compliance operations" question varies by team structure. Adaptive thresholds that learn from usage patterns will make the routing increasingly precise over time.

Why this matters for the compliance platform

The voice interface is one surface of the Acompli platform. The same user who reviews DPIAs in the web interface, manages risks in the dashboard, and curates Article 30 records can also talk to the system. When they do, the conversation shouldn't be constrained by which specialist they happened to select at the start.

The drift detection architecture ensures that Acompli's voice assistant brings the right expertise to each phase of a conversation — administrative when reviewing status, legal when discussing transfer mechanisms, technical when exploring architecture — without the user managing that routing themselves. It's a natural extension of how the broader platform works: the system handles the machinery so the human can focus on the substance.

The design principles that shaped this work — parallel processing to preserve latency, adaptive mechanisms to balance sensitivity against stability, deferred execution to avoid disrupting the user's experience — reflect how we think about AI in compliance more broadly. The technology should be invisible. The expertise should feel seamless. And the human should always be in charge.

This article describes the voice pipeline architecture as of March 2026. The lightweight classifier described here was later adapted for semantic drift detection in the AI self-modification system, where the same pattern — structured routing prompt, nano-class model, confidence threshold — detects whether a modified prompt has drifted from its original intent. For related work, see our research on governance-first design and the self-reinforcing data lifecycle.

Related Research

When the AI rewrites its own instructions

The closed-loop self-modification architecture — and how the persona classifier described here was adapted for semantic drift detection on prompt modifications.

Read article →

Governance, not auto-drafting

Why defensible output matters more than speed, and how the Acompli architecture enforces accountability at every step.

Read article →

← Back to Research