How to build agentic chat with Durable Objects

At Highlight, agentic chat is one of the most performance-critical parts of our product. Many users interact with our chat, which means the system has to be fast, globally available, and resilient to LLM failures.

We've evolved our chat product multiple times and have ultimately arrived on Cloudflare's Durable Objects as one of the core pieces of architecture. In this tutorial, we'll walk you through how we built chat at Highlight and learnings you can apply to your own chat backend.

Why we moved away from a traditional backend

Highlight's first chat backend was built in Python and deployed on a few machines with load balancers. It had two major problems:

Performance. Chat requests were taking over 1,000 ms to respond. Every request had to load the entire conversation history from the database before calling the LLM. For long conversations, this meant repeated database reads on every new message.

Concurrency. When LLM requests failed, conversation state could become corrupted. If a user sent a new message before the previous request finished handling an error, messages ended up out of order, breaking the conversation history. Users had to create a brand new chat every time this happened.

Not to mention, it was built heavily with older models (from 2024) that are blown away by modern ones. We found it increasingly difficult to build on top of this weak foundation. Deployments were overly complex and largely manual. We needed to change gears to something more modern.

Our first move was porting the backend to TypeScript on Cloudflare Workers, which coding agents knocked out in roughly a week. Workers got us partway there, but concurrency remained unsolved. Each chat thread needed to be locked while a request was in flight, otherwise concurrent appends could interleave messages out of order, and LLM APIs reject malformed message sequences. On top of that, we wanted multiple clients (web, desktop) to be able to connect to the same thread.

Preamble

We returned to our core requirements:

Fast: the main bottleneck should be the actual Time to First Token from the underlying LLM provider
Globally distributed: lower latency across the globe
Fault tolerant: errors happen often with LLM requests, we should be able to recover the state of chat in most cases.

With these requirements in mind, we picked Durable Objects for this guide. Durable Objects are Cloudflare's primitive for stateful compute, a single threaded instance with its co-located storage. They're inherently suited for applications where you can map each entity to a single Durable Object (DO), like a conversation thread, and offer many of the guarantees we were already looking for:

They are easy to iterate on. Like Cloudflare Workers, changes are deployed instantly across regions.
They also scale horizontally incredibly well, we can create as many Durable Objects as we want.
Durable Objects are stateful, you can use an ID to guarantee requests will always hit the same object.

If we ever need to leave the CF platform, there are some emerging alternatives like Rivet Actors which provide a similar paradigm to DOs.

Prerequisites

Before we begin, you'll want to have:

Node JS (we're using 22)
A Cloudflare account
An Anthropic API key
Familiarity with Typescript
Claude Code

Project Setup

We'll start by scaffolding a Workers project. We're using Hono in this example because it's fast, easy to build in, and can support basically every modern JS runtime (serverless, node, etc.)

Here's the initial prompt we used to scaffold everything with claude.

Scaffold a Hono typescript project inside of Cloudflare Workers.

Add a Durable Object (ChatObject) which uses Drizzle to handle the sqlite
schema + migrations inside the DO's constructor.

Durable Objects are backed by SQLite and you have a couple ways to interact with this storage. You can use a simple KV API which maps stored values in a row or use SQLite directly. For our case, we'll want to use SQLite to take advantage of the structure & querying ability. On top of SQLite, we recommend a lightweight ORM like Drizzle. We can define our SQL schema in code and, when we push changes, Drizzle will automatically apply them to the DO.

Project structure file tree showing app/ with wrangler, node_modules, src/db, chat-object.ts, index.ts, test/, and config files — Our app's structure looked like this

Now we have the foundation for our project:

Cloudflare Worker
Durable Object that automatically applies the current schema + can handle future SQL migrations if we need to make changes.

Defining the SSE Schema

From here, we'll build out our SSE schema that we use to stream events to our client. We defined these first in code.

Install Zod, a schema validation library. Its ability to define your types at runtime will allow you to expose it through OpenAPI with a Hono plugin.

pnpm install zod

Add a schema.ts file:

import { z } from 'zod';

export const ChatSSETextEvent = z.object({
    type: z.literal('text'),
    content: z.string(),
    messageId: z.uuid(),
    conversationId: z.uuid(),
});

export const ChatSSEMetadataEvent = z.object({
    type: z.literal('metadata'),
    conversationId: z.uuid(),
    model: z.string(),
    llmProvider: z.string(),
});

export const ChatSSEToolUseEvent = z.object({
    type: z.literal('toolUse'),
    messageId: z.uuid(),
    conversationId: z.uuid(),
    name: z.string(),
    toolId: z.string(),
    input: z.any()
});

export const ChatSSEErrorEvent = z.object({
    type: z.literal('error'),
    messageId: z.uuid(),
    conversationId: z.uuid(),
    message: z.string(),
    retryable: z.boolean(),
});

export const ChatSSEStreamEvent = z.discriminatedUnion('type', [
    ChatSSETextEvent,
    ChatSSEMetadataEvent,
    ChatSSEToolUseEvent,
    ChatSSEErrorEvent,
]);

We included an error event as well. When a retryable: true error comes in on the client side, you can automatically retry the request. The DO still has the full conversation history in SQLite, so you don't need to resend the entire message list.

Building the Chat API

With the schemas set, we'll want to setup a way to actually stream messages from an LLM. For this, I'm using Vercel's AI SDK. AI SDK has been a popular option for awhile now, it's well maintained and supports pretty much every provider out there. Its popularity also means later down the line when you want to add telemetry, you can use tools like Braintrust's wrapAISDK.

I asked Claude to update our project:

Let's build out our chat API.

Setup Zod validation inside of Hono to accept an incoming JSON body for a new chat
request inside our Worker.

Setup Vercel AI SDK inside of our ChatObject.

I want the DO to return an SSE stream using Hono's SSE stream helper and
forward the LLM's events using the schemas in lib/schema.ts

The Durable Object should store the SSE events so they can be replayed in the
event of a disconnect.

After Claude finished, we ended up with a DO that had:

Hono (the instance inside the DO allows the worker to forward the request so we get full control over the HTTP response)
A handleChat function that creates the stream, opens a connection with the LLM, and forwards events back to the client
SQLite-backed event storage so we can replay the stream later

(Note: you may have some variation in your code but we'll walk through what our output looked like and you can ask Claude to update if you like our layout)

The Durable Object in Detail

Our goal was simple: serving a chat message should require zero database reads or writes against our primary database. This was important so that performance would never be bottlenecked by the proximity of the Highlight database.

We made several changes to reduce database hits:

Leveraging Cloudflare KV for billing and entitlement checks
Using JWT-based authorization tokens
Moving as many processes to the background as possible (conversation titles, telemetry)

We built the DO using Drizzle to handle the local SQLite db. Migrations are applied automatically through the DO's constructor using blockConcurrencyWhile, which ensures no requests are processed until the schema is up to date:

export class ChatObject extends DurableObject<Env> {
    private storage: DurableObjectStorage
    private db: DrizzleSqliteDODatabase<typeof db>
    private abortController: AbortController
    private eventEmitter: EventEmitter

    constructor(ctx: DurableObjectState, env: Env) {
        super(ctx, env)

        this.abortController = new AbortController()
        this.eventEmitter = new EventEmitter()
        this.storage = ctx.storage

        this.db = drizzle(this.storage, {
            logger: false,
            schema: {
                ...db,
            },
        })

        // Block incoming requests while we ensure migrations
        // have been performed
        ctx.blockConcurrencyWhile(async () => {
            await migrate(this.db, migrations)
        })
    }

    // Rest of methods
}

The eventEmitter is key. It allows us to forward LLM chunks to any number of connected SSE streams, which we'll cover in the event streaming section.

Streaming from inside the Durable Object

Architecture diagram: Client to Worker (Hono) to ChatObject Durable Object (SQLite, eventEmitter) to Anthropic LLM, with SSE stream returning to the client — The full request and SSE return path

The Worker acts as a thin routing layer. When a request comes in, it looks up (or creates) the Durable Object by chat ID and forwards the request:

import { Hono } from "hono";
import { ChatObject } from "./chat-object";

export { ChatObject };

type Env = {
    CHAT_OBJECT: DurableObjectNamespace;
    // ...
};

const app = new Hono<{ Bindings: Env }>();

app.post(
    "/chat/:chatId",
    validateToken,
    chatLimiter,
    validator("json", (value, c) => {
        const result = ChatRequestBody.safeParse(value);
        if (!result.success) {
            return c.json({ error: result.error.issues }, 400);
        }
        return result.data;
    }),
    async (c) => {
        const chatId = c.req.param("chatId");
        const body = c.req.valid("json");

        // Perform entitlement / billing checks
        const id = c.env.CHAT_OBJECT.idFromName(chatId);
        const stub = c.env.CHAT_OBJECT.get(id);

        // Forward the entire request to the Durable Object
        // so that the DO can directly return an SSE stream
        const stream = stub.fetch(
            new Request(c.req.url, {
                method: "POST",
                headers: c.req.raw.headers,
                body: JSON.stringify(body),
            })
        );

        // Run background tasks for the chat
        c.executionCtx.waitUntil(
            // Detailed below...
        )

        return stream;
    }
);

export default app;

The Worker doesn't need to know anything about the chat logic. It just handles the routing, body validation, and authentication.

Event Streaming

We designed the event system to support multiple clients receiving the same stream simultaneously. For example, if you're looking at a query from your phone that you started on your computer, or if building multiplayer chat. Durable Objects make this straightforward.

Each LLM chunk gets written to persistent storage and emitted to all connected listeners:

/**
 * Writes an event (chunk) to persistent storage (so that
 * resuming may occur) and emits the event to the event stream.
 * Called when LLM updates come in
 */
private async writeEvent(chunk: any) {
    const [event] = await this.db
        .insert(db.sseEvents)
        .values({
            event: 'chunk',
            data: JSON.stringify(chunk),
        })
        .returning()

    this.eventEmitter.emit('sseEvent', {
        data: JSON.stringify(chunk),
        event: 'chunk',
        id: event.id,
    })
}

Any connected client subscribes to the emitter and gets events forwarded:

const stream = streamSSE(c, async (stream) => {
    const callback = (event: any) => {
        stream.writeSSE(event)
    }

    this.eventEmitter.on('sseEvent', callback)
})

Reconnection & Replay

When a client needs to connect, they can resume right where they left off. The standard mechanism for this is the Last-Event-ID header. Browsers will automatically send this when an EventSource reconnects. We expose a replay endpoint on the DO:

async handleReplay(c: Context) {
    const lastEventId = c.req.header('Last-Event-ID');
    const startAfter = lastEventId ? parseInt(lastEventId, 10) : -1;

    const events = await this.db
        .select()
        .from(sseEvents)
        .where(
            and(
                eq(sseEvents.conversationId, this.conversationId),
                gt(sseEvents.id, startAfter)
            )
        )
        .orderBy(asc(sseEvents.id));

    return streamSSE(c, async (stream) => {
        for (const event of events) {
            await stream.writeSSE({
                id: String(event.id),
                event: JSON.parse(event.data).type,
                data: event.data,
            });
        }
    });
}

The Worker routes it based on this method:

app.get('/chat/:conversationId/events', async (c) => {
    const conversationId = c.req.param('conversationId');
    const id = c.env.CHAT_OBJECT.idFromName(conversationId);
    const stub = c.env.CHAT_OBJECT.get(id);
    return stub.fetch(c.req.raw);
});

Client Implementation

Here's an example of a client that connects to the stream and handles reconnection:

function connectToChat(conversationId: string, onEvent: (event: ChatSSEStreamEvent) => void) {
  let lastEventId: string | undefined;

  function connect() {
    const url = new URL(`/chat/${conversationId}/events`, BASE_URL);
    const eventSource = new EventSource(url.toString());

    eventSource.addEventListener('text', (e) => {
      lastEventId = e.lastEventId;
      onEvent(JSON.parse(e.data));
    });

    eventSource.addEventListener('metadata', (e) => {
      lastEventId = e.lastEventId;
      onEvent(JSON.parse(e.data));
    });

    eventSource.addEventListener('error', (e) => {
      // EventSource fires 'error' on disconnect — it'll auto-reconnect
      // and send Last-Event-ID, so the DO replays missed events
      console.warn('SSE connection error, reconnecting...');
    });

    return eventSource;
  }

  return connect();
}

Background Processes

We moved as much as we could to the background with queues to allow retries. All of these optimizations were made to ensure that the path from request → LLM response was as fast as possible.

// Run background tasks for the chat
c.executionCtx.waitUntil(
    (async () => {
        // Telemetry logging
        await logEvent(
            userId,
            'chat_request',
            {
                llm_provider: selectedModel.provider,
                model: selectedModel.meta.displayName,
                conversation_id: conversationId,
                // ...
            }
        )

        // Update the conversation title
        if (newConversation) {
            await updateConversationTitle({ userId, prompt, /* ... */ })
        }
    })()
)

File Handling

One exception to the zero database hit rule is file handling. Our desktop client checks the size of the file and, if under the maximum DO Key Value storage size (2MB), uploads it as a Base64 encoded string in the JSON body. This provides a bypass to the entire file upload flow, making smaller files instantly available to the Durable Object. We then move uploading the file to our Google Cloud Storage bucket to the background.

Deploying to Production

If you haven't used wrangler locally before, when running either of these commands you'll be asked to sign in with your Cloudflare account.

Environment Variables

Store your Anthropic API as a secret by using:

npx wrangler secret put ANTHROPIC_API_KEY

This makes it available as c.env.ANTHROPIC_API_KEY inside your Worker and DO without checking it into source control.

Deploying

From there, deploying is a single command:

npx wrangler deploy

Your Worker is now running globally. Durable Objects will be created on demand as chat requests come in. No provisioning needed, Cloudflare handles everything.

Results

After moving to Durable Objects:

P90 time-to-first-token improved by 90%, reduced to 300 ms
We stopped running into chat concurrency issues
Follow-up messages became faster than the initial message because the Durable Object stayed hot

The move to DOs also proved incredibly successful for our move to agentic chat. We can even spawn multiple DOs as subagents, all with their own resources (thread, storage, etc.), which communicate back to the main agent. We'll be publishing our next blog post covering agentic chat soon, stay tuned.

Conclusion

Durable Objects serve as a primitive for us at Highlight, allowing us to build chat and enabling many of our other real time systems like audio transcription. We've also been experimenting with Cloudflare Workflows which are backed by Durable Objects. We've found it super easy to build robust background processes with them, especially when handling LLM related faults.