Designing in the age of agents

Notes on how my design process is changing now that AI can do the execution. Six parts, from research to measurement. Where it helps, where it doesn't, and where to start.

The bottleneck moved upstream

Premise

For most of my career, designers and engineers were the constraint. Execution took time. Iteration was expensive. Tooling and process optimized for output speed: reusable components, design systems that pre-decided things, handoff tools that cut translation loss.

AI broke that constraint. A capable team can go from a rough idea to a working interface in an afternoon. The new constraint is the quality of the intent going into the system, not the speed of the hands coming out of it. A team that knows precisely what it’s building, for whom, and why, and has put that knowledge somewhere agents can actually use will out-ship a team that just plugs AI into the old process.

The consequences run through everything. Research can’t live in Notion docs nobody opens. Design systems can’t live in Figma files that need human translation to become code. Principles can’t be slogans on a wall. And measurement can’t be a post-launch conversation between product and analytics.

Below is my best attempt at the end-to-end picture. Parts of it require engineering work most teams haven’t done. Some of the tooling is still rough. Where that’s true, I’ll say so.

AI made execution cheap, so the constraint moved upstream. The leverage now sits in how clearly your team knows its users and how well that knowledge is encoded for agents to act on.

The essay walks six layers of the shift: research as queryable infrastructure, an honest brain the org can ask, a design system rebuilt as machine-readable contract, principles you can actually apply, a prototype-is-the-handoff workflow, and a measurement loop that closes back into all of it.

None of this is theoretical, but several pieces require real engineering work. I’ll flag those as we go.

Research as infrastructure

Part 01

Most enterprise product orgs run research as a periodic activity. A team does discovery, interviews a handful of customers, writes a report, moves into execution. Six months later, a PM making a roadmap call can’t find that report, doesn’t know whether its conclusions still hold, and decides from memory.

That’s an architecture problem, not a discipline problem. The research was done well. It was just stored in a format that doesn’t survive contact with how decisions actually get made: quickly, in context, by people who weren’t in the room.

Treat research as infrastructure, not as a deliverable. Your org’s knowledge about its users should be as queryable and current as your production database. Anyone should be able to ask a question and get a grounded, cited answer in seconds. Gaps in those answers should show up as research opportunities instead of hiding inside assumptions.

Where the signal comes from

Enterprise teams have more signal than they realize, spread across surfaces nobody actively watches. Gong or Chorus call transcripts hold some of the most candid product feedback you’ll ever see, because customers tell salespeople things they won’t write in a survey. G2 and Gartner reviews come from verified buyers with enough product experience to form an opinion. NPS open-text, filtered by score band, separates “broken” from “valued.” QBR notes and support ticket clusters show what enterprise users actually run into at scale.

The signal exists. It’s scattered across systems with no shared schema and no way to ask one question across all of it.

Building the pipeline to consolidate the signal is real engineering: API integrations, a chunking and tagging strategy, an embedding model, a vector store. Most design teams don’t have that capability today, and pretending otherwise leads to a shiny demo that collapses the first time someone asks a serious question. Pick one source, build the pipeline well, and prove the value before expanding. Customer interview transcripts are usually the right first source. The interviews are already happening, the transcripts already exist, and the insights inside them are the most consistently misplaced.

The three-tier trust model

Once signal is flowing, you need a model for how much authority to give it. Not all knowledge is equally reliable. A single Slack message from a CS rep and a validated JTBD map synthesized from forty interviews shouldn’t carry the same weight in a product decision, but without explicit structure they will.

Tier

What it is

How it is used

Tier 01

Validated insight

Finalized research: JTBD maps, persona definitions, usability findings, strategic synthesis. Owned by whoever holds research authority, research, product, or design. Not fixed to one function.

Citeable in product decisions. Changes only through a deliberate review with human approval. A stakeholder with strong intuition cannot update it unilaterally. This boundary is political as much as technical, and enforcing it takes organizational will.

Risk: calcifies if the review process is too heavy. Set a lightweight bar, one reviewer, async, 48-hour turnaround.

Tier 02

Synthesized signal

AI-generated summaries of recent inputs: weekly ticket themes, NPS driver clusters, interview pattern analysis. Directionally useful. Not individually citable as fact.

Used for hypothesis generation, not decision justification. Refreshed weekly. When a Tier 2 theme appears across three or more weeks and two or more sources, it becomes a candidate for a Tier 1 spike.

Risk: teams treat it as Tier 1 because it's conveniently available. The system must label tier provenance on every answer.

Tier 03

Raw signal

Individual Slack messages, single NPS comments, one support ticket. No individual Tier 3 item carries authority.

Queries against Tier 3 require a frequency threshold. "Fourteen enterprise accounts mentioned this in the last thirty days" is meaningful. "One person mentioned" is noise that should not surface as insight.

Risk: high volume, thin metadata. The ingestion pipeline needs automatic tagging that gets validated periodically, not assumed correct.

One brain, every question

Part 02

Once research is structured and tiered, the org needs a way to consume it. Anyone on the team should be able to ask a question about users and get a grounded, cited answer without filing a research request or waiting for a quarterly synthesis.

The hard part isn’t retrieval. It’s honesty. A language model handed thin or tangential context will still generate a plausible-sounding answer, because the model is optimized to be helpful and refusing feels unhelpful. Building a system that says “I don’t have reliable evidence for this” takes deliberate work: careful prompts, confidence thresholds, ongoing calibration as the model itself shifts. Skip that and you get a system that gives confident wrong answers, which is worse than no system at all.

The interface matters less than the honesty and the underlying knowledge. Start with one interface done well. For most enterprise teams, that’s a Slack bot, one command, natural language, answers tagged with tier and citations. Once that proves out, expand: a Figma plugin, a spec assistant that flags evidence gaps when a PRD is written, and a build-time assistant that asks whether the engineer is about to rebuild something that already exists.

The brain learns, it doesn’t just answer

A static knowledge base decays. User needs change. The product changes. What was true of enterprise procurement workflows in 2023 isn’t necessarily true today. The brain needs a way to grow through use, not only through scheduled maintenance.

Two feedback loops make it work. First, every question the brain can’t confidently answer becomes an entry in a visible evidence-gap log. That log is the org’s research backlog, generated by what people actually need to know rather than by what a researcher predicts they’ll need. Second, when the admin interface accumulates enough answered questions on a topic, those answers become a candidate for a Tier 1 update. A human reviews and approves. The brain gets sharper. The org’s reasoning is preserved rather than walking out the door when a senior researcher leaves.

The people who maintain the brain aren’t just curating knowledge. They’re deciding what the org’s understanding of its users should be, an editorial role that benefits from deep product and user experience regardless of which function the title sits in.

The system as executable truth

Part 03

A design system that lives only in Figma is a reference document. Someone has to interpret it and translate it into code. In a team using AI generation, a Figma-only system becomes a liability: the agent will ignore it and invent its own token values, producing code that looks right but breaks your design language at every hardcoded hex.

Treat the design system as a machine-readable contract, not as a collection of components. The contract specifies every token, every component, every permitted composition, every prohibited pattern, and the reason for each constraint. Agents read it as context. They don’t generate outside it. When they hit a need the contract doesn’t cover, they surface it rather than work around it.

What the contract contains

Five layers, at minimum.

Design tokens in JSON, every primitive and semantic value exported from Figma Variables or Tokens Studio, versioned alongside code, one source of truth shared by Figma and engineering.

Component documentation as structured objects: parseable records for each component’s variants, permitted props, when to use it, when not to, and which tokens it references. Not prose.

Composite pattern recipes, the combinations of components that produce standard enterprise UI: data tables with bulk actions, permission-gated views, empty states for read-only roles.

Brand language rules written as constraints, not inspiration. If a rule can’t be applied as a check, it’s a value, not a rule.

And a don’t-do library, the most under-documented layer in most teams. People document what to do. Almost nobody documents what was tried, what was removed, and exactly why. An agent without this layer will confidently regenerate every solution the team already evaluated and threw out. Each entry needs the mechanism of failure, not just the conclusion. “We removed the sidebar nav” doesn’t help. “The sidebar nav tested well in usability sessions but generated a 23% increase in help requests six months post-launch because users in large-feature-set accounts didn’t explore below the fold” prevents misapplication.

Two paths, honest assessment of each

There are two viable approaches to agent-assisted design system work right now. The right choice depends on team composition.

Recommended where available

Code-first: designers build in the system directly

The design system is a codebase. Components live in Storybook. Designers with engineering capability, or design engineers, build components directly in code, using the token schema as their constraint layer. Figma becomes a communication and exploration tool, not a source of truth.

Zero translation loss between design intent and implementation
Reliable for compound components and multi-theme systems
The system improves through actual use, not through documentation about use
Requires a design engineer or a designer who codes fluently

This is how Linear, Vercel, and Stripe operate. Not novel. Just the most direct path to a zero-handoff system.

Viable, with caveats

MCP-assisted: Figma to code via agent

Figma Variables become the token source. Figma MCP plus Cursor or Claude Code reads the component structure and generates code. Token names in the output match the codebase exactly.

Reliable for simple screens with flat token structures
Requires a highly disciplined component library with no custom overrides
Breaks today on compound components, multi-mode themes, and complex interactive states
Needs a designed failure path: what happens when generated code fails token validation?

Improving fast. If you can't do code-first today, build toward it. MCP is a transitional approach, not a permanent architecture.

Whichever path you pick, the prerequisite work is the same and it’s unglamorous: a token schema engineering has actually adopted, a component library audited for consistency with zero custom overrides in production, and component documentation specific enough that an agent doesn’t have to guess. This takes months. It’s what makes everything downstream possible, and most teams skip it because it doesn’t ship a feature.

Tools in this layer

Figma Variables
Tokens Studio
Style Dictionary
Figma MCP
Cursor / Claude Code
Storybook
Chromatic

Principles that interrogate

Part 04

Most product teams have design principles. Almost none of them function. They’re sentences on a slide deck that get cited in kickoffs and forgotten by the third sprint. They fail because they’re unverifiable. “We design for clarity” describes a preference. You can’t apply it to evaluate a specific decision, so it doesn’t actually influence specific decisions.

A principle worth having can be applied as a constraint, not just admired as a value. Specific enough that a language model can ask: given this design description, does it satisfy this principle, violate it, or sit in tension with another? If your principle can’t be used that way, it’s not a principle yet. It’s a value.

What makes a principle evaluable

Four parts.

A short name. A one-sentence statement precise enough to apply to a real decision. A “this means” list of three to five concrete implications. And a “this does not mean” list of the most common misinterpretations. The last two parts are where the work lives. They’re also what makes a principle teachable to anyone who joins after it was written, AI agent included.

One more test: if your principles could apply to any product in your category, they’re not your principles yet. “Reduce cognitive load” works for every enterprise SaaS product ever shipped. “Our product is used during high-stakes decisions under time pressure, which means we never place critical actions behind modals, and we always show the consequence of an action before the user commits to it” is specific enough to use. That specificity comes from knowing your users in a way your competitors don’t, which is why the knowledge work in Parts 01 and 02 carries weight for everything in this one.

Principles in conflict are the interesting case

Good principles will sometimes contradict each other in real situations. A principle about reducing decisions and a principle about surfacing complexity for power users will conflict whenever you’re designing for both a new enterprise admin and a veteran one in the same screen. The resolution is where judgment lives, and judgment doesn’t automate.

The principle agent’s job is to surface the conflict, not resolve it. Show which principles are satisfied, which are violated, which sit in genuine tension, before a decision ships rather than after. The resolution is human, documented in the admin portal, and over time becomes a richer encoding of the principle itself. The principle learns from every conflict it participates in.

The admin portal as institutional memory

The admin portal is the least glamorous and most important piece of this layer. It’s where anyone in the org can bring a question: why does this principle exist, what does it mean here, does this situation deserve an exception, does this experience suggest the principle needs to evolve?

People bring questions proactively, before making decisions, not because they were flagged. The team that maintains the portal doesn’t just answer questions, they curate the answers into documentation, and the documentation compounds. A PM who asks about secondary CTAs in Q1 adds to a record that lets an engineer make the same decision correctly in Q3 without asking anyone. The org gets genuinely smarter about design, not repeatedly dependent on the same senior people for the same calls.

The prototype is the handoff

Part 05

The traditional handoff gap is a translation problem, not a communication one. Design tools and dev tools speak different languages. Every translation between them drops information. A specific error message copy gets simplified, a carefully considered empty state gets skipped, and a loading state turns into a spinner with no context.

The goal isn’t to communicate the translation better. It’s to remove the step where translation happens. The designer’s intent and the engineer’s implementation reference the same source: the token schema, the component library, the pattern documentation, and the principle review that came before them.

Two models of prototype-to-production

One version still resembles waterfall: design fully in Figma, generate a prototype from the file, run automated checks, principle review, open a PR, developer confirms. This works for greenfield work on a mature, disciplined system.

The teams I see getting the most out of AI generation today aren’t doing this for discovery work. They sketch direction in Figma or on paper, then immediately write code alongside the sketch, use AI to generate variations, test those variations in a staging environment with real users, and treat Figma as a record of decisions rather than a source of production truth. Design and code evolve in parallel. The prototype isn’t a deliverable that precedes development. It is the development, done incrementally and tested early.

Both models share the same prerequisites: an adopted token schema, a component library with no custom overrides, and automated checks that validate both. What differs is the sequence.

What the automated checks cover

Before any human review, automated checks should validate:

No hardcoded values appear in generated code (every color, spacing value, and type style references a token).
Every element maps to a known component in the library.
The code passes an accessibility audit.
The output is compared against the don’t-do library for rejected patterns.

A note on accessibility: tools like axe-core catch roughly thirty percent of real accessibility issues. The ones they miss are usually the worst ones, logical reading order, keyboard traps in complex interactive components, screen reader behavior in custom form controls, focus management in dynamically rendered content. Automated checks are a floor, not a ceiling. Anything that paying enterprise customers will touch needs manual testing with assistive tech, not just a green CI badge.

What zero handoff actually means

It doesn’t mean no communication. It means no information loss. The developer never has to ask what color a component is, which component was intended, or what the empty state should contain. Every one of those questions was answered by the system before the developer saw the work. The PR includes the Figma frame, the principle review output, the token audit, the accessibility scan, the pre-defined success metrics, and a design rationale generated from the knowledge base queries made during the design session. The developer’s job is integration and confirmation, not interpretation.

When developers do ask questions, those questions are worth tracking. A pattern of questions about the same component or flow says the documentation is insufficient, or the component needs to be redesigned to be self-evident. The questions are data, not noise.

Failure mode

The design system has custom overrides

Generated code will contain hardcoded values. Token validation fails constantly. The pipeline breaks before it starts.

Fix

Audit and enforce no-override before introducing AI generation. Months of work. Do it first.

Failure mode

Generation fails and there's no fallback path

Figma MCP breaks on compound components or multi-mode themes. The team gets stuck between a broken pipeline and a manual handoff.

Fix

Design the failure path explicitly. A component that fails token validation falls back to manual review with the token audit as a checklist, not into the void.

Failure mode

Principles conflict and nobody resolves it

The principle review surfaces a tension between two principles. The designer doesn't have time to bring it to the portal. The conflict ships unresolved.

Fix

Principle conflicts block the PR, not just warn. Escalation to the portal takes five minutes and produces documentation. Make the path frictionless.

Failure mode

Accessibility treated as a CI check

axe-core passes. The component ships. Screen reader users hit a keyboard trap in a modal that automated tools didn't catch.

Fix

Define which component categories require manual assistive-tech testing. Complex interactive components always do. Document this on the component itself, not in a testing guide nobody reads.

Closing the loop, honestly

Part 06

Shipping is the moment your hypothesis meets the users you built it for. Everything before shipping is informed inference. Everything after is evidence. The process only improves if the evidence flows back into the system that produced the decision.

Product and design own the signals together, but ownership has to be specific or responsibility diffuses. A division I’ve found useful: product owns the what (did users do the intended thing, at what rate), design owns the how (how much effort did it take, what was the experience of doing it). These are different questions, measured differently. Conflating them is how “the metric is fine” masks a poor experience, and how “users love the design” obscures a retention problem.

The signals that matter and what they reveal

Task completion rate, defined per feature against a pre-launch usability baseline, tells you whether the design communicated the path. Time-to-value tells you whether that path was clear from the start of the session. Error recovery rate, the fraction of users who hit an error and successfully continued, tells you whether your error states actually help. Help-seeking behavior, docs visits, tooltip hovers, support tickets tied to a feature, tells you whether the design explained itself or needed external scaffolding. In-product micro-surveys, a single question asked right after task completion, produce qualitative texture behavioral data can’t. And drop-off analysis, correlated against specific design moments rather than numbered funnel steps, identifies the frame where intent breaks, not just the stage.

Define these signals before launch, not after. The success criteria for a feature are part of the spec. The threshold at which a deviation triggers a design review is decided in the sprint, not in the post-mortem.

The loop

Here’s how a real signal moves through the full system, from an anomaly in behavioral data to a sharper principle and a better-grounded next decision. One worked example. Click any node to trace what happens at that step and what it changes.

Signal travel loop, click any node

Select a node to trace the signal

This example follows a single signal, a drop in task completion rate on an enterprise configuration flow, through every stage of the loop. Each node shows what happens, what decision gets made, and what the system learns.

Click any node above

When the loop breaks

The worked example above shows a clean path. Real loops break, and the breakage modes are worth naming.

Research spikes sometimes produce contradictory findings: new interviews say one thing while existing Tier 1 documents say another. The contradiction is the signal. It means the research was done wrong, or the product changed enough to invalidate earlier findings, or the question is more nuanced than either document captures. Don’t average the findings. Hold the contradiction visible until it’s resolved, and document the resolution so the next contradiction of the same type moves faster.

Principle updates sometimes get blocked. A design team proposes an update to a principle based on clear evidence. A stakeholder disagrees. That’s a governance problem, not a process problem, and no process doc can fully fix it. What the admin portal can do is make the disagreement visible and documented, so it doesn’t recur invisibly in every subsequent decision on the same topic.

Metrics sometimes improve while qualitative feedback gets worse. A redesign raises completion rate from 68% to 79% while NPS comments shift from neutral to frustrated. That usually means the redesign optimized for the primary task at the expense of surrounding context or user confidence. Both signals are real. A framework that tracks only completion rate misses the downstream consequence.

These aren’t edge cases. They happen regularly on any team running a genuine measurement loop. Design the process to surface them, not to assume they won’t occur.

The sequencing question

Closing

If a team tried to implement all six parts at once, it would fail. The scope is too large, the tooling needs prerequisite work, and the organizational change for each part is real. The question that matters: given everything above, where do you actually start?

Three foundations create leverage for everything else. Order matters.

01

Stand up a basic agent from what you already have

Pull the design docs, research reports, and interview or sales-call recordings you can already access into NotebookLM, Claude Projects, or another lightweight tool. That's a research agent, today, with material you already own. Set a weekly reminder to feed it whatever's new: fresh transcripts, support recordings, NPS comments. Add the design system docs on top and it becomes a basic design agent. Layer the principles and it can start evaluating decisions. Each addition compounds, and none of it requires engineering.

Puts a working agent in front of the team this week instead of after a quarter of infrastructure work. The weekly cadence is what keeps it from going stale.
02

Discipline the design system before adding AI generation

Audit every component for custom overrides. Enforce the token schema with engineering. Document every component in the structured format agents need. Slow and unglamorous. Strictly required before MCP or code-first work reliably. Teams that skip this debug more AI-generated code than they save from generating it.

All of Part 05 depends on this. The faster it's done, the sooner the rest of the system pays off.
03

Rewrite two or three principles in evaluable form

Pick principles tied to your most important and most contested decisions. Rewrite them in the four-part format: name, statement, "this means," "this does not mean." Use them in the next design review. If they change the conversation, build the agent around them. If they don't, the problem is in the principles, not the tooling.

Evaluable principles improve review quality immediately with no infrastructure, and they reveal whether the org is ready to let principles constrain decisions. That's a cultural prerequisite for the rest of the system.

Everything else depends on these three. The brain needs the knowledge. The generation pipeline needs the clean system. The admin portal needs the principles. Start with these, and the rest becomes sequencing rather than possibility.

What this actually looks like on a Tuesday

A day in the system

The six parts above describe a system. This section describes a designer using it. Same feature, same morning, different question at each step. The feature is a bulk action toolbar for a data table. The designer has never built one in this product before.

Before getting into it, a quick vocabulary note, because the words agent and skill get used loosely and mean different things depending on who’s talking.

An agent is an AI that can take a sequence of actions to complete a goal. It doesn’t just answer. It decides what steps to take, takes them, checks the result, and continues. It can use tools, call APIs, read files, write files, and hand work off to other agents.

A skill (sometimes called a tool or function) is a single, bounded capability an agent can invoke. “Search the research corpus” is a skill. “Run axe-core on this component” is a skill. “Evaluate this design description against Principle P03” is a skill. Skills are atomic. Agents orchestrate them.

A model context is the persistent knowledge an agent carries into every session: your design system docs, your principles, your research corpus. It’s what makes the agent specific to your product rather than generic. In practice, for most teams today, this is a Claude Project or a NotebookLM notebook loaded with the right documents. Not a custom RAG pipeline. Just the right files, in the right place, asked the right questions consistently.

Before opening Figma: ask the brain

The designer opens the Slack bot, or the Claude Project set up with interview transcripts, NPS responses, and research summaries, and asks: “What do we know about how users currently handle bulk actions in the table? Any friction points?”

The brain retrieves: two interview transcripts where enterprise admins described selecting rows one at a time because they missed the checkbox column, a Tier 2 cluster from support tickets about “bulk delete not working on filtered views,” and a Tier 1 note that users managing more than 500 rows treat bulk actions as a primary workflow, not an advanced feature. Citations included. Tier labels included.

The designer now has a grounded starting point. They didn’t invent the user need. They didn’t rely on a PM’s assumption from a kickoff meeting three weeks ago. They have three pieces of evidence from different signal sources, and they know which ones are validated and which are directional. That’s a different quality of intent going into the design.

Inside Figma: the component search skill

The designer wants to build a toolbar that floats above the table when rows are selected. Before drawing anything, they query the component library context: “Is there an existing pattern for a contextual action bar in our system?”

The skill here is search-over-component-docs. The agent finds a ContextualToolbar component, documented with permitted props, a note that it should never contain more than four primary actions, and a don’t-do entry: “We tried a sticky bottom bar for this pattern and removed it. It was hidden by the browser toolbar on 13-inch laptops and generated a 19% increase in ‘where did my toolbar go’ support tickets in the first month.”

The designer uses the existing component. Thirty minutes saved, and a known failure mode avoided that would not have been in anyone’s memory if it hadn’t been written down.

Hitting a gap: the evidence log as a byproduct

The designer wonders: should bulk delete be a primary button (red, destructive) or live behind a “More” menu? They ask the brain. The brain returns: “I don’t have direct evidence for this specific decision. The closest I have is a Tier 2 cluster about accidental deletions from four months ago, but it’s not specific to bulk operations.”

This is the honest answer. The designer notes it, adds it to the evidence gap log, and makes a judgment call: destructive actions behind a confirmation, not behind a hidden menu. They document the reasoning in the design file. If the same question comes up again, the gap log will surface it as a research candidate. If it never comes up again, no research was wasted on something that didn’t need it.

Finishing the design: the principle review skill

Before sharing anything, the designer describes the design in plain language to the principle agent:

“Bulk action toolbar appears when one or more rows are selected. Shows: count of selected rows, three action buttons (Export, Assign, Delete), and an X to deselect all. Delete triggers a confirmation modal that says ‘Are you sure?’”

The principle agent (a prompt with the four-part principle format as context, evaluating the description) returns:

P01 (Reduce decisions, not options): Satisfied. Three actions is within range. The deselect affordance is explicit.
P03 (Trust is earned in small moments): Flag. The confirmation modal says “Are you sure?” but doesn’t show what will be deleted or whether the action is reversible. Violates “show consequence before commitment.”
P04 (Complexity at the edges): Satisfied. Toolbar only appears in context, not in primary UI.
P05 (Fail loudly, recover gracefully): Flag. No undo path documented. If deletion is irreversible, the modal needs to state that explicitly. If it’s reversible, recovery should be one action.

The designer updates the confirmation modal to show the count and item names being deleted. They want to add an undo toast but don’t know if bulk delete in this product is reversible at the database level. So they bring that question to the admin portal: “Is bulk delete reversible? The answer changes what the design should say.”

A PM answers it within the day. The answer gets documented. Now every future designer working on any destructive action in this product has that answer without asking anyone. The org got a little smarter from a question that took five minutes to log.

Handing to engineering: the build-time skill

Before writing a line of code, the engineer queries the build-time assistant: “I’m about to build a bulk action toolbar that appears when table rows are selected. Does a pattern for this exist?”

The assistant surfaces ContextualToolbar, links the documentation, shows the permitted props, and flags the bottom-bar don’t-do entry. The engineer builds against the existing component. The PR they open includes: the Figma frame, the principle review output, the token audit from the CI check, and a short rationale paragraph the designer wrote covering the evidence they used and the judgment calls they made.

The developer review is twenty minutes. Every question about intent is already answered. The developer is confirming, not interpreting.

So what is the system, concretely

The scenario above uses four separate agents, each with a specific context and a specific skill. Here’s what each one actually is at minimum viable fidelity, and what the more sophisticated version looks like as the team grows into it.

The Design Brain

Answers questions about users from the research corpus. Surfaces evidence gaps. Labels tier provenance on every answer.

MVP today

Claude Project loaded with interview transcripts, NPS summaries, and research docs. You add to it weekly.

Sophisticated version

RAG pipeline over a vector store (Pinecone or Weaviate). Automatic ingestion from Gong, support tickets, NPS APIs. Tier labels enforced by metadata filters.

Component search

Finds existing patterns in the design system before a designer or engineer builds something new. Surfaces don't-do entries.

MVP today

Claude Project loaded with your Storybook MDX files or Notion component docs. Ask it in natural language.

Sophisticated version

Structured search over the component library API. Integrated into a Figma plugin and an IDE extension. Auto-triggered when a new file or component is created.

Principle reviewer

Evaluates a design description against each principle. Returns satisfied, violated, or in tension with a specific reason for each.

MVP today

A saved Claude prompt with the four-part principles as context. Designer pastes a design description. Takes two minutes.

Sophisticated version

Multimodal input reading actual Figma frames or live UI states. Integrated into the PR process as a required check before review.

Build-time assistant

Checks for existing components and patterns before an engineer builds. Flags redundant work and surfaces relevant don't-do entries.

MVP today

Same Claude Project as component search, surfaced in a different interface. An IDE chat window or a Slack command.

Sophisticated version

Git pre-commit hook or PR comment bot. Runs automatically when a new component file is created. Compares against the component registry and flags matches.

The sophisticated versions require engineering. The MVP versions require an afternoon and some discipline about keeping the documents current. The value isn’t in the automation. It’s in the habits: asking before assuming, logging what can’t be answered, documenting reasoning alongside decisions. The automation just makes those habits lower-friction.

What agents can't replace

Coda

Every part of this process involves AI doing something that used to require a human: retrieving and synthesizing research, generating code from a design, evaluating a decision against a principle, flagging a redundant build. The process exists to make those capabilities useful by surrounding them with the right structure.

What the structure can’t replace is the judgment about what the structure should be. Which questions the brain should be able to answer. Which principles deserve to be in the system. When a pattern is genuinely novel versus when it’s a known problem in new clothes. When the measurement loop is producing real learning versus false confidence.

These aren’t residual tasks left over after automation. They’re the consequential ones. The process here is designed to surface them more clearly and more often, so that the people whose judgment matters most spend their time on those calls rather than on the execution that surrounds them.

Designing in the age of agents

The bottleneck moved upstream

Research as infrastructure

One brain, every question

The system as executable truth

Code-first: designers build in the system directly

MCP-assisted: Figma to code via agent

Principles that interrogate

The prototype is the handoff

The design system has custom overrides

Generation fails and there's no fallback path

Principles conflict and nobody resolves it

Accessibility treated as a CI check

Closing the loop, honestly

Signal detected

Brain ingestion

Team review

Research spike

Tier 1 update

Principle sharpens

Design iteration

Ship and re-measure

The sequencing question

Stand up a basic agent from what you already have

Discipline the design system before adding AI generation

Rewrite two or three principles in evaluable form

What this actually looks like on a Tuesday

What agents can't replace