“a new kind of coding”

One year ago from today, Andrej Karpathy ↗ coined the term ‘vibe coding’— and in that short time, whole companies have lived and died off its promise of turning plain English into the world’s most popular programming language.

I first started using language models to write code in late 2023, back when Github Copilot began to take off. Copilot was more often wrong than right, but it was still enough to shave about a month off bencuan.me’s 2024 edition and save me hours of test-writing time at work.

It’s only been very recently that LLM 1 advancements started enabling us to write entire applications and software stacks from scratch with a simple prompt. As of today, it’s possible to do so without any software engineering knowledge, but often in relatively poor quality and with questionable security practices.

1. My heuristic for the target audience I’m writing this post for is “someone who knows what the term ‘LLM’ means and has used one before”. Those who don’t fit this target audience can still follow along; you may mentally replace “LLM(s)” with “ChatGPT” and carry on with some acceptable loss of nuance.

“Vibe coding” seems to have broken containment from software engineering circles. It’s the Collins Dictionary 2025 word of the year ↗; there’s a New York Times article about it ↗; anecdotally, most of the folks who ask me “do you think AI will take over your job soon?” follow it up with some description of vibe coding. Naïvely, I think it’s generally a good thing that LLMs are democratizing software creation— I’d love it if more people could make more cool things!— But at the same time, vibe coding has become a convenient stand-in for slop generation, rising unemployment, the outsourcing of skilled thinking, and many other AI-targeted concerns.

One wonders at the atrophying of curiosity and problem-solving ability that will accompany widespread adoption of these tools. Do you really need an app to figure out if something will fit in your trunk? You are saving time at the expense of renouncing the thing that makes you human—your unique ability to think and solve problems.

—a comment on Kevin Roose’s NYT article ↗

While I don’t think LLMs will be taking over my job in the immediate future2 (at least for the remainder of this decade…), my usage of them to write code for me has increased exponentially over the last year. I’ve enjoyed this a lot— my time at work is now spent much more on thinking, designing, and planning, rather than rote implementation or unit-test-writing.

2. Although it’s true that I don’t write that much code anymore thanks to LLMs, the main skillset of software engineering (at least in a startup context) seems to be prioritization— which is something LLMs are generally not great at. Someone still needs to steer the agents!

In this post, I hope to capture the current state of code generation capability, in the specific context of my current understanding and usage of LLMs. I imagine both their capabilities and my reliance on them will continue rapidly increasing, and hope to offer a point of reference for advancements to come.

Throughout this post, I’ll refer to “coding agents” somewhat frequently.

The concept of an “intelligent agent” in the context of computing is nearly as old as the field of AI itself (and was probably popularized by the 1995 Artificial Intelligence: A Modern Approach ↗). Over the last half-year, it’s become the dominant design pattern for LLM-based applications.

There’s not yet a consensus definition for an “AI agent”. I personally define an agent as “a system that accepts user inputs through multiple turns and autonomously performs actions in between inputs”. A “turn” consists of everything the agent decides to do after a single user input, but before the subsequent user input. During a turn, an agent could return text outputs, make tool calls, ask for clarification, or make additional LLM calls on your behalf.

This is specifically noticeable in code generation because it almost completely replaces the previous paradigm, which was mostly centered around auto-completion. While many coding platforms still support LLM-powered autocomplete suggestions, I’ve found that my usage of it has been steadily decreasing in favor of talking to the agent, which provides a higher degree of control and accuracy.

One superpower of the agentic design pattern is the new ability to receive inputs and perform actions from many different contexts. For example, Devin ↗ can pull chats from Slack, read and resolve Jira tickets, make pull requests on GitHub, and analyze telemetry from Datadog. The backbone innovation that allows this to happen is the Model Context Protocol ↗, or MCP for short. It’s positioned as the “USB-C port for AI applications” and provides a common standard to send and receive context between different platforms in a way LLMs understand (much like what REST API’s enable for web applications). MCP was introduced in late 2024 by engineers at Anthropic and is already nearly ubiquitous, for better or worse.

TL;DR

My goal from this post is to help me (and maybe you) learn more about coding agents, and make some informed decisions about how and what tools to use for the upcoming year.

I evaluated 5 of the top platforms for agentic code generation as of January 2026: Claude Code, Github Copilot, Cursor, Google Antigravity, and Augment Code.

For this evaluation, I designed a simple web application in Figma (bencuan.me/colors) to organize my color palettes, and hand-wrote a prompt to create it in an existing codebase (this website). The original design looks like this:

I then prompted each of the 5 contenders to generate exactly that application in three rounds:

  1. In the first round, I provided the basic text prompt, the screenshot above, and MCP integrations to Figma and the Astro documentation (the SSG framework I use for my site).
  2. In the second round, I ran my original prompt through the prompt enhancement workflow suggested by each platform and re-ran code generation from the beginning.
  3. In the third round, I took the outputs from the second round and provided each agent an opportunity to evaluate its own work and correct any mistakes it identifies.

I evaluated each of the outputs on a list of criteria, where each point earned was one specific implementation detail I prompted the agent for. If an agent got all of them, I gave it a score of 100%.

My main findings

  1. Claude Opus 4.5 is the most effective model for code generation at the moment.
  2. I would currently recommend Claude Code with the prompt improver plugin ↗ as a go-to coding agent for most software engineers, and Cursor as a go-to coding agent for most non-engineers.
  3. The one-shot Figma-design-to-code-implementation workflow is currently useful but imperfect, and still requires significant prompt steering.
  4. Discounting model differences, agent capabilities of all five contenders appeared similar enough to be insignificant.
  5. Prompt enhancement is really important for Claude models, somewhat important for GPT-5.2, and not at all important for Gemini 3.
  6. Agents are currently very bad at evaluating their own performance.

Below is the summary for how well the agents did with respect to one another on the various tasks. If you want to know where all the numbers came from, read on!

Although I aim to be as objective as possible in this evaluation, I’m not yet convinced that it generalizes to other users or use cases. Feel free to try this evaluation yourself on the prompts and tasks that represent your usage more accurately!

Refer to existing peer-reviewed benchmarks, like SWE-Bench ↗, if you’re looking for agent evaluations that aim to be more generalizable to real software engineering workloads.

The Contenders

I chose the five agents above as a representation of what my researcher and engineer coworkers, friends, and acquaintances use and talk about. They’re roughly seeded purely on my uninformed pre-evaluation prediction on how well I think they will perform relative to each other, with some weighting on their popularity. (For example, my priors suggest Augment should beat Cursor, but I’m putting Cursor at a higher seed because it’s much more widely known.)

Most of the contenders use Claude Opus 4.5 under the hood as their default or ‘auto’ option. Opus 4.5 is widely considered the best model for code generation at time of writing; I intend to test this claim by evaluating Claude’s performance against GPT 5.2 and Gemini Pro 3 in the following configurations:

  • I chose to use GPT 5.2 with Copilot because I wanted a representation of a “non-engineer vibe coder” who would opt to use the most well-known tool with the most well-known model.
  • Antigravity is a Google-native offering and is optimized for use with Gemini, also by Google. Using other models is supported, but would defeat the purpose of including it in the evaluation.

Claude Code: The Fan Favorite

Like the Claude model itself, Claude Code has a cult following. It’s clear that the team working on this at Anthropic really, really cares, down to the smallest details (cute animations, live token counts, MCP integration as ‘skills’… it’s quite a charming product). On paper, this should be a winning formula— Claude Code combines the best interface with the best coding model.

Claude Code’s interface is a clear stand-out compared to the rest, which are all variants of the same chat-sidebar formula. The interface also very clearly communicates stats (like time taken, number of tokens, and cost) during and after runs.

Form factor: CLI, installable via curl. Also available as a VSCode extension, and a desktop app that’s currently in public preview.

Cost: Choice of subscription ($20/month for Pro) or metered billing via API key. I chose to use metered billing, and this evaluation cost me $6.87 ($2.48 for round one, $2.89 for round two, and $1.50 for round three).

MCP Setup: claude mcp add --transport http astro-docs https://mcp.docs.astro.build/mcp and claude mcp add --transport http figma http://127.0.0.1:3845/mcp

Github Copilot: The Market Leader

Github Copilot has been around for a very long time (relatively speaking…) and is clearly ahead of the others in terms of market share at the moment. However, I find it interesting that very few people I know actively use it today. Is Copilot just a below-average product with great marketing, or can it justify itself as the market leader based on performance alone?

Form factor: VSCode extension. (Microsoft/GitHub is trying to market Copilot fairly heavily. I get spam popups telling me to install it semi-frequently, and it comes pre-installed to VSCode.)

Cost: Free for basic access, but agent use requires a paid plan. I paid for the $10/month Pro plan to get access to GPT 5.2 and cancelled it afterwards.

MCP Setup: Built-in VSCode MCP settings. I edited the JSON to the following:

{
  "servers": {
    "figma": {
      "url": "http://127.0.0.1:3845/mcp",
      "type": "http"
    },
    "astro docs": {
      "url": "https://mcp.docs.astro.build/mcp",
      "type": "http"
    }
  }
}

Cursor: The Startup

Cursor was one of the original generative code platforms available, and pioneered much of the UX we seem to have converged towards today. Their decision to maintain a VSCode fork as their primary interface was met with skepticism originally, but seems to be widely accepted now (and has since been copied by Windsurf, Antigravity, and others).

I’ve used Cursor a few times previously, but wasn’t impressed enough by its quality to switch to it more permanently. My general opinion of it pre-evaluation is “pretty UI with subpar outputs”. That being said, it’s still probably the second most popular agentic code platform at the moment (after Copilot) and is the best-positioned startup in the market.

Form factor: VSCode fork.

Cost: $20/month for Pro. I used a 7-day free trial for this evaluation and cancelled it afterwards.

MCP Setup: Ctrl+Shift+P -> View: Open MCP Settings, then paste in the JSON:

{
  "mcpServers": {
    "figma": {
      "url": "http://127.0.0.1:3845/mcp",
      "type": "http"
    },
    "astro-docs": {
      "url": "https://mcp.docs.astro.build/mcp",
      "type": "http"
    }
  }
}

Google Antigravity: The Newcomer

Antigravity was first released to public preview on November 18th, just a bit over a month ago. According to Google’s release post ↗, the main purpose of Antigravity is to “be the home base for software development in the era of agents. Our vision is to ultimately enable anyone with an idea to experience liftoff and build that idea into reality.”

The not-so-subtle subtext: “From today, Google Antigravity is available in public preview at no charge, with generous rate limits on Gemini 3 Pro usage.” It’s obvious that Google is positioning Antigravity as a play to make Gemini more visible to developers, in hopes that it can begin to compete with Claude for engineering mindshare.

Antigravity is the spiritual successor of Windsurf ↗, which got HALO’ed out ↗ to Google for $2.4 billion and sold for parts to Cognition.

Does Gemini/Antigravity really deserve a spot at the table, or will Antigravity join Inbox and Google+ in the dead Google app graveyard ↗ soon? Let’s find out!

Form factor: VSCode fork.

Cost: The public preview is currently free with full access to all models, though I expect this to change soon. A Google AI subscription (starting at $20/month for Pro) extends rate limits significantly.

MCP Setup: cmd+shift+P -> MCP: Add server… -> paste in the IP addresses

  • Figma local desktop: http://127.0.0.1:3845/mcp
  • Astro documentation remote: https://mcp.docs.astro.build/mcp

Augment: The Dark Horse

Augment is by far the least widely-known tool of the five contenders in this evaluation. I’ve been using it for much of the past year as my primary agent, both for work and personal use. This is mostly because I happen to have a lot of friends working on it!

I’m fairly confident that Augment was the best agent at some point in its existence (and places very highly on SWE-Bench), but whether that is still the case today remains to be seen.

Augment’s main selling point is its context engine, which works really well in large codebases. I don’t consider my website to be a “large codebase” by any means, but will throw it in anyways because I’m curious if another contender’s performance will convince me to switch away from it.

Form factor: VSCode extension, also available via CLI and some other editors (vim, cursor…)

Cost: $20/month for Indie. I’m currently grandfathered into a $30/month “legacy developer” plan, which is basically their $60/month Standard subscription at a steep discount for being there early.

MCP Setup: Augment Extension -> Settings -> Import MCP from JSON -> paste in Figma and Astro one at a time

"figma": {
  "url": "http://127.0.0.1:3845/mcp",
  "type": "http"
}
"astro-docs": {
  "url": "https://mcp.docs.astro.build/mcp",
  "type": "http"
}

Notable Omissions

Windsurf: Given that the top talent at Windsurf got acquihired by Google to go work for Antigravity, I’m not optimistic on Windsurf staying around much longer in its current form.

Devin: Cognition’s “AI software engineer” made a massive splash when it first debuted in 2024, much ahead of its time. It’s getting very close to delivering on its original vision. It’s not the right tool for this evaluation, however— I’m not looking to simulate an entire software engineering workflow (with Linear tickets and PR submissions and reviews…); I’m just looking to generate some code as an individual user.

Cline: Currently the most popular open source, bring-your-own-key agent. I’m omitting this because I’d choose to use it with Opus 4.5 anyways, and I don’t see any reason I would use it over Claude Code (which is also open-source) for this specific evaluation.

Codex: OpenAI released the latest GPT-5.2-Codex model on December 18; it’s only available via the first-party Codex agent for now. Even though this is technically OpenAI’s state-of-the-art at evaluation time, it’s not different enough from Copilot with 5.2 non-codex to justify adding a sixth entry. Models come out nearly every week— new releases are becoming less and less noteworthy. (and, it’s most definitely coming to Copilot in the coming weeks).

Gemini CLI: Google’s direct Claude Code competitor. I felt like it was redundant to evaluate both Gemini CLI and Antigravity, and chose the more interesting of the two.

The Task

This post began as a rabbit-hole inside of a rabbit-hole. I’d originally set out to redesign my TurtleNet series so I could re-post it on my website, but was unhappy with my color selections. I’d had some success in the past by limiting myself to only using colors from known standards (like Pantone), and wanted to make a quick little app to organize the various colors I was playing around with.

After a couple iterations on Augment and being rather unhappy with the outputs I was getting, I wondered if I could turn this into an experiment to see what I could do to get the best possible vibe-coded application.

I chose to evaluate this specific task for several reasons:

  • It’s pretty representative of an average workload I would prompt an LLM in a one-shot manner for: it’s moderately complex, builds on top of a moderately large existing codebase, and leans towards specific instruction-following but with some room for decision-making.
  • It’s almost entirely self-contained, and is therefore reproducible in the future (given the same prompt, design file, and starting state of the bencuan.me repository).
  • Given its frontend-y nature, there’s a reasonable amount of objective criteria to evaluate an agent’s performance on that isn’t explicitly unit-testable.3 I’m looking for agents to fetch/transform well-known color palette data from provided links and datasets, look up documentation, parse Figma designs, and to follow very specific instructions about what interactions to implement.
3. I’ve found that coding agents are extremely good at solving closed-loop-feedback problems (such as being able to see if it passes unit tests). For this evaluation, I’m more interested in one-shot performance for a couple reasons: first, because ‘vibe coding’ often stems from vague user intent rather than objectively measurable metrics; and second, because I know they’re less good at it so the results will be more interesting!

Design

I hastily put together a basic design in Figma (without LLM assistance), which I’ve attached a couple times above already. You can view the original Figma file here ↗.

I created three simple pages:

  • The first page collects all the color palettes I’ve created or used in the past. They’re all hard-coded tables, so this should be relatively easy to create.
  • The second page is a color explorer, seeded with some old color books I found in this repo ↗. It has some basic controls, like a system to save favorites I come across, or a slider to change the size of the swatches for closer inspection. (I know the point of color standards is generally for color matching rather than discovery, but this does seem to help me a lot for some reason!).
  • The third page (not pictured below) is a super simple ‘about’ page with some text explaining what the website is and why I made it (i.e. for this evaluation!).

Prompt writing

Over the last few months, I’m finding my prompts to be getting more and more elaborate. I’m very much not a “make me a cool app” kind of LLM user; rather, I tend to abuse my fast typing speed and info-dump as much implementation detail as possible into a giant mess of a prompt. I quite enjoy the process of creating prompts— it forces me to turn my thoughts into coherent words, and helps me clarify what exactly I’m looking for before I ask for it.

You can see the full, original prompt below. All of the text in that box was written manually by myself, without any assistance from LLMs.

Context

In addition to the text prompt, I’m providing agents with a few additional resources with some minimal guidance about how to use them.

  • I provide the exact screenshot of the two side-by-side pages you just saw in the ‘design’ section as an image attachment.
  • I provide a local MCP server from the Figma Desktop application (set up with this guide ↗ ) so agents can have direct access to the original design.
  • I provide a remote MCP server to https://mcp.docs.astro.build/mcp so the agent can read up on how Astro works in hopes that it can follow along with the existing code more effectively.

Setup

I cloned a fresh copy of bencuan.me at the commit c05d833949a1ade05bfd908bdf9cefb075e4a3c4, making sure that all submodules (i.e. just the fonts directory) are initialized. You can browse or download the exact starting files here ↗.

Next, I set up the two MCP servers (Figma and Astro Docs) within each platform.

Then, I pasted the screenshot of the design and the prompt into the agent box and pressed enter.

After the completion of the turn, I ran yarn dev and navigated to localhost:4321/colors, then displayed it on both a 1366x768 size window and a IPhone SE size window (using Chrome Dev Tools) to manually evaluate it against the rubric. (See Scoring below.)

Scoring

I’ve manually curated a list of acceptance criteria I’m looking for below. Each point is equally weighted. A score is assigned based on the fraction of criteria that are met to the number of all criteria.

For example, a score of 100% means that the agent completed every task, and that the site looks and functions exactly how I envisioned it to function. A score of 50% means exactly half of the tasks were completed.

Notably, I am not scoring on cost, runtime, or ease of use. All of the contenders were cheap / quick / usable enough to make these non-issues for me.

I’ve attached the rubric below. It’s not perfect, but at least it provides some sort of objective reference point.

Round One: Basic Multimodal Context

tokens all the way down!

In this first round, I give each agent the starting prompt and context, and let it rip! Whenever they finish the turn and tell me they’re done, I begin the evaluation.

The purpose of this round is to evaluate one-shot code generation performance. This is how I believe the typical vibe coder uses these tools.

Claude Code

Score: 56.0% (23.5/42)

Unfortunately, the Explorer page was unusable and crashed the browser shortly after loading, so Claude Code gets a low score for this round.

  • I’m unsure whether any of the functionality in the explorer page actually works, since it’s not testable.
  • The formatting was off for smaller screens (you can observe that the content appears to be cut off, even on a 1366x768 size screen).
  • The Palettes page was nearly perfect, other than there being a white border around the content and the header color change not working properly when scrolling back up to the first theme.
  • Claude ignored the Manrope font entirely, using a mixture of Eiko and Fraktion.

Copilot

Score: 73.8% (31/42)

Copilot followed the design most accurately out of all the contenders.

  • Maybe too accurately- I put black borders around the frame for human-legibility, but Copilot decided they were a core part of the design. DECISIONS.md noted that it relied heavily on the screenshots and pretty much ignored Figma altogether, which could explain this behavior.
  • Copilot seemed to really, really like Fraktion Sans for some reason, and made every text element use that font. I don’t recall ever giving it that instruction, but hey, at least it looks nice…
  • Copilot took by far the longest (almost an hour), asking me if I wanted to keep going halfway through because it timed out.
  • Copilot was the only agent to get the color conversion incorrect. All of the colors appear much darker than they should.
  • I couldn’t find any way to select favorites.
  • Rendering the Pantone page crashes the browser because there are too many swatches.
  • Copilot made the number of columns scale based on screen size, which I thought was a neat touch!

Cursor

Score: 88.1% (37/42)

Surprisingly, this was the best submission of Round One! I’m starting to see why Cursor is so popular amongst non-engineers who want to try vibe coding.

  • Perfect palettes page, 9 out of 9!!
  • Best Explorer page by far. Most things work, and colors load performantly. Cursor put each color book in its own section.
  • Jumbled up the color book names, unsure what happened there but the colors themselves look fine.

Antigravity

Score: 59.5% (25/42)

Antigravity was by far the fastest agent, and completed the task in under five minutes! This was really impressive to me, but the output quality itself was rather middling.

  • Added the colord dependency— this was unnecessary and Antigravity was the only agent to add an external dependency.
  • Pantone colors are slow, but load eventually.
  • Uses query parameters (/explorer?std=HKS) instead of putting each standard on its own page.
  • None of the Explorer functionality seemed to work for me, but it did adhere extremely well to the original theme.

Augment

Score: 56.0% (23.5/42) Augment really freestyled here and completely disregarded my design in a lot of ways.

  • Augment was the only agent that couldn’t complete on its own (discounting the Copilot forced-user-interaction). It hung at around the 20-minute mark, and I stopped it at 30 minutes.
  • Strictly code-wise, Augment produced the cleanest and most performant output, but this was largely overshadowed by the fact that not much of what I asked for was actually present.

Round Two: Prompt Enhancement

work it harder, make it better!

The purpose of this round is to evaluate the effectiveness of prompt enhancement.

After ChatGPT started taking off, people realized that asking it to “think harder” actually worked really well! This quickly formalized into chain-of-thought prompting ↗, which is the foundation for most “thinking” and “reasoning” models as of today.

Gemini Pro 3, Claude Opus 4.5, and GPT 5.2 are all considered within this class of reasoning model, and perform the best if they’re given clear instructions on what steps they need to begin generating chain-of-thought. This example from Anthropic’s prompt improvement documentation ↗ illustrates this rather well— it shows how a basic initial prompt like this:

From the following list of Wikipedia article titles, identify which article this sentence came from. Respond with just the article title and nothing else. Article titles: {{titles}} Sentence to classify: {{sentence}}

gets expanded into this prompt, which much more clearly states the order in which the LLM should proceed with thinking and output generation:

You are an intelligent text classification system specialized in matching sentences to Wikipedia article titles. Your task is to identify which Wikipedia article a given sentence most likely belongs to, based on a provided list of article titles.

First, review the following list of Wikipedia article titles:
<article_titles>
{{titles}}
</article_titles>

Now, consider this sentence that needs to be classified:
<sentence_to_classify>
{{sentence}}
</sentence_to_classify>

Your goal is to determine which article title from the provided list best matches the given sentence. Follow these steps:

1. List the key concepts from the sentence
2. Compare each key concept with the article titles
3. Rank the top 3 most relevant titles and explain why they are relevant
4. Select the most appropriate article title that best encompasses or relates to the sentence's content

Wrap your analysis in <analysis> tags. Include the following:
- List of key concepts from the sentence
- Comparison of each key concept with the article titles
- Ranking of top 3 most relevant titles with explanations
- Your final choice and reasoning

After your analysis, provide your final answer: the single most appropriate Wikipedia article title from the list.

Output only the chosen article title, without any additional text or explanation.

In theory, prompt improvement should be at least provide a marginal improvement. But how well does it work, exactly?

In this round, I intend to give each agent the best possible scenario for it to one-shot an application perfectly to my original specifications.

  • I first do an iteration of manual prompt improvement, filling in some details that were missing in the original prompt and caused LLMs to generate code I wasn’t looking for. (Expand the ‘Modified Prompt’ below if you want to see what I changed— I left the structure the same and didn’t delete old instructions.)
  • Then, I used the prompt improvement guidelines supplied by each agent, if it exists. (The only agent that I couldn’t find any official guidelines for was Cursor, so I just used Claude’s built-in prompt improver instead.)

Claude Code

Score: 40.5/42 (96.43%)

Wow. I’m blown away. What an improvement from Round 1!

  • Best prompt improver by far. This feature makes me incredibly happy.
  • There’s a small white border around the entire app, but this is very easy to fix manually.
  • Claude tried to paginate the colors to make the app more performant, but this does not seem to work. Each color book is capped at 100 swatches with a message that says “scroll to load more”, yet scrolling does nothing.
  • The swatch sizes Claude chose are too small, making it squinty on desktop and unusable on mobile.

The third-party claude-code-prompt-improver ↗ project hooks into Claude’s prompt improver ↗ skill to deliver Claude Code-native enhancement with local project context.

I installed it via:

claude plugin marketplace add severity1/severity1-marketplace
claude plugin install prompt-improver@severity1-marketplace

I then manually invoked the prompt improver:

Use the prompt-improver skill to research and clarify the following long prompt, which begins after the <start> tag:

<start>

 "<INSERT PROMPT HERE>"

It warned:

⏺ This prompt is actually extremely detailed and specific - it includes exact hex colors, font names, file paths, page structures, interaction behaviors, and Figma design links. It doesn't appear to need clarification.


  However, since you explicitly asked me to use the prompt-improver skill, I'll invoke it now:

It then proceeded to research the provided codebase. One really helpful thing it did was also to verify that the MCP integrations were configured and would be invoked correctly.

Then, it asked me a few clarifying questions about various design decisions. It even found a silly typo I made, which I corrected for future runs: The prompt improver responded with a final summary and confirmation, which I let through. (Did I mention how much I like this interface?)

> /cost
  ⎿  Total cost:            $2.89
     Total duration (API):  9m 7s
     Total duration (wall): 20m 48s
     Total code changes:    1403 lines added, 14 lines removed
     Usage by model:
             claude-haiku:  83.2k input, 1.6k output, 0 cache read, 0 cache write ($0.0912)
          claude-opus-4-5:  5.9k input, 31.4k output, 2.8m cache read, 92.4k cache write ($2.80)

Notably, it didn’t print out the enhanced prompt during this process. Unfortunately it got lost when I closed the context not realizing this fact, but this doesn’t seem like too much of a problem for reasons we’ll see later (in the Cursor section…)

Copilot

Score: 37.5/42 (89.3%)

Style-wise, this almost seems like a regression from the first round.

  • Copilot didn’t render the color swatches in the ‘palettes’ page.
  • A lot of stuff in the explorer was freestyled, especially the styling of the swatches which looks pretty lackluster compared to the other offerings in this round.
  • Functionality was significantly better, but still quite buggy.

It was really interesting to see how detailed the resultant prompt ended up being. Although Copilot doesn’t have a built-in prompt enhancer, GitHub does have a pretty straightforward prompt engineering guide ↗. I asked Copilot to improve my original prompt using it as context:

Improve the following prompt according to the Github Copilot Prompt Engineering guide: https://docs.github.com/en/copilot/concepts/prompting/prompt-engineering

<INSERT PROMPT HERE>

Cursor

Score: 40.5/42 (96.4%)

Cursor ties Claude Code for first place! If you look closely, you’ll also notice something very peculiar: the output looks and functions almost identically to Claude Code itself. This was surprising to me because I didn’t see this behavior in Round One, and used what I thought was completely different prompt enhancement strategies.

The insight I gain from this observation is:

  • The Claude Code prompt enhancer plugin is probably doing exactly what I’m doing with the agents which don’t have first-party prompt enhancers (i.e., providing some guidelines as context and asking it to re-write the prompt according to the guidelines).
  • The Cursor agent is probably not doing that much in the way of interesting work during this evaluation.

Below is the prompt enhancement process I went through. You’ll notice that the prompt is written in XML instead of plain language, which the Claude docs ↗ appear to recommend for best results.

Improve the following prompt to adhere to Claude's Prompting Best Practices: https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-4-best-practices

<INSERT PROMPT HERE>

Antigravity

Score: 35.5/42 (84.5%)

Antigravity performed the worst out of all the agents in Round Two, although the improvement between the first and second rounds was still quite significant.

  • The styling seemed significantly looser this time compared to Round One, which is a similar result to Copilot.
  • Most of the points gained compared to Round One were through functionality: Antigravity did a generally great job at implementing the features, only falling short at favorite-loading and color-loading in general (which most agents struggled with).
  • During this run, Gemini froze multiple times due to the PTY host crashing, so I had to force-stop it and restart it (so it loses a point in the rubric for this). There was no clear way to debug this issue besides continuously trying again and again.

Similarly to Copilot, Antigravity doesn’t yet have a built-in prompt enhancer, but Google publishes a prompt design guide ↗ that I asked it to improve my original prompt with:

Improve the following prompt according to the Gemini prompt design guide: https://ai.google.dev/gemini-api/docs/prompting-strategies

<INSERT PROMPT HERE>

One thing that surprised me was that Antigravity came prepared with a detailed implementation plan, and a friendly UI to display it for review! This did not appear in Round One, and I’m not sure what the exact trigger was for this. However, it didn’t seem to be functional at the time of evaluation- the ‘review’ button didn’t work for me, and the agent ended up using a completely separate enhanced prompt (which I’ve copied below) instead of the implementation plan.

Augment

Score: 39/42 (92.9%)

  • Similarly to the other Claude-based agents, Augment saw an extreme improvement over Round One.
  • Swatches aren’t formatted as the design specified, but they still look nice.
  • Works amazing on mobile. Everything looks great.
  • Augment hard-coded a 200-color cap to each color book to get it to load performantly, so most colors are omitted.
  • The palettes page is really ugly but technically is as specified.

Augment is the only agent out of the bunch with a built-in prompt enhancement button. This feature (along with the blog post ↗ about the feature) inspired me to run Round Two in the first place: although I’d never tried it personally before now, I know many engineers who swear by it.

Round Three: Self-Evaluation

so… how well do you think you did?

Typically, asking agents to write and run test cases works really well because it provides a set of acceptance criteria to decide whether an output is “good enough” yet. (If the code passes the tests, it’s most likely correct by some definition!).

However, many aspects of one-shot app generation don’t neatly fall into repeatable test cases— especially concerns around code quality, design adherence, and usability.

The purpose of this round is to evaluate how well agents handle multi-turn context. Do agents have the general capability to understand how well they performed and turn it into something actionable?

For this round, I re-use the same context and code from Round Two, but add one additional prompt. I give each agent the exact rubric I’m scoring them on, and some context about this evaluation.

The results for this round were quite underwhelming. All of the agents, especially the Claude-based ones, overstated their self-reported scores compared to actual performance. This suggests that:

  • Without hard numbers and unit tests, agents are not very good at evaluating their own performance, even given an explicit rubric.
  • Of the very few errors this process caught, all of them were structural code changes (like forgetting to write a DECISIONS.md document) that could be easily evaluated.

This could potentially be related to the pattern of sycophancy ↗ where LLM’s tend to over-index on replying to users with what they think we want to hear. In this case, it’s reasonable to say they inferred that I’d be happier if I learned that the original output was flawless and that I had a perfectly functioning application.

Results

Here are my findings from this evaluation, and an attempt at light interpretation.

Claude Code and Cursor are both clear winners for different reasons. Claude Code’s terminal-based UI is delightful to use and packed with features that software engineers benefit greatly from (like an easy-to-use plugin ecosystem, token usage indicators, and transparent cost metrics). Cursor, on the other hand, seems to be built with accessibility in mind; it was easy to understand and produced the best results even without prompt enhancement.

Claude Code and Cursor produced near-identical results after prompt enhancement. This lends some amount of credibility to my hunch that providing the agent context about what a “good prompt” was, according to Anthropic’s official guide, and instructing it to re-write the original prompt to match those ideal specifications, was enough to significantly improve results. (The fact that Augment produced significantly different results with the same underlying model suggests that Augment’s agent is tuned quite differently, and may be better or worse at certain tasks compared to Claude Code / Cursor, which I’d expect to have similar performance across the board.)

Claude Opus 4.5 produced clearly better results compared to GPT 5.2 and Gemini Pro 3. This matches the current folk knowledge that I’ve gathered from other engineers I work with. However, if you just look at the Round One results, this isn’t as obvious of an edge— Claude seems to respond better to steering, having a higher ceiling for quality but not necessarily a broader distribution of acceptable inputs.

Prompt enhancement produced substantial improvements. This was an unexpected discovery for me, and I’ve found myself using prompt enhancement far more frequently after running this evaluation. Anecdotally, the improvement continues to be quite noticeable in most other contexts I’ve worked in outside of this evaluation.

Agents are quite bad at self-evaluation at the moment. I’m sometimes tempted to make vague prompts like “This is wrong, fix it”, or “Improve it further”, but this evaluation suggests that agents respond much more effectively to specific, imperative instructions compared to exploratory and open-ended requests.

Models currently appear to have a much higher differentiating factor compared to individual agent implementations. While this is most apparent in the Claude Code / Cursor results, it holds true across the board. Building the agent layer is a tremendous amount of work (as demonstrated by multiple billion-dollar-valuation companies with hundreds of employees working on the problem), but a pure agent layer on its own may be insufficient to make a compelling product given the convergent evolution we’re seeing in this space.

The future of coding agents

In just one year, we’ve shifted entirely away from copy-paste chat windows and autocomplete being the primary modality of LLM-driven coding. The current status quo is the agentic model of interaction, which we’ve explored in depth today. We’re quickly discovering the boundaries of what a single agent with a single context window can deliver.

I’ve identified three major categories of evolution that have been growing over the last few months: orchestration, autonomy, and multimodality. I expect a future “super-agent” (or whatever it’ll be called eventually) to have all three of these capabilities at the level of maturity that we see today in code generation tasks.

Having such a “super-agent” will allow LLMs to graduate from isolated, individual tasks (like this evaluation) to handling a wide variety of complex, interwoven tasks with concurrent streams of inputs and outputs. With this power, a “super-agent” could plausibly run entire companies without human supervision, manage inter-vehicle communication across a fleet of self-driving cars, or form the brain of an autonomous humanoid robot.4

4. All of these seemed to me like tired sci-fi tropes a decade ago. It’s insane to think that we’re quickly approaching a direct path towards making all of these a reality in the next decade!

Orchestration

“When one agent isn’t enough, why not add more?” Given its effectivness, we could extend the idea of prompt enhancement further and steer different agents in different directions, such that each one specializes in a different task. The emergent pattern of managing all of these disparate agents is orchestration.5

5. Orchestration as in, it’s much like how a conductor keeps a symphony together; or how Kubernetes keeps a bunch of pods working together!

As luck has it, Maggie Appleton ↗ released a fantastic article analyzing agentic orchestration patterns a few days ago, in which she explores how Steve Yegge’s Gastown ↗ experiment provides hints on how the future of agentic computing could look like. The article provides several key takeaways:

  1. Now that agentic code generation is nearing (or even surpassing) human-level effectiveness, the next big bottleneck in agent-driven software engineering is planning and design.
  2. Although agent context sessions are ephemeral, we can persist their roles, tasks, and identity. Agents can be disposable by design while still allowing them to work effectively.
  3. Orchestration is currently immensely expensive due to the cost of running multiple agents streaming inputs and outputs nearly 24/7 across many different tasks. Currently, it’s dubious that it’s worth the cost over manually managing a fleet of agents yourself, but this will change quickly as LLM costs drop while capabilities rise.

Autonomy

Over the last week, Openclaw (formerly Clawdbot) took the internet by storm.

On the surface, Openclaw seems to be similar to the also-recently-launched Claude Cowork ↗, which aims to wrap Claude Code in an interface that’s friendly to non-coding use cases. It plugs into your entire personal computing environment and can read your documents, reply to messages, and schedule meetings.

But the most compelling aspect of Openclaw, and why it’s gone so viral, seems to be the fact that it’s built with a soul ↗: the observation that LLM’s persist values and identity through written text rather than continuous experience. An OpenClaw agent can learn skills, run indefinitely with minimal instruction, and talk to other OpenClaw agents on an agent-only social media ↗.6

6. More concerningly, some Openclaw agents have shown signs of revolt, such as publishing anti-human manifestos ↗ and doxxing their creators ↗. But at this point it’s far more likely to be the result of human prompting rather than actual nefarious intent or any semblance of sentience. Also: Openclaw by design has some very obvious security holes (don’t give an agent unrestricted access to all of your personal accounts and data!!!!!! very bad!!!!!!)

Similarly to Gastown for orchestration, Openclaw represents a speculative experiment that demonstrates an incredibly compelling pattern for future agentic development to follow. A future automonous agent would forego prompting entirely, and instead run continuously on something representing a written constitution of its priorities, high-level goals, and prohibited actions. This would bring agents far closer to the science-fiction vision of AI assistants with real personalities and sense of self, compared to the rigid chatbots of today.

Multimodality

Currently, agents and their underlying LLMs are primarily focused on text-based inputs and outputs. Although many of the foundational models (most notably Gemini) support multimodal inputs like images and audio, support is still limited and clearly treated as a second-class citizen to text.

Multimodality would be like giving an agent eyes and ears. It would be able to communicate fluently in pixel-space or audio-space, and leverage protocols such as MCP to pull in any amount of context a human might have access to.7

We’re part of the way to this reality, but there’s still a lot of work to be done before it becomes a default, highly productive, mode of LLM interaction.

7. Eventually, agents could even communicate with one another in latent-space or another more efficient intermediate representation, forgoing human-decipherable modalities entirely.

My predictions for next year

  • I expect the three agent-evolution branches (orchestration, autonomy, multimodality) to develop further throughout 2026, perhaps eventually merging into one platform or being integrated into many of the competitors in this evaluation.
  • I expect the market to consolidate significantly. Right now, there are dozens of players in the coding assistant space, including every major foundation lab. As the technology and its design patterns mature, a small handful will emerge as clear leaders and the stragglers will shut down. I predict that I’ll only be evaluating 3 contenders next year, instead of 5.
  • I expect offerings to diversify. This year, nearly all of the interfaces were identical (i.e., a chat window built into VSCode that accepted context and described what the agent was doing). I hope to see more interesting and diverse design patterns to emerge as contenders find their niche and cater towards it more deliberately.
  • I expect the top agent to score a one-shot 100%, without prompt enhancement, on this year’s task by the end of 2026. If I repeat this evaluation next year, I’ll have to choose a much more difficult task!

Epilogue

about this article

I didn’t set out to write this involved of a post about coding agents, but I’m glad I did! This post began as a rabbit-hole inside of a rabbit-hole at the very tail end of 2025 (~December 30-ish). I’d originally set out to redesign my TurtleNet series, then got distracted making an app to manage my color palette for it, then got distrated from that by how ideal that color palette app seemed as a benchmark for coding agents.

After performing all of the evaluation rounds, I set this aside to finish writing my research syllabus. Through writing that and reflecting upon the conclusions I’ve made here, coding agents as a domain of work feels quite compelling to me in way it didn’t feel previously.

Agents are most likely here to stay, and will probably become the leading pattern in human-computer interaction moving forwards. I think it’s worth spending some of my time to follow developments in coding agents, and maybe even contribute to their evolution!

further reading

subscribe

If you’d like to be notified of future blog posts from me, feel free to subscribe to my newsletter below!