Brilliant yet Clueless – The Superpowers: What AI Agents Actually Deliver in Development

Part 2 of the series “Brilliant yet Clueless – How AI Agents Are Changing Our Software Development”

Our Widget Builder UI – it was on the agenda at practically every Tuesday meeting. We should redo that. We know. But then came the next client project, the next sprint, and the topic slipped down the list again. We just kept making small improvements here and there – always one tiny step forward.

At some point, I just started. Not because I suddenly had time, but because I’d already learned a lot from using the AI agent on other projects – what works, what doesn’t, where to begin. Eat your own dogfood: if we tell our clients that AI accelerates development, we should be able to demonstrate that with our own products too.

The first attempt was sobering. I gave the agent the task of reworking the interface – and the result was okay. Not bad, but not what I had envisioned either. The fault was mine: I had hoped the agent would guess my unspoken expectations. It didn’t, of course. Our brilliant colleague is no mind reader.

The breakthrough came with a simple question: How would I explain this to a brilliant but clueless intern?

So I did the real work: thinking.

First, a style guide – partly AI-generated, but curated and decided by me. Then mockups. Several variants, pitted against each other, compared, discarded, refined. Until there was a template that even our clueless colleague could work with.

From there, the agent built a completely new interface within a few hours. Cleanly structured, aligned with the style guide, easy to follow. Sure, there was still finishing work – smoothing edges, adjusting details. But the project that had been on the list for months was done in a few days.

The lesson: AI doesn’t get better through a better model. It gets better through better specifications. The effort shifts – away from coding, towards thinking and specifying.

Quick note: These aren’t all the insights that make AI work better, of course. But if I wrote everything down here, the article would be far too long. More knowledge comes with working with us 😄

Working on the interface gave me an idea: if the AI can rebuild widgets so well – can it also create new ones?

The first approach was pragmatic: an AI chat directly in our interface. Select options, enter a prompt, an n8n workflow generates the widget in the background. It worked.

But then it became clear: there’s more. If the entire interface were controllable via AI – not just generation, but also testing, configuring, validating – we’d have a different tool in our hands.

That’s how our MCP server for the Widget Builder came about. MCP – Model Context Protocol – is a collection of API endpoints that give an AI everything it needs: what functions are available? How is a widget structured? What rules apply?

In practice: a product manager describes what they need in plain language. The agent generates a working widget from that. What used to mean a prototyping cycle of days now happens in minutes. “Can we test what that would look like?” is no longer an effort – it’s a prompt.

When AI Validates Its Own Work

An insight that sounds trivial but changed everything: AI development only becomes truly good when the agent can validate its own results.

For a long time, I checked the output manually. The agent generates, I review. Works, but doesn’t scale.

The turning point was Playwright – a tool for automated browser testing. The agent opens a browser, clicks through the application, checks whether everything works, reports the result back. No theoretical “should be fine” – a real test with a real browser.

What that changed: I now dare to give the agent bigger tasks. Because I know that obvious errors won’t slip through. And the validation strategies we build in one project can be reused directly in the next.

Azure Costs: What Had Been Running Unnoticed for Months

Everyone knows cloud costs: you should keep an eye on them, but rarely do it thoroughly enough. Too many services, too many places where costs build up unnoticed.

I set the agent loose on our Azure infrastructure – with a structured prompt: infrastructure explained, search criteria defined. Unused resources, oversized instances, orphaned storage accounts.

The result was uncomfortably enlightening. An App Service Plan that had been running on too high a tier since a migration. Storage accounts with no active use. Resource groups with leftovers from test projects. Things that don’t stand out individually – but add up. Systematically, the agent worked through every line, without fatigue, without losing track.

The savings were significant enough to pay for the entire AI investment several times over. And that was just the first pass.

Network Analysis: What We Thought Was Clean

One of our environments had been causing intermittent issues for months. Nothing dramatic, but persistent enough to keep coming up. We looked here, checked there – but could never properly isolate the problem. At some point, I had the agent analyse the entire network configuration at once. Firewall rules, routes, DNS, everything together. Within minutes, it had found the root cause. The difference: the agent sees the entire configuration simultaneously and cross-references it. That’s not a domain expertise advantage – it’s a capacity advantage.

What Saves the Most Time Day-to-Day

Beyond the big projects, it’s the small things that add up:

Documentation: The agent clicks through our product, takes screenshots, builds a structured guide from them. What normally costs me half a day – one prompt.

Translations: Context-aware, with consistent domain terminology across the entire product. No comparison to what Google Translate produces.

Scripts: “Write me a script that does X” – the agent delivers the script with error handling and a suggestion for how it fits into the existing pipeline.

The project manager who starts touching code. With AI support, someone on our team without deep programming skills can make small adjustments. The AI explains the code, suggests the change, verifies it. That changed the team dynamic – in a way nobody had expected.

Where the Real Strength Lies

After several weeks in production use, a pattern becomes visible. The agent doesn’t “program better” than a developer. Not for architectural decisions, not for domain-specific logic, not for anything that requires experience.

Its strength is different: it works through large volumes of code and configuration at a speed no human can match. It finds inconsistencies that get lost under time pressure. It doesn’t apply best practices “most of the time” – it applies them every time. And it checks its own work before I have to.

Our code quality has measurably improved – because the agent checks every pull request against our standards and lets nothing slip through.

But the very trait that makes it so powerful – it does what you tell it, no ifs or buts – is also its greatest weakness. We’ll talk about that in Part 3.

Next week: The story of the out-of-office assistant that independently negotiated contracts – and why the result was technically flawless yet still a disaster.

🔧 Tech Corner: From AI Chat to MCP Server

For those who want the details.

The Evolution Stages

Our widget generation went through three stages – and each taught us something important:

Stage 1: AI Chat in the UI A simple chat interface in our product. The user selects options, enters a prompt, an n8n workflow processes the request in the background and generates a widget. Functional, but limited – the AI could only do what we had explicitly mapped as a workflow.

Stage 2: MCP Server The leap from “AI receives a task” to “AI has access to an entire toolkit”. Instead of pre-built workflows, we provide API endpoints that the agent can flexibly combine:

┌─────────────────────────────────────────────────┐
│  MCP Server: Widget Builder                     │
│                                                 │
│  API Endpoints:                                 │
│  ├── /widgets/create     → Create widget        │
│  ├── /widgets/configure  → Set parameters       │
│  ├── /templates/list     → Fetch templates      │
│  ├── /styleguide/get     → Load design rules    │
│  └── /docs/reference     → Read documentation   │
│                                                 │
│  + Context information:                         │
│  ├── What works, what doesn't                   │
│  ├── Core instructions & rules                  │
│  └── Known limitations                          │
└─────────────────────────────────────────────────┘

Stage 3: Self-Validation The decisive step: the agent can open a browser with Playwright and test its own work – like a real user. Only then is the cycle complete: Generate → Test → Correct → Done.

The Prompt Library: Base vs. Specialisation

Our prompt library in the Git repository is structured in two tiers:

Tier 1 – Base Prompts (always loaded):

/prompts
  /base
    coding-standards.md     → Code conventions, naming, patterns
    git-workflow.md         → How issues and PRs are created
    documentation-style.md  → How documentation should look
    security-rules.md       → What the agent must NOT do

Tier 2 – Specialised Prompts (loaded per task):

/prompts
  /specialists
    widget-generator.md     → Widget structure, components, styling
    code-reviewer.md        → Review criteria, common errors
    network-analyst.md      → Network topology, security rules
    cost-optimizer.md       → Azure services, pricing, benchmarks

You are a frontend specialist on our team.

## Your Base Knowledge
[Base prompt: coding-standards.md is loaded automatically]
[Base prompt: git-workflow.md is loaded automatically]

## Your Specialisation
You generate widgets for our dashboard system.
Each widget follows this structure:
- Component: [Framework-specific details]
- Styling: Use exclusively our design system
- Data: Use the standard data model
- Tests: Write at least one integration test

## Your Task
[Populated dynamically per request]

## Rules
- Create a GitHub issue for each change BEFORE writing code
- Justify every design decision in the issue
- Create a PR referencing the issue
- Do NOT touch existing files that aren't part of your task

Why Self-Validation Changes Everything

Without validation, the agent delivers code that looks plausible. With validation, it delivers code that works. The difference sounds small but is enormous in practice. Only when the agent can test its own output do you dare to assign it larger tasks – because you know that obvious errors won’t slip through.

In Part 3’s Tech Corner, we’ll show what happens when the guardrails still aren’t enough – and why system rules outside the AI are sometimes more important than the best prompt.

Agent File #2: The Superpowers What: AI agent in productive use over several weeks Result: Widget UI completely rebuilt (in days instead of months), MCP server for widget generation built, Azure costs reduced, network errors found, automated documentation Lesson: AI doesn’t get better through a better model – but through better specifications. The effort shifts from coding to thinking and specifying. And the real leverage comes when the agent can validate its own work.

This is Part 2 of the four-part series “Brilliant yet Clueless”. Part 1: “The New Colleague” | Part 3: “The Fails” will be published next week.

Brilliant yet Clueless – The Superpowers: What AI Agents Actually Deliver in Development

The Widget UI – Why the First Attempt Wasn’t Enough

From AI Chat to MCP Server: How Widget Generation Grew Up

When AI Validates Its Own Work

Azure Costs: What Had Been Running Unnoticed for Months

Network Analysis: What We Thought Was Clean

What Saves the Most Time Day-to-Day

Where the Real Strength Lies

🔧 Tech Corner: From AI Chat to MCP Server

The Evolution Stages

The Prompt Library: Base vs. Specialisation

Example: Widget Generator Prompt (Simplified)

Why Self-Validation Changes Everything

Tags

Related Content

Brilliant yet Clueless – The New Colleague: Why We "Hired" an AI Agent

Out-of-Office Messages – Done Right

Efficiently Informed Back to the Office

The Future of Communication with Staffbase and AI

Ready to Talk?