Development is becoming free. What will you build?
A lot of the debate around AI coding tools is still circling a question that doesn’t really matter anymore: can the model write good code? Sure, sometimes it can. Sometimes it can’t. That was an interesting question when these tools were mostly party tricks and occasional productivity boosts. The question that matters now, especially if you’re a senior developer or you lead teams, is different: can the workflow sustain real engineering over time, or does it fall apart into regressions, rewrites, and the usual “I’ll just do it myself” moment?
The thing that finally pushed this from “interesting” into “oh, this changes things” for me was almost comically mundane. I was at home, near the end of the working day, and I checked on an autonomous run I’d left ticking away in the background. The terminal said “Complete”. My default assumption was that it meant “complete this step”, or “complete this milestone”, and that the next thing I’d see would be some familiar mess that needed tidying up. Instead it meant the whole sequence was done. Twenty milestones, end to end, without me stepping in to fix things, rescue the architecture, or spend the evening doing prompt-whack-a-mole.
If you’ve been trying these tools for a while, you’ll know why that feels unusual. The core failure mode in earlier models wasn’t really code generation. It was continuity. They’d make progress, and then quietly break something that already worked. They’d forget a decision you made ten minutes ago. They’d “fix” a bug by undoing an earlier constraint. You could get value out of them, but only if you stayed on top of them, and at that point you’re not really delegating implementation, you’re just taking on a new kind of management overhead.
This run didn’t feel like that. It felt like a workflow that could actually hold together long enough for the leverage to shift. And when the leverage shifts, the bottleneck shifts with it. Implementation stops being the scarce resource. The scarce resource becomes intent: working out what you want, describing it in a way that can be built, and then proving that the thing you got is the thing you asked for.
If you’ve seen AI claims before you’re probably used to claims being supported with a quick toy implementation of something. A webpage or tic tac toe. But this wasn’t a toy exercise. The build target was a Godot game, and the first milestone wasn’t “draw a sprite on screen”. It was building a rendering pipeline based on 3D signed distance fields, and then layering gameplay on top. If you’ve ever done shader work, you already know how quickly this sort of thing goes sideways. A black screen. Shader compilation errors. Garbage output because you’re in the wrong coordinate space. Distance functions that are almost right until you introduce movement and everything warps. Frame jitter because timing and state updates aren’t aligned. It’s the sort of work where “plausible code” is worthless, because the only thing that matters is whether the pipeline actually behaves.
What I found interesting was how the run validated its own progress. It didn’t jump straight into building the whole game and hope for the best. It did the same kind of sanity checks most of us do when we’re wiring up something we don’t fully trust yet. Render a gradient to confirm the coordinate system. Render a known primitive to confirm the distance function and camera mapping. Take screenshots and confirm that what should be on screen is, in fact, on screen. Not pixel-perfect comparisons, just the basic qualitative check you’d do yourself: “yes, that’s the right thing, in the right place, behaving the way it should.”
From there it climbed the ladder in a way that will feel familiar: ship rendering, movement, Newtonian physics, screen wrap, enemies, collisions, projectiles, scoring, sound, boss encounters, progression, upgrades, tunable difficulty. There were bugs along the way. Some milestones were incomplete on the first pass. That’s normal. The point is that it converged. By the end of the milestone run those bugs were resolved and the project was feature-complete in the ways the spec described.
It’s worth being precise about what “complete” means here, because this is where people tend to talk past each other. I didn’t adversarially test this build. I didn’t go hunting for edge cases. I wasn’t trying to break it. I validated it the way you’d validate a product milestone: you run the checks, you play it, you replay it, and you confirm the promised behaviours are present and coherent. That was the question I cared about: can an autonomous agent take a moderately complex spec and deliver it across many milestones without constant supervision? In this case, yes.
It also made one boundary very clear. The feel and difficulty tuning weren’t quite right. That’s not a failure, it’s a reminder. “Feel” isn’t primarily a correctness problem. It’s a taste and feedback problem. You still need a human to decide what “better” means, and to iterate based on playing the thing.
So why did this work when so many previous attempts haven’t? I don’t think the answer is “prompt engineering”. If anything, prompt engineering is the wrong mental model now. What worked here looked much more like a disciplined development process:
- Specification: define what you’re building in a way that can be decomposed
- Constraints: be explicit about architecture, non-goals, safety boundaries
- Verification: define what “done” means, and how you’ll prove it
In this case the spec didn’t start as a document, it started as an interview-style conversation that forced decisions. When I couldn’t give detail, the assistant offered structured options. That produced a PRD-sized artefact, and then it decomposed that into milestone documents with task lists and test requirements. That decomposition is doing a lot of the heavy lifting, because it keeps work local. One milestone, one definition of done, one verification path. Less drift, fewer accidental rewrites.
If I compress the workflow into a sentence: I defined a set of features, constrained them with explicit test requirements, and verified the outcome through acceptance testing. The tools were Claude Code running the latest generation of Claude Opus models, with an autonomous loop that iterates through milestones until their checks pass, but the point isn’t the brand names. The point is that the combination of continuity + constraints + verification changes what’s practical to delegate.
This is where the implications matter for senior developers, tech leads, and engineering managers. The obvious reaction is to jump straight to “does this replace developers?” and I think that’s a dead end. The more immediate, useful question is: where does senior value sit when implementation throughput changes dramatically? In my view it moves towards translating needs into buildable specifications, defining constraints that prevent architectural drift and unsafe behaviour, designing validation so “done” is demonstrable, and reviewing outputs with a maintenance mindset rather than trusting surface plausibility. In other words, it moves towards the things that good seniors already do, but it makes them even more central.
The sceptics are also right about the hard parts. Agents struggle with long-term maintenance judgement, and they can’t be trusted on security and performance without expert review. If you’ve spent years living with the consequences of “quick fixes”, you should be wary. The responsibility doesn’t disappear. If anything it becomes more important to be disciplined about constraints and verification, because fast output can become fast damage when it isn’t bounded. There’s also a practical boundary I’d hold today: running this unsupervised inside a mature production codebase is risky, because the real constraints in those systems often live outside the repository. They live in integrations, historical compromises, organisational memory, and awkward edge cases that nobody remembers until they break.
All of which leads to a pretty simple recommendation, and it’s one that you can’t outsource to reading: you need to try this yourself. Write a PRD for something moderately complex. Not a toy. Break it into milestones with clear tasks and acceptance tests. Put constraints around scope and safety. Let an autonomous agent run without intervention. Then review what you got with the same seriousness you’d apply to a human developer’s work. You might come away unimpressed, and that’s useful. Or you might discover that some assumptions that felt safe even a few months ago are no longer safe.
Just go out and make something.