Autonomy is here: now agents just do things too
On Karpathy’s autoresearch and what changes when optimization becomes abundant and agents can act on their own
A few nights ago, I watched a tiny GitHub repo make a familiar part of my brain go quiet.
Karpathy’s new project, autoresearch, is about 630 lines of Python. It runs on a single GPU. You don’t touch the training code; you write a Markdown file called program.md that describes the research goal. Then an AI agent edits train.py, runs 5‑minute experiments, and only keeps the changes that improve the metric. While you sleep, it runs something like a hundred experiments, nudging a small language model closer to GPT‑2 on a single graphics card.
On paper, that doesn’t sound exotic. We’ve had AutoML, hyperparameter tuning, and “let’s grid search the hell out of this” for over a decade. But watching this particular loop – human writes the research plan, agent writes and rewrites the code, GPU hums in the corner – feels different.
It feels like hiring your first great startup employee.
You give them a goal, the keys to a few systems, and some constraints. They come back later with a better version of your world. They make decisions you didn’t specify, run experiments you wouldn’t have queued up, and surface results you barely have time to read.
That’s what autonomy feels like, the first time you see it up close. You point an agent at a problem, give it access and a metric, and it just does things. Once you experience that, it’s very hard to imagine going back to a world where nothing moves unless a human is pushing.
1. Treat agents like early employees
In the past few years, startup culture has rediscovered the truth of agency: you can just do things. No permission slip, no committee, just a person who notices something broken and fixes it. We tell founders and early employees this as a kind of mantra praising agency.
What feels wild about autoresearch is that agency is no longer an exclusively human trait. Your friendly neighborhood autonomous agents can just do things too. You give them a problem, some keys, and a clear metric, and they’ll wander off into the stack to make reality slightly better while you’re doing something else.
Autoresearch is a clean illustration: the “job” is not “be smart,” it’s “within this sandbox, with this data, improve this metric as fast as you can without breaking anything.” You write the research goal, the agent edits the code and runs the experiments. It’s the same relationship you have with a great first hire: you define the arena and the standard, they decide how to move inside it. They just do things.
Operationally, that suggests a new kind of job design:
For growth: “Within this experiments folder and analytics stack, keep running onboarding and pricing tests to maximize LTV/CAC, and log every change.”
For product: “Within this feature flag system, continuously A/B test copy and flows to minimize drop‑off at step two of signup.”
For ML: “Within this training script and dataset, search over architectures and hyperparameters to improve validation loss, and only commit changes that clear the bar.”
If you can describe the role for a scrappy generalist human, you’re very close to describing it for an autonomous loop. The founder’s work shifts from “do the thing” to “specify the job and build the sandbox where agents can safely do it.”
2. Optimization is now a free primitive
The second shift is more uncomfortable: optimization is not scarce anymore.
Historically, you needed a data science team, ML infra, and real budget to do this kind of work. Standing up pipelines, training loops, logging, dashboards – that was the whole show. Now a single engineer can point a small codebase at a GPU, bolt on an agent loop like autoresearch, and get a level of exploration that used to require a team.
Soon, a semi‑technical founder will be able to do the same for simpler problems. If you can scrape it, you can analyze it. If you can log it, you can optimize it. Computation, reasoning, generation – they’re not literally free, but they are cheap enough to treat as abundant.
If optimization is abundant, the founder question changes from:
“Can we afford to tackle this problem?”
to
“What becomes possible if I assume exploration is basically free?”
The most interesting companies in the next decade will be the ones that re‑run their opportunity set under that assumption. Problems that were “too hard” or “too expensive” ten years ago are suddenly in range. The constraint is no longer “can we try enough things?” but “are we choosing the right problems to point our loops at?”
3. Scrappy beats elegant
There’s another pattern that autoresearch makes hard to ignore.
A nontrivial number of ML people will tell you this isn’t “real” machine learning. It’s not a novel architecture, it’s not a new optimizer, it’s not a clever theorem. It’s a loop: a Markdown file, a training script, a metric, and an agent willing to try things.
That distinction is academically interesting. It’s also exactly the kind of argument that slows you down when the world is changing under your feet.
This era will reward teams that are willing to be scrappy and a little bit brute‑force. Cultures that are comfortable wiring up ugly, high‑throughput loops will discover more, faster, than cultures that wait for an elegant framework to congeal around them.
You can see it in how different ecosystems respond to tools like this. Some people immediately fork the repo and point it at their own models, or their trading strategies, or their internal ranking problems. Others debate whether it fits into a taxonomy.
If you’re a founder, you don’t get paid for taxonomies. You get paid for getting somewhere first and making it real. The market doesn’t care whether your loop is pretty. It cares whether your product gets better while everyone else is still in meetings.
4. Moats are now full‑stack execution
If you take seriously the idea that optimization is abundant and autonomy is real, then you also have to contend with a less comfortable implication: a lot of the technical “moat” story gets weaker.
If everyone can download a decent model, rent a GPU, and strap an agent loop on top, having “an ML team” stops being much of a differentiator. The patterns behind autoresearch – design the arena, freeze the metric, let agents iterate – will spread.
Where does advantage move?
To the boring, hard, full‑stack stuff:
Owning the customer and their trust.
Understanding the problem in its messy real‑world context.
Capturing and structuring the right data exhaust.
Designing systems – product, technical, organizational – that are legible enough for agents to act inside.
Network effects still matter. Distribution still matters. Taste still matters. The companies that keep compounding will be the ones that can pull all of that together and then plug autonomy into it, instead of treating agents as a sidecar.
5. The new superpower is orchestration
Spinning up one agent is a parlor trick. Orchestrating many agents across your company – safely, coherently, and in service of a real strategy – is where this gets interesting.
Autoresearch is a single loop, but you can already see the shape of the next step: multiple agents with different roles, handing off experiments, critiquing each other’s changes, running in a shared environment. A whole research org in code. It will be awkward and failure‑prone at first, the way every early human organization is awkward and failure‑prone.
Founders who learn to design these “agent orgs” have an unfair advantage. They’ll:
Define loops for every important function: growth experiments, pricing, onboarding, risk checks.
Build the observability and safety rails so agents can act without taking the company off a cliff.
Hire people whose job is to write the equivalent of program.md for each loop: clear, sharp definitions of what “better” means and what’s allowed along the way.
In that world, capital doesn’t disappear, but it changes shape. You won’t use it to buy as much human iteration. You’ll use it to buy data, distribution, and time – the ingredients that make your agent lab more valuable than the one next door.
When you can spin up an army of junior teammates who never sleep, the hard part stops being “can we try enough things?” and becomes “are we pointing them at the right questions?”
That’s the question autoresearch left me with. Not “is this real ML?” Not “is this AGI yet?” But something much more practical:
If you had a GPU in the corner and an agent you trusted to work all night, what would you ask it to do first?
And if your answer is “I’m not sure,” that’s the real constraint now.


