A year on, where Anthropic's Computer Use actually shipped: Field Notes

When Anthropic released Computer Use as a public beta in late 2024, most of the demos were either browse-and-book travel agents or very nervous-looking screenshots of an AI clicking around someone’s spreadsheet. The category got a lot of column inches. A lot of “this changes everything” and a lot of “this is also a security nightmare.”

A year and change later, here’s what’s actually true.

The hype framing died fast. Inside three months it was clear that an LLM steering a mouse and keyboard, in the wild, on a real desktop, was not going to replace a junior employee any time soon. It was slow. It was occasionally wrong in confident ways. It struggled with anything visually unusual. The compounding error problem, where each step built on the previous step’s interpretation of the screen, made longer tasks unreliable in ways that were hard to predict and embarrassing to demo.

But the technology didn’t go away. It moved sideways into places that turned out to fit it better.

Internal back-office tools, not customer-facing operations

The successful deployments we’ve seen aren’t agents driving public software. They’re agents driving the kind of internal tool that an SMB has built in-house, where the workflow is repetitive, the consequences of error are recoverable, and the human-in-the-loop is sitting two metres away from the screen. Reconciliation. Data entry. Pulling reports out of legacy systems that don’t have APIs. The unsexy stuff.

Legacy system bridges

This is the most genuinely interesting use case. There are still business-critical systems running in Australian SMBs that haven’t been touched since the early 2010s, with no API, no documented database access, and a vendor relationship that’s long expired. Computer Use turns out to be a reasonable bridge for these systems. Not pretty, but functional, and crucially, it doesn’t require the vendor’s cooperation.

QA and testing

Several teams we work with use Computer Use as a smarter, more flexible Selenium for testing internal apps. The model can write its own test plans, identify edge cases a human might not have considered, and adapt when the UI changes. This isn’t what Anthropic’s launch demo showed. It might be the killer use case.

Anything regulated

Almost nothing. Medical, legal, financial, customer-money: the auditability story is still rough enough that most of our clients in those domains have decided to wait. “An LLM clicked something” remains a hard line item to put in front of a regulator.

The pattern across the successful deployments: the agent isn’t replacing a job, it’s replacing the worst part of a job. The reconciliation that the senior accountant resented. The legacy data-pulling that the operations manager farmed out to whoever was newest. The test runs that took two days every release. These are the wins. They’re undramatic and they keep working.

For SMBs, the lesson is the one Computer Use turned out to teach the whole industry: the agent that browses the open internet is a research demo. The agent that drives your in-house workflow is a tool. They are very different products and they should be evaluated very differently. If you’re being pitched the first when you need the second, push back.

A conductor doesn’t ask the violins to play the timpani’s part. The right instrument in the right place. Twelve months in, that’s where Computer Use earned its seat.

A year on, where Anthropic’s Computer Use actually shipped.

Internal back-office tools, not customer-facing operations

Legacy system bridges

QA and testing

Anything regulated

Got a legacy system that no API will ever touch?