Last Friday, OpenAI introduced a new coding system called Codex, which aims to perform complex programming tasks through natural language commands. Codex moves OpenAI to a new proxy encoding tool, which has just begun to form.
From early co-pilots at Github to modern tools like Cursor and Windsurf, most AI coding assistants are a very clever form of autofill. These tools usually live in an integrated development environment where users interact directly with AI-generated code. The prospect of simply assigning tasks and returning after completion is largely out of reach.
However, these new agent coding tools are led by products like Devin, Swe-Agent, OpenHands and the aforementioned OpenAI Codex, and are designed to be available without looking at the code. The goal is to operate like a manager of an engineering team, assign problems through workplace systems like Asana or Slack, and check when the solution is reached.
For believers in the form of high-function AI, this is the next logical step in the natural development of automation, taking over more and more software work.
"In the beginning, people just wrote code by pressing every keystroke," explained Kilian Lieret, a Princeton researcher and member of the SWE-Agent team. "Github Copilot was the first product to offer a truly automatic complete, and this is the second phase. You're still absolutely in the loop, but sometimes you can shortcuts."
The goal of a proxy system is to go beyond the developer's environment completely, but to ask questions to the coding agents and let them solve them on their own. "We pulled the item back to management, I just assigned a bug report and the bot tried to fix it completely autonomously," Lieret said.
This is an ambitious goal, and so far it has proven difficult.
After Devin was universally acquired in late 2024, it has attracted severe criticism from YouTube Pundits, as well as criticism from early Answer.ai customers. For Vibe-encoded veterans, the overall impression is a familiar one: the supervising model works as much as manually completing tasks due to many errors. (Although Devin’s launch is a bit tough, it hasn’t stopped fundraisers from recognizing potential – in March, Devin’s parent company, Cognitive AI, reportedly raised hundreds of millions of dollars at a valuation of $4 billion.)
Even proponents of the technology should warn of unsupervised atmosphere encoding, viewing new coding agents as powerful elements in the development of human supervision.
“Now, what I want to say is that for the foreseeable future, one must step into code review time to see the code written,” said Robert Brennan, CEO of All Hands AI, who remained open. "I saw a few people just automatically approve every bit of code the agent wrote and they got stuck. It quickly got out of control."
Hallucinations are also a persistent problem. Brennan recalled an incident where when asked about the API released after the OpenHANDS agent's training data was cut off, the agent made details of the API suitable for description. All hand AI says it is working to cause damage before capturing these hallucinations on the system, but there is no simple fix.
Arguably, the best measure of proxy programming progress is the SWE Bench rankings, where developers can test their models against a range of unsolved issues in an open GitHub repository. OpenHands is currently ranked number one on the proven rankings, solving 65.8% of the problem settings. OpenAI claims that one of the models that power Codex, Codex-1, could do better, listing a score of 72.1% in the announcement - although the score comes with some warnings, it has not been independently verified.
The focus of many in the technology industry is that high benchmark scores don't necessarily translate into true handheld proxy coding. If the proxy coder can only solve three of every four problems, a lot of oversight of human developers is required, especially when dealing with complex systems with multiple stages.
Like most AI tools, the hope is that improving the underlying model will be implemented at a steady pace, ultimately allowing the proxy coding system to grow into a reliable developer tool. But finding ways to manage hallucinations and other reliability issues is crucial to get there.
“I think there are some sound barriers to the effect,” Brennan said. “The question is, how much trust can you transfer to the agency so they take more out of your workload at the end of the day?”