Welcome to my personal blog! I use it to share what I'm currently learning or thinking about, usually on topics related to technology, business, and health.
OpenAI's Codex: A review after a dozen AI generated PRs
How good is OpenAI's Codex and its newest codex-1 model? Over the weekend, I've set it up and used it to generate a dozen real-life PRs that I ended up merging into my application (some, of course, with a few tweaks). If you're curious how Codex performs in real life or if it might even be worth a subscription, this article is for you.
An essential first step: Configuring Codex's environment
What makes Codex useful is that it can execute code. This means running tests, linters, etc. This way, it can receive instructions and work on your application until it believes the instructions have been followed and the tests and other validation pass.
Unsurprisingly, you need to set up the environment to make this work. It works similarly to any CI setup, just a lot more basic and impractical: Codex runs your code in a Docker container that you can't configure, has no access to the internet after the setup stage, and can't run Docker images itself. This makes setting up the image a lot more challenging as you need to configure databases manually, and complex external dependencies could be almost impossible to set up. How problematic this is will depend on the complexity of your application.
To me, this is the biggest limitation of Codex at the moment. The good news is that this can "easily" be improved. It isn't surprising that a research preview wouldn't allow the more advanced configuration we have access to in mature CI services.
Here is the interface to setup your environment:
Once you are done, this setup will run at the start of every task before internet is cut off.
Requesting work from Codex
Codex works similarly to Claude Code: It looks at the AGENTS instructions, traverses your code, makes changes, and runs commands it knows about to confirm nothing broke or regressed.
I tend to think the success of these tools relies on two things:
The intelligence and memory of the LLM
The technical guardrails put in place to help the LLM produce high quality work
Intelligence and memory
OpenAI uses codex-1 (a variant of o3) to perform tasks. Clearly, this model is able to produce high-quality code and deal with a lot of complexity. But as always, the problem comes from the size of the work at hand: Instructions tend to be short (otherwise you might as well do the work yourself) and repositories can be large and undocumented. Context windows aren't unlimited and tokens aren't free. So any tool needs to perform a complex balancing act to provide a decent output at a reasonable cost.
In this preview, I found the output quality satisfactory. Codex might go with the wrong approach, but it rarely produced code where I felt it was subpar. It felt very similar to having a junior on your team. You wouldn't ask everything of a junior developer but there is still a lot of work it can do, or at least start for you.
I was, however, surprised by how "lazy" it can be. I would have expected it to work for extended periods of time until it figures out a solution, but this isn't the case. I suspect this might be a cost control problem not a fundamental limitation of the technology; but either way it significantly limits the type of tasks you can give it.
Amusingly, at one point, I asked it to go one by one and fix linter errors for me (the type of easy tasks a junior dev could successfully handle). There were dozens of files to change and a total of over 200 errors. That's too much for Code. Its solution? Disable the linter entirely:
Technical guardrails
Automated tests and linters are a tremendous help to LLMs. What's interesting with Codex, however, is that these guardrails are not fully automated. Contrary to a GitHub PR where you can set up specific conditions that will pass or fail the PR, Codex decides for itself what it will run. That can be smart, but I found the current implementation lacking: oftentimes, it will skip running tests entirely. Considering how asynchronous the work is, I would much prefer a static list of scripts it would need to go through to consider its work done.
Is Codex ready for production use?
Yes. Codex doesn't push to production and instead creates PRs. It assumes you have an automated test suite (that most likely would block the deployment on GitHub) and that someone will review the code.
In this context, Codex can already do a lot. I did a small demo of my application the other day and noticed a few issues while doing it. Instead of taking care of them myself, I delegated them to Codex while I worked on more challenging tasks. Later on, I simply merged the PR after confirming the work was what I expected. These weren't complex tasks, but they still saved me some time. I expect that as Codex improves on all fronts, the complexity of tasks offloaded will increase as well.
When I reviewed OpenAI's two previous research previews, Deep Research and Operator, I was excited for what would become possible in the future, but it was too early for those products to become part of my daily toolkit. But Codex is a lot more mature and already has a purpose. That makes me more eager than ever to have the product improved!
💬 Comments