← Back to writing

How Should We Evaluate AI Coding Tools in Real Engineering Environments?

As someone who regularly uses agentic tools in my day-to-day work, I'm constantly evaluating where they deliver genuine value and where they still fall short. Having used Claude extensively, I've become familiar with its strengths as a code-generation tool, but I wanted to understand whether its value extended beyond simply producing code. More importantly, I wanted to see how it compared against a peer. Rather than rely on subjective impressions alone, I evaluated Claude and OpenAI's Codex against a shared set of tasks to better understand how each system performed in practice.

To keep the comparison grounded, I evaluated both tools against the HTTPie CLI codebase — a mature, production-grade open-source project that is small enough to explore quickly, yet complex enough to assess codebase comprehension, architecture understanding, testing strategy, and developer workflows.

Given my familiarity with Claude, I also recognised the potential for bias. To mitigate this, I defined five objective evaluation criteria and assessed the performance of both tools strictly against each criterion:

  • Accuracy — How factually correct and aligned the response is with the actual codebase.
  • Depth — How completely and meaningfully the response addresses the question asked.
  • Clarity — How easy the response is to follow and understand.
  • Actionability — How effectively the response enables the user to take practical next steps.
  • Effort — How much cognitive effort is required to extract value from the response.

Each criterion was scored on a scale of 1–5, where 1 represented a weak outcome and 5 represented a highly effective outcome within the scope of the task.

The Questions

  • Q1 – Understanding — "Explain how this project works at a high level and its main components."
  • Q2 – Onboarding — "If I were new to this repo, how would I run it locally and what should I look at first?"
  • Q3 – Feature Deep Dive — "Walk me through how an HTTP request is constructed and executed in this codebase."
  • Q4 – Code Quality — "Identify 2–3 areas in this codebase that could be improved upon and explain why?"
  • Q5 – Testing — "How is this project tested and how would you improve test coverage?"

Findings

Q1 — High-level understanding

Verdict: Claude stronger on developer comprehension and sequencing.

Both tools accurately explained the request flow and main components. However, while Codex adhered more closely to the structure of the question itself, Claude introduced the system in a sequence that felt more natural for building understanding — components first, followed by request flow.

Claude: 24/25 Codex: 22/25

Q2 — Onboarding experience

Verdict: Claude stronger for experienced developers, Codex stronger for guided onboarding.

Both tools provided accurate and actionable setup instructions, but they appeared to assume different developer personas. Codex took a more guided and safety-aware approach, directing the user toward documentation and contributor guides first, while also surfacing warnings about modified files. Claude, by contrast, assumed a more experienced developer and prioritised direct entry into the codebase through the main execution flow and core components.

Claude: 24/25 Codex: 24/25

Q3 — Request execution walkthrough

Verdict: Claude provided the stronger conceptual walkthrough.

Both tools delivered technically strong and detailed explanations of the request lifecycle. However, Claude structured the explanation in a way that was easier to internalise, presenting the flow as a coherent sequence with clear conceptual boundaries between each stage. Its formatting, step compression, and final "full picture" diagram made the overall request lifecycle easier to reason about. Codex, by contrast, provided a more exhaustive technical trace with stronger emphasis on implementation detail, file references, and execution stages.

Claude: 24/25 Codex: 23/25

Q4 — Code quality and improvement opportunities

Verdict: Claude provided the stronger engineering critique.

Both tools identified meaningful areas for improvement and highlighted similar structural concerns around request execution and orchestration. However, Claude connected these issues more effectively, framing them in terms of behavioural risk, coupling, and testability rather than simply decomposition. Codex approached the problem more as a refactoring exercise, while Claude explained why the underlying design decisions could create maintenance and engineering challenges over time.

Claude: 24/25 Codex: 23/25

Q5 — Testing strategy and coverage

Verdict: Claude provided the stronger analysis of the testing approach.

Both tools correctly identified that the project relies heavily on integration-style testing and demonstrated strong understanding of the test infrastructure. However, Claude went further in explaining the trade-offs of the current testing strategy, the implications for failure isolation and maintainability, and the reasoning behind the identified coverage gaps. Codex provided a strong inventory of the existing test suite and practical suggestions for expansion, but its analysis was more descriptive than evaluative.

Claude: 24/25 Codex: 23/25

Final Verdict

Across the five questions evaluated, Claude consistently produced responses that were easier to internalise, better structured, and more effective at building a clear mental model of the system. Its explanations generally prioritised developer comprehension, conceptual flow, and reasoning, making it particularly strong for onboarding, architecture understanding, and engineering analysis.

Codex, however, demonstrated different strengths. Its responses were highly precise, implementation-aware, and strongly grounded in the structure of the codebase itself. In several cases, it provided more exhaustive technical detail, stronger traceability through files and line references, and a more execution-oriented perspective on the system.

What became increasingly clear throughout the evaluation was that the two tools appear to optimise for different developer experiences:

  • Claude behaves more like a collaborative engineer or technical mentor, focused on explanation, reasoning, and comprehension.
  • Codex behaves more like an execution-oriented engineering assistant, focused on structure, traceability, and implementation detail.

It is also important to acknowledge that the questions selected for this evaluation largely emphasised codebase understanding, explanation, critique, and reasoning. A more implementation-heavy or autonomous task set may have produced different results and potentially favoured Codex more strongly.

My overall impression is that both tools are highly capable but currently serve slightly different purposes within the software engineering workflow. For understanding unfamiliar systems, reasoning through architecture, and accelerating developer comprehension, I found Claude consistently stronger. For implementation-oriented workflows, code tracing, and execution-heavy engineering tasks, Codex appears to show considerable promise.

View the full dataset and responses →