@echen | Scrollback

echen @echen

sonnet 4.5 dropped 2 weeks back – “the best coding model yet”

we ran it through our internal agentic coding benchmark – real-world engineering tasks and codebases.

results: indeed, claude 4.5 wins! (although gpt-5-codex is close and costs less than half)

but something much more interesting surprised me:
> about half the tasks each model failed were passed by the other.

in other words:
> they’re different types of coders.

we deep dived one example.

claude 4.5: the craftsman perfectionist. slow to err, obsessed with correctness, maybe a little neurotic about spacing but ultimately reliable.

gpt-5-codex: the hacker-engineer. exploratory, error-prone, and a bit too eager to improvise.

full benchmark + example deep dive ->

Oct 14, 2025 View on X →

https://x.com/echen/status/1978239403337130044

Tuesday, October 14, 2025