AI-assisted self-review — what it speeds up, where it falls short

In the previous post I explained why I built CommitBrief: the inability to gain distance from my own diff, existing AI tools not thinking at the diff level, the need for a second pair of eyes without leaving the terminal. This post is about “how much”, not “why”. After three months, I want to put concrete numbers on which workflow layers it plugs into, how much speed it adds, which types of bugs it keeps catching, and where it comes up short.

A quick reminder first: the place where these tools create value is not “let AI write it” but “let AI be a second set of eyes” — as I wrote before, as production gets cheaper, judgment gets more expensive. The value of a tool is how well it assists that expensive judgment. What follows is as much as I’ve been able to measure.

The right leverage point: changes, not files

In the first week I started with the wrong abstraction: handing a file to the AI and saying “review this.” The output was long, generic, and largely irrelevant. Because what a reviewer looks at is not the file — it’s the change coming into that file. Once CommitBrief’s design unit became the diff, three things improved simultaneously: the token budget dropped 10×, severity assignment became meaningful (the question is no longer “is this function well-written” but “does this change introduce a problem”), and the false-positive band narrowed.

In practice, this means the “AI side” of a PR can fit into four or five seconds — fast enough to make it a pre-push reflex.

Three integration layers

There are three layers for plugging CommitBrief into a workflow; each has a different speed profile.

Layer	Command	Speed profile	Typical use
Manual	`commitbrief`	4-5 s, $0.02-0.04	One last look on top of your own eyes before pushing
Pre-push hook	`commitbrief install-hook --hook=pre-push --fail-on=critical`	Cancels push on critical finding	Solo developer / small team
CI step	`commitbrief --fail-on=high`	Soft/hard gate in the PR pipeline	Shared baseline for team PRs

Don’t use all three at once — pick one and keep it in mind. Enabling them all simultaneously increases notification noise, not workflow discipline. For solo work I use the manual command + pre-push hook; I keep the CI step for team projects.

There’s a subtle point about the pre-push hook: the version I initially installed was actually running --staged — meaning at push time the index is usually empty, so it silently fell through to a no-op and the push went through anyway. “Thinking you set up a CI gate but actually setting up nothing” — I caught this bug in my own audit. In v1.0, pre-push now reads git’s stdin protocol and runs commitbrief diff <remote>..<local>. Being honest about the places the tool hit its own limits is important.

Numbers — my own workflow

Three scenarios, three different speed profiles.

Scenario 1 — Pre-push reflex (solo)

A typical mid-sized PR for me: ~120 lines added, 40 deleted, four or five files. My old workflow:

Review with git diff --staged: 4-7 min
Two more read-throughs (to counter anchor bias): +3-4 min
Push, open PR

My new workflow:

commitbrief --staged: 4-5 s → read the output: 30-60 s
Fix any Critical/High findings + re-run (cache hit: 13 µs, $0)
Deliberate deep read (I now know what I’m looking for): 2-3 min
Push, open PR

Total: 9-10 min → 3-5 min. About 55% faster, yes — but the real win isn’t in the minutes; catching an actual bug before it reaches a reviewer is the most expensive part of this job.

Scenario 2 — Team PR round-trip

When a critical finding is caught before it reaches a reviewer, the PR round-trip drops from two days to one. That’s a deliberate understatement: it actually drops further, but the average is dragged up by the “reviewer still caught something else” cycle that’s always there. AI doesn’t fully bypass the human reviewer; it focuses the reviewer’s eyes on hard findings instead of easy ones. That’s the net effect.

Scenario 3 — Cache hit economics

CommitBrief’s cache key: SHA256(diff + system_prompt + provider + model + lang + schema_version). As long as the diff hasn’t changed, the same call takes 13 ms and costs $0. In practice:

Re-review after a rebase: free
“One last look” before pushing: free
Switching provider to compare: not free — it’s a new cache key — but cheap enough to be attractive as a second parallel opinion

Monthly billing math: Anthropic Sonnet 4.6 at an average $0.03/review × ~50 PRs/month × 40% cache-miss rate ≈ $0.60/month. Ollama is zero, just electricity. Negligible compared to the opportunity cost of an hour of human code-review time.

Patterns I keep catching

Over three months I’ve tallied five patterns that CommitBrief (and other providers sharing the same pipeline) catches over and over. Some come from my daily workflow, some from CommitBrief’s own pre-v1.0 self-audit — I reviewed my own project with three different providers using my own project rules and collected the common finding sets. Every single one falls in the “catchable by reading the diff, hard to catch by eye” category.

1. Error wrap duplication. Repeating the same message in a prefix when the %w format specifier already wraps the error. Produces "auth failed: auth failed: invalid token" in logs. Static analysis doesn’t catch it because it’s not technically wrong; it’s a semantic repeat. Compiler is happy, logs are a mess.

2. NOT NULL column migration without a default. Harmless on an empty table, fails on a populated prod table. Typical CommitBrief output: “Migration fails on any table that already has rows; either backfill in a prior migration or add a DEFAULT before the constraint.” Classic category of bugs that don’t show up in staging but do in prod.

3. i18n catalog key present, unused. Common in multi-language projects: you add an EN/TR key for a new prompt but leave a hardcoded string in code instead of calling cat.T(...). Invisible in EN (the string is the same), a functional breakage in TR. In my own project this pattern was one of the most consistent findings in the audit — guard.prompt and setup.welcome keys were sitting in the YAML, but the actual prompts were English literals in Go code. Even more ironic: the Turkish [e/H]: (yes/No) prompt was being rendered, but the parser only accepted y/yes — so when a Turkish user typed e, it was silently treated as “no”. The place I thought had “i18n” actually had “half-baked i18n.”

4. Config knob accepted-but-ignored. Present in the config schema, accepted by config set, documented — but never read anywhere in the code path. The user trusts the docs, sets the value, zero effect. I did this in CommitBrief’s own cache config: setting cache.enabled: false did nothing because the pipeline opened the cache using only the Dir and RepoRoot parameters. This pattern especially tends to split across two PRs — the “define config” PR and the “use config” PR. AI catches the missing second one; the human reviewer loses context in the time between the two PRs.

5. A ref lost between empty layers. A mechanism you think is acting as a gate turns out to be a no-op. CommitBrief’s pre-push hook did exactly this (explained above). Without AI, this kind of bug is usually caught in production or in a security incident; a reviewer who recognizes the pattern can spot it in the diff too.

The common thread across all five: none of them are “code that fails to compile.” They are all the gap between intent and reality. That’s exactly where AI genuinely adds value — validating syntax is already the compiler’s job.

Why project-specific rules matter

The most concrete thing that distinguishes CommitBrief from generic AI review tools: COMMITBRIEF.md lives in the project root and is sent to the LLM as the system prompt. Three rules from CommitBrief’s own COMMITBRIEF.md:

“New dependencies must be MIT / Apache-2.0 / BSD / ISC / MPL-2.0 / LGPL-3.0+; AGPL or proprietary is not accepted.”
“All user-facing strings must go through cat.T(key, ...); hardcoded English / Turkish is rejected.”
“If a new key is added to the i18n catalog, EN/TR parity is mandatory; a key cannot exist on only one side.”

These are not generic “write good code” suggestions — they are this project’s own policies. A generic AI assistant won’t ask “is this dependency MIT or GPL”; an AI reviewing against my rules will. That difference lifts the review from syntax-level to architectural-intent-level — and in a team sharing the same single file, it means “which standards apply” is resolved in code before it’s debated in a review.

Provider choice + cost

Having four providers (Anthropic, OpenAI, Gemini, Ollama) plus two CLI-based providers (claude-cli, gemini-cli) living on a single abstraction enables two things: comparison is cheap, no lock-in.

Provider	Typical review	Strengths	Weaknesses
Anthropic Sonnet 4.6	$0.03	Most consistent reasoning, ephemeral prompt cache (5 min)	Most expensive mid-tier
Gemini 2.5 Pro	$0.01	2M context, generous free tier	Occasionally overly verbose
OpenAI GPT-4o	$0.02	Fast; automatic cache on prefix ≥1024 tokens	Shorter explanations than Sonnet
Ollama (local)	$0	Zero exfiltration, your own hardware	Hardware required, quality depends on model
`claude-cli`	$0*	Existing Claude Code subscription	Plain text — no structured findings
`gemini-cli`	$0*	Existing Gemini CLI auth	Same

* Marginal $0 assuming an existing subscription.

In practice: I use Anthropic Sonnet for daily work (consistency matters). For PRs that need higher sensitivity I send the same diff to Gemini Pro as well and compare the two finding sets — since the cache logic is per-provider-key, they create separate entries, but the cost is still a rounded few cents. Ollama for code that needs to stay local. I’m not using the CLI providers in my daily workflow yet because their plain-text output bypasses the --fail-on severity gate; if you already pay for the subscription they’re useful for a quick “second glance”, but not for CI integration.

Where it falls short

Three months in, I still run into the tool’s limits. An honest list:

No architectural questioning. It has no answer to “should I build this feature” or “is this abstraction right.” It doesn’t see how a single diff fits into the whole system.
No intent validation. It can’t say “are you building the right thing.” A perfect implementation of the wrong spec comes back green.
Limited cross-file consistency. File-to-file consistency within a single PR is fine, but it can’t contrast against code outside the PR (“this pattern is done X way elsewhere in the project, but you used Y here”). No repo-wide vectorization; a deliberate scope decision in the PRD.
False-positive band exists. Usually at the info level and easy to ignore, but it’s there. Filtering info findings is possible with the OUTPUT.md template.
Prompt injection risk. Adversarial content inside the diff can mislead the LLM. CommitBrief’s mitigation: <project_rules>...</project_rules> XML wrapper + immutability guard + a secret scan before each review. Still not zero risk; for security-sensitive diffs I’d recommend --no-cache + careful reading.

None of these limitations lead to the conclusion “the tool doesn’t work” — it just needs to be placed correctly. CommitBrief is not a replacement for lint; it’s a layer on top of lint + tests, below human review.

Closing thoughts

I want to repeat the line from the first post: if production has gotten cheaper, reviewing what’s produced shouldn’t stay expensive. The concrete takeaway from three months is this: the point where AI adds value is calibrating judgment — not replacing it. The right design for a tool is not to deny that difference; it’s to accept it and bridge both sides.

CommitBrief is what built that bridge for me. If it works for you, I’d love for you to share your COMMITBRIEF.md — seeing how other projects define “good code” is the most instructive part of all this.

github.com/CommitBrief/commitbrief
commitbrief.com
Portfolio page: /en/portfolio/commitbrief