What happened when I asked Claude to be honest about its own work
A colleague's tip, a concept called "satisficing," and a Claude Code plugin that stops AI beyond the finish line.
“Large language models (LLMs) often ‘satisfice’ — they accept good-enough rather than optimize.”
That’s what Claude replied when I asked it about creating a self-scoring skill after a colleague shared a tip that involves asking your LLM to score its output and following that up with something like, “What would you change to improve your score?”
I looked it up, and satisficing is a blend of “satisfy” and “suffice.” It describes the tendency to settle for a solution that meets the minimum threshold rather than push toward the best possible one. Turns out, it’s not just a human thing. When an LLM finishes a task, it tends to stop at it works instead of it’s good.
That gap — between working and good — is where a lot of value quietly leaks out, and you might not know it unless you ask.
So we built a quality gate
Claude and I built claude-self-score, a Claude Code plugin that adds a structured quality gate before task completion.
The idea is simple: instead of letting Claude declare “Done!” the moment a task is technically completed, the plugin asks the LLM to evaluate its own work against a defined rubric — and only close the loop if the output actually meets the bar.
No more bare-minimum output.
How it works
Claude scores its own work across 7 dimensions:
Correctness — Does it actually do what it’s supposed to do?
Code quality — Is it clean, readable, maintainable?
Completeness — Are all requirements addressed?
Elegance — Is it the simplest solution that works?
Performance — Does it use resources wisely?
Security — Are there obvious vulnerabilities?
User intent alignment — Does it meet the literal request and the actual intent behind it?
Each dimension is scored on a 1.0–10.0 scale. If the overall average falls below 9.5/10, the plugin identifies specific improvements and asks before implementing them. It also detects dishonest self-scoring patterns — things like score inflation and N/A abuse, where a model might quietly sidestep dimensions it’s underperforming on.
And it runs automatically at the end of every task. No need to manually invoke it.
Why this matters
Without a quality gate, AI assistants do what humans do under time pressure: they stop when something works, not when it’s right.
Structured self-reflection catches things that a passing glance misses — edge cases that weren’t obvious at first, code that functions but would be a headache to maintain, and solutions that answer the question asked while missing the point. That last one is subtle and common. An LLM can satisfy every stated requirement and still produce something that doesn’t serve what you actually need.
The improvement loop also has a compounding effect. Small, targeted fixes — the kind that only surface when you’re forced to look critically — add up.
Oh, and it failed its own bar at first
Crazy thing is, since it automatically runs at the end of each task, the skill ran after its creation — self-scoring and improving itself. The first pass came in at 9.4 — just below its own 9.5 threshold.
Basically, the skill designed to catch “good enough” caught itself being good enough.
There’s something humbling and deeply satisfying about that. The system worked exactly as intended, on the thing that built it.
Try it out, and let me know how it works for you.
Have you ever noticed your favorite LLM just satisficing it? Let’s talk about it in the comments.
AI use in this post:
Banner image was created with Claude and Nano Banana 2.


