Key Metrics for Evaluating Large Language Model Summarization Quality

When ROUGE looks “good” but users still complain

You ship a summary feature that clears your offline benchmark—ROUGE looks strong—then support tickets roll in: “It missed the decision,” “It invented a date,” “I can’t skim this.” That gap happens because overlap scores reward shared wording, not shared meaning. If the model copies long phrases, ROUGE climbs even when it drops the one sentence your users needed. And if the model paraphrases well, ROUGE can fall even when the summary is accurate and complete.

For a go/no-go call, you need metrics that map to product risks: missing key points, wrong facts, unnecessary repetition, and hard-to-read output. That starts by writing down what a “bad summary” costs you in your specific workflow—then measuring those failure modes directly.

What would make you say no? Defining summary risks in product terms

In practice, you don’t reject a summary because it “scored low.” You reject it because it causes rework, bad decisions, or loss of trust. So define “no-go” as a short list of concrete failure cases tied to your workflow: a meeting recap that drops the owner and deadline, a support-ticket digest that flips the customer’s plan tier, an article summary that states an outcome that never happened.

Write acceptance thresholds as product statements, then convert them into checks. If a missed action item creates follow-up meetings, you need a coverage target (e.g., “captures all decisions and action items”). If a wrong fact could trigger compliance escalation, you need a factuality threshold (“zero invented entities or numbers”). If users stop skimming because summaries ramble, you need redundancy and length bounds.

The hard part is cost: these checks take labeled examples and reviewer time, so pick the smallest “stop-ship” list you can enforce.

Coverage: catching missing key points without scoring “style”

That “smallest stop-ship list” usually starts with coverage, because missing one decision or action item is what forces users back into the full source. In a meeting-notes product, that means a summary can read smoothly and still fail if it drops who owns the next step. In support tickets, it can omit the customer’s environment details and make the handoff useless.

Measure coverage by checking for a fixed set of key items, not by judging prose. Create a lightweight annotation template: decisions, action items (owner + due date), key constraints, and top issues. Have reviewers mark those items in the source, then score whether the summary includes each one (binary) and whether it’s specific enough to act on (e.g., “follow up” is not an action item). This stays stable even when the model paraphrases.

The constraint is labor: key-item labeling costs time and gets inconsistent fast. Limit it to your highest-risk fields and sample weekly, then expand only if misses show up in production.

Factuality is the launch blocker: how to quantify hallucinations you can’t tolerate

Weekly sampling catches missing items, but the moment a summary states a thing that never happened, users stop trusting every future recap. In product terms, factuality failures aren’t “a little worse,” they change behavior: people re-open the transcript, double-check the ticket, or paste the whole email thread into chat and ask again. That’s why a single invented number, name, or commitment date can be a stop-ship issue even when coverage looks fine.

Quantify this by scoring “atomic claims,” not whole summaries. Have reviewers highlight each distinct claim in the summary (e.g., “Renewal is $24k,” “Next step is a security review on Apr 3,” “Customer is on Pro tier”), then label each as supported, contradicted, or not found in the source. Roll this into a strict rate: unsupported-claims per summary, plus a separate count for “high-impact” claims (numbers, dates, named people, compliance statements). Set your threshold in those units (for example, 0 high-impact unsupported claims across the launch set).

The hard part is reviewer drift. Two people will disagree unless you give them a short playbook and force them to cite the exact supporting line. Once you can trust the labels, you can decide whether the next problem is correctness with too many words—or just too many words.

When summaries feel bloated: redundancy and compression as separate knobs

“Too many words” usually shows up after you’ve fixed the scary errors: the summary is correct, but it repeats itself and still takes as long to read as the source. Treat that as two different problems. Redundancy is wasted space inside the summary (the same point restated, repeated context, duplicated action items). Compression is the ratio between source length and summary length, even if every sentence is unique.

Measure redundancy by marking sentences as “adds a new key item” vs “restates an earlier one,” then score redundant-sentence rate per summary. You can also flag repeated entities and identical claim pairs (“Decision: ship Friday” and later “They agreed to ship Friday”). Measure compression with a simple length target (tokens, characters, or seconds-to-skim) tied to your UI, plus a minimum coverage gate so “short” doesn’t mean “missing.”

The real constraint is reader tolerance: reviewers will argue about what counts as repetition unless you define it with examples. Once you can separate “repeating” from “still too long,” you can tune prompts and post-processing without breaking factuality.

Readability and tone—because ‘technically correct’ can still be unusable

Once you’ve stopped repetition without breaking factuality, the next complaint is blunt: “I can’t read this.” You’ll see summaries that are accurate but packed with hedges, clause-heavy sentences, and odd phrasing that slows scanning in a UI meant for skimming. The tone can also be wrong for the workflow: a support digest that sounds like marketing copy, or meeting notes that read like a legal disclaimer.

Measure this with checks that map to user effort. Track reading grade level (or average sentence length) and set a ceiling that matches your audience, then pair it with a quick “scan test”: can a reviewer answer three fixed questions (top issue, decision, next step) in 10 seconds from the summary alone? For tone, use a small rubric with two “no-go” labels—overconfident language (“will,” “confirmed”) when the source is uncertain, and mismatched formality (too chatty or too stiff) for your product.

The constraint is subjectivity: you need two example summaries per label, or reviewers will score their preferences. With that in place, you’re ready to turn these metrics into a scorecard stakeholders will accept.

A small scorecard you can defend in a go/no-go meeting

Stakeholders will ask, “So is it good enough?” and you need an answer that points to failure rates, not vibes. Use a short scorecard with gates and one or two “watch” metrics: Coverage (key-item recall; launch gate), Factuality (unsupported claims per summary, with a hard gate of zero high-impact unsupported claims), Redundancy (redundant-sentence rate; watch), Compression (length target with a coverage floor; watch), Readability/Tone (10-second scan pass rate plus two no-go tone labels; launch gate if your UI depends on skimming).

Keep it lightweight: 50–100 representative examples, weekly production sampling, and a reviewer playbook with cite-the-source rules. The real cost is review time, so reserve deep labeling for launches and incidents, and rely on sampling to catch drift before users do.