Why AI Performance Varies Across Different Topics

The whiplash: when AI nails it—and when it doesn’t

You paste a question into an AI assistant and it delivers something you’d happily send to a client: clean structure, sharp phrasing, even decent examples. Then you try a nearby question—same tool, same tone—and it hands back a confident answer that’s subtly off, missing key constraints, or built on facts that don’t exist.

That swing isn’t random. Some requests are “AI-friendly” because they’re basically pattern work: summarizing, reformatting, drafting options, or explaining common concepts. Others are “AI-fragile” because they require current details, local rules, or judgment you’d have to defend under pressure. The hard part is that both can sound equally polished.

If you can name which kind of request you’re making before you hit enter, you stop treating every response like a surprise—and start treating it like a tool with limits.

Are you asking for recall—or for judgment you’d defend in a meeting?

That “tool with limits” feeling usually shows up right when you switch from asking for recall to asking for judgment. A quick definition of a term, a list of common pros and cons, or a template email is mostly retrieval plus formatting. It tends to hold up.

But if you’re asking, “Which option should we choose?” or “What’s the best approach here?”, you’re asking for a call you’d defend in a meeting. That kind of answer depends on priorities, risk tolerance, and constraints the model can’t see. If you don’t supply those, it will still produce a clean recommendation—and you won’t notice what’s missing until someone asks, “Based on what?” The cost is real: you can lose an hour polishing the wrong direction.

A fast check helps: can you point to the facts that would change the answer? If you can’t, you’re probably in judgment land—and the next question is whether those facts are current, local, or proprietary.

How much of this depends on up-to-date, local, or proprietary facts?

In real work, the answer often hinges on details that change faster than your doc does: today’s pricing, a vendor’s latest policy, a new feature rollout, a deadline that moved. If the “facts that would change the answer” live on a webpage that updates weekly, in a Slack thread, or in a client contract, a general-purpose model will guess unless you give it the inputs.

Local rules create the same problem. “Is this compliant?” lands differently in New York than in Texas, and differently in April 2026 than it did two years ago. If you don’t specify where and when, you’ll get something that reads broadly true but may be unusable in your context.

Proprietary facts are the quiet killer: internal metrics, negotiated terms, current headcount, a roadmap slide. The model can’t see them, so it fills the gap with defaults. When you notice those gaps, you’re ready for the next trap: topics with hidden rules and real penalties for being wrong.

When the topic has hidden rules (and the penalty for being wrong is high)

It often looks like a simple question: “Can we do this?” “Do we have to file anything?” “Is this allowed?” You’ll get a crisp, confident answer—because many domains have a plain-language surface that hides a real rulebook underneath. Tax, HR, security, procurement, regulated marketing, accessibility, and contracts all work this way: one missing detail flips the outcome.

The trouble is that “hidden rules” aren’t just extra facts. They’re gates. If a contract crosses a dollar threshold, a policy might require a second approver. If customer data touches a specific system, you may trigger a security review. If a benefit is offered in one state, an extra notice might be mandatory. The model can’t reliably infer which gate you’re near, so it tends to answer for the generic case.

When being wrong creates real fallout—an audit issue, a legal miss, a broken launch—treat the first answer as a map of questions, not a decision. The next move is to force the rulebook into the prompt.

The “sounds right” trap: spotting confident nonsense early

That “map of questions” mindset matters because the most dangerous output isn’t obviously wrong—it’s wrong in a way that reads like it came from someone competent. You’ll see tidy definitions, plausible steps, and a confident tone, even when the answer quietly invents a policy name, cites a rule that doesn’t exist, or fills a missing detail with a “reasonable” default.

A quick tell is false specificity. If it gives exact thresholds, dates, fees, or “required forms” without naming a source you could actually open, assume it’s guessing. Another tell is when it answers the version of the question you wish you asked: it skips edge cases, ignores your constraints, or never asks a single clarifying question in a domain where one missing fact flips the outcome.

Don’t debate it. Force a self-check: ask it to list assumptions, identify which facts would change the recommendation, and provide a short “what would make this wrong” section before you act on it.

Make the same topic easier for the model by changing the ask

That “list assumptions” move works even better when you reshape the request so the model has fewer places to guess. In practice, people ask a broad question (“Is this compliant?”) and get a broad answer. You can often get a more reliable output on the same topic by turning it into a constrained task: “Given these three facts, draft the questions I need to answer before we can decide,” or “Write two options: one that assumes we’re in Texas and one that assumes we’re in New York, and tell me what changes.”

Another useful shift is to ask for structure, not conclusions. “Give me a checklist I can verify,” “Provide a decision tree with the inputs labeled,” or “Draft an email to Legal/IT that asks for the exact data we’re missing.” If the model has to show its inputs, you can see where it’s weak.

This kind of prompting is slower than “just answer,” and you can still end up iterating in circles if the missing facts live in systems you can’t access. That’s usually the moment to switch modes.

When to stop iterating and switch to sources, tools, or experts

That “switch modes” moment usually arrives when your next prompt would be, “Okay, but what’s the real number/date/rule?” If the answer depends on a live source you can open—pricing pages, policy docs, release notes—stop asking the model to guess and go fetch the source. Use the model to generate the exact queries to run, the fields to extract, and the checklist you’ll verify against, then paste the relevant excerpts back in.

If you’re working with internal facts, switch to tools, not cleverer wording. Pull the metric from the dashboard, the clause from the contract, the ticket status from the system of record. “Assume X” saves time until it ships a decision built on the wrong X.

And when the downside is concrete—legal exposure, security risk, payroll mistakes—escalate fast. Use the draft to brief Legal, IT, Finance, or a specialist, and ask them to confirm the one or two gates that actually decide the outcome. Then you can treat the model as a drafting engine again.

A workable default: treat AI output like a draft with a built-in confidence meter

Then you can treat the model as a drafting engine again—but keep one habit: assume every answer carries an invisible “confidence meter,” and your job is to read it. If the output sticks to stable concepts, labels its assumptions, and points to inputs you can verify, you’re probably safe to use it as a draft. If it leans on exact numbers, dates, or “required” steps without a source, treat it like a placeholder.

A simple default works: publish language only after one pass of “what would make this wrong?” and one check against the real system of record. That costs minutes. Skipping it can cost a rework cycle, or worse, a decision you can’t defend.