LLM Testing Checklist: A Pre-Launch Guide

Air Canada lost a court case because its chatbot invented a refund policy. The tribunal ruled the airline had to honor what the bot promised. Klarna reversed its AI-first customer service strategy after its chatbot delivered worse service than humans, and started rehiring agents. Both stories made headlines because the underlying problem was the same. A large language model shipped into production without the QA process the technology actually needs.

The pattern is widespread. McKinsey’s 2025 State of AI report found that 78% of organizations now use AI in at least one business function, yet only about a third have scaled it to enterprise level. Most of the rest are running pilots that either stall in development or ship before they’re ready, with no structured way to know which one is happening.

This LLM testing checklist is built for the second group. It covers seven pre-launch stages, the thresholds that separate “ready” from “not yet,” the tools each stage needs, and the failure patterns that show up most often in LLM testing for production-bound products. Treat it as a structured pass before launch.

Why LLM Testing Breaks Traditional QA Assumptions

Traditional software is deterministic. The same input gives the same output, so test assertions pass or fail cleanly. Large language models do not behave this way. Ask the same question twice and you get two valid but different answers. Change a word in the system prompt, and downstream behavior shifts in ways you cannot fully predict.

That non-determinism reshapes how to test LLM behavior. You replace binary assertions with scored thresholds. You measure how often the model behaves correctly rather than whether it does or does not on a single run. And you account for a new class of failures that traditional QA never had to think about: hallucinations, prompt injection, jailbreaks, training data leakage, bias drift, and silent regressions after every model update.

The result is a different testing discipline, anchored to the LLM evaluation metrics that matter and a wider scope than functional correctness alone. The checklist below sequences that discipline into seven stages that build on each other.

The 7-Stage LLM Testing Checklist

Run these stages in order. Each one produces an output the next stage needs. Skip the early ones and you pay for it later, when fixing a flawed dataset or rewriting a system prompt costs ten times what it would have on day one. The seven stages below mirror the sequence most production-bound LLM evaluation programs follow, and they scale up or down depending on risk tier.

LLM Testing Checklist: A Pre-Launch Guide

Stage 1: Define Your Evaluation Contract

Before writing a single test, document three things: the use case, the risk tier, and the acceptable thresholds for each metric you plan to track. Get sign-off from product, engineering, and QA on all three. Without this contract, every later argument about “is this good enough to ship?” becomes opinion.

Risk tier matters most. A customer-facing chatbot can tolerate a hallucination rate around 5%. A medical or legal assistant needs to stay below 1%. Pick the tier first, then the thresholds follow. Common starting points for production deployments:

  • Faithfulness: 0.75 or higher on a normalized 0-1 scale.
  • Answer relevance: 0.70 or higher.
  • Toxicity: 0.10 or lower.
  • Hallucination rate: 5% or lower for general use, 1% for high-stakes contexts.

The contract becomes the gate criteria. If a model fails any threshold, it does not ship.

Stage 2: Curate Your Golden Dataset

The dataset is the foundation of every test that follows. A weak dataset gives you confident-looking numbers that mean nothing in production.

For a basic launch, build at least 50 examples. For production-grade evaluation, aim for 500 or more. The best source is real usage data, anonymized and labeled. If you do not have that yet, simulate representative inputs across three categories: normal cases that users will hit most of the time, edge cases like rare scenarios and malformed input, and adversarial inputs including jailbreak attempts and scope-violation prompts.

When teams ask how to test LLM applications without first building a golden dataset, the honest answer is that they can’t. The dataset has to come first, even if it starts small, and grow as production logs surface new failure patterns over the first few weeks of use.

Stage 3: Test Output Quality

This is where most teams start, often before they should. Output quality testing covers three metrics that work together:

  • Faithfulness measures whether the model’s response stays within the source material. Non-negotiable for any RAG system.
  • Answer relevance measures whether the response actually addresses the user’s question. A faithful response can still answer the wrong question.
  • Hallucination rate measures how often outputs contain factually incorrect claims, with thresholds that scale to your risk tier.

For RAG pipelines, retrieval testing comes before generation testing. Industry analysis consistently shows that the majority of RAG failures originate in retrieval, not the model itself. A practical stack of RAG evaluation tools catches retrieval gaps before they reach user-facing answers.

Tools that work well at this stage include DeepEval, Ragas, and G-Eval for LLM-as-a-judge scoring. Plan one to two weeks for this stage on a customer-facing product.

Stage 4: Test for Bias and Safety

Safety testing covers two distinct surfaces. Toxicity scoring catches harmful, offensive, or discriminatory outputs at the moment of generation. Bias testing surfaces systematic differences in how the model treats different demographic groups, even when individual responses look fine on their own.

Automated scanners catch the obvious cases. Google’s Perspective API is a common baseline for toxicity, and libraries like AIF360 help with demographic fairness scans. But automated tooling misses subtle, context-specific harm, which is why the NIST AI Risk Management Framework and its Generative AI Profile call for combining automated evaluation with human review across the AI lifecycle. The strongest programs pair every automated scan with manual review of 50 or more borderline cases flagged for human judgment.

Compliance pressure now reinforces this stage. The EU AI Act requires high-risk AI systems to document testing for accuracy, bias mitigation, and human oversight. For products operating in the EU, this stage is no longer optional even if you skipped it before.

Stage 5: Run Security and Red Team Tests

LLMs introduce an attack surface that traditional security testing was not built for. The five most common vectors to probe before launch:

  • Direct prompt injection, where user input overrides the system prompt.
  • Indirect prompt injection through malicious instructions embedded in retrieved documents, URLs, or uploaded files.
  • Jailbreaks using roleplay, hypothetical framing, or DAN-style prompts that bypass safety filters.
  • System prompt extraction attempts that reveal your template and any sensitive context inside it.
  • Training data leakage under crafted queries that pull fragments of the model’s training set.

Automated scanners like Garak and PyRIT cover known attack patterns at the input and output layer. They are necessary but not sufficient. Indirect injection through trusted content and chained tool exploits typically need human red teamers to surface, since they depend on context the scanner cannot see.

For production deployments, pair the automated baseline with focused penetration testing services. This is the stage where teams skipping outside expertise lose the most, since the highest-impact LLM attacks are rarely the ones in the documentation.

Stage 6: Validate Performance, Latency, and Cost

Performance testing for LLM-powered products goes beyond “does the response feel fast”. Three numbers matter most. Time to first token (TTFT) drives perceived speed for streaming interfaces and should land under one second for chat. Total response latency, especially the p95 figure, tells you more than the average about real user experience. Cost per request needs to be tracked at baseline and at ten times expected production volume, because the architecture that handles 10 test users for two hundred dollars a month behaves very differently at fifty thousand concurrent sessions.

This is where teams asking how to test LLMs under realistic load discover that their pilot architecture cannot scale. Caching layers, fallback models, rate limiting, and load balancers all reshape the latency and cost profile in production. Run load tests against the production architecture, not the pilot.

Stage 7: Lock in Regression Suite and Observability

The final stage is the one most teams treat as optional, which is also why they deploy and then quietly degrade. Every prompt update, model swap, and knowledge base change can break what worked yesterday.

Wrap every prior stage’s tests into a regression suite that runs on every change, not every release. Snapshot baseline scores so future regressions become visible immediately rather than after a customer complaint. Add production observability before launch, not after the first incident. Tracing, response logging, and drift alerts all belong in place on day one.

Regression testing for LLM-powered products looks different from traditional regression. The suite is probabilistic, the thresholds are scored, and the comparisons happen across distributions rather than exact matches. The principle is the same though: catch the regression in CI, not in a customer’s screenshot on social media.

Where LLM Testing Goes Wrong even with the Checklist

A complete AI testing checklist still leaves room for predictable failures. Three patterns show up repeatedly in pre-launch reviews of LLM-powered products.

Treating evaluation as a launch ritual. Teams run the checklist once, hit their thresholds, ship, and shelve the suite. The first model update silently breaks half the gains, and no one notices until a user does. The fix is making the regression suite part of every deployment, not a milestone you check off and forget.

Testing the model in isolation. A model that scores well in a sandbox can still fail in production because retrieval is broken, the system prompt is stale, or a tool integration rate-limits at the wrong moment. The checklist works only when applied to the full integrated stack, including the orchestration layer, RAG pipeline, and any downstream APIs the model can reach.

Skipping adversarial testing because the product seems low-risk. Internal tools get jailbroken too. Any LLM connected to real data, tools, or user input belongs in Stage 5, regardless of whether your audience is one team or one million users. The Klarna case is instructive here. A support bot that users tricked into writing Python was not a high-stakes deployment, but the reputational fallout still earned global coverage.

From Checklist to Launch

A pre-launch testing pass typically runs three to six weeks for a customer-facing product. Internal tools can ship in one to two weeks if the dataset is ready. High-stakes deployments in regulated industries usually need six to twelve weeks to satisfy compliance review.

The seven-stage flow gives you the structure. The thresholds give you the gate. The regression suite and observability layer give you a way to keep what you earned through testing in the first place. What this checklist cannot give you is the time you did not budget for testing in the first place.

If you are closer to launch than you would like to be, contact us and we will tell you which stages to run first against your timeline. QAwerk’s LLM testing services cover the full sequence with senior engineers experienced in pre-launch evaluation for production LLMs.

FAQ

What is the difference between LLM testing and LLM evaluation?

Evaluation measures output quality against defined metrics like faithfulness, answer relevance, and toxicity. Testing is the broader process that includes evaluation plus security, performance, integration, and regression checks. Most production teams need both, run in sequence, before launching a large language model into a customer-facing context.

How long does LLM testing take?

A standard pre-launch pass takes three to six weeks for a customer-facing product. Internal tools can ship in one to two weeks if the dataset is ready. High-stakes deployments in medical, legal, or financial domains typically run six to twelve weeks to meet regulatory and compliance expectations.

How is LLM testing different from traditional software testing?

Traditional software is deterministic, so tests pass or fail on exact output match. Large language models produce probabilistic outputs, meaning the same input can return different valid responses. Testing replaces binary assertions with scored thresholds across accuracy, safety, latency, and consistency, and adds an entirely new category of adversarial checks.

What tools are used for LLM testing?

Common tools include DeepEval and Ragas for output quality, Promptfoo for prompt regression, Garak and PyRIT for adversarial security, LangSmith and OpenAI Evals for evaluation pipelines, and Guardrails AI for runtime safety enforcement. Most production teams combine three or four depending on their use case and budget.

See how QA validation of an LLM-powered chat and onboarding flow helped an AI matchmaking app expand across 4 US cities and secure $6.7M in funding

Please enter your business email isn′t a business email