Shahzad Bhatti Welcome to my ramblings and rants!

March 22, 2026

Generative and Agentic AI Design Patterns

Filed under: Computing — admin @ 8:29 pm

Over the past year I’ve built production agentic systems across several domains and shared what I learned along the way: production-grade AI agents with MCP and A2A, a daily minutes assistant with RAG, MCP, and ReAct, rebuilding fintech infrastructure with ReAct and local models, automated PII detection with LangChain and Vertex AI, and API compatibility guardians with LangGraph. I have learned a lot building those systems through trial and error. In this blog, I will share a set of generative and agentic AI patterns I have learned from reading Generative AI Design Patterns and Agentic Design Patterns. I have built hands on python examples from these patterns and built github.com/bhatti/agentic-patterns for running agentic apps locally via Ollama with open-source models (Qwen, DeepSeek, Llama, Mistral). Each pattern in the repo includes a README, working code, real-world use cases, and best practices.


Quick Start

git clone https://github.com/bhatti/agentic-patterns
cd agentic-patterns
pip install -r requirements.txt
ollama pull llama3
cd patterns/logits-masking && python example.py

See SETUP.md for full setup instructions.


Table of Contents


Category 1: Content & Style Control

The first five patterns control and optimize content generation, style, and format:

Pattern 1: Logits Masking

Category: Content & Style Control
Use When: You need to enforce constraints during generation (e.g., valid JSON, banned words)

Problem

When generating structured outputs (like JSON, code, or formatted text), language models can produce invalid sequences that don’t conform to required style rules, schemas, or constraints.

Solution

Logits Masking intercepts the model’s token generation process to enforce constraints during sampling. Three key steps:

  1. Intercept Sampling — Modify logits before token selection
  2. Zero Out Invalid Sequences — Mask invalid tokens (set logits to -inf)
  3. Backtracking — Revert to checkpoint if invalid sequence detected

Use Cases

  • API response generation (ensure valid JSON)
  • Code generation (enforce style guidelines)
  • Content moderation (prevent banned words)
  • Structured data extraction (match specific formats)

Constraints: Requires access to model logits (not available in all APIs). State tracking can be complex for nested structures. Performance overhead from logits processing.

Tradeoffs:

  • ? Prevents invalid generation at source
  • ? More efficient than post-processing
  • ?? More complex than simple validation
  • ?? May limit model creativity

Code Snippet

class JSONLogitsProcessor(LogitsProcessor):
    """Intercept logits and mask invalid JSON tokens."""

    def __call__(self, input_ids, scores):
        # STEP 1: Intercept sampling
        current_text = self.tokenizer.decode(input_ids[0])

        # STEP 2: Zero out invalid sequences
        for token_id in range(scores.shape[-1]):
            if not self._is_valid_json_token(token_id, current_text):
                scores[0, token_id] = float('-inf')  # Mask invalid

        return scores

Full Example: patterns/logits-masking/example.py


Pattern 2: Grammar Constrained Generation

Category: Content & Style Control
Use When: You need outputs that conform to formal grammar specifications

Problem

Language models often produce text that doesn’t conform to required formats, schemas, or grammars. Unlike simple masking, grammar-constrained generation ensures outputs follow formal grammar specifications.

Solution

Grammar Constrained Generation uses formal grammar specifications to guide token generation. Three implementation approaches:

  1. Grammar-Constrained Logits Processor — Use EBNF grammar to create processor
  2. Standard Data Format — Leverage JSON/XML with existing validators
  3. User-Defined Schema — Use custom schemas (JSON Schema, Pydantic)

Use Cases

  • API configuration generation (OpenAPI specs)
  • Configuration files (YAML, TOML that must parse)
  • Database queries (SQL with guaranteed syntax)
  • Code generation (must compile/parse)

Constraints: Requires grammar definition or schema. Grammar parsing can be computationally expensive. Complex grammars may limit generation speed.

Tradeoffs:

  • ? Guarantees grammatical correctness
  • ? Works with existing schema languages
  • ?? More complex than simple masking
  • ?? May require grammar expertise

Code Snippet

# Option 1: Formal Grammar
grammar = """
root        ::= endpoint_config
endpoint_config ::= "{" ws endpoint_def ws "}"
endpoint_def    ::= '"endpoint"' ws ":" ws endpoint_obj
"""

# Option 2: JSON Schema
schema = {
    "type": "object",
    "required": ["endpoint"],
    "properties": {
        "endpoint": {
            "type": "object",
            "required": ["name", "method", "path"]
        }
    }
}

# Apply grammar constraints during generation
processor = GrammarConstrainedProcessor(grammar, tokenizer)
logits = processor(input_ids, logits)

Full Example: patterns/grammar/example.py


Pattern 3: Style Transfer

Category: Content & Style Control
Use When: You need to transform content from one style to another

Problem

Content often needs to be transformed from one style to another while preserving core information. Manual rewriting is time-consuming and inconsistent.

Solution

Style Transfer uses AI to transform content between styles. Two approaches:

  1. Few-Shot Learning — Use example pairs in prompt (no training)
  2. Model Fine-Tuning — Fine-tune model on style pairs

Use Cases

  • Professional communication (notes to emails)
  • Content adaptation (academic to blog posts)
  • Brand voice (maintain consistent tone)
  • Platform adaptation (different social media styles)

Constraints: Few-shot limited by context window. Fine-tuning requires training data. Style consistency can vary.

Tradeoffs:

  • ? Few-shot: Quick, no training needed
  • ? Fine-tuning: Better consistency
  • ?? Few-shot: May not capture nuances
  • ?? Fine-tuning: Requires data collection

Code Snippet

# Option 1: Few-Shot Learning
examples = [
    StyleExample(
        input_text="urgent: need meeting minutes by friday",
        output_text="Subject: Urgent: Meeting Minutes Needed\n\nDear [Recipient],\n\n..."
    )
]

transfer = FewShotStyleTransfer(examples)
result = transfer.transfer_style("quick update: deadline moved")

# Option 2: Fine-Tuning
training_data = [
    {"prompt": "Convert notes to email", "completion": "Professional email..."}
]
fine_tuned_model = fine_tune_model(base_model, training_data)

Full Example: patterns/style-transfer/example.py


Pattern 4: Reverse Neutralization

Category: Content & Style Control
Use When: You need to generate content in a specific personal style that zero-shot can’t capture

Problem

When you need content in a specific, personalized style, zero-shot prompting fails because the model doesn’t know your unique writing style.

Solution

Reverse Neutralization uses a two-stage fine-tuning approach:

  1. Generate Neutral Form — Create content in neutral, standardized format
  2. Fine-Tune Style Converter — Train model to convert neutral ? your style
  3. Inference — Use fine-tuned model for style conversion

Use Cases

  • Personal blog writing (technical content to your style)
  • Brand voice (consistent voice across content)
  • Documentation style (match organization’s style guide)
  • Communication templates (your personal email style)

Constraints: Requires fine-tuning. Needs training data (neutral ? style pairs). Two-stage process.

Tradeoffs:

  • ? Learns your specific style
  • ? Consistent results
  • ? Captures personal nuances
  • ?? Requires data collection and training
  • ?? Less flexible (need retraining to change style)

Code Snippet

# Step 1: Generate neutral form
neutral_generator = NeutralGenerator()
neutral = neutral_generator.generate_neutral("API Authentication")

# Step 2-3: Create training dataset and fine-tune
pairs = [
    StylePair(neutral="Technical doc...", styled="Your blog style...")
]
fine_tuned_model = fine_tune_on_preferences(pairs)

# Step 4: Use fine-tuned model
converter = StyleConverter(fine_tuned_model)
styled = converter.convert_to_style(neutral)

Full Example: patterns/reverse-neutralization/example.py


Pattern 5: Content Optimization

Category: Content & Style Control
Use When: You need to optimize content for specific performance goals (e.g., open rates, conversions)

Problem

When creating content for specific purposes, you need to optimize for outcomes. Traditional A/B testing is limited — it’s manual, time-consuming, and doesn’t learn patterns.

Solution

Content Optimization uses preference-based fine-tuning (DPO) to train a model to generate content that wins in comparisons:

  1. Generate Pair — Create two variations from same prompt
  2. Compare — Test and pick winner based on metrics
  3. Create Dataset — Collect preference pairs (prompt, chosen, rejected)
  4. Fine-Tune with DPO — Train model on preferences
  5. Use Optimized Model — Generate better-performing content

Use Cases

  • Email marketing (optimize subject lines for open rates)
  • E-commerce (optimize product descriptions for conversions)
  • Social media (optimize posts for engagement)
  • Landing pages (optimize copy for sign-ups)

Constraints: Requires preference data collection. DPO fine-tuning is computationally intensive. Need clear optimization metrics.

Tradeoffs:

  • ? Learns from all comparisons
  • ? Scales to many variations
  • ? Model internalizes winning patterns
  • ?? Requires training data (100+ pairs)
  • ?? More complex than A/B testing

Code Snippet

# Step 1: Generate pair
generator = ContentGenerator()
var_a, var_b = generator.generate_pair("New product launch")

# Step 2: Compare and pick winner
comparator = ContentComparator(optimization_goal="open_rate")
pair = comparator.compare(ContentPair(prompt, var_a, var_b))

# Step 3-4: Create dataset and fine-tune
preferences = [PreferenceExample(prompt, chosen, rejected)]
dpo_trainer = PreferenceTuner()
optimized_model = dpo_trainer.fine_tune(preferences)

# Step 5: Use optimized model
optimized_generator = OptimizedContentGenerator(optimized_model)
result = optimized_generator.generate_optimized("Newsletter")

Full Example: patterns/content-optimization/example.py


Category 2: Adding Knowledge / RAG Stack

Patterns 6–12 augment LLMs with external knowledge sources for accessing up-to-date information, private data, and knowledge beyond the model’s training cutoff.


Pattern 6: Basic RAG (Retrieval-Augmented Generation)

Category: Adding Knowledge
Use When: You need to augment LLM responses with external knowledge sources

Problem

LLMs have three key knowledge limitations:

  • Static Knowledge Cutoff — Trained on data up to a specific date
  • Model Capacity Limits — Can’t store all knowledge in parameters
  • Lack of Private Data Access — No access to internal documents or databases

Solution

Basic RAG uses trusted knowledge sources when generating LLM responses. Two pipelines:

Indexing Pipeline (preparatory):

  • Load documents ? Chunk into manageable pieces ? Store in searchable index

Retrieval-Generation Pipeline (runtime):

  • Retrieve relevant chunks for query ? Ground prompt with retrieved context ? Generate response using LLM

Use Cases

  • Product documentation (answer questions about features/APIs)
  • Company knowledge base (query internal wikis/policies)
  • Customer support (accurate answers from support docs)
  • Research assistance (search through papers/documents)
  • Legal/compliance (query regulations/guides)

Tradeoffs:

  • ? Access to up-to-date and private knowledge
  • ? Can handle large knowledge bases
  • ? Transparent (can cite sources)
  • ?? Requires indexing infrastructure
  • ?? Retrieval quality affects response quality

Code Snippet

# INDEXING PIPELINE
loader = DocumentLoader()
documents = loader.load_documents("product_docs")

splitter = TextSplitter(chunk_size=500, chunk_overlap=50)
chunks = []
for doc in documents:
    chunks.extend(splitter.split_document(doc))

index = Index()
index.add_chunks(chunks)

# RETRIEVAL-GENERATION PIPELINE
retriever = Retriever(index, top_k=3)
generator = RAGGenerator(retriever)

result = generator.generate("How do I authenticate with the API?")
# Returns answer with source citations

Full Example: patterns/basic-rag/example.py


Pattern 7: Semantic Indexing

Category: Adding Knowledge
Use When: You need semantic understanding beyond keywords, or have complex content (images, tables, code)

Problem

Traditional keyword-based indexing has limitations:

  • Semantic Understanding — Misses meaning (“car” and “automobile” are different keywords)
  • Complex Content — Struggles with images, tables, code blocks, structured data
  • Context Loss — Fixed-size chunking breaks up related content
  • Multimedia — Can’t effectively index images, videos, or other media

Solution

Semantic Indexing uses embeddings (vector representations) to capture meaning:

  1. Embeddings — Encode text/images into fixed vector representations for semantic meaning
  2. Semantic Chunking — Divide text into meaningful segments based on semantic content
  3. Image/Video Handling — Use OCR or vision models for embedding generation
  4. Table Handling — Organize and extract key information from structured data
  5. Contextual Retrieval — Preserve context with hierarchical chunking
  6. Hierarchical Chunking — Multi-level chunking (document ? section ? paragraph)

Use Cases

  • Technical documentation (code examples, API docs, tutorials)
  • Research papers (find by concept, not keywords)
  • Product catalogs (search by features, not names)
  • Multimedia content (images, videos with descriptions)

Code Snippet

# CONCEPT 1: EMBEDDINGS
from sentence_transformers import SentenceTransformer
import math

class EmbeddingGenerator:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def generate_embedding(self, text: str) -> List[float]:
        return self.model.encode(text).tolist()

    def cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
        dot_product = sum(a * b for a, b in zip(vec1, vec2))
        magnitude1 = math.sqrt(sum(a * a for a in vec1))
        magnitude2 = math.sqrt(sum(a * a for a in vec2))
        return dot_product / (magnitude1 * magnitude2) if magnitude1 * magnitude2 > 0 else 0.0

# CONCEPT 2: SEMANTIC CHUNKING
@dataclass
class SemanticChunk:
    id: str
    text: str
    embedding: Optional[List[float]] = None
    chunk_type: str = "text"  # text, code, table, image
    parent_id: Optional[str] = None
    children_ids: List[str] = None

class SemanticChunker:
    def chunk_by_structure(self, content: str) -> List[SemanticChunk]:
        """Chunk respecting document structure (headers, sections, paragraphs)."""
        chunks = []
        sections = re.split(r'\n(#{2,3}\s+.+?)\n', content)
        current_section = None
        chunk_index = 0
        for i, part in enumerate(sections):
            if part.strip().startswith('#'):
                if current_section:
                    chunks.append(SemanticChunk(id=f"chunk-{chunk_index}", text=current_section))
                    chunk_index += 1
                current_section = part + "\n"
            else:
                current_section = (current_section or "") + part
        if current_section:
            chunks.append(SemanticChunk(id=f"chunk-{chunk_index}", text=current_section))
        return chunks

# CONCEPTS 5 & 6: HIERARCHICAL CHUNKING & CONTEXTUAL RETRIEVAL
class ContextualRetriever:
    def retrieve_with_context(self, query: str, top_k: int = 3,
                              include_context: bool = True) -> List[SemanticChunk]:
        query_embedding = self.embedding_generator.generate_embedding(query)
        scored_chunks = []
        for chunk in self.chunks.values():
            if chunk.embedding:
                similarity = self.embedding_generator.cosine_similarity(
                    query_embedding, chunk.embedding
                )
                scored_chunks.append((similarity, chunk))
        scored_chunks.sort(key=lambda x: x[0], reverse=True)
        top_chunks = [chunk for _, chunk in scored_chunks[:top_k]]
        if include_context:
            contextual_chunks = []
            for chunk in top_chunks:
                contextual_chunks.append(chunk)
                # Add parent for context
                if chunk.parent_id and chunk.parent_id in self.chunks:
                    parent = self.chunks[chunk.parent_id]
                    if parent not in contextual_chunks:
                        contextual_chunks.append(parent)
                # Add children for detail
                for child_id in (chunk.children_ids or []):
                    if child_id in self.chunks:
                        child = self.chunks[child_id]
                        if child not in contextual_chunks:
                            contextual_chunks.append(child)
            return contextual_chunks
        return top_chunks

Full Example: patterns/semantic-indexing/example.py


Pattern 8: Indexing at Scale

Category: Adding Knowledge
Use When: Your RAG system needs to handle large-scale knowledge bases with evolving, time-sensitive information

Problem

RAG systems in production face critical challenges as knowledge bases grow:

  • Data Freshness — Recent findings obsolete old guidelines
  • Contradictory Content — Multiple versions of information cause confusion
  • Outdated Content — Old information remains in index, leading to incorrect answers

Solution

Indexing at Scale uses metadata and temporal awareness:

  1. Document Metadata — Use timestamps, version numbers, source information
  2. Temporal Tagging — Tag chunks with creation/update dates, expiration dates
  3. Contradiction Detection — Identify and prioritize newer over older contradictory content
  4. Outdated Content Management — Automatically deprecate or flag outdated information

Code Snippet

@dataclass
class TemporalMetadata:
    created_at: datetime
    updated_at: datetime
    expires_at: Optional[datetime] = None
    version: str = "1.0"
    source: str = ""
    authority: str = "medium"  # high, medium, low

class ContradictionDetector:
    def _resolve_contradiction(self, chunk_a, chunk_b):
        # Prefer newer date
        if chunk_a.metadata.updated_at > chunk_b.metadata.updated_at:
            return "chunk_a"
        # If same date, prefer higher authority
        if chunk_a.metadata.authority > chunk_b.metadata.authority:
            return "chunk_a"
        return "chunk_b"

# KNOWLEDGE BASE WITH TEMPORAL AWARENESS
kb = HealthcareGuidelinesKB()
kb.add_guideline(
    content="CDC recommends masks required in public",
    source="CDC",
    date=datetime(2021, 7, 15),
    authority="high"
)

result = kb.query("Should I wear a mask?", prefer_recent=True)
# Returns most recent guidelines, flags contradictions

Full Example: patterns/indexing-at-scale/example.py


Pattern 9: Index-Aware Retrieval

Category: Adding Knowledge
Use When: Basic RAG fails due to vocabulary mismatches, fine details, or holistic answers requiring multiple concepts

Problem

Users ask questions in natural language (“How do I log in?”), but your API documentation uses technical terminology (“OAuth 2.0 authentication”, “access token”). Basic RAG fails because “log in” ? “authentication” ? “OAuth 2.0”.

Solution

Index-Aware Retrieval uses four advanced retrieval techniques:

  1. Hypothetical Document Embedding (HyDE) — Generate hypothetical answer first, then match chunks to that answer
  2. Query Expansion — Translate user terms to technical terms used in chunks
  3. Hybrid Search — Combine keyword (BM25) and semantic (embedding) search with weighted average
  4. GraphRAG — Store documents in graph database, retrieve related chunks after finding initial match

Code Snippet

# TECHNIQUE 1: HYPOTHETICAL DOCUMENT EMBEDDING (HyDE)
class HyDEGenerator:
    def retrieve_with_hyde(self, query: str, chunks: List[DocumentChunk], top_k: int = 3):
        # Step 1: Generate hypothetical answer
        hypothetical_answer = self.generate_hypothetical_answer(query)
        # "To authenticate, use OAuth 2.0 access token..."

        # Step 2: Embed hypothetical answer (not original query)
        hyde_embedding = embedding_generator.generate_embedding(hypothetical_answer)

        # Step 3: Find chunks similar to hypothetical answer
        scored_chunks = []
        for chunk in chunks:
            similarity = cosine_similarity(hyde_embedding, chunk.embedding)
            scored_chunks.append((chunk, similarity))

        return sorted(scored_chunks, key=lambda x: x[1], reverse=True)[:top_k]

# TECHNIQUE 2: QUERY EXPANSION
class QueryExpander:
    def expand_query(self, query: str) -> str:
        term_translations = {
            "log in": ["authentication", "oauth", "access token"],
            "error": ["error code", "status code", "exception"]
        }
        expanded_terms = [query]
        for user_term, tech_terms in term_translations.items():
            if user_term in query.lower():
                expanded_terms.extend(tech_terms)
        return " ".join(expanded_terms)

# TECHNIQUE 3: HYBRID SEARCH (BM25 + Semantic)
class HybridRetriever:
    def retrieve(self, query: str, top_k: int = 5):
        bm25_score = bm25_scorer.score(query, chunk)
        semantic_score = cosine_similarity(query_embedding, chunk.embedding)
        # ? = 0.4 means 40% BM25, 60% semantic
        hybrid_score = 0.4 * bm25_score + 0.6 * semantic_score
        return sorted_chunks_by_score[:top_k]

# TECHNIQUE 4: GRAPHRAG
class GraphRAG:
    def retrieve_related(self, initial_chunk_id: str, depth: int = 1):
        related_ids = graph[initial_chunk_id]
        for _ in range(depth - 1):
            next_level = [graph[rid] for rid in related_ids]
            related_ids.extend(next_level)
        return [chunks[cid] for cid in related_ids]

Full Example: patterns/index-aware-retrieval/example.py


Pattern 10: Node Postprocessing

Category: Adding Knowledge
Use When: Retrieved chunks have issues like ambiguous entities, conflicting content, obsolete information, or are too verbose

Problem

Your RAG system retrieves legal document chunks with issues: ambiguous entities (“Apple” could be company or fruit), conflicting interpretations of the same law, obsolete regulations superseded by new ones, and verbose chunks with only small relevant sections.

Solution

Node Postprocessing improves retrieved chunks through a pipeline:

  1. Reranking — Use more accurate models (like BGE) to rerank chunks
  2. Hybrid Search — Combine BM25 and semantic retrieval
  3. Query Expansion and Decomposition — Expand queries and break into sub-queries
  4. Filtering — Remove obsolete, conflicting, or irrelevant chunks
  5. Contextual Compression — Extract only relevant parts from verbose chunks
  6. Disambiguation — Resolve ambiguous entities and clarify context

Code Snippet

# TECHNIQUE 1: RERANKING (BGE-style Cross-Encoder)
# In production: from sentence_transformers import CrossEncoder
# model = CrossEncoder('BAAI/bge-reranker-base')

# TECHNIQUE 5: CONTEXTUAL COMPRESSION
class ContextualCompressor:
    def compress(self, chunk: DocumentChunk, query: str, max_length: int = 200):
        query_words = set(query.lower().split())
        sentences = chunk.content.split('.')
        relevant_sentences = [
            s for s in sentences
            if query_words & set(s.lower().split())
        ]
        compressed_content = '. '.join(relevant_sentences[:3]) + '.'
        return DocumentChunk(id=chunk.id + "_compressed", content=compressed_content[:max_length])

# TECHNIQUE 6: DISAMBIGUATION
class Disambiguator:
    def disambiguate(self, chunks: List[DocumentChunk], query: str):
        entity_contexts = {
            "apple": {
                "company": ["technology", "iphone", "corporate"],
                "fruit": ["nutrition", "eating", "food"]
            }
        }
        query_words = set(query.lower().split())
        for chunk in chunks:
            for entity, contexts in entity_contexts.items():
                if entity in chunk.content.lower():
                    entity_type = determine_from_context(entity, query_words, chunk.content)
                    if entity_type:
                        chunk.entities.append(f"{entity}:{entity_type}")
        return chunks

# COMPLETE POSTPROCESSING PIPELINE
def query_with_postprocessing(question: str):
    expanded = query_processor.expand_query(question)
    candidates = hybrid_retriever.retrieve(expanded, top_k=10)
    filtered = filter.filter_obsolete([c for c, _ in candidates])
    filtered = filter.filter_by_relevance(candidates, threshold=0.3)
    reranked = reranker.rerank(question, filtered, top_k=5)
    disambiguated = disambiguator.disambiguate([c for c, _ in reranked], question)
    compressed = [compressor.compress(c, question) for c in disambiguated]
    return compressed

Full Example: patterns/node-postprocessing/example.py


Pattern 11: Trustworthy Generation

Category: Adding Knowledge
Use When: RAG systems need to build user trust by preventing hallucination, providing citations, and detecting out-of-domain queries

Problem

Users lose trust because the system answers questions outside its knowledge domain, answers lack citations, and it provides confident answers when retrieval actually failed.

Solution

Trustworthy Generation builds user trust through multiple mechanisms:

  1. Out-of-Domain Detection — Detect when knowledge base doesn’t contain relevant information
  2. Embedding Distance Checking — Measure similarity between query and retrieved chunks
  3. Citations — Provide source citations for all factual claims
  4. Self-RAG Workflow — 6-step self-reflective process to verify responses
  5. Guardrails — Prevent generation of unsafe or unreliable content

Code Snippet

# OUT-OF-DOMAIN DETECTION
class OutOfDomainDetector:
    def is_out_of_domain(self, query: str, chunks: List[DocumentChunk]) -> Tuple[bool, str]:
        if chunks:
            query_embedding = embedding_generator.generate_embedding(query)
            min_distance = min([
                1 - cosine_similarity(query_embedding, chunk.embedding)
                for chunk in chunks
            ])
            if min_distance > threshold:
                return True, "Query too far from knowledge base"
        if not has_domain_keywords(query):
            return True, "Query lacks domain-specific terminology"
        if not chunks:
            return True, "No relevant chunks found"
        return False, ""

# SELF-RAG WORKFLOW (6 Steps)
class SelfRAGProcessor:
    def process(self, query: str, retrieved_chunks: List[DocumentChunk]):
        # STEP 1: Generate initial response
        initial_response = generate_initial_response(query, retrieved_chunks)
        # STEP 2: Chunk the response
        response_chunks = chunk_response(initial_response)
        # STEP 3: Check whether chunk needs citation
        for chunk in response_chunks:
            chunk.needs_citation = needs_citation(chunk.text)
        # STEP 4: Lookup sources
        for chunk in response_chunks:
            if chunk.needs_citation:
                chunk.sources = lookup_sources(chunk.text, retrieved_chunks)
        # STEP 5: Incorporate citations
        final_response = incorporate_citations(response_chunks)
        # STEP 6: Add warnings
        warnings = generate_warnings(response_chunks)
        return {"response": final_response, "warnings": warnings}

# COMPLETE TRUSTWORTHY GENERATION PIPELINE
def query_with_trustworthiness(question: str):
    is_ood, reason = out_of_domain_detector.is_out_of_domain(question, chunks)
    if is_ood:
        return {"response": f"Cannot answer: {reason}", "out_of_domain": True}
    result = self_rag.process(question, retrieved_chunks)
    passed, reason = guardrails.check(question, result, retrieved_chunks)
    if not passed:
        result["response"] = f"Cannot provide reliable answer: {reason}"
    return result

Full Example: patterns/trustworthy-generation/example.py


Pattern 12: Deep Search

Category: Adding Knowledge
Use When: Complex information needs require iterative retrieval, multi-hop reasoning, or comprehensive research across multiple sources

Problem

Investment analysts need comprehensive research on companies/industries. Basic RAG retrieves a few chunks and provides incomplete answers. They need a system that iteratively explores multiple sources, identifies gaps, and follows up on missing information.

Solution

Deep Search uses an iterative loop that retrieves and thinks until a good enough answer is found or a time/cost budget is exhausted:

Code Snippet

class DeepSearchOrchestrator:
    def __init__(self, budget: Budget):
        self.retriever = MultiSourceRetriever()  # Web, APIs, knowledge bases
        self.reasoner = LLMReasoner()
        self.budget = budget  # Time/cost constraints

    def search(self, query: str, depth: int = 2) -> DeepSearchResult:
        root_section = self._create_section(query)
        sections = [root_section]
        sections_to_expand = [root_section]
        current_depth = 0

        while current_depth < depth:
            current_depth += 1
            exhausted, reason = self.budget.is_exhausted()
            if exhausted:
                break
            next_sections = []
            for section in sections_to_expand:
                gaps = self.reasoner.identify_gaps(query, section.answer, section.sources)
                follow_ups = self.reasoner.generate_follow_ups(query, gaps)
                for follow_up in follow_ups:
                    subsection = self._create_section(follow_up)
                    section.subsections.append(subsection)
                    sections.append(subsection)
                    next_sections.append(subsection)
            sections_to_expand = next_sections
            is_good_enough, quality = self.reasoner.assess_answer_quality(
                query, root_section.answer, sections
            )
            if is_good_enough:
                break

        final_answer = self.reasoner.final_synthesis(query, sections)
        return DeepSearchResult(query, final_answer, sections, self.all_sources)

@dataclass
class Budget:
    max_iterations: int = 5
    max_time_seconds: float = 60.0
    max_cost_dollars: float = 1.0

    def is_exhausted(self) -> Tuple[bool, str]:
        if self.iterations_used >= self.max_iterations:
            return True, "max_iterations"
        if self.time_used >= self.max_time_seconds:
            return True, "max_time"
        if self.cost_used >= self.max_cost_dollars:
            return True, "max_cost"
        return False, ""

# USAGE
analyst = MarketResearchAnalyst()
result = analyst.research(
    query="What factors should I consider when evaluating TechCorp as an investment?",
    max_iterations=10,
    max_time_seconds=30.0
)

Full Example: patterns/deep-search/example.py


Category 3: LLM Reasoning

Patterns 13–16 address reasoning and task specialization: how to get step-by-step or multi-path reasoning from LLMs.


Pattern 13: Chain of Thought (CoT)

Category: LLM Reasoning
Use When: Problems require multistep reasoning, logical deduction, or an auditable reasoning trace

Problem

Foundational models suffer from critical limitations on math, logical deduction, and sequential reasoning:

  • Zero-shot often fails when the problem requires multistep reasoning
  • Black-box answers with no insight into how the conclusion was reached
  • Misinterpretation of rules

Solution

Chain of Thought (CoT) prompts request a step-by-step reasoning process before the final answer. Three variants:

  1. Zero-shot CoT — Append “Think step by step” (no examples)
  2. Few-shot CoT — Provide example (question ? step-by-step reasoning ? answer). RAG gives fish; few-shot CoT shows how to fish.
  3. Auto CoT — Sample questions ? generate reasoning for each with zero-shot CoT ? use as few-shot examples for the actual query

Code Snippet

# VARIANT 1: ZERO-SHOT COT
ZERO_SHOT_COT_SUFFIX = "\n\nThink step by step. Show your reasoning and then state the final conclusion."

def zero_shot_cot(policy: str, case_description: str, question: str, llm=None) -> CoTResult:
    prompt = f"{policy}\n\nCase: {case_description}\n\nQuestion: {question}{ZERO_SHOT_COT_SUFFIX}"
    full_response = llm(prompt)
    return CoTResult(question=question, reasoning=..., conclusion=..., variant="zero_shot")

# VARIANT 2: FEW-SHOT COT — "show how to fish"
FEW_SHOT_EXAMPLES = """
Example 1:
Q: Customer purchased 10 days ago, unopened, has receipt. Eligible for full refund?
A: Step 1: Within 30 days? Yes. Step 2: Unopened? Yes. Step 3: Receipt? Yes.
   Conclusion: Yes, full refund.
"""
def few_shot_cot(policy: str, case_description: str, question: str, llm=None) -> CoTResult:
    prompt = f"{policy}\n\n{FEW_SHOT_EXAMPLES}\n\nNew question:\nQ: {question}\n\nCase: {case_description}\n\nA:"
    return ...

# VARIANT 3: AUTO COT — build few-shot automatically
def auto_cot(policy: str, case_description: str, question: str, num_demos: int = 2, llm=None) -> CoTResult:
    demos = []
    for sample_q in question_pool[:num_demos]:
        response = llm(f"{policy}\n\nQuestion: {sample_q}\n\nThink step by step.")
        demos.append(f"Q: {sample_q}\nA:\n{response}\n")
    prompt = f"{policy}\n\n" + "\n".join(demos) + f"\n\nNew question:\nQ: {question}\n\nCase: {case_description}\n\nA:"
    return ...

# REFUND ELIGIBILITY ADVISOR
advisor = RefundEligibilityAdvisor(policy=REFUND_POLICY)
result = advisor.check_eligibility(case, variant="few_shot")  # zero_shot | few_shot | auto_cot
# result.reasoning, result.conclusion

Full Example: patterns/chain-of-thought/example.py


Pattern 14: Tree of Thoughts (ToT)

Category: LLM Reasoning
Use When: Strategic tasks with multiple plausible paths; single linear CoT is insufficient

Problem

Many tasks that demand strategic thinking cannot be solved by a single multistep reasoning path:

  • Single-path limitation — CoT follows one sequence; if that path is wrong, the answer suffers
  • Branching decisions — Multiple plausible next steps
  • Need for exploration — Best solution often requires exploring several directions

Solution

Tree of Thoughts treats problem-solving as tree search with four components:

  1. Thought generation — From current state, generate N possible next steps
  2. Path evaluation — Score each partial solution (0–100) for promise
  3. Beam search (top K) — Keep only the top K states; prune the rest
  4. Summary generation — Produce a concise summary and answer from the best path

Code Snippet

class TreeOfThoughts:
    def generate_thoughts(self, state: str, step: int, problem: str) -> List[str]:
        """Generate N possible next thoughts from current state."""
        return thoughts

    def evaluate_state(self, state: str, problem: str) -> float:
        """Score path promise (0-1). Correctness, progress, potential."""
        return score

    def solve(self, problem: str) -> ToTResult:
        beam = [(0.5, initial_state, [], 0)]
        for step in range(1, self.max_steps + 1):
            candidates = []
            for score, state, path, _ in beam:
                thoughts = self.generate_thoughts(state, step, problem)
                for thought in thoughts:
                    new_state = state + "\nStep N: " + thought
                    new_score = self.evaluate_state(new_state, problem)
                    candidates.append((new_score, new_state, path + [thought], step))
            beam = sorted(candidates, key=lambda x: -x[0])[:self.beam_width]
        best_state, best_path = beam[0]
        summary = self.generate_summary(problem, best_state)
        return ToTResult(..., solution_summary=summary, reasoning_path=best_path)

# INCIDENT ROOT-CAUSE ANALYZER
analyzer = IncidentRootCauseAnalyzer()
result = analyzer.analyze("API latency spiked; DB, cache, dependencies in use.")
# result.solution_summary, result.reasoning_path

Full Example: patterns/tree-of-thoughts/example.py


Pattern 15: Adapter Tuning

Category: LLM Reasoning
Use When: You need a foundation model to perform a specialized task with a small dataset and want to keep base weights frozen while training only a small adapter (e.g., LoRA)

Problem

Incoming tickets must be routed to billing, technical, sales, or general. Prompt-only classification can be brittle. Adapter tuning trains a small task-specific head on a few hundred labeled tickets while keeping the foundation model frozen.

Solution

Adapter tuning (PEFT) has three key aspects:

  1. Teaches the foundation model a specialized task — Train on input-output pairs
  2. Foundation weights frozen; only a small adapter is updated — LoRA or adapter layers are trained
  3. Training dataset can be smaller — Often a few hundred to a few thousand high-quality pairs suffice

Code Snippet

class TicketIntentRouter:
    def __init__(self):
        self._pipeline = Pipeline([
            ("foundation", TfidfVectorizer(max_features=2000)),  # frozen after fit
            ("adapter", LogisticRegression(max_iter=500)),       # only this is "trained"
        ])

    def train(self, examples: List[TicketExample]) -> None:
        texts = [ex.text for ex in examples]
        labels = [ex.intent for ex in examples]
        self._pipeline.fit(texts, labels)

    def predict(self, text: str) -> AdapterTuningResult:
        pred = self._pipeline.predict([text])[0]
        probs = self._pipeline.predict_proba([text])[0]
        return AdapterTuningResult(intent=pred, confidence=float(probs.max()))

router = TicketIntentRouter()
router.train(train_examples)  # 200–2000 (text, intent) pairs
result = router.predict("I was charged twice, please refund.")
# result.intent -> "billing"

Full Example: patterns/adapter-tuning/example.py


Pattern 16: Evol-Instruct

Category: LLM Reasoning
Use When: You need to teach a pretrained model new, complex tasks from private data by evolving simple instructions into harder ones, generating answers, and instruction tuning (SFT/LoRA)

Problem

The company wants a model that answers complex policy questions from internal docs under data privacy. Manually creating thousands of hard (question, answer) pairs is expensive.

Solution

Evol-Instruct in four steps:

  1. Evolve instructions — From seed questions, create harder variants: deeper (constraints, hypotheticals), more concrete (“list 3 reasons”), multi-step (combine two questions)
  2. Generate answers — For each instruction, produce a high-quality answer (LLM with access to your private context)
  3. Evaluate and filter — Score each (instruction, answer) 1–5; keep only examples above a threshold
  4. Instruction tuning — SFT on an open-weight model (Llama, Gemma) using the filtered dataset; PEFT/LoRA for efficient training

Code Snippet

# STEP 1: Evolve instructions
def evolve_instructions(seeds: List[str]) -> List[str]:
    # Deeper: add constraints/hypotheticals
    # Concrete: "List 3 reasons...", "What are the steps..."
    # Multi-step: combine two questions
    return all_instructions

# STEP 2: Generate answers (LLM + policy context)
qa_pairs = generate_answers(all_instructions)

# STEP 3: Score and filter (LLM or model; 1-5)
scored = [score_instruction_answer(ia) for ia in qa_pairs]
filtered = [ex for ex in scored if ex.score >= 4]

# STEP 4: SFT-ready dataset (chat format) -> then HuggingFace SFT/LoRA
sft_dataset = [{"messages": [{"role": "user", "content": ex.instruction},
                             {"role": "assistant", "content": ex.answer}]}
               for ex in filtered]
# Train with transformers + peft + trl SFTTrainer

Full Example: patterns/evol-instruct/example.py


Category 4: Reliability & Evaluation

Patterns 17–20 focus on evaluation, safety, and reliability: using LLMs to judge quality, and guard against harmful or off-policy outputs.


Pattern 17: LLM as Judge

Category: Reliability
Use When: You need nuanced evaluation of model or human outputs with scores and justifications to drive feedback loops, filtering, or training

Problem

Teams must evaluate thousands of support replies for helpfulness, tone, accuracy, clarity, and completeness. Human review does not scale; simple metrics (length, keyword match) miss nuance.

Solution

LLM as Judge uses an LLM to score and justify outputs against a scoring rubric. Three options:

  1. Prompting — Criteria and instructions in the prompt; LLM returns score (1–5) per criterion and brief justification. Temperature=0 for consistency.
  2. ML — Create rubric, collect historical (item, scores) data, train a classification model to replicate the rubric at scale.
  3. Fine-tuning — Fine-tune a model as a dedicated judge on your rubric and labeled data.

Code Snippet

SUPPORT_REPLY_CRITERIA = """
- Helpfulness: Addresses the customer's question; actionable next steps.
- Tone: Professional, empathetic.
- Accuracy: Factually correct.
- Clarity: Easy to read; no unnecessary jargon.
- Completeness: Covers the main ask.
"""

def build_judge_prompt(item: str, criteria: str) -> str:
    return f"""You are evaluating a customer support reply. Score 1-5 per criterion with brief justification.
Criteria: {criteria}
Reply: --- {item} ---
Scores:"""

# Invoke judge with temperature=0 for consistency
raw = run_judge(build_judge_prompt(reply))
result = parse_judge_response(raw, reply)
# result.scores -> [CriterionScore(criterion="Helpfulness", score=4, justification="..."), ...]

Full Example: patterns/llm-as-judge/example.py


Pattern 18: Reflection

Category: Reliability
Use When: You invoke the LLM via a stateless API and want it to correct or improve its first response without the user sending a follow-up.

Problem

The API must return a short apology email for a delayed shipment. A single LLM call may omit an order reference, sound generic, or lack a clear next step.

Solution

Reflection: Do not return the first response to the client. (1) First call ? initial response. (2) Evaluate: send initial response to an evaluator; get feedback. (3) Modified prompt: original request + initial response + feedback. (4) Second call ? revised response. Return the revised response.

Code Snippet

def run_reflection(user_prompt: str) -> ReflectionResult:
    initial_response = generate_initial(user_prompt)       # First call
    feedback, notes = evaluate(user_prompt, initial_response)  # Evaluator
    modified_prompt = (
        f"Original request:\n{user_prompt}\n\n"
        f"Your previous response:\n---\n{initial_response}\n---\n\n"
        f"Feedback to apply:\n{feedback}\n\nProduce an improved version."
    )
    revised_response = generate_revised(modified_prompt)  # Second call
    return ReflectionResult(initial_response, feedback, revised_response)
# Return revised_response to client; initial_response is not sent.

Full Example: patterns/reflection/example.py


Pattern 19: Dependency Injection

Category: Reliability
Use When: Developing and testing GenAI apps are nondeterministic, models change quickly, and you need code to be LLM-agnostic; inject LLM and tool calls

Problem

Developing and testing is hard: LLM output is nondeterministic, APIs change, and you want CI and local dev without API keys.

Solution

Dependency Injection: Pass LLM and tool calls into the pipeline as dependencies. Production uses real implementations; tests and dev use mocks that return hardcoded, deterministic results.

Code Snippet

# Pipeline accepts dependencies; no direct LLM calls inside
def run_ticket_pipeline(
    ticket_text: str,
    summarize_fn: Callable[[str], str],
    suggest_action_fn: Callable[[str, str], str],
) -> TicketResult:
    summary = summarize_fn(ticket_text)
    suggested_action = suggest_action_fn(ticket_text, summary)
    return TicketResult(summary=summary, suggested_action=suggested_action)

# Production: real implementations
result = run_ticket_pipeline(ticket, real_summarize, real_suggest_action)

# Tests: mocks (hardcoded, deterministic)
result = run_ticket_pipeline(ticket, mock_summarize, mock_suggest_action)
assert result.summary == "Customer reports an issue..."

Full Example: patterns/dependency-injection/example.py


Pattern 20: Prompt Optimization

Category: Reliability
Use When: You want better results from prompt engineering but changing the foundational model would force repeating all trials so use a repeatable optimization loop with pipeline

Solution

Prompt optimization as four components — (1) Pipeline of steps that use the prompt (prompt is a parameter), (2) Dataset to evaluate on, (3) Evaluator that scores each output, (4) Optimizer that proposes candidates and picks the best by score.

Code Snippet

def run_pipeline(prompt_template: str, ticket: str) -> str:
    return generate_fn(prompt_template, ticket)

dataset = get_dataset()

def evaluate_summary(summary: str, ticket: str) -> float:
    return 0.0  # ... length, key-info, or LLM-as-Judge

best_prompt, best_score = optimize_prompt(
    candidate_prompts=["Summarize in one sentence.", "Write a one-line summary.", ...],
    dataset=dataset,
    run_fn=lambda p, t: run_pipeline(p, t),
    eval_fn=evaluate_summary,
)
# When model changes: re-run optimize_prompt with same dataset/evaluator

Full Example: patterns/prompt-optimization/example.py


Category 5: Tools, Agents & Efficiency

Patterns 21–32 extend LLMs with tool calling, code execution, multi-agent collaboration, and production efficiency techniques.


Pattern 21: Tool Calling

Category: Tools & Agents
Use When: You need the model to act by calling APIs, looking up live order status, searching internal systems

Solution

Bind tools to the model; run a LangGraph with an assistant node (LLM) and ToolNode (executes tools). Conditional routing: if the last message has tool_calls, run tools and loop back.

Code Snippet

from langgraph.graph import END, MessagesState, StateGraph
from langgraph.prebuilt import ToolNode
from langchain_core.tools import tool

@tool
def lookup_order_status(order_id: str) -> str:
    """Look up order in OMS."""
    return '{"status":"shipped",...}'

workflow = StateGraph(MessagesState)
workflow.add_node("assistant", call_model)
workflow.add_node("tools", ToolNode([lookup_order_status]))
workflow.set_entry_point("assistant")
workflow.add_conditional_edges("assistant", route_tools_or_end)
workflow.add_edge("tools", "assistant")
app = workflow.compile()

Full Example: patterns/tool-calling/example.py
Dependencies: pip install -r patterns/tool-calling/requirements.txt; Ollama with a tool-capable model (e.g. llama3.2)


Pattern 22: Code Execution

Category: Tools & Agents
Use When: The task needs an artifact (diagram, plot, query): the model should emit a DSL and a sandbox runs it

Solution

Code execution: Prompt the model for DSL (low temperature). A sandbox writes temp files, runs dot, python (restricted), or a DB driver with timeouts and allowlists. LangGraph can wire generate_dsl ? execute_sandbox as a linear graph.

Code Snippet

from langgraph.graph import END, StateGraph

workflow = StateGraph(CodeExecutionState)
workflow.add_node("generate_dsl", node_generate_dsl)
workflow.add_node("sandbox", execute_in_sandbox)
workflow.set_entry_point("generate_dsl")
workflow.add_edge("generate_dsl", "sandbox")
workflow.add_edge("sandbox", END)
app = workflow.compile()
final = app.invoke({"user_request": "..."})

Full Example: patterns/code-execution/example.py
Dependencies: pip install -r patterns/code-execution/requirements.txt; optional Graphviz (brew install graphviz)


Pattern 23: Multi-Agent Collaboration

Category: Tools & Agents
Use When: Work is multistep, multi-domain, and long-running; a single agent hits cognitive, tool, and tuning limits

Solution

Multi-agent collaboration: Define agents with narrow mandates and clear handoffs. Patterns include hierarchical (planner delegates), prompt chaining (sequential pipelines), peer-to-peer / blackboard (shared store), and parallel execution.

Code Snippet

from langgraph.graph import END, StateGraph

g = StateGraph(MultiAgentState)
g.add_node("plan", node_plan)
g.add_node("technical", node_technical)
g.add_node("compliance", node_compliance)
g.add_node("merge", node_merge)
g.add_node("critic", node_critic)
g.add_node("finalize", node_finalize)
g.set_entry_point("plan")
app = g.compile()
result = app.invoke({"user_request": "..."})

Full Example: patterns/multiagent-collaboration/example.py


Pattern 24: Small Language Model

Category: Efficiency & Deployment
Use When: Frontier models are too large or too expensive to self-host; you want smaller models, distillation, quantization, or faster decoding

Solution

  1. Knowledge distillation — Train a student on teacher soft targets; KL divergence aligns token distributions
  2. Quantization — 4-bit / 8-bit weights (BitsAndBytesConfig, NF4) shrink footprint
  3. Speculative decoding — A small draft model proposes tokens; a large target verifies in parallel

Code Snippet

# Distillation: minimize KL(student || teacher) on teacher softmax + CE on labels
# Quantization: BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", ...)
# Speculative decoding: vLLM speculative_config={"model": draft_id, "num_speculative_tokens": k}

Full Example: patterns/small-language-model/example.py


Pattern 25: Prompt Caching

Category: Efficiency & Deployment
Use When: The same or similar prompts hit your LLM repeatedly and you want lower latency, lower cost, and less load

Solution

  1. Client-side exact cache — Hash (model, params, messages) ? store response
  2. Framework caches — LangChain InMemoryCache / SQLiteCache via set_llm_cache
  3. Semantic cache — Embeddings to match paraphrases; return cached answer if similarity ? threshold
  4. Server-side prompt caching — Anthropic / OpenAI may cache eligible long prompts inside the API

Code Snippet

# Exact: sha256(f"{model}\n{prompt}") -> response
# Semantic: cosine(embed(query), embed(cached_prompt)) >= threshold
# LangChain: set_llm_cache(SQLiteCache(database_path="..."))
# Provider: Anthropic cache_control / OpenAI automatic prefix caching (see docs)

Full Example: patterns/prompt-caching/example.py


Pattern 26: Inference Optimization

Category: Efficiency & Deployment
Use When: You self-host LLMs and must maximize throughput, cut latency, and control KV-cache memory

Solution

  1. Continuous batching (dynamic batching) — Requests enter and leave at fine granularity; vLLM (PagedAttention) and SGLang reduce padding waste
  2. Speculative decoding — Draft + target models (see Pattern 24)
  3. Prompt compression — Remove redundancy in system + RAG context to shrink KV footprint

Code Snippet

# Continuous batching: use vLLM / SGLang / TensorRT-LLM — not hand-rolled pad batches
# Speculative decoding: vLLM speculative_config={...} (see Pattern 24)
# Prompt compression: dedupe, summarize, or learned compressors before .generate()

Full Example: patterns/inference-optimization/example.py


Pattern 27: Degradation Testing

Category: Efficiency & Deployment
Use When: You need load testing that matches LLM inference behavior with TTFT, end-to-end latency, token throughput, and RPS under rising concurrency

Key Metrics

  • TTFT — Time from request to first token (streaming)
  • EERL — End-to-end request latency (wall time to last token)
  • Output tokens / second — Generation throughput
  • Requests / second — Completed requests per second at a given concurrency

Code Snippet

# Per request: ttft_s, eerl_s, output_tokens -> tok/s = tokens / (eerl_s - ttft_s)
# Aggregate: p95_ttft, p95_eerl, mean tok/s, rps = n / wall_time
# Tools: LLMPerf, LangSmith traces

Full Example: patterns/degradation-testing/example.py


Pattern 28: Long-Term Memory

Category: Memory & Agents
Use When: LLM calls are stateless; you need continuity across sessions with working, episodic, procedural, and semantic memory

Solution

  1. Working memory — Recent turns / scratch context (sliding window)
  2. Episodic memory — Dated interactions (“what we did”)
  3. Procedural memory — Playbooks and tool recipes
  4. Semantic memory — Stable facts; typically embedding search (Mem0, custom RAG-over-memories)

Code Snippet

# Mem0: memory.add(messages, user_id=...); memory.search(query, user_id=...)
# Four layers: working (deque), episodic (log), procedural (playbooks), semantic (vector / KV)

Full Example: patterns/long-term-memory/example.py
Dependencies: pip install mem0ai openai chromadb for Mem0-aligned version


Pattern 29: Template Generation

Category: Setting Safeguard
Use When: You need repeatable, reviewable customer-facing text; full free-form generation is too variable or mixes facts with creativity unsafely

Solution

  1. Prompt the model to output a template only, with explicit placeholders ([CUSTOMER_NAME], [ORDER_ID], …)
  2. Human / comms reviews the template (per locale/product), not every send
  3. Fill slots in code; optional second LLM pass only for lint or translation
  4. Few-shot examples in the prompt show approved shapes so new templates stay grounded

Code Snippet

# Low temp + few-shot -> template with [SLOT_NAME]
# validate required [ORDER_ID], [CUSTOMER_NAME] present
# fill_template(template, slots_from_crm)

Full Example: patterns/template-generation/example.py


Pattern 30: Assembled Reformat

Category: Setting Safeguard
Use When: A full LLM-generated page can hallucinate high-risk attributes (battery chemistry, hazmat, allergens, medical claims)

Solution

  1. Risk registry — chemistry, Wh, hazmat, etc. from structured sources only
  2. Assemble deterministic blocks (specs, shipping, legal)
  3. Optional LLM for tone/SEO only, conditioned on the assembled facts
  4. Validate — banned claims, chemistry contradictions

Code Snippet

# facts = load_pim(sku)  # BatteryChemistry.NIMH, ...
# page = render_compliance_block(facts)  # deterministic
# fluff = llm_marketing(facts)  # constrained; validate_high_risk(page + fluff, facts)

Full Example: patterns/assembled-reformat/example.py


Pattern 31: Self-Check

Category: Setting Safeguard
Use When: You can obtain per-token logprobs from inference and want a statistical signal to flag uncertain or fragile generations for review

Solution

  1. Logits ? softmax ? probabilities (p_i)
  2. Logprob (log p) for the sampled token (APIs often return this directly)
  3. Flag tokens with low (p) or small margin to the second-best token
  4. Perplexity on a sequence: PPL = exp(-mean(logprobs))

Code Snippet

# p_i = exp(logprob_i); flag if p_i < threshold
# PPL = exp(-mean(logprobs))  # natural-log probs per token

Full Example: patterns/self-check/example.py


Pattern 32: Guardrails

Category: Setting Safeguard
Use When: You must enforce security, privacy, moderation, and alignment around LLM and RAG systems

Solution

  1. Prebuilt — Gemini safety settings; OpenAI Moderation API; hosted provider filters
  2. Custom — PII redaction, banned topics, allowlists, regex injection detectors, LLM-as-Judge (Pattern 17)
  3. Composeapply_guardrails(text, scanners) pipeline; scan query, then answer

Code Snippet

# apply_guardrails(user_query, [pii_redact, banned_topic])
# answer = engine.query(sanitized); apply_guardrails(answer, [pii_redact, moderation])

Full Example: patterns/guardrails/example.py


Category 6: Agentic Behavior Patterns

Patterns 33–50 align with Agentic Design Patterns (Antonio Gulli): specialized agent roles, orchestration, and production agentic systems.


Pattern 33: Prompt Chaining

Category: Agentic Orchestration
Use When: A task benefits from sequential decomposition where each LLM call has one job, structured output feeds the next step

Solution

Code Snippet

# state = classify(q); state = decompose(state); state = answer(state); state = format(state)
# Or LangGraph: add_node per step, linear edges

Full Example: patterns/prompt-chaining/example.py


Pattern 34: Routing

Category: Agentic Orchestration
Use When: You must classify or direct each request to the right handler

Solution

  1. Rule-based routing for deterministic paths
  2. Embedding similarity to handler descriptions or labeled exemplars
  3. LLM routing with JSON schema: route, confidence, optional rationale
  4. ML classifier on features for scale and SLOs

Code Snippet

# RunnableBranch (langchain_core): (predicate, runnable), ..., default
# Or: rules_first = route_rules(text); if conf < 0.9: route_llm(text)

Full Example: patterns/routing/example.py


Pattern 35: Parallelization

Category: Agentic Orchestration
Use When: Independent subtasks can run together such as research fan-out, analytics partitions, parallel validators

Solution

Code Snippet

# LCEL: RunnableParallel(gather=..., analyze=..., verify=...) | RunnableLambda(merge)
# stdlib: ThreadPoolExecutor; submit each branch; as_completed ? dict

Full Example: patterns/parallelization/example.py


Pattern 36: Learning and Adaptation

Category: Agentic Learning
Use When: Systems must improve from experience such as RL (PPO with clipped surrogate for stability) and preference alignment (DPO without a separate reward model)

Solution

  1. RL agents — collect trajectories ? advantage estimates ? PPO-style clipped ratio to limit destructive updates
  2. LLM alignment — RLHF path (reward model + PPO) vs DPO (direct policy update from chosen/rejected completions)
  3. Online / memory — replay, regularization, retrieval over past successes

Code Snippet

# PPO: clip ratio r to [1-eps, 1+eps]; surrogate min(r*A, clip(r)*A)
# DPO: preference loss on log pi(y_w) - log pi(y_l) vs reference (see TRL / papers)

Full Example: patterns/learning-adaptation/example.py


Pattern 37: Exception Handling and Recovery

Category: Agentic Reliability
Use When: Agents, chains, and tools must survive failures by detecting and classifying errors, and retrying wisely and fallback to degraded paths

Solution

  1. Detect — Structured errors, validation, guardrails, timeouts
  2. Classify — Transient vs permanent vs policy
  3. Handle — Exponential backoff, circuit breaker, fallback model or cache
  4. Recover — Idempotent retries, compensation, checkpoint resume
def run_with_fallback(
    primary: Callable[[], T],
    fallback: Callable[[], T],
    is_recoverable: Callable[[BaseException], bool],
) -> T:
    """
    Try ``primary``; on a recoverable exception, invoke ``fallback``.

    Args:
        primary: Preferred code path (e.g. frontier model).
        fallback: Degraded path (e.g. smaller model or cached stub).
        is_recoverable: Whether to use fallback for this exception type.

    Returns:
        Result from primary or fallback.

    Raises:
        Re-raises if the primary fails with a non-recoverable error.
    """
    try:
        return primary()
    except Exception as exc:
        if not is_recoverable(exc):
            raise
        return fallback()

Full Example: patterns/exception-handling-recovery/example.py


Pattern 38: Human-in-the-Loop (HITL)

Category: Agentic Safety
Use When: Automation must yield to people for quality, compliance, or risk

Solution

  1. Triggers — Low confidence, high stakes, novel situations, regulatory rules, sampling
  2. Review — Queues, rubrics, SLAs, multi-level approval
  3. Feedback — Labels and edits ? datasets, policies, routing
  4. Orchestration — LangGraph interrupt / human nodes; workflow engines with wait states

Code Snippet

# if stakes == HIGH or conf < tau: enqueue(HumanReviewTicket)
# LangGraph: interrupt_before=[human_node]; resume with Command

Full Example: patterns/human-in-the-loop/example.py


Pattern 39: Agentic RAG (Knowledge Retrieval)

Category: Agentic Knowledge
Use When: You need up-to-date, source-grounded answers with embeddings, semantic search, chunking, vector stores, and advanced variants

Solution

  1. Chunk ? embed ? vector DB; measure relevance via cosine / distance metrics
  2. Hybrid retrieval (dense + sparse) where lexical match matters
  3. Graph RAG for entity-centric queries; agentic RAG for query rewrite, tool retrieval, multi-hop
  4. LangChain LCEL / LangGraph for pipelines and cycles

Code Snippet

# LCEL: RunnablePassthrough.assign(context=retriever) | prompt | llm
# LangGraph: retrieve -> grade_documents -> [rewrite_query | generate]

Full Example: patterns/agentic-rag/example.py
See also Patterns 6–12 for depth implementations of each RAG component.


Pattern 40: Resource-Aware Optimization

Category: Agentic Efficiency
Use When: You must optimize LLM and agent workloads for cost, latency, capacity, and graceful degradation

Solution

  1. Budgets and tiered models — Estimate $ per request
  2. Route by priority and load (Pattern 34)
  3. Prune / summarize context; cache (25); smaller models (24)
  4. Degrade — Fewer tools, shorter answers, async handoff

Code Snippet

# if budget.remaining() < need: summarize(history) or tier = "small"
# if degradation == MINIMAL: tool_gate.disable_heavy_tools()

Full Example: patterns/resource-aware-optimization/example.py


Pattern 41: Reasoning Techniques (Agentic)

Category: Agentic Reasoning
Use When: You need a structured approach to complex Q&A using CoT, ToT, self-correction, PAL / code-aided reasoning, ReAct, RLVR, debates (CoD), deep research

Solution

Use the technique map in patterns/reasoning-techniques/README.md: CoT (13), ToT (14), Reflection (18), Deep Search (12), ReAct / tools (21), PAL-style code (22), multi-agent debates (23), prompt / workflow optimization (20).

def language_agent_tree_search_stub(
    frontier: list[str],
    expand_fn: Callable[[str], list[str]],
    score_fn: Callable[[str], float],
    beam_width: int = 2,
) -> list[tuple[str, float]]:
    """
    Minimal beam-style selection (stand-in for Language Agent Tree Search).

    LATS in the literature expands **language** states/actions, scores children
    with a value model or LLM critic, and prunes—unlike a flat ToT breadth list.

    Args:
        frontier: Current candidate partial solutions or thoughts.
        expand_fn: Callable taking one candidate, returning child strings.
        score_fn: Callable taking a string, returning higher-is-better score.
        beam_width: Max states to keep after scoring.

    Returns:
        Top ``beam_width`` (candidate, score) pairs.
    """
    children: list[tuple[str, float]] = []
    for node in frontier:
        for ch in expand_fn(node):
            children.append((ch, float(score_fn(ch))))
    children.sort(key=lambda x: x[1], reverse=True)
    return children[:beam_width]

Full Example: patterns/reasoning-techniques/example.py


Pattern 42: Evaluation and Monitoring (Agentic)

Category: Agentic Observability
Use When: You need performance tracking, A/B tests, compliance evidence, latency SLOs, token/cost telemetry, custom quality metrics (LLM-as-judge), and multi-agent traces

Solution

Instrument calls, aggregate SLAs, run experiments with guardrail metrics, store audit evidence, trace multi-agent workflows.

Code Snippet

# trace_id + span per LLM/tool; tokens += pt+ct; export to OTLP
# ab_variant(user_key, "exp", ("a","b")); compare judge_score & p95_latency

Full Example: patterns/evaluation-monitoring/example.py


Pattern 43: Prioritization

Category: Agentic Scheduling
Use When: Competing tasks must be ordered to support queues, cloud jobs, trading paths, security incidents using multi-criteria scores, dynamic re-ranking, and resource-aware scheduling

Solution

Weighted dimensions (urgency, impact, effort, SLA, security), recompute on events, integrate with routing (34) and capacity (40).

Code Snippet

# score = w1*urgency + w2*importance - w3*effort + w4*f(sla) + w5*security

Full Example: patterns/prioritization/example.py


Pattern 44: Memory Management

Category: Agentic State
Use When: Agents need short-term context, long-term persistence, episodic retrieval, procedural playbooks, and privacy-aware storage

Solution

Tier memory (working, episodic, procedural, semantic); extract and retrieve selectively; persist orchestrator state with MemorySaver.

Code Snippet

# LangGraph: compile(..., checkpointer=MemorySaver()); thread_id in config
# External: memory.search(query, user_id=...) for semantic / episodic layers

Full Example: patterns/memory-management/example.py


Pattern 45: Planning and Task Decomposition

Category: Agentic Orchestration
Use When: You need explicit task graphs, dependencies, and valid execution order

Decompose goals into a DAG of subtasks with dependencies. The planner agent determines which tasks to run in parallel vs. sequentially based on dependency analysis.

Full Example: patterns/planning-task-decomposition/example.py


Pattern 46: Goal Setting and Monitoring

Category: Agentic Governance
Use When: SMART goals, progress vs. targets, deviation detection, strategy updates

The goal-monitor agent tracks metrics against defined targets, detects when progress deviates from expected trajectories, and adjusts strategy when needed.

Full Example: patterns/goal-setting-monitoring/example.py


Pattern 47: MCP Integration (Agentic)

Category: Tooling / Integration
Use When: Model Context Protocol servers — discovery, tools/list, tools/call — secure composition with Pattern 21

Model Context Protocol (MCP) provides a standardized interface between agents and external resources. Agents discover available tools at runtime through the protocol, call them with structured inputs, and receive structured outputs.

Full Example: patterns/mcp-integration/example.py


Pattern 48: Inter-Agent Communication

Category: Distributed Agents
Use When: Message envelopes, routing, correlation, capability discovery (A2A-style) with Pattern 23

Agent-to-Agent (A2A) communication defines structured message schemas and communication protocols for inter-agent coordination. Agents send typed messages (task assignments, results, status updates, requests for clarification) through a message bus or shared workspace.

Full Example: patterns/inter-agent-communication/example.py


Pattern 49: Safety Guardian

Category: Safety / Compliance
Use When: Multi-layer defense, risk thresholds, shutdown paths beyond single guardrail scanners (extends Pattern 32)

The safety guardian agent implements three-tier protection: pre-action guardrails (evaluate the proposed action before execution), in-process monitoring (enforce scope and resource constraints during execution), and post-action auditing. Includes prompt injection detection for agents that process external content.

Full Example: patterns/safety-guardian/example.py


Pattern 50: Exploration and Discovery

Category: Search / Learning
Use When: Explore vs. exploit, novel environments, hypothesis cycles (pairs with Patterns 12, 14, 41, 36)

Implements a multi-armed bandit or curiosity-driven strategy that balances exploitation (using known-good approaches) with exploration (trying new approaches to discover if they’re better). Scores update from outcomes, so the agent continuously refines its strategy distribution.

Full Example: patterns/exploration-discovery/example.py


Pattern Comparison Matrix

#PatternComplexityTraining RequiredBest For
1Logits MaskingMediumNoValid JSON, banned words
2Grammar Constrained GenerationHighNoAPI configs, schemas
3Style TransferLow–MediumOptionalNotes to emails
4Reverse NeutralizationHighYesYour writing style
5Content OptimizationHighYesOpen rates, conversions
6Basic RAGMediumNoDocumentation, knowledge bases
7Semantic IndexingHighNoTechnical docs, multimedia, tables
8Indexing at ScaleHighNoHealthcare guidelines, policies
9Index-Aware RetrievalHighNoTechnical docs, API docs
10Node PostprocessingHighNoLegal docs, medical records
11Trustworthy GenerationHighNoMedical Q&A, legal research
12Deep SearchHighNoMarket research, due diligence
13Chain of ThoughtLow–MediumNoPolicy eligibility, math, compliance
14Tree of ThoughtsHighNoRoot-cause analysis, design exploration
15Adapter TuningMedium–HighYes (adapter only)Intent routing, content moderation
16Evol-InstructHighYes (SFT/LoRA)Policy Q&A, compliance playbooks
17LLM as JudgeLow–MediumNoSupport quality, model evaluation
18ReflectionMediumNoDrafts, code, plans
19Dependency InjectionLow–MediumNoFast deterministic tests with mocks
20Prompt OptimizationMedium–HighNoSummarization, copy, classification
21Tool CallingMedium–HighNoAPIs, live data, actions (ReAct)
22Code ExecutionMedium–HighNoDiagrams, plots, SQL
23Multi-Agent CollaborationHighNoVendor review, incidents, research crews
24Small Language ModelMedium–HighOptionalCost, VRAM, throughput
25Prompt CachingLow–MediumNoRepeated prompts, long prefixes
26Inference OptimizationMedium–HighNoSelf-hosted throughput, KV memory
27Degradation TestingMedium–HighNoTTFT, EERL, tok/s, RPS; LLMPerf
28Long-Term MemoryMedium–HighNoStateful assistants, personalization
29Template GenerationLow–MediumNoTransactional email/SMS
30Assembled ReformatMedium–HighNoPDPs with hazmat/battery risk
31Self-CheckMediumNoLogprobs, perplexity, uncertainty triage
32GuardrailsMedium–HighNoSecurity, moderation, PII
33Prompt ChainingLow–MediumNoSequential workflows, structured handoffs
34RoutingLow–HighOptionalIntent ? handler, tools, subgraph
35ParallelizationLow–MediumNoResearch, analytics, multimodal
36Learning and AdaptationHighYesRL, preferences, online drift
37Exception Handling & RecoveryLow–HighNoAgent tools, chains, APIs
38Human-in-the-Loop (HITL)Low–HighNoModeration, fraud, trading, safety
39Agentic RAGMedium–HighNoFresh knowledge, multi-hop retrieval
40Resource-Aware OptimizationMedium–HighNoCost/latency budgets, degradation
41Reasoning TechniquesLow–Very HighVariesCoT, ToT, ReAct, PAL, debates
42Evaluation and MonitoringMedium–HighNoLLM-judge metrics, multi-agent spans
43PrioritizationLow–HighNoSupport, cloud jobs, trading, security
44Memory ManagementMedium–HighNoLangGraph threads, episodic retrieval
45Planning & Task DecompositionMedium–HighNoDAG tasks, dependencies
46Goal Setting & MonitoringMediumNoSMART goals, deviation detection
47MCP IntegrationMediumNoTool servers, discovery
48Inter-Agent CommunicationMedium–HighNoMessages, routing, A2A
49Safety GuardianHighNoLayered safety, shutdown paths
50Exploration & DiscoveryMediumNoExplore/exploit, novel domains

Choosing the Right Pattern

If you need…Use…
Enforce constraints during generationPattern 1: Logits Masking
Formal grammar compliancePattern 2: Grammar Constrained Generation
Transform content style quicklyPattern 3: Style Transfer (Few-Shot)
Consistent personal stylePattern 4: Reverse Neutralization
Optimize for performance metricsPattern 5: Content Optimization
External knowledge augmentationPattern 6: Basic RAG
Semantic search or complex contentPattern 7: Semantic Indexing
Large-scale, evolving knowledge with freshnessPattern 8: Indexing at Scale
Handle vocabulary mismatchesPattern 9: Index-Aware Retrieval
Ambiguous entities, conflicting or verbose chunksPattern 10: Node Postprocessing
Prevent hallucination and build user trustPattern 11: Trustworthy Generation
Comprehensive research with multi-hop reasoningPattern 12: Deep Search
Multistep reasoning or auditable reasoning tracePattern 13: Chain of Thought
Explore multiple strategies or hypothesesPattern 14: Tree of Thoughts
Specialize a foundation model with small labeled dataset (100s–1000s pairs)Pattern 15: Adapter Tuning
Teach a model new tasks from private dataPattern 16: Evol-Instruct
Scalable, nuanced evaluation with scores and justificationsPattern 17: LLM as Judge
Self-correction in stateless APIs without user follow-upPattern 18: Reflection
Develop and test GenAI pipelines without flaky LLM callsPattern 19: Dependency Injection
Find good prompts, re-run when model changesPattern 20: Prompt Optimization
Call APIs, live systems, or tools (not only RAG)Pattern 21: Tool Calling
Diagrams, plots, or queries as DSL executed in sandboxPattern 22: Code Execution
Multiple specialized roles, decomposition, parallel workPattern 23: Multi-Agent Collaboration
Run on smaller GPUs, cut cost, speed up decodingPattern 24: Small Language Model
Avoid recomputing repeated or paraphrased promptsPattern 25: Prompt Caching
Higher throughput and lower KV pressure on self-hosted LLMsPattern 26: Inference Optimization
Load tests with LLM-native metrics (TTFT, EERL, tok/s, RPS)Pattern 27: Degradation Testing
Durable user context beyond raw chat historyPattern 28: Long-Term Memory
On-brand, reviewable customer email/SMSPattern 29: Template Generation
Product pages where wrong specs are unacceptablePattern 30: Assembled Reformat
Flag uncertain generations using token probabilitiesPattern 31: Self-Check
Policy enforcement (PII, banned topics, moderation)Pattern 32: Guardrails
Reliable multi-step workflows with structured handoffsPattern 33: Prompt Chaining
Pick the right tool, model, or specialist pathPattern 34: Routing
Run independent tasks concurrently, then mergePattern 35: Parallelization
Improve from rewards, preferences, or streaming feedbackPattern 36: Learning and Adaptation
Agents and chains that survive tool/API failuresPattern 37: Exception Handling & Recovery
People in the loop for high-stakes decisionsPattern 38: HITL
Gulli-level RAG with agentic retrieval loopsPattern 39: Agentic RAG
Cost/latency-aware agents with graceful degradationPattern 40: Resource-Aware Optimization
Map of reasoning methods tied to implementationsPattern 41: Reasoning Techniques
Production observability: latency, tokens, A/B testsPattern 42: Evaluation and Monitoring
Rank competing tasks or incidentsPattern 43: Prioritization
LangGraph-style memory tiers + checkpointingPattern 44: Memory Management
Explicit task DAGs and dependency orderPattern 45: Planning & Task Decomposition
SMART goals and deviation from targetsPattern 46: Goal Setting & Monitoring
MCP tool servers with discovery and secure callsPattern 47: MCP Integration
Agent message fabric / A2A-style coordinationPattern 48: Inter-Agent Communication
Layered safety beyond I/O scannersPattern 49: Safety Guardian
Explore vs. exploit in open-ended searchPattern 50: Exploration & Discovery

Takeaways

Here are major takeaways from these agentic patterns:

Enforce constraints early. Logits Masking and Grammar Constrained Generation prevent bad output at the token level. The same logic applies to Guardrails: put them in the runtime layer, not in the system prompt.

RAG is a stack you build layer by layer. Start with Basic RAG. When vocabulary gaps break retrieval, add Semantic Indexing. When contradictions surface, add Indexing at Scale. When queries don’t match chunks, add Index-Aware Retrieval. When retrieved chunks are noisy, add Node Postprocessing. Each pattern fixes the failure mode of the one before.

Structure the reasoning. Chain of Thought, Tree of Thoughts, and ReAct all treat reasoning as something to engineer. Adding “think step by step” costs one line and measurably improves multi-step accuracy. Tree of Thoughts costs more but handles problems where a single reasoning path gets stuck.

You need less data than you think for specialization. Adapter Tuning and Evol-Instruct both produce strong task-specific models from hundreds of examples, not millions. Evolving seed questions into harder variants and filtering by quality gives you a curriculum worth training on. The bottleneck is usually data quality, not quantity.

The operational patterns matter as much as the modeling ones. Prompt Caching, Inference Optimization, and Degradation Testing don’t appear in research papers. They’re the difference between a working demo and a system that holds up under real traffic.


I have added examples for all 50 patterns at: github.com/bhatti/agentic-patterns

Run the setup in ten minutes with SETUP.md, then explore whichever patterns are most relevant to what you’re building.

Technology Stack

  • Ollama — Local model serving
  • LangChain — LLM orchestration
  • CrewAI — Multi-agent systems
  • LangGraph — Stateful agent workflows
  • HuggingFace — Model hub and transformers
  • PyTorch — Direct model access and logits manipulation

March 13, 2026

Load Testing Applications That Actually Scale: A Practitioner’s Guide

Filed under: Computing — admin @ 3:55 pm

In past projects, I saw most engineering teams ran load tests before major launches and rarely at any other time. The assumption is that if a code change is small, performance is probably fine. In practice, that assumption fails regularly. A runtime upgrade can change memory allocation patterns, garbage collection behavior, and connection handling in ways that only appear under load. A third-party library upgrade can introduce synchronous blocking where there was none before. A new database index can shift query planner behavior and affect read latency at scale. None of these surface in functional tests. None of them are visible in code review. They show up under load, in production, usually at the worst possible time.

Performance testing isn’t a pre-launch ceremony. It’s part of how you understand and maintain your system’s behavior as your code evolves, your dependencies change, and your traffic grows. This guide covers the full scope: the test types and what each one tells you, how to design meaningful tests, what metrics to collect, which tools to use, how to handle dependencies in your tests, and how to make this a regular part of your development process rather than a one-time event.


Why Performance Testing Gets Skipped

Often teams skip performance testing due to setup time, cost or slow feedback loop. These constraints are legitimate, but they lead to a familiar outcome where performance problems get discovered in production. Another common pattern I have observed is that many teams don’t have a clear baseline picture of how their application actually behaves. They don’t know their normal memory footprint. They don’t know which code paths are hot. They don’t know at what concurrency level their database connection pool saturates or when their cache hit rate starts degrading. Without a baseline, you can’t detect regressions, you can’t capacity plan accurately, and you can’t tell a normal traffic spike from an actual problem. The goal of performance testing is to know your system well enough to predict how it behaves and catch it when behavior changes unexpectedly.


Performance Testing in the SDLC

The most effective teams don’t treat performance testing as a separate phase instead they integrate it into their regular development process at multiple levels.

  • During development: I have found profiling tools like JProbe/Yourkit for Java, pprof for GO, V8 profiler for Node.js, XCode instruments for Swift/Objective-C incredibly useful to find hot code path, memory leaks or concurrency issues.
  • During code review: Another common pattern that I have found useful is flagging changes to caching, database queries, serialization, or hot code paths for load testing before merge.
  • Nightly CI/CD pipelines: Though, load testing on each commit would be excessive but they can partially run as part of nightly build so that we can fix them before they reach production.
  • On a regular schedule: Another option is to run full-scale load and soak tests run on a defined cadence like weekly.
  • Before major releases: Comprehensive tests covering all scenarios like average load, peak load, stress, spikes, soak can run against a production-representative environment.
  • After significant dependency upgrades: Runtime upgrades, major library version bumps, and infrastructure changes all deserve their own performance test pass.

The Testing Taxonomy

Following are different types of performance tests:

Profiling

Profiling instruments your application during execution and shows you exactly where time and memory are spent like which functions consume CPU, which allocate the most memory, where goroutines or threads block. You can run profiling locally before the code review so that you understand the bottlenecks already exist in your code. Load testing tells you how those bottlenecks behave when many users hit them simultaneously. Most runtimes include profiling support like Go’s pprof, Node.js’s built-in CPU and heap profiler, Python’s cProfile so you can also enable them in a test environment if needed.

Load Testing

Load testing applies a realistic, expected workload and verifies the system meets defined performance targets. The workload mirrors production traffic such as request distribution, concurrency level, and payload shapes. The goal isn’t to break anything. It’s to confirm the system handles its designed workload within acceptable response times and error rates. Any change that could affect throughput like a code change in a hot path, a dependency upgrade, a configuration change, a schema migration should warrant a load test.

Stress Testing

Stress testing pushes load well beyond expected levels to find where the system breaks and how it breaks. At what point does performance degrade? What component fails first? Does the system fail gracefully or catastrophically, or corrupting state? In past projects, I found a practical target in cloud environments is 10x your expected peak load. This accounts for real-world variability: viral traffic events, bot traffic, cascading retries from upstream services, and faster growth than planned. Stress tests also expose whether your failure modes are safe. When your system can’t keep up, what happens? Does it queue requests until it runs out of memory? Does it reject new connections cleanly with meaningful errors? Does retry behavior from clients amplify load turning a recoverable spike into a full outage?

Spike Testing

Spike testing applies an abrupt load increase not a gradual ramp but a sharp jump so that we can learn how the system absorbs and recovers from it. This simulates promotional emails going out, products appearing in news, scheduled batch jobs triggering thousands of concurrent operations, or a mobile app push notification causing a synchronized rush of API calls. The spike testing can identify problems like cold-start latency when new instances initialize, connection pool exhaustion when concurrency jumps faster than the pool replenishes, cache stampedes when many concurrent requests miss cache simultaneously, and auto-scaling lag when the metric-to-action delay is too long. After the spike, watch recovery. Latency should return to baseline. Resource utilization should drop. If it doesn’t, the system is carrying forward pressure that will degrade subsequent traffic.

Soak Testing

Soak testing runs a moderate, sustained load over an extended period from several hours to several days. The load level isn’t extreme; the duration is the point so that it can uncover problems that only occur after a long duration such as:

  • Memory leaks: Usage climbs slowly and continuously. The system that runs fine for 30 minutes may run out of heap after 8 hours. This is especially important to test after runtime or library upgrades, which can change allocator behavior.
  • Connection leaks: Database or HTTP connections that aren’t properly released accumulate until the pool is exhausted.
  • Thread accumulation: Background threads that don’t terminate properly compound over time.
  • Disk exhaustion: Log files that aren’t rotated, or temporary files that aren’t cleaned up, fill disk gradually.
  • Cache degradation: Caches misconfigured for their access patterns may perform well initially and degrade as the working set evolves.
  • GC pressure: Garbage collection that runs cleanly initially can become increasingly frequent and pause-heavy as heap fragmentation grows over time.

Scalability Testing

Scalability testing validates that your system scales up to absorb increasing load and scales back down when load subsides. Cloud infrastructure assumes elastic scaling so scalability testing verifies the assumption. This helps verify that: the metric driving scale-up (CPU, request rate, queue depth) actually reaches its threshold under realistic load. The scaling event actually reduces the pressure that triggered it. Scale-up happens fast enough that users don’t experience degradation during the lag. Scale-down doesn’t trigger an immediate scale-up cycle, creating instability. In practice, auto-scaling especially first scale event can take several minutes so you need to make sure that you have some extra capacity to handle increased load.

Volume Testing

Many performance characteristics change materially as data grows. Index scan times increase. Query planner behavior shifts. Cache hit rates drop as the working set outgrows cache size. Search latency that is acceptable at 50 million records may become unacceptable at 250 million. Test at your current production data volume, then at projected volumes for 1 and 3 years out. The time to address data growth challenges in architecture is before you’re already there.

Recovery Testing

Recovery testing applies an abnormal condition like a dependency failure, a network partition, a resource exhaustion event and measures how long the system takes to return to normal operation. The key questions: does the system recover at all? How long does recovery take? What’s the user-visible impact during the recovery window?


Handling Dependencies in Your Tests

One of the practical decisions in every load test is what to do about dependencies like external APIs, third-party services, internal microservices, payment processors, identity providers, email services, and so on. You have two approaches, and which one you choose depends on what your test is trying to answer.

Mock Dependencies When You’re Focused on Your Own Code

When your goal is to validate your application’s internal performance like memory footprint, CPU usage, throughput of your business logic, efficiency of your data access layer then mocking external dependencies is often the right call. However, you will need to build a well-designed mock that returns realistic response payloads with configurable latency. Mocking lets you:

  • Isolate your application’s performance characteristics from the noise of external variability
  • Simulate dependency failure modes (timeouts, errors, slow responses) in a controlled way
  • Run tests without consuming third-party quotas or generating costs in external systems
  • Reproduce specific latency profiles to understand how your code behaves under different dependency performance conditions

Include Real Dependencies When Integration Behavior Matters

When your goal is to validate end-to-end system behavior including the interaction effects between your system and its dependencies then you can use real dependencies or realistic stubs deployed under your control. The reason this matters: under load, dependencies behave differently than they do at idle. For example, higher latency in dependencies can propagate creating back-pressure in your system that a mock would never reveal. Dependencies that are slow, throttled, or unavailable under load can:

  • Exhaust your connection pools (connections held open waiting for a slow response)
  • Fill your request queues (new requests queueing behind slow in-flight requests)
  • Trigger retry storms (your retry logic amplifying load on an already-struggling dependency)
  • Surface timeout and circuit-breaker behavior that only activates under real latency conditions

If you include real third-party services in your load test, be explicit about two things: you may consume quota and generate costs, and their performance becomes part of your results. When a dependency is slow, it appears as latency in your own metrics — know what you’re measuring.

A practical middle ground: deploy internal stubs for your external dependencies. A stub is a service you control that returns realistic responses with configurable behavior. Unlike a mock in a test harness, a stub runs as a real service and participates in your actual network topology. It lets you test realistic integration behavior without the unpredictability or cost of real external services.

Watch for Automatic Retry Amplification

Another factor that can skew results from performance testing is automated retries at various layers when a request fails or times out. Under load, this multiplies traffic. If your application generates 400 write operations per second against a dependency, and that dependency starts returning errors, your client may retry each failed request two or three times and suddenly generating 800 to 1,200 operations per second against an already-struggling system. In your load tests, verify that your retry behavior is bounded and doesn’t turn a manageable degradation into a cascading failure. Exponential backoff with jitter, retry budgets, and circuit breakers all exist to prevent this.


Design Your Load Model

Before writing a test script, model the load you intend to generate. A poorly designed load model produces results that feel meaningful but don’t correspond to anything real.

Use Production Traffic Patterns as Your Starting Point

Study your actual production metrics. Identify:

  • Average requests per second across a normal operating period
  • Peak requests per second during your highest-traffic periods
  • Request distribution across endpoints: what percentage of traffic hits each API? Most services have a small number of high-traffic endpoints and many low-traffic ones.
  • Read/write ratio: most production services are read-heavy; your load model should reflect that
  • Payload characteristics — average request and response sizes
  • User session behavior: are users authenticated? Do requests carry session state? Do later requests in a workflow depend on earlier ones?
  • Geographic distribution: does your traffic come from one region or many?

Use Stepped Load Progression

Ramp load gradually rather than jumping to peak immediately. A stepped approach produces distinct data points at each level, making it easier to identify where behavior changes.

Hold each step long enough for metrics to stabilize and for any auto-scaling events to complete. If your auto-scaling policy triggers after 5 minutes of sustained high CPU, your steps need to run for at least 7-10 minutes. Steps that are too short produce transient data that doesn’t represent steady-state behavior.

Model Think Time

Real users don’t send requests as fast as possible. They read pages, fill forms, wait for results, and make decisions. Think time like the pause between user actions should be randomized within a realistic range based on observed production behavior. Omitting think time concentrates load artificially, inflates concurrency counts, and produces results that don’t correspond to real user behavior.

Model Transaction Workflows, Not Just Endpoints

A user doesn’t hit /api/checkout. They authenticate, browse products, add items to a cart, enter payment details, and confirm an order. Each step depends on the previous step and carries state forward. Test complete workflows. Measure the whole transaction, not just individual request latency. This reveals which step in the workflow breaks first under load, which is your actual bottleneck. For transactional workflows, count the full transaction as your unit of measurement, not individual requests. A checkout that takes 12 requests and completes in 3 seconds is different from one that requires 12 requests and only completes 60% of the time under load.


The Test Environment

Your test environment is the single largest source of invalid load test results. Get this wrong and every metric, analysis, and conclusion downstream becomes unreliable.

Match Production Infrastructure

The test environment should match production in:

  • Instance types, sizes, and counts
  • Database configuration: connection pool size, cache allocation, index configuration, replica count
  • Caching layers and their sizes (this is a common miss a cache sized to 10% of production will warm and evict very differently)
  • Auto-scaling configuration and thresholds
  • Load balancer and network configuration
  • All service configurations that affect throughput or latency

Pay particular attention to cache sizes. Under-sized caches in test environments produce unrealistically high cache miss rates, which increases database load and makes your results look worse than production will be. Over-sized caches make things look better.

Use Representative Data Volumes

Test environments with small datasets produce misleading results. A database with 1 million rows behaves differently from one with 100 million rows in ways that are significant and non-linear. Index performance, query planner behavior, partition routing, and cache hit rates all change with data volume. Populate your test environment with data that reflects realistic production scale before running meaningful performance tests.

Isolate the Test Environment Completely

I have seen a load test takes down production environment because it shared a common infrastructure. A test environment that shares any infrastructure with production like databases, message queues, caching clusters, network paths, logging infrastructure creates two simultaneous problems: invalid test results (because production traffic contaminates your measurements) and potential production incidents (because your load test contaminates production systems). Shared test environments that connect to production Messaging bus, Kafka, or database clusters have caused outages. Enforce complete isolation.

Account for Test Data Accumulation

Load tests generate real data. After many test runs, your test database accumulates records, logs grow, and storage fills. Plan your test data lifecycle from the start, e.g., how you populate data before tests, whether you clean up between runs, and how you prevent accumulated test data from affecting your test environment’s performance over time.

Document Your Environment Specification

Version-control your environment definition alongside your test scripts. When you compare results across time, you need to know that what changed was the system under test, not the test environment. An environment specification that exists only in someone’s memory cannot be reproduced reliably.


Metrics: Collect the Right Things

Load testing generates a lot of data. The teams that extract the most value don’t collect more metrics, they collect the right metrics and actually analyze them.

Latency

Track percentiles, not averages. Averages hide tail behavior that determines user experience.

  • P50 — what the median user experiences
  • P90 — your common-case ceiling; nine in ten requests complete within this
  • P99 — your near-worst case; one in a hundred users waits this long
  • P99.9 — your extreme tail; relevant for high-volume services where 0.1% is still thousands of users

The gap between P50 and P99.9 tells you about consistency. A wide gap means some users experience good performance while others experience unacceptable degradation. Systems under load often hold P50 steady while P99 climbs.

Throughput

  • Requests per second: raw system throughput
  • Successful transactions per second: throughput filtered by correctness; throughput with a 20% error rate is not good throughput
  • Throughput per resource unit: requests per CPU core, per GB of memory helps with capacity planning

Error Rates

  • Fault rate — server-side failures (5xx responses)
  • Error rate — client rejections, throttled requests, timeouts
  • Error distribution — which specific errors, at what load levels

Don’t aggregate errors into a single rate. A 2% error rate composed entirely of timeouts tells you something different from a 2% error rate of connection refused responses. Decompose your error data and correlate specific error types with the load levels at which they appear.

Resource Utilization

Collect these for every component like application servers, databases, caches, message queues, load balancers, and load generators:

  • CPU: overall and per-core; watch for single-threaded bottlenecks where overall CPU looks fine but one core is maxed
  • Memory: heap usage, GC frequency and pause duration, swap usage; track memory over time in soak tests to detect leaks
  • Disk I/O: read and write throughput, queue depth, utilization percentage; relevant for databases and any service that writes logs or temp files
  • Network I/O: ingress and egress bytes per second, connection counts, dropped packets
  • Thread and connection pool utilization: active threads, queued requests, pool exhaustion events

Application-Level Metrics

  • Cache hit/miss/eviction rates: degrading hit rates under load reveal cache sizing or key distribution problems
  • Queue depths: growing queues indicate consumers can’t keep pace with producers
  • Database connection pool saturation: one of the most common failure modes under load
  • GC pause duration and frequency: GC pressure under load causes latency spikes that don’t show up in CPU metrics directly
  • Retry rates: high retry rates indicate a dependency is struggling, and may be amplifying load
  • Circuit breaker state: how often circuit breakers open under load, and what triggers them

Dependency-Level Metrics

When you include real dependencies in your test, monitor them as carefully as your own service:

  • Response latency from each dependency (P50, P99)
  • Error rates from each dependency
  • Dependency-side resource utilization (if you have access)
  • Message bus ingress and egress (if applicable)
  • Partition utilization for distributed storage systems

When a dependency is slow or erroring, that signal propagates through your system as elevated latency and errors in your own metrics. You need dependency-level metrics to trace the source.

Availability

Define availability targets before testing:

  • Service availability: percentage of requests that succeed
  • Per-endpoint availability: some endpoints degrade before others; measure them independently
  • Dependency availability: availability of each system your service calls

Business-Level Metrics

The most important metrics are often furthest from the infrastructure:

  • Orders completed per minute
  • Successful authentication rate
  • Payment processing completion rate
  • Data write confirmation rate

Infrastructure metrics tell you what the system is doing. Business metrics tell you what users are experiencing. A system where P99 latency stays within SLA but checkout completion drops 15% under load has a problem that infrastructure metrics alone won’t reveal clearly.


Tools

Over the years, I have used various commercial and open source tools like LoadRunner, Grinder, Tsung, etc. that are no longer well maintained. Here are common tools that can be used for load testing:

For Simple Endpoint Testing

  • ab (Apache Bench) and Hey: Command-line tools that generate load against a single endpoint. No scripting required, fast to start.
  • Vegeta: Generates load at a constant request rate, independent of server response time. This distinction matters: when your server responds slowly, most tools automatically reduce request rate. Vegeta maintains the configured rate as latency climbs, which means you observe back-pressure and degradation accurately.
echo "GET https://api.example.com/users/123" | vegeta attack -rate=500 -duration=60s | vegeta report
  • k6: Scripted in JavaScript, distributed as a single Go binary. k6 handles multi-step scenarios natively, supports parameterized test data, models think time, and exposes rich built-in metrics. It integrates with Prometheus, CloudWatch, and Grafana for analysis, and supports threshold-based pass/fail in CI pipelines.
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // ramp to 100 users
    { duration: '5m', target: 100 },   // hold at average load
    { duration: '2m', target: 500 },   // ramp to peak
    { duration: '5m', target: 500 },   // hold at peak
    { duration: '1m', target: 1000 },  // spike
    { duration: '2m', target: 0 },     // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],  // 99% of requests under 500ms
    http_req_failed: ['rate<0.01'],    // less than 1% error rate
  },
};

export default function () {
  // Step 1: authenticate
  const loginRes = http.post('https://api.example.com/auth/login', {
    username: `user_${__VU % 10000}@example.com`,
    password: 'password',
  });
  check(loginRes, { 'login succeeded': (r) => r.status === 200 });

  const token = loginRes.json('token');

  // Step 2: fetch catalog (read operation)
  const catalogRes = http.get('https://api.example.com/catalog?page=1', {
    headers: { Authorization: `Bearer ${token}` },
  });
  check(catalogRes, { 'catalog loaded': (r) => r.status === 200 });

  sleep(Math.random() * 3 + 1); // think time: 1-4 seconds

  // Step 3: place order (write operation)
  const orderRes = http.post(
    'https://api.example.com/orders',
    JSON.stringify({ item_id: Math.floor(Math.random() * 1000), quantity: 1 }),
    { headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' } }
  );
  check(orderRes, { 'order placed': (r) => r.status === 201 });

  sleep(Math.random() * 2 + 1);
}
  • Apache JMeter: Meter supports complex scenarios through a GUI, handles correlation between requests, has a broad plugin ecosystem, and has extensive enterprise adoption.
  • Locust: Pure Python, code-defined test scenarios (not XML), a built-in web UI for real-time monitoring, distributed mode via a controller/worker model, and trivially scriptable.

For Distributed Load Generation

AWS Distributed Load Testing: When a single machine can’t generate the volume you need, this solution orchestrates load across multiple instances, accepts JMeter scripts as the test definition, and streams results to time-series storage for analysis. Use it when your bandwidth or TPS requirements exceed what a single load generator can produce.

For Observability During Tests

You can use following monitoring stack to gather performance metrics:

  • Prometheus + Grafana: commonly used for infrastructure and application metrics; k6 exports directly to Prometheus
  • CloudWatch: native AWS monitoring; integrates with most AWS services and many load testing tools
  • Distributed tracing (Jaeger, Zipkin, AWS X-Ray): essential for understanding latency in distributed systems; propagate correlation IDs through every service boundary so you can trace a slow request to the specific component that caused it

Without distributed tracing, diagnosing latency in a multi-service system under load is largely guesswork.


Execution

  • Warm Up Before Measuring: JIT compilation, connection pool initialization, cache population, and DNS resolution all affect early request latency. Build a ramp-up period into every test. Discard metrics from the warmup phase. Measure steady-state behavior only.
  • Verify Your Load Generator Isn’t the Bottleneck: Before trusting any results, confirm: load generator CPU stays well below saturation (under 70%), network I/O doesn’t approach the bandwidth ceiling, and the tool achieves the TPS you configured not a lower number due to local resource constraints. If you configure 1,000 TPS but the generator only achieves 600, your results reflect the generator’s limits, not your system’s.
  • Notify Dependent Teams Before Testing: If your test environment shares any infrastructure with other teams, notify them before running high-volume tests. Unexpected load from your tests against a shared component (a database, a message bus, a routing layer) can cause problems for teams who have no idea a load test is running.
  • Run Each Scenario in Isolation First: Test each scenario independently before running combinations. An isolated test that reveals a problem gives you more diagnostic information than a combined test that reveals the same problem buried in noise from other scenarios.
  • Don’t Overwrite Previous Results: Each test run should write to a new, timestamped output file. Overwriting results from a previous run is a common mistake when running iterative tests in a loop. You lose the ability to compare across runs.
  • Pause Between Runs: Allow the system to fully drain between test iterations like connections close, queues clear, resource utilization returns to baseline. Residual load from one run contaminates the starting conditions of the next.

Common Pitfalls

  • Testing a single endpoint and calling it done. A service’s behavior under load isn’t determined by any single endpoint. Test complete workflows, including the paths that matter most to users.
  • Ignoring dependencies. When your dependencies are slow or unavailable, your service appears slow. When your service hammers a dependency with load, the dependency may degrade and create a feedback loop. Model dependency behavior explicitly and mock it when you want to isolate your own code, use real or realistic stubs when integration behavior matters.
  • Mismatch between test environment and production. Different hardware, different cache sizes, different connection pool limits, different network latency profiles, any of these make test results non-transferable to production. Document your environment specification. Validate that it matches production before trusting results.
  • Small data volumes. A test environment with 1% of production data volume produces optimistic results. Populate test data to realistic scale.
  • Running load tests once. Performance characteristics change with every code change, every dependency upgrade, and every growth milestone. A load test you ran six months ago tells you about a system that no longer exists.
  • Ignoring ramp-down. Verify that resource utilization returns to baseline after load subsides. A system that doesn’t recover cleanly carries forward pressure that degrades subsequent traffic.
  • Not collecting metrics from all layers. Application-level metrics without infrastructure metrics leave you guessing about root cause. Infrastructure metrics without application or business-level metrics leave you unable to quantify user impact. Collect all three.
  • Stopping tests when something goes wrong instead of analyzing the failure mode. When a stress test surfaces a failure, that’s the point. Note what failed, under what conditions, and how the system behaved. Stopping the test immediately loses the degradation data that tells you whether the failure mode is safe or catastrophic.

Analysis

  • Establish a Baseline Before Comparing Anything: Every metric needs a reference point. P99 latency of 300ms is good or bad depending entirely on what P99 looks like at baseline load. Run a baseline test with minimal concurrent users before escalating. Capture that baseline explicitly. Compare every subsequent measurement against it.
  • Separate Signal from Noise: A single high-latency data point is noise. A systematic increase in P99 as concurrency crosses 500 users is signal. Look for the pattern: where does behavior change? At what load level? After what duration? What resource metric correlates with the change?
  • Trace Latency to Its Source: When you observe elevated latency, resist looking first at application CPU. Latency accumulates in many places: network round trips between services, database query execution, lock contention, GC pause accumulation, connection pool queuing, and downstream dependency latency. Distributed tracing lets you follow a slow request through every component it touched and attribute the latency precisely. Fix the actual source, not the nearest visible symptom.
  • Investigate Unexpectedly Good Results: If your system performs better than expected under load, investigate before celebrating. Unexpected improvement often means your test isn’t exercising the paths you intended such as caches warming too aggressively, load not reaching the components you think it is, or test data creating unrealistic access patterns. Results you can’t explain aren’t results you can rely on.
  • Generate Comparative Reports: A report listing numbers has limited value. A report comparing those numbers to your baseline, to your previous test run, and to your defined thresholds has significant value. For each metric, capture:
    – Current result
    – Baseline value
    – SLA or target threshold
    – Previous test result (regression or improvement?)
    – Load level at which the metric was captured

Store test results in a queryable format over time.


Building a Continuous Performance Practice

The teams with the most reliable services don’t treat performance testing as a project. They treat it as a discipline with regular cadence.

  • Define performance goals and revisit them annually. Goals should include throughput targets, latency percentiles, error rate limits, resource utilization ceilings, and headroom targets (how much capacity should remain available at peak). As your traffic patterns change, your service evolves, and your SLAs tighten, these goals need updating.
  • Automate pass/fail thresholds in CI. Encode your performance targets as pipeline gates. A change that increases P99 latency by 40% under load should fail the build, the same way a change that breaks a unit test fails the build.
  • Run performance canaries in production. Continuously exercise production endpoints at low volume from monitoring infrastructure. Track latency, error rates, and throughput over time. Detect gradual degradation before users do.
  • Assign a performance owner on each service team. Performance improvements don’t happen without someone watching the metrics, reviewing throttling rules, identifying regressions, and driving improvements.
  • Review results across time for patterns. Look at all your load test results over the past quarter. Which metrics trend in the wrong direction? Which components appear repeatedly in bottleneck analysis? Patterns across multiple tests reveal systemic issues that any individual test misses.
  • Share what you learn. Performance problems and their solutions are valuable organizational knowledge. Document them. Share them across teams. The team dealing with connection pool exhaustion today is probably not the first team to hit that issue.

The Pre-Test Checklist

Before any load test:

  • [ ] Test objectives and pass/fail thresholds defined in writing before execution
  • [ ] Test environment completely isolated from production
  • [ ] Test environment infrastructure matches production configuration (instance types, cache sizes, connection pools, scaling settings)
  • [ ] Test data populated to realistic production scale
  • [ ] Dependent services decided: mock, stub, or real — with rationale documented
  • [ ] Monitoring dashboards active for all components, including load generators
  • [ ] Dependent team on-call contacts notified
  • [ ] Output file naming prevents overwrites between iterations
  • [ ] Previous test results available for comparison

During execution:

  • [ ] Baseline captured before escalating load
  • [ ] Load generator resource utilization verified (not the bottleneck)
  • [ ] Error rates monitored in real time — abnormal errors trigger a pause for investigation
  • [ ] Each step held long enough for metrics to stabilize
  • [ ] Auto-scaling events logged with timestamps

After execution:

  • [ ] Results compared to defined thresholds and previous runs
  • [ ] Anomalies investigated before conclusions are drawn
  • [ ] Root cause documented for any threshold violations
  • [ ] Action items assigned with owners and deadlines
  • [ ] Test results stored in versioned, queryable storage
  • [ ] Environment cleanup completed (test data, log files, temporary resources)

Putting It Together

Any code change can affect performance. A dependency upgrade, a new index, a configuration tweak, a framework version bump — all of these can change memory footprint, CPU usage, throughput, and latency in ways that don’t appear until you run real load. The only reliable way to catch these changes before they affect users is to make performance testing a routine part of how you build and ship software, not something you do once before a big launch.

Start with profiling to understand where time and memory go in your own code. Add load tests to your CI pipeline to catch regressions early. Run soak tests to find memory and connection leaks. Stress test to 10x your expected peak so you know what your ceiling looks like and how you fail when you hit it. Test with real dependency behavior when integration effects matter, and mock dependencies when you want to isolate your own code.

Collect metrics at every layer such as application, infrastructure, and business so you can connect a latency spike to its root cause and quantify its user impact. Store results over time so you can detect gradual regressions before they become incidents. The goal is to know your system well enough that production behavior matches what you measured in testing.


Powered by WordPress