AWS Certified Generative AI Developer - Professional (AIP-C01) Ultimate Cheat Sheet
Your Quick Reference Study Guide
This cheat sheet covers the core concepts, terms, and definitions you need to know for the AWS Certified Generative AI Developer - Professional (AIP-C01). We've distilled the most important domains, topics, and critical details to help your exam preparation.
💡 Note: While this study guide highlights essential concepts, it's designed to complement—not replace—comprehensiv e learning materials. Use it for quick reviews, last-minute prep, or to identify areas that need deeper study before your exam.
About This Cheat Sheet: This study guide covers core concepts for AWS Certified Generative AI Developer - Professional (AIP-C01). It highlights key terms, definitions, common mistakes, and frequently confused topics to support your exam preparation.
Use this as a quick reference alongside comprehensive study materials.
AWS Certified Generative AI Developer - Professional (AIP-C01)
Cheat Sheet •
About This Cheat Sheet: This study guide covers core concepts for AWS Certified Generative AI Developer - Professional (AIP-C01). It highlights key terms, definitions, common mistakes, and frequently confused topics to support your exam preparation.
Use this as a quick reference alongside comprehensive study materials.
Foundation Model Integration, Data Management, and Compliance
31%Smart Chunking & Provenance Anchors
Split and normalize docs (fixed, semantic, overlapping, adaptive); attach anchors/timestamps for precise retrieval.
Key Insight
Overlap boosts recall but raises cost and duplicate noise; chunk size must preserve semantics and allow provenance tracing.
Often Confused With
Common Mistakes
- Assuming more overlap always helps — increases cost and retrieval noise.
- Using tiny chunks that lose context and reduce relevance.
- Skipping source IDs/timestamps — destroys provenance and freshness checks.
Foundation Model Selection & Sizing
Choose FMs by modality, context window, latency, throughput, cost, license and tuning options to meet accuracy and risk.
Key Insight
Match model size and context window to latency/cost/compliance needs; use RAG or PEFT before escalating to a bigger model.
Often Confused With
Common Mistakes
- Defaulting to the largest model — ignores latency, cost, and diminishing returns.
- Skipping retrieval or PEFT and expecting domain accuracy out‑of‑the‑box.
- Treating latency and throughput as the same — optimizing one can hurt the other.
Model Lifecycle & Retirement
Version models, provenance, and compatibility; use canaries, validated rollbacks, and retention policies for safe FM ops
Key Insight
True reproducibility = model binary + training-data snapshot + metadata + runtime environment, not just a version number.
Often Confused With
Common Mistakes
- Assuming a numeric version or binary-only versioning guarantees reproducibility.
- Swapping an alias without canary/validation — causes silent behavior changes in production.
- Treating rollback as redeploying a binary — ignores schema, feature-store, and downstream compatibility.
Amazon Bedrock (Managed FM API Layer)
Managed API access to third‑party and AWS foundation models — handles inference/routing, not vector storage or arbitrary
Key Insight
Bedrock exposes provider FMs via API — it is not a vector DB or arbitrary model host; you must implement retries, fallbacks, and governance.
Often Confused With
Common Mistakes
- Expecting Bedrock to act as a managed vector DB that stores/indexes your app data for RAG.
- Assuming you can upload and run arbitrary model binaries in Bedrock.
- Thinking Bedrock removes the need for retries, fallbacks, IAM controls, or audit integration.
FM Data-Gate: Validation Workflows
Automated structural, semantic, and safety checks plus synthetic tests to catch tokenization, preprocessing, and multim
Key Insight
Combine targeted synthetic edge-cases with unit + integration + regression tests: unit finds parser/tokenizer bugs; integration/regression catch drift
Often Confused With
Common Mistakes
- Relying on synthetic data alone to validate real-world FM behavior.
- Using uniform/noise perturbations as adversaries — misses realistic modality corruptions.
- Skipping integration/regression tests because unit tests passed.
Bedrock Data Automation (BDA) — Async Multimodal Extractor
Managed async extractor that converts multimodal unstructured content into structured outputs written to S3; needs post‑
Key Insight
BDA is inference/processing only and writes results to S3 asynchronously — it does NOT train models, act as a low‑latency endpoint, or auto‑index into
Often Confused With
Common Mistakes
- Expecting BDA to fine‑tune or train foundation models.
- Treating BDA as a low‑latency/synchronous inference endpoint.
- Assuming outputs are auto‑indexed into vector stores or are PII‑clean without validation.
Similarity Metrics & Normalization
Compare and preprocess embeddings—pick metric and norm to trade retrieval quality, index size, and latency.
Key Insight
Dot-product == cosine only if vectors are L2-normalized; choose metric to match embedding geometry and index engine.
Often Confused With
Common Mistakes
- Treating dot-product and cosine as always interchangeable — only equal with L2-normalized vectors.
- Applying L2-normalization blindly — it can hurt models that encode useful magnitude information.
- Calling PCA supervised — PCA is unsupervised and preserves variance, not class separability.
OpenSearch + Neural Plugin for Bedrock
OpenSearch vector search (knn_vector/dense_vector): index precomputed embeddings, tune engine, space_type, and shard/rep
Key Insight
knn_vector uses k‑NN engines (FAISS/HNSW) and index.knn params; dense_vector needs script scoring—mapping controls metric/latency tradeoffs.
Often Confused With
Common Mistakes
- Assuming OpenSearch auto-generates embeddings — embeddings must be produced externally (Bedrock, SageMaker, etc.).
- Treating knn_vector and dense_vector as interchangeable — they require different index settings, plugins, and scoring.
- Thinking more shards always reduce latency — extra shards increase fan‑out and coordination overhead.
ANN — HNSW, IVF & Quantization Tuning
Approximate nearest-neighbor indexes (HNSW/IVF/quant) trade tiny recall for big latency and memory gains in RAG.
Key Insight
Tuning is a 3‑way tradeoff: recall vs latency vs memory — adjust efConstruction/efSearch, nprobe/cluster count, and quantization accordingly.
Often Confused With
Common Mistakes
- Expecting ANN to always match exact k-NN top-k — small ranking differences are normal.
- Assuming efConstruction only affects build time — poor construction hurts query recall.
- Increasing IVF partitions without raising nprobe can reduce, not improve, recall.
Bedrock KB — Managed Vector Store & Provenance
Bedrock provides a hosted vector store with hierarchical docs and provenance-aware retrieval to ground LLM answers and支持
Key Insight
Managed vectors speed integration but still require semantic chunking, a metadata schema, and explicit sync to ensure accurate, provable retrieval.
Often Confused With
Common Mistakes
- Treating Bedrock's store as a drop‑in replacement for custom sharding or advanced index tuning.
- Assuming hierarchical grouping alone guarantees relevance — chunking and metadata design determine quality.
- Skipping ingestion/sync setup — Bedrock won't auto-sync source systems unless configured.
Prompt Engineering (Templates & Context Windows)
Design, iterate, and validate instruction templates and context flows to produce predictable, testable FM outputs.
Key Insight
Control intent, output schema, and context order — well-structured templates + retrieval beat just longer prompts.
Often Confused With
Common Mistakes
- Assuming longer prompts always improve output quality
- Expecting a single prompt to work unchanged across models or contexts
- Believing prompting alone can replace fine-tuning or retrieval augmentation
Hallucination Detection & Mitigation
Detect and reduce fabricated outputs using grounding (RAG), verification chains, provenance, and conservative fallbacks.
Key Insight
Grounding + automated verification (source checks, answer validation, conservative responses) is the primary defense — decoding tweaks alone won't fix
Often Confused With
Common Mistakes
- Treating model token-level confidence as factual correctness
- Assuming retrieval guarantees elimination of hallucinations
- Relying on greedy/low-temperature decoding alone to prevent fabrications
Smart Chunking & Provenance Anchors
Split and normalize docs (fixed, semantic, overlapping, adaptive); attach anchors/timestamps for precise retrieval.
Key Insight
Overlap boosts recall but raises cost and duplicate noise; chunk size must preserve semantics and allow provenance tracing.
Often Confused With
Common Mistakes
- Assuming more overlap always helps — increases cost and retrieval noise.
- Using tiny chunks that lose context and reduce relevance.
- Skipping source IDs/timestamps — destroys provenance and freshness checks.
Foundation Model Selection & Sizing
Choose FMs by modality, context window, latency, throughput, cost, license and tuning options to meet accuracy and risk.
Key Insight
Match model size and context window to latency/cost/compliance needs; use RAG or PEFT before escalating to a bigger model.
Often Confused With
Common Mistakes
- Defaulting to the largest model — ignores latency, cost, and diminishing returns.
- Skipping retrieval or PEFT and expecting domain accuracy out‑of‑the‑box.
- Treating latency and throughput as the same — optimizing one can hurt the other.
Model Lifecycle & Retirement
Version models, provenance, and compatibility; use canaries, validated rollbacks, and retention policies for safe FM ops
Key Insight
True reproducibility = model binary + training-data snapshot + metadata + runtime environment, not just a version number.
Often Confused With
Common Mistakes
- Assuming a numeric version or binary-only versioning guarantees reproducibility.
- Swapping an alias without canary/validation — causes silent behavior changes in production.
- Treating rollback as redeploying a binary — ignores schema, feature-store, and downstream compatibility.
Amazon Bedrock (Managed FM API Layer)
Managed API access to third‑party and AWS foundation models — handles inference/routing, not vector storage or arbitrary
Key Insight
Bedrock exposes provider FMs via API — it is not a vector DB or arbitrary model host; you must implement retries, fallbacks, and governance.
Often Confused With
Common Mistakes
- Expecting Bedrock to act as a managed vector DB that stores/indexes your app data for RAG.
- Assuming you can upload and run arbitrary model binaries in Bedrock.
- Thinking Bedrock removes the need for retries, fallbacks, IAM controls, or audit integration.
FM Data-Gate: Validation Workflows
Automated structural, semantic, and safety checks plus synthetic tests to catch tokenization, preprocessing, and multim
Key Insight
Combine targeted synthetic edge-cases with unit + integration + regression tests: unit finds parser/tokenizer bugs; integration/regression catch drift
Often Confused With
Common Mistakes
- Relying on synthetic data alone to validate real-world FM behavior.
- Using uniform/noise perturbations as adversaries — misses realistic modality corruptions.
- Skipping integration/regression tests because unit tests passed.
Bedrock Data Automation (BDA) — Async Multimodal Extractor
Managed async extractor that converts multimodal unstructured content into structured outputs written to S3; needs post‑
Key Insight
BDA is inference/processing only and writes results to S3 asynchronously — it does NOT train models, act as a low‑latency endpoint, or auto‑index into
Often Confused With
Common Mistakes
- Expecting BDA to fine‑tune or train foundation models.
- Treating BDA as a low‑latency/synchronous inference endpoint.
- Assuming outputs are auto‑indexed into vector stores or are PII‑clean without validation.
Similarity Metrics & Normalization
Compare and preprocess embeddings—pick metric and norm to trade retrieval quality, index size, and latency.
Key Insight
Dot-product == cosine only if vectors are L2-normalized; choose metric to match embedding geometry and index engine.
Often Confused With
Common Mistakes
- Treating dot-product and cosine as always interchangeable — only equal with L2-normalized vectors.
- Applying L2-normalization blindly — it can hurt models that encode useful magnitude information.
- Calling PCA supervised — PCA is unsupervised and preserves variance, not class separability.
OpenSearch + Neural Plugin for Bedrock
OpenSearch vector search (knn_vector/dense_vector): index precomputed embeddings, tune engine, space_type, and shard/rep
Key Insight
knn_vector uses k‑NN engines (FAISS/HNSW) and index.knn params; dense_vector needs script scoring—mapping controls metric/latency tradeoffs.
Often Confused With
Common Mistakes
- Assuming OpenSearch auto-generates embeddings — embeddings must be produced externally (Bedrock, SageMaker, etc.).
- Treating knn_vector and dense_vector as interchangeable — they require different index settings, plugins, and scoring.
- Thinking more shards always reduce latency — extra shards increase fan‑out and coordination overhead.
ANN — HNSW, IVF & Quantization Tuning
Approximate nearest-neighbor indexes (HNSW/IVF/quant) trade tiny recall for big latency and memory gains in RAG.
Key Insight
Tuning is a 3‑way tradeoff: recall vs latency vs memory — adjust efConstruction/efSearch, nprobe/cluster count, and quantization accordingly.
Often Confused With
Common Mistakes
- Expecting ANN to always match exact k-NN top-k — small ranking differences are normal.
- Assuming efConstruction only affects build time — poor construction hurts query recall.
- Increasing IVF partitions without raising nprobe can reduce, not improve, recall.
Bedrock KB — Managed Vector Store & Provenance
Bedrock provides a hosted vector store with hierarchical docs and provenance-aware retrieval to ground LLM answers and支持
Key Insight
Managed vectors speed integration but still require semantic chunking, a metadata schema, and explicit sync to ensure accurate, provable retrieval.
Often Confused With
Common Mistakes
- Treating Bedrock's store as a drop‑in replacement for custom sharding or advanced index tuning.
- Assuming hierarchical grouping alone guarantees relevance — chunking and metadata design determine quality.
- Skipping ingestion/sync setup — Bedrock won't auto-sync source systems unless configured.
Prompt Engineering (Templates & Context Windows)
Design, iterate, and validate instruction templates and context flows to produce predictable, testable FM outputs.
Key Insight
Control intent, output schema, and context order — well-structured templates + retrieval beat just longer prompts.
Often Confused With
Common Mistakes
- Assuming longer prompts always improve output quality
- Expecting a single prompt to work unchanged across models or contexts
- Believing prompting alone can replace fine-tuning or retrieval augmentation
Hallucination Detection & Mitigation
Detect and reduce fabricated outputs using grounding (RAG), verification chains, provenance, and conservative fallbacks.
Key Insight
Grounding + automated verification (source checks, answer validation, conservative responses) is the primary defense — decoding tweaks alone won't fix
Often Confused With
Common Mistakes
- Treating model token-level confidence as factual correctness
- Assuming retrieval guarantees elimination of hallucinations
- Relying on greedy/low-temperature decoding alone to prevent fabrications
Implementation and Integration
26%Agentic AI — Router & Orchestrator
Models + routing rules that map intent to tools/agents, coordinate steps, and manage short‑term state.
Key Insight
Routing, state, and validation are distinct responsibilities — good routing sends work, memory preserves context, validation guarantees correctness.
Often Confused With
Common Mistakes
- Assume agents share one implicit global memory — you must design syncing/consistency.
- Skip runtime safeguards — omit timeouts, circuit breakers, or result validation.
- Swap rule routing for a learned router without benchmarking latency, cost, and error modes.
Bedrock Agents — AWS Managed Orchestration
AWS-managed agent orchestrator (default ReAct) that connects FMs, APIs, and data — integrations required for memory/DBs/
Key Insight
Bedrock provides orchestration and connectors, not built‑in persistent memory or automatic production guardrails.
Often Confused With
Common Mistakes
- Assume Bedrock Agents include persistent long‑term memory or a built‑in vector DB.
- Deploy without developer guardrails — skip timeouts, validation, or circuit breakers.
- Think Bedrock is closed‑box and can't call external APIs or third‑party models.
Batching Strategies — Static / Dynamic / Micro / Continuous
Group inference inputs to trade throughput vs per-item latency; pick by SLOs, token variance, and API limits.
Key Insight
Larger batches raise throughput but add queuing and tail latency; use micro/dynamic batching with size/time caps for tight SLOs.
Often Confused With
Common Mistakes
- Assuming bigger batches always improve throughput — memory/context or API rate limits often cap gains.
- Believing batching always reduces per-request latency — queuing and tail-latency can increase end-to-end time.
- Ignoring token-length variance and padding — variable lengths inflate compute and ruin throughput estimates.
Bedrock Provisioned Throughput (Model Units — MUs)
Buy dedicated Bedrock Model Units for guaranteed tokens/minute throughput; intended for sustained, high-volume inference
Key Insight
Provisioned MUs reserve throughput and are billed hourly per model/region — they guarantee capacity, not fixed per-request latency.
Often Confused With
Common Mistakes
- Treating provisioned throughput as the same as on‑demand serverless — provisioned is reserved, not per-call autoscale.
- Expecting billing per API call — MUs are hourly charges (commit terms alter price), not per-invocation fees.
- Assuming provisioned removes latency variability — input size and model compute still cause per-request latency differences.
Vector Stores (RAG Indexes)
Vector DBs for RAG: index build/update, chunking, similarity tuning, access control, and monitoring.
Key Insight
Embedding-model version + chunking choices set retrieval quality — change either and you must reindex or retune.
Often Confused With
Common Mistakes
- Swapping embeddings from a new model into an old index without reindexing breaks nearest-neighbor relevance.
- Treating reindexing as instant — plan for long rebuilds; use versioned indices and atomic index swaps.
- Over-chunking to boost recall — tiny chunks fragment context and reduce coherent answers; match chunk size to query/context length.
GenAI Security & Governance
Runtime sandboxing, IAM/VPC isolation, KMS/secrets controls, tenant routing, quotas, and telemetry partitioning.
Key Insight
Isolation must be layered: network + IAM + runtime sandbox + telemetry scoping — missing any layer enables leakage.
Often Confused With
Common Mistakes
- Relying on namespaces/ACLs alone — logical separation doesn't stop shared-pipeline leaks.
- Assuming encryption prevents prompt-injection or inference-time data exfiltration.
- Creating per-tenant endpoints but sharing logs/metrics — telemetry still mixes tenant data unless partitioned.
Cost‑Aware Model Cascades
Route requests across models (static or dynamic) to balance cost, latency, and output quality.
Key Insight
Start cheap and escalate using calibrated confidence/telemetry — avoid calling every model in parallel to save compute.
Often Confused With
Common Mistakes
- Always start with the smallest model — triggers costly fallbacks and higher end-to-end latency.
- Routing purely by per-call token cost — ignores model capability and confidence, causing quality drops.
- Designing dynamic routing that invokes all candidate models in parallel — wastes compute and inflates cost.
API Gateway: FM Front Door
Use API Gateway to validate/transform requests, enforce auth/throttling, and normalize errors for Bedrock FMs.
Key Insight
Gateway offloads shaping/early validation and consistent errors, but it doesn't replace backend safety, retries, or fine-grained auth.
Often Confused With
Common Mistakes
- Relying on API Gateway to prevent hallucinations or guarantee model input safety.
- Skipping backend validation because transformations ran at the gateway.
- Expecting API Gateway to auto-retry model invocations or fully hide timeouts from clients.
Token Streaming & Backpressure
Send model output token-by-token (SSE/WebSocket/HTTP stream) to lower perceived latency for real-time UIs.
Key Insight
Streaming improves perceived latency, not always total latency — you must frame/reassemble chunks, enforce backpressure, and integrate with gateways.
Often Confused With
Common Mistakes
- Assuming streaming reduces total compute or end-to-end latency
- Treating SSE and WebSocket as functionally identical
- Expecting each chunk to be complete JSON — ignoring framing/reassembly
OpenAPI for GenAI (API‑First)
Define FM-facing HTTP/JSON endpoints, schemas and metadata (rate limits, streaming hints) so integrations are consistent
Key Insight
OpenAPI documents the API surface (and can carry streaming/extensions metadata) but does NOT implement runtime enforcement or reveal model internals.
Often Confused With
Common Mistakes
- Believing an OpenAPI file enforces rate limits/auth at runtime
- Thinking OpenAPI describes model internals or training data
- Assuming versioned spec auto-resolves backward compatibility
AI Safety, Security, and Governance
20%Bedrock Guardrails (ApplyGuardrails)
Runtime pre/post checks that block, redact, label, and enforce decoupled safety policies on Bedrock model calls.
Key Insight
Guardrails are an enforcement middleware around model calls—not model fine‑tuning—and must be orchestrated with logging, redaction, and human review.
Often Confused With
Common Mistakes
- Thinking guardrails change model weights—they only intercept and transform I/O.
- Assuming ApplyGuardrails removes all hallucinations or guarantees accuracy.
- Expecting ApplyGuardrails to auto-provide full audit/compliance records without extra config.
Prompt Injection & Jailbreak Defense
Detect and block attempts to override system instructions—apply layered runtime detectors, context separation, and RBAC,
Key Insight
Injection payloads can come from user input, RAG-retrieved docs, or tool outputs—treat all context as untrusted and enforce provenance, signed tool/tX
Often Confused With
Common Mistakes
- Believing keyword removal or a single regex will stop all injections.
- Only checking external user prompts—ignoring model-generated context, RAG hits, or tool outputs.
- Assuming encrypted logs or 'immutable' system prompts alone prevent runtime exfiltration.
Least‑Privilege for Foundation Models (Bedrock)
Scope Bedrock/FM access with IAM/ABAC, resource policies and short‑lived scoped tokens to separate inference, tuning, &
Key Insight
AuthN ≠ AuthZ: combine ABAC attributes, resource policies and scoped STS tokens to restrict inference vs customization.
Often Confused With
Common Mistakes
- Relying on TLS or a shared API key as 'secure enough' — still need fine‑grained authorization
- Using one broad IAM role or wildcard (e.g., "bedrock:*") across tenants to 'simplify' access
- Assuming ABAC alone removes the need for scoped policy statements or explicit denies
IAM for GenAI: Roles, Policies & MFA
Apply least‑privilege JSON policies, use short‑lived roles/STSp, federation for users, and MFA for humans in GenAI flows
Key Insight
Policy eval rules matter: permissions union; explicit Deny overrides. Use role separation, STS sessions and SCPs for boundaries.
Often Confused With
Common Mistakes
- Using the root account or long‑lived IAM user keys for routine tasks
- Assuming MFA changes authorization or reduces granted permissions
- Thinking multiple attached policies conflict — they combine; explicit Deny still beats Allows
Forensic Traceability — Hash Chains, Merkle, WORM & KMS
Provable, append-only records of prompts/interactions using hash chains, Merkle proofs, signatures, WORM storage, and K
Key Insight
Integrity requires cryptographic anchors + secure key separation + independent notarization; storage alone isn't proof.
Often Confused With
Common Mistakes
- Assuming raw, unredacted prompts are safe to store for debugging or compliance.
- Believing a single S3 bucket automatically makes logs immutable and tamper-proof.
- Thinking encryption or a hash chain alone proves integrity without key separation or notarization.
CloudTrail — API & Config Audit Trail
AWS service that records API calls and config changes; enable data events to capture S3/Lambda and FM-related activity.
Key Insight
CloudTrail records and retains events but doesn't enforce actions; data events and full payload capture must be explicitly enabled.
Often Confused With
Common Mistakes
- Expecting CloudTrail to include full request/response payloads by default.
- Relying on CloudTrail to prevent or block unauthorized actions.
- Assuming data-plane events (S3/Lambda object details) are recorded without enabling them.
Data Masking & Privacy Tech (DP, SMPC, HE)
Pick masking, differential privacy, or crypto (SMPC/HE) by data type — balance re‑id risk vs utility.
Key Insight
Noise must be calibrated to sensitivity and ε,δ; cryptographic methods protect computation/privacy but add cost and limit analytic utility.
Often Confused With
Common Mistakes
- Assuming any added noise = differential privacy — noise must match sensitivity and ε,δ.
- Believing DP eliminates all re‑identification risk — it bounds worst‑case leakage, not absolute safety.
- Treating masking/token removal as the same as DP — masking hides fields; DP provides statistical privacy guarantees.
Grounding & Source Attribution (RAG Provenance)
Ensure outputs link to retrieved passages, citations and retrieval metadata so answers are verifiable and auditable.
Key Insight
A citation or URL alone isn't proof — surface the supporting passage, retrieval context, and metadata (score, timestamp, doc id).
Often Confused With
Common Mistakes
- Assuming any citation or URL guarantees the answer is correct.
- Thinking grounding removes all hallucinations so human review isn't needed.
- Relying on a high retrieval score alone as proof that a source is authoritative.
Operational Efficiency and Optimization for GenAI Applications
12%Inference Deployment Patterns (SageMaker & Bedrock)
Map real‑time, async/queue, serverless and multi‑model patterns to SageMaker/Bedrock using latency vs cost tradeoffs.
Key Insight
Pick by traffic shape: steady high‑QPS → dedicated real‑time; bursty/low‑QPS → serverless or async; MMEs save model disk loads but don't remove GPU/mv
Often Confused With
Common Mistakes
- Assuming MMEs are always cheaper — ignores request frequency, model load latency and caching
- Believing MMEs remove GPU/memory limits — instance sizing, sharding, and model size still constrain you
- Treating asynchronous inference as streaming/real‑time — async adds queueing and higher end‑to‑end latency
Token Accounting & Tracking
Log and reconcile input + output + system tokens per request with the model tokenizer and monitor aggregated trends for費
Key Insight
Billing = input + output + hidden/system tokens; use the model's tokenizer server‑side and time‑series alerts to catch drift and leaks
Often Confused With
Common Mistakes
- Estimating tokens from character count — tokenizer rules (byte‑pair) differ widely
- Trusting client‑side estimates without server reconciliation — billed tokens may differ
- Ignoring output and system tokens — they can be a large portion of cost
Token Window & Quota Control (TPM/RPM)
Measure and control every token (system, assistant, retrieved) to meet context windows, cost, and quota SLAs.
Key Insight
All messages and retrieved context consume tokens — use the model tokenizer to count, then truncate, compress, chunk, cache, or stream to preserve the
Often Confused With
Common Mistakes
- Treating character count as tokens — always measure with the target model's tokenizer.
- Over-compressing/truncating context and losing required facts — test output quality after each reduction.
- Caching prompts without versioning/validation — caches go stale or leak private data.
Model & Infra Right-Sizing (GPU, Inferentia, Graviton)
Match model variant and accelerator to SLAs: benchmark accuracy vs latency/cost, then optimize with quantization, comp‑r
Key Insight
The cheapest/fastest real-world choice is empirical — benchmark model variants on target instance/accelerator, use quantization/compilation and right‑
Often Confused With
Common Mistakes
- Deploying the largest model by default — may break latency and cost SLAs without benchmark data.
- Choosing hardware by peak FLOPS only — memory bandwidth, drivers, and kernel support change real latency.
- Assuming batching always lowers latency — batching can increase per-request and tail latency if misused.
GenAI Observability — Tokens, Traces & SLOs
Collect traces, metrics and token-aware telemetry (TTF, per-token latency/cost, quality) tied by causal IDs for SLO-led
Key Insight
Correlate prompt → tokens → response with causal IDs; instrument model latency, time-to-first-token, per-token latency, hallucination rates and token‑
Often Confused With
Common Mistakes
- Logging prompts alone won't reconstruct flows—use traces, causal IDs and metrics.
- Don't equate SLOs with SLAs—alert on measurable SLOs and error budgets, not legal SLAs.
- Watching only latency/availability misses quality and token-cost signals (hallucination, relevance, per-token cost).
Vector DB Monitoring — Latency, Recall & Index Health
Track p50/p95/p99 latency, QPS, embedding-similarity distributions, recall@k/MRR/nDCG, freshness, ingestion and compacts
Key Insight
Low latency ≠ good retrieval—combine operational metrics (p95/p99, compaction, replication, ingestion lag) with retrieval quality (recall@k, MRR, nDCG
Often Confused With
Common Mistakes
- Only monitor query latency—ignore recall, freshness and embedding drift.
- Assuming high similarity scores guarantee correct answers—ranking and context matter.
- Believing an index exists equals healthy—check staleness, compaction and replication status.
Testing, Validation, and Troubleshooting
11%Responsible Model Validation (CI + H2H)
CI-driven automated + human-in-the-loop tests for quality, safety, hallucination checks, and controlled rollouts.
Key Insight
Automated metrics catch numeric regressions; human review and adversarial/synthetic tests expose hallucinations and safety failures.
Often Confused With
Common Mistakes
- Relying only on automated metrics to declare production readiness.
- Treating p < 0.05 as proof of business impact without effect-size/context.
- Relying only on unit tests; skipping integration/canary and human review.
Drift Detection & Remediation
Continuously detect input, concept, and performance drift via stats, embedding divergence, telemetry, and triggerable SL
Key Insight
Correlate embedding/statistical shifts with labeled performance and infra telemetry to separate transient anomalies from real drift.
Often Confused With
Common Mistakes
- Retraining immediately on any statistical shift without impact analysis.
- Monitoring inputs only; ignoring outputs and labeled performance.
- Treating a single spike as persistent drift; not correlating with infra or user-change signals.
RAG (Retrieval‑Augmented Generation): Index→Embed→Ground
End-to-end retrieval + grounding: index, chunk, embed, and trace to locate and stop hallucinations.
Key Insight
Hallucinations often originate in retrieval—indexing, chunking, embedding model or similarity params—not only the LLM.
Often Confused With
Common Mistakes
- Blame the LLM first—skip inspecting index quality and retrieval logs.
- Default to larger chunks; oversized chunks dilute context and reduce relevance.
- Treat retrieved sources as authoritative; citations ≠ correctness.
Prompt Governance: Versioning, Testing & Rollouts
Manage prompts like code: version, template, test, stage rollouts, and monitor metrics to prevent regressions.
Key Insight
Treat prompts as deployable artifacts—use CI/CD, unit/regression tests and canary/A‑B rollouts to trace regressions to wording.
Often Confused With
Common Mistakes
- Treat prompts as informal—skip versioning and approvals.
- Hot-fix tiny wording changes in prod without tests or canary rollout.
- Rely only on raw I/O logs; skip structured tests and metrics for regressions.
Certification Overview
Cheat Sheet Content
Similar Cheat Sheets
- CCNA Exam v1.1 (200-301) Cheat Sheet
- AWS Certified Cloud Practitioner (CLF-C02) Cheat Sheet
- AWS Certified AI Practitioner (AIF-C01) Cheat Sheet
- Exam AI-900: Microsoft Azure AI Fundamentals Cheat Sheet
- Google Cloud Professional Cloud Architect Cheat Sheet
- Google Cloud Security Operations Engineer Exam Cheat Sheet