Google Cloud Professional Cloud Architect Ultimate Cheat Sheet
Your Quick Reference Study Guide
This cheat sheet covers the core concepts, terms, and definitions you need to know for the Google Cloud Professional Cloud Architect. We've distilled the most important domains, topics, and critical details to help your exam preparation.
💡 Note: While this study guide highlights essential concepts, it's designed to complement—not replace—comprehensiv e learning materials. Use it for quick reviews, last-minute prep, or to identify areas that need deeper study before your exam.
About This Cheat Sheet: This study guide covers core concepts for Google Cloud Professional Cloud Architect. It highlights key terms, definitions, common mistakes, and frequently confused topics to support your exam preparation.
Use this as a quick reference alongside comprehensive study materials.
Google Cloud Professional Cloud Architect
Cheat Sheet •
About This Cheat Sheet: This study guide covers core concepts for Google Cloud Professional Cloud Architect. It highlights key terms, definitions, common mistakes, and frequently confused topics to support your exam preparation.
Use this as a quick reference alongside comprehensive study materials.
Designing and Planning a Cloud Solution Architecture
25%TCO — Full Lifecycle Cost
Estimate all lifecycle costs — migration, licensing, infra, ops, training, downtime — to compare architectures and make,
Key Insight
TCO = upfront + ongoing + hidden costs (migration, training, downtime). Lower initial spend can raise lifecycle cost.
Often Confused With
Common Mistakes
- Counting only upfront cloud bills or CapEx and ignoring migration, ops, training, and downtime.
- Assuming lift-and-shift always yields lower TCO than refactor or platform-native redesign.
Scalability & Performance Targets
Quantified load, concurrency, latency and throughput targets that drive autoscaling, partitioning, caching and resource選
Key Insight
Translate SLA numbers into autoscale thresholds, partitioning/sharding and cache tiers — trade-offs exist between latency, throughput, consistency.
Often Confused With
Common Mistakes
- Treating autoscaling as a substitute for capacity planning and load/performance testing.
- Blaming latency solely on the network instead of app design, caching, or storage choices.
- Provisioning for peak as 'typical' instead of using elasticity and cost-aware scaling policies.
Inference Modes: Batch vs Online vs Cache
Choose online for low‑latency SLAs, batch for high throughput/cost savings, caching/hybrid for repeated or bursty loads.
Key Insight
Match SLA to traffic: provisioned online (Vertex AI Endpoints) for steady low latency; Vertex AI Batch Prediction for throughput; add caching (Memorеs
Often Confused With
Common Mistakes
- Assuming autoscaling removes cold starts — model load and I/O still cause latency spikes.
- Writing off batch as 'too slow' — micro‑batches/streaming and precompute can meet near‑real‑time needs.
- Assuming serverless endpoints always beat VMs — cold starts and platform limits impact latency and cost.
HA & Failover: Patterns and RTO/RPO Tradeoffs
Design multi‑zone/region redundancy, global LBs, and replicated storage (Spanner/Cloud SQL HA/Cloud Storage) with tested
Key Insight
HA = redundancy + orchestrated failover: active‑active for minimal RTO, active‑passive to save cost; data layer choices (sync vs async) set RPO.
Often Confused With
Common Mistakes
- Spreading VMs across zones but keeping a single‑region DB — you still have a single point of failure.
- Treating more replicas as instant recovery — ignores replication lag and promotion orchestration.
- Relying on backups as HA — restores are manual with high RTO, not automatic failover.
VPC Design (Virtual Private Cloud)
Plan subnets/IPs, routing, peering/Shared VPC, hybrid links and firewall zones to meet connectivity and security.
Key Insight
Design IP addressing and routing first: peering isn't transitive, Shared VPC centralizes control, and VPN/Interconnect don't replace firewalls.
Often Confused With
Common Mistakes
- Assuming VPC peering is transitive and will route via a third VPC.
- Treating cloud firewalls as stateless — GCP firewalls track sessions (return traffic allowed).
- Believing VPN/Interconnect provide app-layer segmentation so you can skip firewall policies.
VPC Network Peering (Private Backbone)
Private, high‑bandwidth internal link between VPCs using Google’s backbone; exchanges routes directly but has topology/D
Key Insight
Peering exchanges routes but is non‑transitive, rejects overlapping CIDRs, and does not provide private DNS or internet/on‑prem transit.
Often Confused With
Common Mistakes
- Treating peering as transitive (A→B and B→C ⇒ A→C).
- Expecting automatic private DNS name resolution across peered VPCs.
- Attempting peering with overlapping IP ranges — routes will be rejected.
Multicloud Integration: Data Gravity & Trust
Place and operate workloads across on‑prem and clouds by balancing data gravity, identity, networking, and migration tr
Key Insight
Data gravity dictates placement: keep compute near large datasets, federate identity, and minimize cross‑cloud egress.
Often Confused With
Common Mistakes
- Assuming identical ML pipelines run unchanged across clouds
- Treating 'retire' as immediate deletion without retention/compliance check
- Believing one integration approach fits all workloads
Anthos — Hybrid & Multicloud Kubernetes
Anthos provides a consistent Kubernetes control plane, policy, and lifecycle across on‑prem and other clouds.
Key Insight
Anthos centralizes control and policy but does not remove node/hardware ops; use it to run workloads where data lives, not to magically move data.
Often Confused With
Common Mistakes
- Treating Anthos as fully managed GKE with no infrastructure operations
- Assuming Anthos automatically migrates or replicates data to GCP
- Expecting Cloud Run on Anthos to behave identical to fully managed Cloud Run
Compute Platform Selection — GKE • Cloud Run • App Engine • Functions • VMs
Map workload traits to GCP compute: VMs for control, GKE for containers, Cloud Run/App Engine/Functions for managed ops.
Key Insight
Trade control vs. managed: Compute Engine = max control; GKE = orchestrated containers; Cloud Run = stateless containers; App Engine = opinionated Paa
Often Confused With
Common Mistakes
- Assuming modernization requires a full rewrite — use Strangler Fig for incremental moves
- Treating managed services as zero‑ops; they still need integration, config, and can lock you in
- Equating lift‑and‑shift with cloud‑native; misses operational, scaling, and cost tradeoffs
Data Migration & Schema Evolution — CDC, Dual‑Write, Expand→Contract
Move and evolve data with minimal downtime: CDC, dual‑write, expand‑then‑contract, backfills, schema registries and cut‑
Key Insight
Design as expand‑then‑contract + schema versioning + automated reconciliation; assume transient inconsistency and plan rollback
Often Confused With
Common Mistakes
- Treating dual‑writes as automatically consistent — they introduce drift and need reconciliation
- Assuming CDC guarantees cross‑system transactional consistency
- Skipping rollback or verification because staged cutover 'should' be safe
TCO — Full Lifecycle Cost
Estimate all lifecycle costs — migration, licensing, infra, ops, training, downtime — to compare architectures and make,
Key Insight
TCO = upfront + ongoing + hidden costs (migration, training, downtime). Lower initial spend can raise lifecycle cost.
Often Confused With
Common Mistakes
- Counting only upfront cloud bills or CapEx and ignoring migration, ops, training, and downtime.
- Assuming lift-and-shift always yields lower TCO than refactor or platform-native redesign.
Scalability & Performance Targets
Quantified load, concurrency, latency and throughput targets that drive autoscaling, partitioning, caching and resource選
Key Insight
Translate SLA numbers into autoscale thresholds, partitioning/sharding and cache tiers — trade-offs exist between latency, throughput, consistency.
Often Confused With
Common Mistakes
- Treating autoscaling as a substitute for capacity planning and load/performance testing.
- Blaming latency solely on the network instead of app design, caching, or storage choices.
- Provisioning for peak as 'typical' instead of using elasticity and cost-aware scaling policies.
Inference Modes: Batch vs Online vs Cache
Choose online for low‑latency SLAs, batch for high throughput/cost savings, caching/hybrid for repeated or bursty loads.
Key Insight
Match SLA to traffic: provisioned online (Vertex AI Endpoints) for steady low latency; Vertex AI Batch Prediction for throughput; add caching (Memorеs
Often Confused With
Common Mistakes
- Assuming autoscaling removes cold starts — model load and I/O still cause latency spikes.
- Writing off batch as 'too slow' — micro‑batches/streaming and precompute can meet near‑real‑time needs.
- Assuming serverless endpoints always beat VMs — cold starts and platform limits impact latency and cost.
HA & Failover: Patterns and RTO/RPO Tradeoffs
Design multi‑zone/region redundancy, global LBs, and replicated storage (Spanner/Cloud SQL HA/Cloud Storage) with tested
Key Insight
HA = redundancy + orchestrated failover: active‑active for minimal RTO, active‑passive to save cost; data layer choices (sync vs async) set RPO.
Often Confused With
Common Mistakes
- Spreading VMs across zones but keeping a single‑region DB — you still have a single point of failure.
- Treating more replicas as instant recovery — ignores replication lag and promotion orchestration.
- Relying on backups as HA — restores are manual with high RTO, not automatic failover.
VPC Design (Virtual Private Cloud)
Plan subnets/IPs, routing, peering/Shared VPC, hybrid links and firewall zones to meet connectivity and security.
Key Insight
Design IP addressing and routing first: peering isn't transitive, Shared VPC centralizes control, and VPN/Interconnect don't replace firewalls.
Often Confused With
Common Mistakes
- Assuming VPC peering is transitive and will route via a third VPC.
- Treating cloud firewalls as stateless — GCP firewalls track sessions (return traffic allowed).
- Believing VPN/Interconnect provide app-layer segmentation so you can skip firewall policies.
VPC Network Peering (Private Backbone)
Private, high‑bandwidth internal link between VPCs using Google’s backbone; exchanges routes directly but has topology/D
Key Insight
Peering exchanges routes but is non‑transitive, rejects overlapping CIDRs, and does not provide private DNS or internet/on‑prem transit.
Often Confused With
Common Mistakes
- Treating peering as transitive (A→B and B→C ⇒ A→C).
- Expecting automatic private DNS name resolution across peered VPCs.
- Attempting peering with overlapping IP ranges — routes will be rejected.
Multicloud Integration: Data Gravity & Trust
Place and operate workloads across on‑prem and clouds by balancing data gravity, identity, networking, and migration tr
Key Insight
Data gravity dictates placement: keep compute near large datasets, federate identity, and minimize cross‑cloud egress.
Often Confused With
Common Mistakes
- Assuming identical ML pipelines run unchanged across clouds
- Treating 'retire' as immediate deletion without retention/compliance check
- Believing one integration approach fits all workloads
Anthos — Hybrid & Multicloud Kubernetes
Anthos provides a consistent Kubernetes control plane, policy, and lifecycle across on‑prem and other clouds.
Key Insight
Anthos centralizes control and policy but does not remove node/hardware ops; use it to run workloads where data lives, not to magically move data.
Often Confused With
Common Mistakes
- Treating Anthos as fully managed GKE with no infrastructure operations
- Assuming Anthos automatically migrates or replicates data to GCP
- Expecting Cloud Run on Anthos to behave identical to fully managed Cloud Run
Compute Platform Selection — GKE • Cloud Run • App Engine • Functions • VMs
Map workload traits to GCP compute: VMs for control, GKE for containers, Cloud Run/App Engine/Functions for managed ops.
Key Insight
Trade control vs. managed: Compute Engine = max control; GKE = orchestrated containers; Cloud Run = stateless containers; App Engine = opinionated Paa
Often Confused With
Common Mistakes
- Assuming modernization requires a full rewrite — use Strangler Fig for incremental moves
- Treating managed services as zero‑ops; they still need integration, config, and can lock you in
- Equating lift‑and‑shift with cloud‑native; misses operational, scaling, and cost tradeoffs
Data Migration & Schema Evolution — CDC, Dual‑Write, Expand→Contract
Move and evolve data with minimal downtime: CDC, dual‑write, expand‑then‑contract, backfills, schema registries and cut‑
Key Insight
Design as expand‑then‑contract + schema versioning + automated reconciliation; assume transient inconsistency and plan rollback
Often Confused With
Common Mistakes
- Treating dual‑writes as automatically consistent — they introduce drift and need reconciliation
- Assuming CDC guarantees cross‑system transactional consistency
- Skipping rollback or verification because staged cutover 'should' be safe
Managing and Provisioning Infrastructure
18%Hybrid Connectivity — Interconnect, HA‑VPN & Cloud Router
Link on‑prem/multi‑cloud: Interconnect for bandwidth/SLA, HA‑VPN for encrypted failover, Cloud Router for BGP.
Key Insight
Physical links (Dedicated/Partner Interconnect) give capacity & SLA; Cloud Router only exchanges BGP routes — it doesn't add redundancy or bandwidth.
Often Confused With
Common Mistakes
- Assume Partner Interconnect is always cheaper or lower latency than Dedicated.
- Rely on Cloud Router for physical redundancy — it's a routing control plane only.
- Pick VPN because it's 'cheapest'—ignore sustained throughput, latency, SLA, and operational cost of many tunnels.
Subnet & IP Design — VPC, GKE Pods/Services
Plan CIDRs for VPCs, nodes, pods and services; reserve GKE secondary ranges and prevent overlaps across projects/VPCs.
Key Insight
Primary subnet CIDR is effectively immutable for planning; use reserved secondary ranges for GKE and avoid overlaps—overlaps break peering and complic
Often Confused With
Common Mistakes
- Assume alias IPs auto‑resolve cross‑project or cross‑VPC CIDR overlaps.
- Treat ClusterIP addresses as coming from the node subnet — they come from the service CIDR.
- Think subnet size only affects IP count — it also affects peering, routing, firewall rules, and migrations.
BigQuery — ML Canonical Store
Structured, analytical store for ML: partition/cluster, SQL feature engineering, and Vertex AI batch I/O.
Key Insight
Partition+cluster to cut scan costs; ideal for batch training/feature joins — not for sub‑100ms online lookups.
Often Confused With
Common Mistakes
- Believing batch predictions can only write to Cloud Storage — BigQuery can be a direct prediction target.
- Assuming streaming inserts are zero-latency or unlimited — they incur latency and are quota‑controlled.
- Using BigQuery for low‑latency online prediction/lookup — it's analytical, not an OLTP or real‑time KV store.
GCS — Model Artifacts & Batch Storage
Durable object storage for model artifacts: choose storage class + lifecycle + versioning for cost and compliance.
Key Insight
Pick storage class for access pattern and use lifecycle to automate cost cuts; versioning/retention holds block deletions.
Often Confused With
Common Mistakes
- Trying to delete noncurrent versions without bucket versioning — versioning must be enabled.
- Expecting POSIX-style, low-latency behavior for many small reads/writes — object storage suits large objects/sequential I/O.
- Assuming lifecycle rules run instantly or override holds — lifecycle is asynchronous and cannot bypass retention holds/locks.
Accelerator Choice — CPU, GPU, TPU, Edge
Pick CPU/GPU/TPU/edge based on model precision, memory/IO, latency, throughput, and provisioning cost limits.
Key Insight
Match the bottleneck: FLOPS-bound → GPU/TPU; memory/IO-bound → more RAM or better interconnect; latency-sensitive → edge/quantized models.
Often Confused With
Common Mistakes
- Assuming adding accelerators always gives linear speedup and lower cost.
- Treating TPUs as drop-in GPU replacements without code/XLA/op changes.
- Ignoring memory, network and IO limits — they can dominate latency and cost.
Cloud Load Balancers — Type, Scope, Health
Choose HTTP(S)/TCP/UDP and global vs regional LB, wire health checks to autoscalers, and weigh CDN/SSL trade-offs for HA
Key Insight
Global LB provides anycast IP and cross-region failover but NOT automatic app-state replication; scope choice must consider latency, compliance, and D
Often Confused With
Common Mistakes
- Expecting perfectly even traffic distribution — weights, capacity and proximity affect routing.
- Assuming a global LB obviates cross-region data replication or DR configuration.
- Using only TCP/ICMP health checks and missing HTTP(S) endpoint or status-code validations.
Vertex AI Pipelines — Compile, Submit, Schedule
Managed service to compose, compile (YAML/JSON), submit and schedule end-to-end ML pipelines on Vertex.
Key Insight
A compiled pipeline is an artifact, not a running job — you must submit it and configure per-step resources, retries, and retention.
Often Confused With
Common Mistakes
- Assuming a compiled pipeline auto-runs without submission to Vertex.
- Expecting automatic per-step scaling without specifying machine types/worker pools.
- Believing pipelines must run on a separate Kubeflow cluster and can't call CustomJobs.
Dataflow (Apache Beam): Batch & Streaming Prep
Cloud Dataflow is the managed Apache Beam runner for scalable batch/stream transforms with BigQuery, GCS, Pub/Sub, Big‑/
Key Insight
Beam = SDK/model; Dataflow = managed runner — choose windowing, triggers, connectors and autoscaling settings to meet latency and consistency SLAs.
Often Confused With
Common Mistakes
- Thinking Dataflow is streaming-only; it also handles optimized bounded (batch) jobs.
- Believing you must write raw Beam code every time — templates, SQL, and Flex options exist.
- Relying on autoscaling alone for bursts — use windowing, triggers and backpressure controls.
Vertex AI Prediction: Online vs Batch
Managed serving: low‑latency online endpoints for real‑time; batch jobs (GCS I/O) for high‑throughput offline inference.
Key Insight
Pick by SLA: online for sub‑second user requests; batch for cost‑efficient bulk inference that tolerates minutes/hours delay.
Often Confused With
Common Mistakes
- Expecting batch jobs to meet sub‑second, user‑facing latency
- Assuming online and batch have identical cost, SLA, and performance profiles
- Ignoring endpoint cold‑starts and autoscaler warm‑up delays
Vertex AI Prebuilt APIs — Use & Tradeoffs
Managed multimodal ML APIs for fast integration — tradeoffs: cost, latency, rate limits, and domain accuracy.
Key Insight
Fastest to deploy but always model cost/latency/throughput, secure auth/data flow, and plan fallbacks for domain gaps.
Often Confused With
Common Mistakes
- Assuming prebuilt APIs run locally or on‑prem by default
- Skipping call‑volume, payload size, or feature‑choice cost modelling
- Treating remote API latency, rate limits, and cold‑starts as negligible
Security and Compliance
18%IAM — Roles, Service Accounts & Least Privilege
Control which principals do what on GCP resources; use scoped roles, service accounts, and Workload Identity to enforce
Key Insight
Grant the minimal role at the narrowest scope; prefer predefined/custom roles + Workload Identity over primitive roles or long‑lived keys.
Often Confused With
Common Mistakes
- Assuming a custom role created in one project is automatically available org‑wide.
- Treating predefined roles as always least‑privilege; many include extra permissions.
- Using long‑lived service account keys for GKE instead of Workload Identity.
ML Threats — Data, Model & Supply‑Chain Attacks
Attacks can poison data, extract or steal models, or exploit CI/CD/runtime; defend via access control, KMS, monitoring,&
Key Insight
Threats span the entire ML lifecycle — prevention + provenance + detection are required; access controls alone don’t stop extraction or insider/supply
Often Confused With
Common Mistakes
- Believing adversarial attacks only affect image models; text, tabular, and time‑series are vulnerable.
- Thinking encrypting model artifacts prevents poisoning; poisoning occurs in training/supply chain before encryption.
- Assuming anonymization or strong access controls alone prevent model leakage or extraction.
Cloud DLP (Data Loss Prevention) — PII Obscuring
Detect, classify and transform PII on GCP (redact, mask, tokenize, pseudonymize) chosen by re‑ID risk.
Key Insight
Pseudonymization keeps re‑ID paths; anonymization aims to break them irreversibly — pick technique by legal risk, re‑ID likelihood, and control of key
Often Confused With
Common Mistakes
- Treating pseudonymization as irreversible anonymization
- Assuming enabling DLP alone prevents data exfiltration (no access, logging, perimeter controls)
- Believing aggregation/k‑anonymity guarantees zero re‑identification risk
ML Privacy: PII/PHI/PCI Controls
Apply classification, minimization, de‑id, DP, tokenization, encryption and strict model access to stop sensitive‑data泄露
Key Insight
Models can memorize and leak training records — combine de‑identification, differential privacy, access controls, audit logs and validation to reduce泄
Often Confused With
Common Mistakes
- Assuming de‑identification permanently removes regulatory obligations
- Believing synthetic data eliminates all privacy risk
- Thinking trained models can't leak PII so no model access controls are needed
Analyzing and Optimizing Processes
15%CI/CD Pipelines — Gated, Immutable Releases
Automate build→test→release with immutable artifacts, environment gates, security scans, and rollback plans.
Key Insight
CI produces signed immutable artifacts; CD must promote those through gated stages (tests, canary, approvals) not rebuild per env.
Often Confused With
Common Mistakes
- Installing a CI tool equals CI/CD — process, tests, and gating must be designed too.
- Treating Continuous Delivery as auto-deploy to prod — require explicit gating/approvals for production.
- Using one pipeline template for all services — ignore per-service tests, quotas, and rollback needs.
Model & Data Lineage — Reproducible Provenance
Capture end-to-end provenance (data, transforms, code, params, checksums) so models and datasets are reproducible andaud
Key Insight
Lineage = directed causal graph linking inputs, transforms, runs, and artifacts; snapshots (versioning) alone don't show causality.
Often Confused With
Common Mistakes
- Assuming cloud auto-captures complete lineage — you must instrument pipelines and record hashes, params, and run IDs.
- Storing raw file copies only — missing schema, transform params, and code hashes breaks reproducibility.
- Believing lineage metadata alone ensures compliance — combine with IAM, retention, and signed evidence.
Autoscaling Patterns — MIGs & GKE HPA/VPA
Pick horizontal vs vertical scaling; use MIGs/HPA/VPA plus forecasting, headroom, triggers, cooldowns to meet SLOs.
Key Insight
Reserve forecasted base capacity (procure/reserve) and handle spikes with reactive HPA/VPA; always add headroom and cooldowns to protect SLOs.
Often Confused With
Common Mistakes
- Assuming horizontal scaling always outperforms vertical resizing
- Scaling to observed peak with no headroom or variability analysis
- Believing HPA only uses CPU metrics
FinOps: Cost ↔ Capacity Trade-offs
Balance cost, capacity, availability and performance with budgeting, forecasting, labels, showback/chargeback and policy
Key Insight
Match discounts and procurement to stable patterns, use autoscaling for variability, and enforce labels + showback to tie spend to outcomes.
Often Confused With
Common Mistakes
- Assuming autoscaling always lowers cloud costs
- Buying CUDs without verifying steady usage windows
- Treating labels and tagging as optional bookkeeping
Managing Implementation
12%IaC — Versioned, Reproducible Infra
Define cloud infrastructure in versioned code to provision, audit, and reproduce environments automatically.
Key Insight
Idempotent declarations + remote state and locking = safe multi‑env rollouts, plan reviews, and drift detection.
Often Confused With
Common Mistakes
- Assume IaC always means declarative; some tools are imperative with different guarantees
- Checking secrets into code or state files instead of using a secrets manager
- Skipping plan/review/CI and applying changes directly to production
Blue‑Green Releases — Swap Whole Envs
Run two production-identical environments and flip traffic to deploy or rollback with near-zero downtime.
Key Insight
Blue‑green swaps entire environments — it's fast for stateless services but requires DB/compatibility strategies for stateful systems.
Often Confused With
Common Mistakes
- Underestimate the cost of running duplicate production infrastructure
- Assume rollback is trivial when databases or external state are involved
- Forget session affinity, connection draining, and health checks when switching traffic
Terraform IaC — GCP Remote State & Least-Privilege
Declarative provisioning with Terraform on GCP; manage remote state, per-workspace SAs, and secret handling for safe ops
Key Insight
State is the source-of-truth and may contain secrets—use encrypted remote state, locking, per-workspace service accounts, and Secret Manager.
Often Confused With
Common Mistakes
- Assuming Terraform state contains no sensitive data; state can include secrets and identifiers.
- Granting Terraform a broad Owner role to avoid permission errors instead of least-privilege SAs.
- Committing secrets or plain variables to repo instead of using Secret Manager or encrypted backends.
Audit & Access Transparency Logs — The Evidence Chain
Admin, Data Access, and Access Transparency logs show who/what accessed resources; export, retain, and protect them for
Key Insight
Admin Activity is on by default; Data Access often isn’t. Access Transparency shows Google staff access. Exports + IAM + retention policies are needed
Often Confused With
Common Mistakes
- Expecting Access Transparency to include full request/response payloads — it typically shows access events and metadata.
- Treating exported logs as tamper-proof without bucket/object locks, strict IAM, and retention settings.
- Relying on monitoring alerts as a replacement for durable, exported audit logs in post-incident forensics.
Solution and Operations Excellence
12%Backup & DR Postures (RTO/RPO + Runbooks)
Backups, replication, retention and runbooks tailored to meet RTO/RPO; test full failovers, not just file restores.
Key Insight
RTO/RPO drive posture choice—snapshots/log‑shipping vs pilot‑light/warm‑standby/multi‑site; runbooks/automation set real RTO.
Often Confused With
Common Mistakes
- Relying on frequent snapshots alone—ignore transaction logs and consistency.
- Equating single-file restore tests with full DR readiness.
- Assuming failover is automatic—forget DNS, certs, data lag and tested runbooks.
Error Budget (SLO-driven Risk Control)
Allowed unreliability (1−SLO) measured via SLIs/rolling windows; use burn‑rate to gate releases.
Key Insight
Error budget = complement of SLO; measure all SLIs (errors, latency, correctness), track burn rate, and apply tiered actions when thresholds hit.
Often Confused With
Common Mistakes
- Treating error budget as the SLO instead of its complement.
- Counting only outages—ignoring latency, incorrect responses, and partial failures.
- Assuming budgets reset instantly; ignore rolling windows and burn‑rate math.
SLIs / SLOs / SLAs — Error Budgets
Quantitative health metrics (SLIs), targets (SLOs) and contracts (SLAs); use error budgets to drive releases and alerts.
Key Insight
SLO = allowed unreliability over a rolling window; error budget = remaining allowed failure; burn rate dictates throttling/rollback.
Often Confused With
Common Mistakes
- Treating error budget as financial budget instead of allowed unreliability.
- Mixing SLOs with SLAs — SLOs are internal targets; SLAs are contractual with penalties.
- Assuming error budgets reset instantly at window boundaries rather than using rolling/defined windows.
Data Drift — Input & Label Shifts
Distribution shifts in inputs or labels that reduce model generalization; detect at feature-level, validate impact, then
Key Insight
Types matter: covariate (inputs), prior (label freq), concept (label meaning). Use feature stats, PSI, and unlabeled proxies to detect early.
Often Confused With
Common Mistakes
- Assuming all detected drift immediately breaks model performance — always quantify impact first.
- Monitoring only outputs/ops metrics (latency/throughput) and ignoring feature-distribution checks.
- Automatically retraining on drift without diagnosing data quality, features, or label issues first.
Cloud Build CI/CD (cloudbuild.yaml & triggers)
Managed CI/CD that runs cloudbuild.yaml steps, produces artifacts, and can invoke deployments or Vertex AI pipelines.
Key Insight
Build steps share one workspace; triggers are filterable/skippable; Cloud Build invokes training but won't long‑term store models or run heavy GPU/TPU
Often Confused With
Common Mistakes
- Assuming triggers run on every commit — forget to filter by branch/tag or allow skip flags.
- Expecting Cloud Build to auto-version/store models long-term instead of pushing to Artifact Registry/GCS.
- Thinking Cloud Build runs long GPU/TPU training jobs directly (it invokes, doesn't replace training infra).
Gated Deployments & Pipeline Parameters
Use conditional steps, automated/manual gates, scoped params and runbooks to coordinate code, schema, and data changes.
Key Insight
Promotions must coordinate binaries + DB/state migrations + runbook steps; params need scoping and secret handling for reproducibility and safety.
Often Confused With
Common Mistakes
- Assuming promotion only moves binaries — skipping DB/state migration coordination breaks releases.
- Treating pipeline params as non-sensitive or ephemeral — expose secrets or lose reproducibility.
- Believing automation removes the need for human approvals during high-risk rollouts.
Monitoring & Alerting — Noise Suppression
Choose metrics/logs/traces, set severity-based alerts and routes, and suppress noise to cut false pages.
Key Insight
Alert on customer impact (SLO breaches), not every symptom; map severity → responders → channel.
Often Confused With
Common Mistakes
- Paging on every alert instead of using severities and escalation paths
- Relying only on logs for diagnosis; no metrics/traces for fast root cause
- Overzealous dedup/suppression that hides distinct incidents or escalations
ML Inference Latency Troubleshooting
Profile CPU/GPU, threads, queues, I/O and network; reproduce single-request and synthetic tails to isolate root cause.
Key Insight
High tail latency usually stems from single-threading, queuing, or client/network—scaling replicas often won’t fix it.
Often Confused With
Common Mistakes
- Assuming undersized VM/instance is always the cause without profiling
- Switching to GPUs without benchmarked per-request and cold-start checks
- Adding instances to hide latency without addressing single-thread, queue, or client-side delays
Regression Testing — Stage-Gated, Fast, Targeted
Automated suites that catch reintroduced bugs; run the right scope at the right pipeline stage to keep gates fast.
Key Insight
Run narrow, fail-fast regression pre-merge; run full suites in CI/pre-prod and gate promotions by risk and test stability.
Often Confused With
Common Mistakes
- Treating regression suite as only unit tests
- Assuming a larger suite always improves safety (ignores feedback speed)
- Running regression checks only post-release instead of pre-merge/pre-deploy
ML Metrics — AUC-ROC/PR & MAE/RMSE/RMSLE
Classification ranking metrics (AUC-ROC/AUC-PR) and regression error metrics (MAE/RMSE/RMSLE); choose by class imbalance
Key Insight
Use AUC-PR when positives are rare; RMSE penalizes large errors; RMSLE measures relative error but breaks with zeros/negatives.
Often Confused With
Common Mistakes
- Relying on AUC-ROC for highly imbalanced positive classes
- Treating AUC-PR and AUC-ROC as interchangeable
- Using RMSLE on negative or zero-valued targets (invalid)
Progressive Rollouts & Blast‑Radius Control
Use blue‑green/canary/rolling + feature flags, kill‑switches and traffic shaping; gate with SLIs/SLOs and observability.
Key Insight
Roll out to small, observable traffic slices and require thresholded SLI gates — rollback only with aggregated, contextual signals to avoid flapping.
Often Confused With
Common Mistakes
- Treat readiness and liveness probes as interchangeable.
- Apply steady‑state SLI thresholds to deployment verification unchanged.
- Trigger immediate rollback on any single error — causes flapping.
Unified Telemetry: Metrics, Logs, Traces & Probes
Instrument metrics, structured logs, traces and synthetic probes; correlate them to detect anomalies, find root cause, &
Key Insight
Correlate golden signals with domain metrics, traces and probes; interpret SLI deltas with load/noise/upstream context before blaming code.
Often Confused With
Common Mistakes
- Only monitor golden signals; ignore domain-specific metrics and logs.
- Use metrics alone for root-cause diagnosis; skip logs and traces.
- Treat every SLI delta as a deployment regression; ignore noise/upstream causes.
Certification Overview
Cheat Sheet Content
Similar Cheat Sheets
- CCNA Exam v1.1 (200-301) Cheat Sheet
- AWS Certified Cloud Practitioner (CLF-C02) Cheat Sheet
- Google Cloud Certified Generative AI Leader Cheat Sheet
- AWS Certified AI Practitioner (AIF-C01) Cheat Sheet
- Exam AI-900: Microsoft Azure AI Fundamentals Cheat Sheet
- Google Cloud Security Operations Engineer Exam Cheat Sheet