Mocka logoMocka
Home
Why MockaPricingFAQAbout

AWS Certified Data Engineer - Associate (DEA-C01) Ultimate Cheat Sheet

4 Domains • 34 Concepts • Approx. 5 pages

Your Quick Reference Study Guide

This cheat sheet covers the core concepts, terms, and definitions you need to know for the AWS Certified Data Engineer - Associate (DEA-C01). We've distilled the most important domains, topics, and critical details to help your exam preparation.

💡 Note: While this study guide highlights essential concepts, it's designed to complement—not replace—comprehensiv e learning materials. Use it for quick reviews, last-minute prep, or to identify areas that need deeper study before your exam.

AWS Certified Data Engineer - Associate (DEA-C01) Practice Questions
Access Mock Exams & Comprehensive Question Bank
Listen to Audio Podcasts
Expert summaries for AWS Certified Data Engineer - Associate (DEA-C01)

About This Cheat Sheet: This study guide covers core concepts for AWS Certified Data Engineer - Associate (DEA-C01). It highlights key terms, definitions, common mistakes, and frequently confused topics to support your exam preparation.

Use this as a quick reference alongside comprehensive study materials.

AWS Certified Data Engineer - Associate (DEA-C01)

Cheat Sheet •

Provided by GetMocka.com

About This Cheat Sheet: This study guide covers core concepts for AWS Certified Data Engineer - Associate (DEA-C01). It highlights key terms, definitions, common mistakes, and frequently confused topics to support your exam preparation.

Use this as a quick reference alongside comprehensive study materials.

Data Ingestion and Transformation

34%

Kinesis Data Streams — Shards, Scale & Ordering

Shard-based streaming: per-shard ordering and fixed throughput; scale by shard count or use on‑demand mode.

Key Insight

Throughput and ordering are per-shard — add shards to raise total capacity; hot partition keys still throttle.

Often Confused With

Amazon Kinesis Data FirehoseAWS LambdaAmazon MSK

Common Mistakes

  • Assuming unlimited independent consumers per shard
  • Believing record ordering is guaranteed across the whole stream
  • Thinking adding shards raises per-shard limits or instantly fixes hot-key throttling

Kinesis Data Firehose — Buffer, Transform, Deliver

Managed delivery pipeline that buffers, optionally Lambda-transforms, converts formats, compresses, and writes to common

Key Insight

Firehose is for simple delivery and light transforms with at-least-once semantics; use Streams for replay/stateful needs.

Often Confused With

Amazon Kinesis Data StreamsAWS LambdaAmazon Redshift

Common Mistakes

  • Expecting exactly-once delivery — Firehose is at-least-once; duplicates can occur
  • Assuming unlimited in-flight transforms — Lambda transforms have size, time, and resource limits
  • Believing Firehose auto-creates Redshift tables or target schemas

Glue ETL & Crawlers (Serverless Spark)

Serverless Spark ETL + crawlers: auto-schema discovery, Data Catalog management, and S3-staged warehouse loads.

Key Insight

Crawlers discover metadata; Glue jobs do transforms — use S3 staging and bulk-load patterns for Redshift/Snowflake.

Often Confused With

AWS Glue Data CatalogAmazon EMR

Common Mistakes

  • Treating Glue as batch-only — it supports interactive and streaming Spark jobs.
  • Relying on crawlers to infer perfect schemas for nested/heterogeneous data.
  • Expecting direct optimal writes to Redshift/Snowflake without S3 staging or job tuning.

Glue Data Catalog (Metadata Store)

Hive-compatible metadata repo for tables, partitions, and schema versions — stores pointers, not files.

Key Insight

Catalog holds metadata only; crawlers/ETL populate it and partition sync/query visibility often requires explicit actions (MSCK/ALTER or projection).

Often Confused With

AWS Glue ETLHive Metastore

Common Mistakes

  • Thinking the catalog contains actual data files instead of metadata pointers.
  • Believing crawlers alone will always infer correct nested schemas — use custom classifiers or explicit schemas.
  • Assuming partition sync and safe schema propagation to downstream jobs happen automatically.

MWAA — Managed Airflow (DAGs on S3)

Managed Apache Airflow: DAGs/plugins stored in S3; AWS runs Airflow infra but you own networking, IAM and heavy task scU

Key Insight

MWAA handles Airflow infra/scale but not your network/IAM or heavy data processing — offload big jobs and secure VPC/roles.

Often Confused With

AWS Glue WorkflowsAWS Step Functions

Common Mistakes

  • Assume MWAA auto-configures VPC/IAM — you must supply VPC, subnets, SGs, and execution roles.
  • Run heavy/long data jobs on MWAA workers — offload to EMR, Glue, or Lambda for scale and cost control.
  • Assume plugins/DAG changes deploy instantly or unchanged — test plugin compatibility; expect S3/scheduler sync delays.

Glue Workflows — Glue-centric ETL Orchestration

Serverless DAGs that coordinate Glue jobs, crawlers and triggers for ETL pipelines; designed for Glue-first orchestras,​

Key Insight

Glue Workflows orchestrate Glue components and simple retries natively; use Step Functions/Lambda for cross-service or advanced flows.

Often Confused With

AWS Step FunctionsAmazon Managed Workflows for Apache Airflow (MWAA)

Common Mistakes

  • Think Glue Workflows can natively orchestrate any AWS service — they manage Glue jobs/crawlers/triggers only.
  • Assume Glue Workflows replace Step Functions for complex, multi‑service or human‑approval flows — they are Glue‑centric.
  • Expect separate workflow runtime charges — you pay for underlying Glue jobs/crawlers; the workflow metadata has no runtime fee.

AWS Orchestration: EventBridge, Step Functions, Glue, MWAA, ECS

Pick the orchestrator by state, runtime, retry logic, scheduling and external integrations.

Key Insight

Step Functions = stateful, long-running workflows & complex retries; EventBridge routes events; MWAA runs Airflow DAGs; Glue Workflows are Glue‑centic

Often Confused With

EventBridgeStep FunctionsGlue Workflows

Common Mistakes

  • Treating EventBridge like a stateful orchestrator
  • Relying on retries without idempotency or compensating actions
  • Assuming services are interchangeable; ignoring checkpointing, execution limits and costs

Detect & Fix Data Skew: broadcast, shuffle, salting, repartition

Detect uneven partitioning/task times, choose broadcast vs shuffle, and use salting or repartitioning to rebalance.

Key Insight

Broadcast only when the small table fits executor memory; salting redistributes hot keys; always confirm optimizer behavior with explain plan and task

Often Confused With

Broadcast joinsShuffle joinsRepartitioning

Common Mistakes

  • Using broadcast join when the 'small' table exceeds executor memory
  • Adding partitions instead of changing the join key or applying salting
  • Trusting optimizer hints without validating the explain plan and task metrics

Kinesis Data Streams — Shards, Scale & Ordering

Shard-based streaming: per-shard ordering and fixed throughput; scale by shard count or use on‑demand mode.

Key Insight

Throughput and ordering are per-shard — add shards to raise total capacity; hot partition keys still throttle.

Often Confused With

Amazon Kinesis Data FirehoseAWS LambdaAmazon MSK

Common Mistakes

  • Assuming unlimited independent consumers per shard
  • Believing record ordering is guaranteed across the whole stream
  • Thinking adding shards raises per-shard limits or instantly fixes hot-key throttling

Kinesis Data Firehose — Buffer, Transform, Deliver

Managed delivery pipeline that buffers, optionally Lambda-transforms, converts formats, compresses, and writes to common

Key Insight

Firehose is for simple delivery and light transforms with at-least-once semantics; use Streams for replay/stateful needs.

Often Confused With

Amazon Kinesis Data StreamsAWS LambdaAmazon Redshift

Common Mistakes

  • Expecting exactly-once delivery — Firehose is at-least-once; duplicates can occur
  • Assuming unlimited in-flight transforms — Lambda transforms have size, time, and resource limits
  • Believing Firehose auto-creates Redshift tables or target schemas

Glue ETL & Crawlers (Serverless Spark)

Serverless Spark ETL + crawlers: auto-schema discovery, Data Catalog management, and S3-staged warehouse loads.

Key Insight

Crawlers discover metadata; Glue jobs do transforms — use S3 staging and bulk-load patterns for Redshift/Snowflake.

Often Confused With

AWS Glue Data CatalogAmazon EMR

Common Mistakes

  • Treating Glue as batch-only — it supports interactive and streaming Spark jobs.
  • Relying on crawlers to infer perfect schemas for nested/heterogeneous data.
  • Expecting direct optimal writes to Redshift/Snowflake without S3 staging or job tuning.

Glue Data Catalog (Metadata Store)

Hive-compatible metadata repo for tables, partitions, and schema versions — stores pointers, not files.

Key Insight

Catalog holds metadata only; crawlers/ETL populate it and partition sync/query visibility often requires explicit actions (MSCK/ALTER or projection).

Often Confused With

AWS Glue ETLHive Metastore

Common Mistakes

  • Thinking the catalog contains actual data files instead of metadata pointers.
  • Believing crawlers alone will always infer correct nested schemas — use custom classifiers or explicit schemas.
  • Assuming partition sync and safe schema propagation to downstream jobs happen automatically.

MWAA — Managed Airflow (DAGs on S3)

Managed Apache Airflow: DAGs/plugins stored in S3; AWS runs Airflow infra but you own networking, IAM and heavy task scU

Key Insight

MWAA handles Airflow infra/scale but not your network/IAM or heavy data processing — offload big jobs and secure VPC/roles.

Often Confused With

AWS Glue WorkflowsAWS Step Functions

Common Mistakes

  • Assume MWAA auto-configures VPC/IAM — you must supply VPC, subnets, SGs, and execution roles.
  • Run heavy/long data jobs on MWAA workers — offload to EMR, Glue, or Lambda for scale and cost control.
  • Assume plugins/DAG changes deploy instantly or unchanged — test plugin compatibility; expect S3/scheduler sync delays.

Glue Workflows — Glue-centric ETL Orchestration

Serverless DAGs that coordinate Glue jobs, crawlers and triggers for ETL pipelines; designed for Glue-first orchestras,​

Key Insight

Glue Workflows orchestrate Glue components and simple retries natively; use Step Functions/Lambda for cross-service or advanced flows.

Often Confused With

AWS Step FunctionsAmazon Managed Workflows for Apache Airflow (MWAA)

Common Mistakes

  • Think Glue Workflows can natively orchestrate any AWS service — they manage Glue jobs/crawlers/triggers only.
  • Assume Glue Workflows replace Step Functions for complex, multi‑service or human‑approval flows — they are Glue‑centric.
  • Expect separate workflow runtime charges — you pay for underlying Glue jobs/crawlers; the workflow metadata has no runtime fee.

AWS Orchestration: EventBridge, Step Functions, Glue, MWAA, ECS

Pick the orchestrator by state, runtime, retry logic, scheduling and external integrations.

Key Insight

Step Functions = stateful, long-running workflows & complex retries; EventBridge routes events; MWAA runs Airflow DAGs; Glue Workflows are Glue‑centic

Often Confused With

EventBridgeStep FunctionsGlue Workflows

Common Mistakes

  • Treating EventBridge like a stateful orchestrator
  • Relying on retries without idempotency or compensating actions
  • Assuming services are interchangeable; ignoring checkpointing, execution limits and costs

Detect & Fix Data Skew: broadcast, shuffle, salting, repartition

Detect uneven partitioning/task times, choose broadcast vs shuffle, and use salting or repartitioning to rebalance.

Key Insight

Broadcast only when the small table fits executor memory; salting redistributes hot keys; always confirm optimizer behavior with explain plan and task

Often Confused With

Broadcast joinsShuffle joinsRepartitioning

Common Mistakes

  • Using broadcast join when the 'small' table exceeds executor memory
  • Adding partitions instead of changing the join key or applying salting
  • Trusting optimizer hints without validating the explain plan and task metrics

Data Store Management

26%

S3 — Data Lake Object Store

Immutable, key-based object storage for data lakes; you must pick formats, layout, lifecycle and cost controls.

Key Insight

S3 is object (not POSIX): objects are immutable and key-addressed. Performance & cost come from format, partitioning, and lifecycle rules.

Often Confused With

EBSEFS

Common Mistakes

  • Treating S3 like a POSIX/block filesystem (expecting file locks or in-place updates).
  • Assuming built-in indexing/schema or fast queries without columnar formats and partitioning.
  • Thinking S3 provides multi-object atomic transactions or that costs are only stored bytes.

DynamoDB Streams (Table CDC)

Per-item change stream for DynamoDB; use with Lambda/Kinesis for CDC — TTL deletions appear as REMOVE records.

Key Insight

Streams are at-least-once with a short retention window, preserve order per shard/partition key, and show TTL expirations as REMOVE events.

Often Confused With

Kinesis Data StreamsKinesis Data Firehose

Common Mistakes

  • Expecting exactly-once delivery — Streams are at-least-once (handle duplicates).
  • Assuming change records are retained indefinitely — retention window is short (~24h).
  • Believing TTL deletes are instantaneous or invisible — they're asynchronous and appear as REMOVE records.

Data Catalogs — Metadata Index & Governance

Centralized metadata index linking schemas, lineage, owners and business terms for discovery and governance.

Key Insight

Stores metadata and pointers (not raw data); automated crawlers need config and human curation to stay accurate.

Often Confused With

Data lakeData dictionaryMetadata repository

Common Mistakes

  • Thinking crawlers alone guarantee complete, correct metadata
  • Treating the catalog as a data store — it holds metadata and pointers only
  • Equating lineage with audit logs

Glue Crawlers — Schema Inference

Sample-based scanners that infer columns, types and partitions into the Glue Data Catalog; useful but fallible.

Key Insight

Inference is sample-driven and pattern-based — sampling bias, ambiguous formats, and path patterns cause wrong types or missed partitions.

Often Confused With

AWS Glue ETL jobsPartition discovery

Common Mistakes

  • Assuming crawlers infer perfect types/nullability — sampling and ambiguous formats can misclassify
  • Believing crawlers modify or delete source files — they only read and update the Data Catalog
  • Expecting partition discovery with zero config — path patterns or classifiers are often required

Data Lifecycle: Retention, Tiering & TTL

Policy rules to classify, promote/demote, and expire data (hot→warm→cold→archive) balancing cost, latency, SLAs, and law

Key Insight

Age is only one signal—combine access frequency, SLA/business value, retrieval cost, and legal holds to decide tiering

Often Confused With

S3 Lifecycle PoliciesBackup/Retention PoliciesDatabase TTL

Common Mistakes

  • Assuming cold/archive is always cheapest—ignores retrieval and per-request fees
  • Believing colder tiers mean reduced durability or no immediate access
  • Applying one static policy to all datasets without monitoring or reclassification

S3 Lifecycle & Intelligent‑Tiering

Combine lifecycle rules and Intelligent‑Tiering to automate storage-class transitions, expirations, version handling, &-

Key Insight

Intelligent‑Tiering auto-optimizes across its supported tiers; Glacier/Deep Archive require explicit lifecycle rules or archive-tier option

Often Confused With

Glacier / Glacier Deep ArchiveS3 Object LockVersioning Lifecycle

Common Mistakes

  • Expecting lifecycle rules to bypass Object Lock or legal holds — they do not
  • Assuming lifecycle actions take effect instantly once a rule is created
  • Thinking enabling versioning alone prevents object deletion forever

Row vs Columnar: CSV → Parquet

Convert row formats (CSV/Avro) to Parquet to shrink storage and speed analytics via column pruning and predicate push‑/p

Key Insight

Parquet stores data in row‑groups and column‑chunks; predicate pushdown + column pruning drastically cuts S3 I/O — but CPU for decompression/encoding/

Often Confused With

AvroORCCSV

Common Mistakes

  • Assuming columnar always wins—bad for small‑row OLTP or frequent single‑row writes.
  • Believing higher compression always lowers query cost—ignores CPU/decompression overhead.
  • Thinking Parquet files are mutable—updates require rewrite/merge/compaction, not in‑place edits.

Schema Evolution & Data Modeling

Pick schema‑on‑write vs schema‑on‑read, partitioning, indexing, compression and denorm based on dominant query SLAs and$

Key Insight

Schema‑on‑write = higher ingest/validation cost but predictable, fast reads; schema‑on‑read = flexible but higher runtime/query cost—choose by read SL

Often Confused With

Schema-on-readSchema-on-writeSemi-structured vs Unstructured

Common Mistakes

  • Assuming NoSQL requires no schema or validation—implicit schemas and contracts are still needed.
  • Always normalizing to save storage—can cripple analytic read performance; denormalize where queries dominate.
  • Thinking schema‑on‑read is always cheaper—frequent queries pay parsing and runtime costs.

Data Operations and Support

22%

AWS Lambda — Serverless Functions

Event-driven, short-lived serverless compute for stateless tasks that scale with events.

Key Insight

Use Lambda for stateless, low-latency tasks — enforce the 15‑min timeout, treat /tmp as ephemeral, and plan for concurrency/quota limits.

Often Confused With

AWS FargateAmazon EC2AWS Step Functions

Common Mistakes

  • Assuming unlimited runtime — Lambda maximum is 15 minutes.
  • Relying on /tmp as persistent storage across invocations.
  • Ignoring account/regional concurrency limits and throttling.

Lambda ETL — Stream & Record Transforms

Serverless record-level ETL and stream consumers using Event Source Mapping, batching, and async Destinations.

Key Insight

Good for lightweight, stateless transforms and small-window streaming; for shuffle-heavy or large-state ETL use Glue/EMR and design idempotency/back‑p

Often Confused With

AWS GlueAmazon EMRKinesis Data Analytics

Common Mistakes

  • Expecting Lambda to replace Spark/EMR for large, shuffle-heavy ETL.
  • Assuming exactly-once processing — ESM yields at-least-once; build idempotency.
  • Using Destinations for synchronous calls — Destinations apply only to async invocations.

Redshift — Columnar MPP Warehouse (RA3 / Provisioned)

Managed columnar MPP SQL warehouse for BI/ELT; tune via distribution/sort keys, compression, WLM, and materialized views

Key Insight

Pick distribution keys to colocate join columns (avoid skew); sort keys cut I/O for range scans but need maintenance; WLM controls concurrency/latency

Often Confused With

Amazon AthenaRedshift Spectrum

Common Mistakes

  • Choosing any column as a distribution key — leads to severe data skew and slow joins
  • Assuming sort keys keep rows perfectly ordered after updates — VACUUM/maintenance required
  • Thinking cluster size alone fixes latency — WLM queue/configuration directly impacts concurrency and response time

Athena — Serverless SQL on S3 (Presto/Trino)

Serverless interactive SQL that queries data in S3 (schema-on-read); billed per TB scanned and uses Glue for metadata

Key Insight

Athena queries files in place — performance hinges on file format, partitioning, compression, and file size, not 'serverless magic'

Often Confused With

Amazon RedshiftAmazon EMR

Common Mistakes

  • Believing Athena stores or manages your data — it only queries S3 in place
  • Assuming Athena is always faster than warehouses — speed depends on file layout and concurrency
  • Expecting automatic cataloging/optimization — you must define or crawl metadata and optimize file layout

Pipeline Orchestration & Resilience

Coordinate automated pipelines with idempotency, retries, checkpoints and multi‑region failover to meet RTO/RPO.

Key Insight

Design for idempotency + exponential backoff + state checkpoints; RTO/RPO determine replay vs automated failover.

Often Confused With

High AvailabilityDisaster RecoveryCI/CD

Common Mistakes

  • Blind retries without idempotency or backoff cause duplicate processing or amplify overload.
  • Single‑region replication ≠ full resiliency — region failures still break SLAs without multi‑region strategy.
  • Monitoring/alerts alone don't recover pipelines — you need automated remediation or practiced runbooks.

Glue Job Monitoring: Metrics, Logs & Bookmarks

Use CloudWatch metrics, continuous logs and Glue bookmarks to detect failures, diagnose root causes, and tune DPU/IO.

Key Insight

Bookmarks record processed state but don't guarantee idempotency; continuous logging gives executor traces but costs and Glue metrics are often coarse

Often Confused With

Job BookmarksContinuous LoggingCloudWatch Metrics

Common Mistakes

  • Assuming job bookmarks make a job fully idempotent — they only track state and can still skip or duplicate data.
  • Expecting continuous logging to be on and free — it's opt‑in and adds cost/latency/volume.
  • Interpreting high DPU use as a signal to add DPUs — bottlenecks may be skew, I/O, or GC, not compute.

Glue Data Quality — In‑Job Checks & Error Handling

Deequ‑based Glue data checks and in‑job validations to detect, route, or stop bad records; tune for cost vs latency.

Key Insight

A pass only covers declared rules and sampled stats; in‑transit checks catch bad rows earlier but add CPU/latency—use routing, quarantine, or fail‑/s​

Often Confused With

DeequAt‑rest validationGlue Data Catalog

Common Mistakes

  • Treating a rule pass as proof of full downstream correctness.
  • Swapping in‑transit and at‑rest checks without changing handling or SLA.
  • Assuming in‑job validation must abort on the first bad record.

Deequ & DQDL — Spark Data Quality Checks

Open‑source Spark library (Deequ) + DQDL DSL to declare profiling and constraint checks; detects/report issues, does not

Key Insight

Deequ computes metrics and evaluates constraints on Spark/JVM; it reports violations (no auto‑fix). Use separate remediation pipelines; Glue merely re

Often Confused With

Glue Data QualityETL transformation codeAutomated data‑fix tools

Common Mistakes

  • Thinking Deequ transforms or writes rows instead of only computing metrics.
  • Assuming Deequ is proprietary to AWS or only runs inside Glue.
  • Using DQDL as a general-purpose ETL language instead of a declarative rule DSL.

Data Security and Governance

18%

AWS IAM — Identities & JSON Policies

Global service to create users, groups, roles and JSON policies for API auth/authz; favor roles and temporary creds.

Key Insight

Policies are evaluated together (explicit deny wins); use roles/temporary creds and least‑privilege — never daily root keys.

Often Confused With

IAM roleResource-based policies

Common Mistakes

  • Using root or long‑lived IAM user credentials for routine tasks (MFA doesn't make this best practice).
  • Attaching broad managed policies (e.g., AdministratorAccess) instead of scoping least privilege.
  • Thinking IAM is regional or that services can assume groups (groups only bundle users).

IAM Role — Temporary, Assumable Identity

An assignable identity with a permission policy and a trust policy; principals assume it to get temporary STS creds.

Key Insight

Trust policy = who can assume; permission policy = what the session can do — assuming yields temporary credentials only.

Often Confused With

IAM userInstance profile

Common Mistakes

  • Treating a role like a user that holds long‑lived access keys.
  • Confusing the trust policy with permission grants (trust only permits assumption).
  • Assuming any service or account can assume a role without configuring the trust principal.

Lake Formation: Named Grants, LF‑Tags & Row Filters

Enforce table/column/row permissions in the Glue Data Catalog via Lake Formation grants, LF‑Tags (TBAC) and data filters

Key Insight

Lake Formation grants (not IAM alone) gate Glue-catalog access; LF‑Tags give tag-based inheritance; row filters restrict rows (not mask).

Often Confused With

IAM policiesS3 bucket policiesGlue resource policies

Common Mistakes

  • Assuming IAM policies alone block Glue/Athena access — Lake Formation grants may also be required
  • Believing LF‑Tags fully replace named grants — TBAC complements, doesn’t always substitute
  • Treating row-level filters like masking — they exclude rows; masking requires transformation

Least‑Privilege: IAM + Lake Formation + Boundaries + Secrets

Enforce least privilege by combining scoped IAM policies, permission boundaries, role separation, Lake Formation grants,

Key Insight

AuthZ is layered: IAM role scoping + Lake Formation catalog grants + permission boundaries/secrets; mis‑scoped explicit denies can block access.

Often Confused With

AuthenticationIAM policiesData tagging enforcement

Common Mistakes

  • Assuming authentication equals authorization — being signed in ≠ having data access
  • Believing IAM policies and Lake Formation perms are interchangeable
  • Thinking classifying/tagging data auto-enforces access without corresponding policies

KMS & Envelope Encryption (Masking/Anonymization)

Use KMS CMKs to wrap per-object DEKs; envelope encryption + masking reduces KMS calls and limits exposure.

Key Insight

Wrap per-object DEKs with CMKs—DEKs cut KMS usage but must not be reused; encrypted DEKs can safely accompany ciphertext.

Often Confused With

SSE-KMSClient-side encryption

Common Mistakes

  • Thinking envelope encryption removes KMS — CMKs still wrap/unwrap DEKs.
  • Reusing one DEK to save cost — increases blast radius; use per-object/file DEKs.
  • Believing encrypted DEK exposure is unsafe — the encrypted DEK can be stored; only plaintext DEK must be protected.

S3 Data Lake: Encryption, Access & Lifecycle

Design S3 with correct encryption (SSE options/client-side), tight policies, prefixes, lifecycle rules, versioning and O

Key Insight

Encryption doesn't change IAM/ACLs; default bucket encryption affects new uploads only; Object Lock is required for immutability.

Often Confused With

SSE-S3SSE-KMSS3 Versioning

Common Mistakes

  • Assuming server-side encryption prevents unauthorized reads — IAM/policies still control access.
  • Expecting default bucket encryption to affect existing objects — it applies only to new uploads.
  • Believing versioning alone ensures immutability — use Object Lock/retention for compliance.

CloudWatch Logs & Insights — Tamper‑Aware Pipeline Monitor

Centralize CloudTrail/S3/Lambda logs in CloudWatch Logs + Insights to detect tampering, access/config changes for audits

Key Insight

Monitor access, config and data events across regions with immutable storage and log‑file validation — encryption alone won't show tampering

Often Confused With

AWS CloudTrailAWS Config

Common Mistakes

  • Assume CloudTrail records S3 object/data‑plane events by default — data events must be explicitly enabled
  • Treat encryption or centralization as integrity checks — they don't detect deletions or config tampering
  • Keep logs in the producer account/bucket and expect protection — a compromised producer can alter its own logs

AWS CloudTrail — API Audit Trail

Captures AWS control‑plane API activity; deliver events to S3, CloudWatch Logs, or CloudTrail Lake for forensic queries,

Key Insight

CloudTrail records API (management) events; S3/KMS object‑level (data) events are optional and cost extra — enable multi‑region, log‑file validation,


Often Confused With

Amazon CloudWatch LogsCloudTrail Lake

Common Mistakes

  • Expect CloudTrail to capture application or OS logs — it records AWS API activity only
  • Fail to enable data events or multi‑region logging — you'll miss S3/KMS object‑level and cross‑region activity
  • Rely on CloudTrail alone for real‑time alerts — integrate with CloudWatch/EventBridge/GuardDuty for notifications

KMS — CMK/DEK, Grants & Rotation

CMK types and policies; use envelope encryption (DEK encrypted by CMK); manage grants, rotation, and multi‑Region keys.

Key Insight

CMKs gate usage — IAM + key policy/grants both matter; envelope encryption keeps DEKs transient; rotation creates new key versions, not instant data‑w

Often Confused With

AWS‑managed CMKs (aws/*)Client‑side encryptionS3 SSE options (SSE‑S3 / SSE‑KMS)

Common Mistakes

  • Treating AWS‑managed CMKs as having the same control as customer CMKs
  • Thinking CMK rotation immediately prevents decrypting already encrypted data
  • Assuming IAM permissions alone let you decrypt without key policy/grant

Data Lineage — Provenance & Audit Trail

Capture end‑to‑end provenance (source, transform, actor, timestamp, row ID) to prove SARs, deletions, and impact for aud

Key Insight

Regulatory proofs need linkable, row/event‑level lineage combined with retention, access controls and deletion workflows — schema maps alone often fal

Often Confused With

Audit logsData catalog (Glue / Lake Formation)Data quality metrics

Common Mistakes

  • Relying on generic logs that lack linkable lineage fields (source/row/actor/timestamp)
  • Assuming schema‑level lineage satisfies SARs or deletion requests
  • Thinking lineage automatically enforces data quality or deletion without workflows

© 2026 Mocka.ai - Your Exam Preparation Partner

AWS Certified Data Engineer - Associate (DEA-C01) Practice Questions
Access Mock Exams & Comprehensive Question Bank
Listen to Audio Podcasts
Expert summaries for AWS Certified Data Engineer - Associate (DEA-C01)

Certification Overview

Duration:120 min
Questions:65
Passing:72%
Level:Intermediate

Cheat Sheet Content

34Key Concepts
4Exam Domains

Similar Cheat Sheets

  • CCNA Exam v1.1 (200-301) Cheat Sheet
  • AWS Certified Cloud Practitioner (CLF-C02) Cheat Sheet
  • AWS Certified AI Practitioner (AIF-C01) Cheat Sheet
  • Exam AI-900: Microsoft Azure AI Fundamentals Cheat Sheet
  • Google Cloud Professional Cloud Architect Cheat Sheet
  • Google Cloud Security Operations Engineer Exam Cheat Sheet
Mocka logoMocka

© 2026 Mocka. Practice for what's next.

Product

  • Browse Certifications
  • How to get started

Company

  • About Us
  • Contact

Legal

  • Terms of Service
  • Privacy Policy
  • Imprint
Follow