AWS Certified Data Engineer - Associate (DEA-C01) Ultimate Cheat Sheet

4 Domains • 34 Concepts • Approx. 5 pages

Your Quick Reference Study Guide

This cheat sheet covers the core concepts, terms, and definitions you need to know for the AWS Certified Data Engineer - Associate (DEA-C01). We've distilled the most important domains, topics, and critical details to help your exam preparation.

💡 Note: While this study guide highlights essential concepts, it's designed to complement—not replace—comprehensiv e learning materials. Use it for quick reviews, last-minute prep, or to identify areas that need deeper study before your exam.

AWS Certified Data Engineer - Associate (DEA-C01) Practice Questions

Access Mock Exams & Comprehensive Question Bank

Listen to Audio Podcasts

Expert summaries for AWS Certified Data Engineer - Associate (DEA-C01)

Kinesis Data Streams — Shards, Scale & Ordering

Shard-based streaming: per-shard ordering and fixed throughput; scale by shard count or use on‑demand mode.

Key Insight

Throughput and ordering are per-shard — add shards to raise total capacity; hot partition keys still throttle.

Often Confused With

Amazon Kinesis Data FirehoseAWS LambdaAmazon MSK

Common Mistakes

Assuming unlimited independent consumers per shard
Believing record ordering is guaranteed across the whole stream
Thinking adding shards raises per-shard limits or instantly fixes hot-key throttling

Kinesis Data Firehose — Buffer, Transform, Deliver

Managed delivery pipeline that buffers, optionally Lambda-transforms, converts formats, compresses, and writes to common

Key Insight

Firehose is for simple delivery and light transforms with at-least-once semantics; use Streams for replay/stateful needs.

Often Confused With

Amazon Kinesis Data StreamsAWS LambdaAmazon Redshift

Common Mistakes

Expecting exactly-once delivery — Firehose is at-least-once; duplicates can occur
Assuming unlimited in-flight transforms — Lambda transforms have size, time, and resource limits
Believing Firehose auto-creates Redshift tables or target schemas

Glue ETL & Crawlers (Serverless Spark)

Serverless Spark ETL + crawlers: auto-schema discovery, Data Catalog management, and S3-staged warehouse loads.

Key Insight

Crawlers discover metadata; Glue jobs do transforms — use S3 staging and bulk-load patterns for Redshift/Snowflake.

Often Confused With

AWS Glue Data CatalogAmazon EMR

Common Mistakes

Treating Glue as batch-only — it supports interactive and streaming Spark jobs.
Relying on crawlers to infer perfect schemas for nested/heterogeneous data.
Expecting direct optimal writes to Redshift/Snowflake without S3 staging or job tuning.

Glue Data Catalog (Metadata Store)

Hive-compatible metadata repo for tables, partitions, and schema versions — stores pointers, not files.

Key Insight

Catalog holds metadata only; crawlers/ETL populate it and partition sync/query visibility often requires explicit actions (MSCK/ALTER or projection).

Often Confused With

AWS Glue ETLHive Metastore

Common Mistakes

Thinking the catalog contains actual data files instead of metadata pointers.
Believing crawlers alone will always infer correct nested schemas — use custom classifiers or explicit schemas.
Assuming partition sync and safe schema propagation to downstream jobs happen automatically.

MWAA — Managed Airflow (DAGs on S3)

Managed Apache Airflow: DAGs/plugins stored in S3; AWS runs Airflow infra but you own networking, IAM and heavy task scU

Key Insight

MWAA handles Airflow infra/scale but not your network/IAM or heavy data processing — offload big jobs and secure VPC/roles.

Often Confused With

AWS Glue WorkflowsAWS Step Functions

Common Mistakes

Assume MWAA auto-configures VPC/IAM — you must supply VPC, subnets, SGs, and execution roles.
Run heavy/long data jobs on MWAA workers — offload to EMR, Glue, or Lambda for scale and cost control.
Assume plugins/DAG changes deploy instantly or unchanged — test plugin compatibility; expect S3/scheduler sync delays.

Glue Workflows — Glue-centric ETL Orchestration

Serverless DAGs that coordinate Glue jobs, crawlers and triggers for ETL pipelines; designed for Glue-first orchestras,

Key Insight

Glue Workflows orchestrate Glue components and simple retries natively; use Step Functions/Lambda for cross-service or advanced flows.

Often Confused With

AWS Step FunctionsAmazon Managed Workflows for Apache Airflow (MWAA)

Common Mistakes

Think Glue Workflows can natively orchestrate any AWS service — they manage Glue jobs/crawlers/triggers only.
Assume Glue Workflows replace Step Functions for complex, multi‑service or human‑approval flows — they are Glue‑centric.
Expect separate workflow runtime charges — you pay for underlying Glue jobs/crawlers; the workflow metadata has no runtime fee.

AWS Orchestration: EventBridge, Step Functions, Glue, MWAA, ECS

Pick the orchestrator by state, runtime, retry logic, scheduling and external integrations.

Key Insight

Step Functions = stateful, long-running workflows & complex retries; EventBridge routes events; MWAA runs Airflow DAGs; Glue Workflows are Glue‑centic

Often Confused With

EventBridgeStep FunctionsGlue Workflows

Common Mistakes

Treating EventBridge like a stateful orchestrator
Relying on retries without idempotency or compensating actions
Assuming services are interchangeable; ignoring checkpointing, execution limits and costs

Detect & Fix Data Skew: broadcast, shuffle, salting, repartition

Detect uneven partitioning/task times, choose broadcast vs shuffle, and use salting or repartitioning to rebalance.

Key Insight

Broadcast only when the small table fits executor memory; salting redistributes hot keys; always confirm optimizer behavior with explain plan and task

Often Confused With

Broadcast joinsShuffle joinsRepartitioning

Common Mistakes

Using broadcast join when the 'small' table exceeds executor memory
Adding partitions instead of changing the join key or applying salting
Trusting optimizer hints without validating the explain plan and task metrics

Kinesis Data Streams — Shards, Scale & Ordering

Shard-based streaming: per-shard ordering and fixed throughput; scale by shard count or use on‑demand mode.

Key Insight

Throughput and ordering are per-shard — add shards to raise total capacity; hot partition keys still throttle.

Often Confused With

Amazon Kinesis Data FirehoseAWS LambdaAmazon MSK

Common Mistakes

Assuming unlimited independent consumers per shard
Believing record ordering is guaranteed across the whole stream
Thinking adding shards raises per-shard limits or instantly fixes hot-key throttling

Kinesis Data Firehose — Buffer, Transform, Deliver

Managed delivery pipeline that buffers, optionally Lambda-transforms, converts formats, compresses, and writes to common

Key Insight

Firehose is for simple delivery and light transforms with at-least-once semantics; use Streams for replay/stateful needs.

Often Confused With

Amazon Kinesis Data StreamsAWS LambdaAmazon Redshift

Common Mistakes

Expecting exactly-once delivery — Firehose is at-least-once; duplicates can occur
Assuming unlimited in-flight transforms — Lambda transforms have size, time, and resource limits
Believing Firehose auto-creates Redshift tables or target schemas

Glue ETL & Crawlers (Serverless Spark)

Serverless Spark ETL + crawlers: auto-schema discovery, Data Catalog management, and S3-staged warehouse loads.

Key Insight

Crawlers discover metadata; Glue jobs do transforms — use S3 staging and bulk-load patterns for Redshift/Snowflake.

Often Confused With

AWS Glue Data CatalogAmazon EMR

Common Mistakes

Treating Glue as batch-only — it supports interactive and streaming Spark jobs.
Relying on crawlers to infer perfect schemas for nested/heterogeneous data.
Expecting direct optimal writes to Redshift/Snowflake without S3 staging or job tuning.

Glue Data Catalog (Metadata Store)

Hive-compatible metadata repo for tables, partitions, and schema versions — stores pointers, not files.

Key Insight

Catalog holds metadata only; crawlers/ETL populate it and partition sync/query visibility often requires explicit actions (MSCK/ALTER or projection).

Often Confused With

AWS Glue ETLHive Metastore

Common Mistakes

Thinking the catalog contains actual data files instead of metadata pointers.
Believing crawlers alone will always infer correct nested schemas — use custom classifiers or explicit schemas.
Assuming partition sync and safe schema propagation to downstream jobs happen automatically.

MWAA — Managed Airflow (DAGs on S3)

Managed Apache Airflow: DAGs/plugins stored in S3; AWS runs Airflow infra but you own networking, IAM and heavy task scU

Key Insight

MWAA handles Airflow infra/scale but not your network/IAM or heavy data processing — offload big jobs and secure VPC/roles.

Often Confused With

AWS Glue WorkflowsAWS Step Functions

Common Mistakes

Assume MWAA auto-configures VPC/IAM — you must supply VPC, subnets, SGs, and execution roles.
Run heavy/long data jobs on MWAA workers — offload to EMR, Glue, or Lambda for scale and cost control.
Assume plugins/DAG changes deploy instantly or unchanged — test plugin compatibility; expect S3/scheduler sync delays.

Glue Workflows — Glue-centric ETL Orchestration

Serverless DAGs that coordinate Glue jobs, crawlers and triggers for ETL pipelines; designed for Glue-first orchestras,

Key Insight

Glue Workflows orchestrate Glue components and simple retries natively; use Step Functions/Lambda for cross-service or advanced flows.

Often Confused With

AWS Step FunctionsAmazon Managed Workflows for Apache Airflow (MWAA)

Common Mistakes

Think Glue Workflows can natively orchestrate any AWS service — they manage Glue jobs/crawlers/triggers only.
Assume Glue Workflows replace Step Functions for complex, multi‑service or human‑approval flows — they are Glue‑centric.
Expect separate workflow runtime charges — you pay for underlying Glue jobs/crawlers; the workflow metadata has no runtime fee.

AWS Orchestration: EventBridge, Step Functions, Glue, MWAA, ECS

Pick the orchestrator by state, runtime, retry logic, scheduling and external integrations.

Key Insight

Step Functions = stateful, long-running workflows & complex retries; EventBridge routes events; MWAA runs Airflow DAGs; Glue Workflows are Glue‑centic

Often Confused With

EventBridgeStep FunctionsGlue Workflows

Common Mistakes

Treating EventBridge like a stateful orchestrator
Relying on retries without idempotency or compensating actions
Assuming services are interchangeable; ignoring checkpointing, execution limits and costs

Detect & Fix Data Skew: broadcast, shuffle, salting, repartition

Detect uneven partitioning/task times, choose broadcast vs shuffle, and use salting or repartitioning to rebalance.

Key Insight

Broadcast only when the small table fits executor memory; salting redistributes hot keys; always confirm optimizer behavior with explain plan and task

Often Confused With

Broadcast joinsShuffle joinsRepartitioning

Common Mistakes

Using broadcast join when the 'small' table exceeds executor memory
Adding partitions instead of changing the join key or applying salting
Trusting optimizer hints without validating the explain plan and task metrics

S3 — Data Lake Object Store

Immutable, key-based object storage for data lakes; you must pick formats, layout, lifecycle and cost controls.

Key Insight

S3 is object (not POSIX): objects are immutable and key-addressed. Performance & cost come from format, partitioning, and lifecycle rules.

Often Confused With

EBSEFS

Common Mistakes

Treating S3 like a POSIX/block filesystem (expecting file locks or in-place updates).
Assuming built-in indexing/schema or fast queries without columnar formats and partitioning.
Thinking S3 provides multi-object atomic transactions or that costs are only stored bytes.

DynamoDB Streams (Table CDC)

Per-item change stream for DynamoDB; use with Lambda/Kinesis for CDC — TTL deletions appear as REMOVE records.

Key Insight

Streams are at-least-once with a short retention window, preserve order per shard/partition key, and show TTL expirations as REMOVE events.

Often Confused With

Kinesis Data StreamsKinesis Data Firehose

Common Mistakes

Expecting exactly-once delivery — Streams are at-least-once (handle duplicates).
Assuming change records are retained indefinitely — retention window is short (~24h).
Believing TTL deletes are instantaneous or invisible — they're asynchronous and appear as REMOVE records.

Data Catalogs — Metadata Index & Governance

Centralized metadata index linking schemas, lineage, owners and business terms for discovery and governance.

Key Insight

Stores metadata and pointers (not raw data); automated crawlers need config and human curation to stay accurate.

Often Confused With

Data lakeData dictionaryMetadata repository

Common Mistakes

Thinking crawlers alone guarantee complete, correct metadata
Treating the catalog as a data store — it holds metadata and pointers only
Equating lineage with audit logs

Glue Crawlers — Schema Inference

Sample-based scanners that infer columns, types and partitions into the Glue Data Catalog; useful but fallible.

Key Insight

Inference is sample-driven and pattern-based — sampling bias, ambiguous formats, and path patterns cause wrong types or missed partitions.

Often Confused With

AWS Glue ETL jobsPartition discovery

Common Mistakes

Assuming crawlers infer perfect types/nullability — sampling and ambiguous formats can misclassify
Believing crawlers modify or delete source files — they only read and update the Data Catalog
Expecting partition discovery with zero config — path patterns or classifiers are often required

Data Lifecycle: Retention, Tiering & TTL

Policy rules to classify, promote/demote, and expire data (hot→warm→cold→archive) balancing cost, latency, SLAs, and law

Key Insight

Age is only one signal—combine access frequency, SLA/business value, retrieval cost, and legal holds to decide tiering

Often Confused With

S3 Lifecycle PoliciesBackup/Retention PoliciesDatabase TTL

Common Mistakes

Assuming cold/archive is always cheapest—ignores retrieval and per-request fees
Believing colder tiers mean reduced durability or no immediate access
Applying one static policy to all datasets without monitoring or reclassification

S3 Lifecycle & Intelligent‑Tiering

Combine lifecycle rules and Intelligent‑Tiering to automate storage-class transitions, expirations, version handling, &-

Key Insight

Intelligent‑Tiering auto-optimizes across its supported tiers; Glacier/Deep Archive require explicit lifecycle rules or archive-tier option

Often Confused With

Glacier / Glacier Deep ArchiveS3 Object LockVersioning Lifecycle

Common Mistakes

Expecting lifecycle rules to bypass Object Lock or legal holds — they do not
Assuming lifecycle actions take effect instantly once a rule is created
Thinking enabling versioning alone prevents object deletion forever

Row vs Columnar: CSV → Parquet

Convert row formats (CSV/Avro) to Parquet to shrink storage and speed analytics via column pruning and predicate push‑/p

Key Insight

Parquet stores data in row‑groups and column‑chunks; predicate pushdown + column pruning drastically cuts S3 I/O — but CPU for decompression/encoding/

Often Confused With

AvroORCCSV

Common Mistakes

Assuming columnar always wins—bad for small‑row OLTP or frequent single‑row writes.
Believing higher compression always lowers query cost—ignores CPU/decompression overhead.
Thinking Parquet files are mutable—updates require rewrite/merge/compaction, not in‑place edits.

Schema Evolution & Data Modeling

Pick schema‑on‑write vs schema‑on‑read, partitioning, indexing, compression and denorm based on dominant query SLAs and$

Key Insight

Schema‑on‑write = higher ingest/validation cost but predictable, fast reads; schema‑on‑read = flexible but higher runtime/query cost—choose by read SL

Often Confused With

Schema-on-readSchema-on-writeSemi-structured vs Unstructured

Common Mistakes

Assuming NoSQL requires no schema or validation—implicit schemas and contracts are still needed.
Always normalizing to save storage—can cripple analytic read performance; denormalize where queries dominate.
Thinking schema‑on‑read is always cheaper—frequent queries pay parsing and runtime costs.

AWS Lambda — Serverless Functions

Event-driven, short-lived serverless compute for stateless tasks that scale with events.

Key Insight

Use Lambda for stateless, low-latency tasks — enforce the 15‑min timeout, treat /tmp as ephemeral, and plan for concurrency/quota limits.

Often Confused With

AWS FargateAmazon EC2AWS Step Functions

Common Mistakes

Assuming unlimited runtime — Lambda maximum is 15 minutes.
Relying on /tmp as persistent storage across invocations.
Ignoring account/regional concurrency limits and throttling.

Lambda ETL — Stream & Record Transforms

Serverless record-level ETL and stream consumers using Event Source Mapping, batching, and async Destinations.

Key Insight

Good for lightweight, stateless transforms and small-window streaming; for shuffle-heavy or large-state ETL use Glue/EMR and design idempotency/back‑p

Often Confused With

AWS GlueAmazon EMRKinesis Data Analytics

Common Mistakes

Expecting Lambda to replace Spark/EMR for large, shuffle-heavy ETL.
Assuming exactly-once processing — ESM yields at-least-once; build idempotency.
Using Destinations for synchronous calls — Destinations apply only to async invocations.

Redshift — Columnar MPP Warehouse (RA3 / Provisioned)

Managed columnar MPP SQL warehouse for BI/ELT; tune via distribution/sort keys, compression, WLM, and materialized views

Key Insight

Pick distribution keys to colocate join columns (avoid skew); sort keys cut I/O for range scans but need maintenance; WLM controls concurrency/latency

Often Confused With

Amazon AthenaRedshift Spectrum

Common Mistakes

Choosing any column as a distribution key — leads to severe data skew and slow joins
Assuming sort keys keep rows perfectly ordered after updates — VACUUM/maintenance required
Thinking cluster size alone fixes latency — WLM queue/configuration directly impacts concurrency and response time

Athena — Serverless SQL on S3 (Presto/Trino)

Serverless interactive SQL that queries data in S3 (schema-on-read); billed per TB scanned and uses Glue for metadata

Key Insight

Athena queries files in place — performance hinges on file format, partitioning, compression, and file size, not 'serverless magic'

Often Confused With

Amazon RedshiftAmazon EMR

Common Mistakes

Believing Athena stores or manages your data — it only queries S3 in place
Assuming Athena is always faster than warehouses — speed depends on file layout and concurrency
Expecting automatic cataloging/optimization — you must define or crawl metadata and optimize file layout

Pipeline Orchestration & Resilience

Coordinate automated pipelines with idempotency, retries, checkpoints and multi‑region failover to meet RTO/RPO.

Key Insight

Design for idempotency + exponential backoff + state checkpoints; RTO/RPO determine replay vs automated failover.

Often Confused With

High AvailabilityDisaster RecoveryCI/CD

Common Mistakes

Blind retries without idempotency or backoff cause duplicate processing or amplify overload.
Single‑region replication ≠ full resiliency — region failures still break SLAs without multi‑region strategy.
Monitoring/alerts alone don't recover pipelines — you need automated remediation or practiced runbooks.

Glue Job Monitoring: Metrics, Logs & Bookmarks

Use CloudWatch metrics, continuous logs and Glue bookmarks to detect failures, diagnose root causes, and tune DPU/IO.

Key Insight

Bookmarks record processed state but don't guarantee idempotency; continuous logging gives executor traces but costs and Glue metrics are often coarse

Often Confused With

Job BookmarksContinuous LoggingCloudWatch Metrics

Common Mistakes

Assuming job bookmarks make a job fully idempotent — they only track state and can still skip or duplicate data.
Expecting continuous logging to be on and free — it's opt‑in and adds cost/latency/volume.
Interpreting high DPU use as a signal to add DPUs — bottlenecks may be skew, I/O, or GC, not compute.

Glue Data Quality — In‑Job Checks & Error Handling

Deequ‑based Glue data checks and in‑job validations to detect, route, or stop bad records; tune for cost vs latency.

Key Insight

A pass only covers declared rules and sampled stats; in‑transit checks catch bad rows earlier but add CPU/latency—use routing, quarantine, or fail‑/s

Often Confused With

DeequAt‑rest validationGlue Data Catalog

Common Mistakes

Treating a rule pass as proof of full downstream correctness.
Swapping in‑transit and at‑rest checks without changing handling or SLA.
Assuming in‑job validation must abort on the first bad record.

Deequ & DQDL — Spark Data Quality Checks

Open‑source Spark library (Deequ) + DQDL DSL to declare profiling and constraint checks; detects/report issues, does not

Key Insight

Deequ computes metrics and evaluates constraints on Spark/JVM; it reports violations (no auto‑fix). Use separate remediation pipelines; Glue merely re

Often Confused With

Glue Data QualityETL transformation codeAutomated data‑fix tools

Common Mistakes

Thinking Deequ transforms or writes rows instead of only computing metrics.
Assuming Deequ is proprietary to AWS or only runs inside Glue.
Using DQDL as a general-purpose ETL language instead of a declarative rule DSL.

AWS IAM — Identities & JSON Policies

Global service to create users, groups, roles and JSON policies for API auth/authz; favor roles and temporary creds.

Key Insight

Policies are evaluated together (explicit deny wins); use roles/temporary creds and least‑privilege — never daily root keys.

Often Confused With

IAM roleResource-based policies

Common Mistakes

Using root or long‑lived IAM user credentials for routine tasks (MFA doesn't make this best practice).
Attaching broad managed policies (e.g., AdministratorAccess) instead of scoping least privilege.
Thinking IAM is regional or that services can assume groups (groups only bundle users).

IAM Role — Temporary, Assumable Identity

An assignable identity with a permission policy and a trust policy; principals assume it to get temporary STS creds.

Key Insight

Trust policy = who can assume; permission policy = what the session can do — assuming yields temporary credentials only.

Often Confused With

IAM userInstance profile

Common Mistakes

Treating a role like a user that holds long‑lived access keys.
Confusing the trust policy with permission grants (trust only permits assumption).
Assuming any service or account can assume a role without configuring the trust principal.

Lake Formation: Named Grants, LF‑Tags & Row Filters

Enforce table/column/row permissions in the Glue Data Catalog via Lake Formation grants, LF‑Tags (TBAC) and data filters

Key Insight

Lake Formation grants (not IAM alone) gate Glue-catalog access; LF‑Tags give tag-based inheritance; row filters restrict rows (not mask).

Often Confused With

IAM policiesS3 bucket policiesGlue resource policies

Common Mistakes

Assuming IAM policies alone block Glue/Athena access — Lake Formation grants may also be required
Believing LF‑Tags fully replace named grants — TBAC complements, doesn’t always substitute
Treating row-level filters like masking — they exclude rows; masking requires transformation

Least‑Privilege: IAM + Lake Formation + Boundaries + Secrets

Enforce least privilege by combining scoped IAM policies, permission boundaries, role separation, Lake Formation grants,

Key Insight

AuthZ is layered: IAM role scoping + Lake Formation catalog grants + permission boundaries/secrets; mis‑scoped explicit denies can block access.

Often Confused With

AuthenticationIAM policiesData tagging enforcement

Common Mistakes

Assuming authentication equals authorization — being signed in ≠ having data access
Believing IAM policies and Lake Formation perms are interchangeable
Thinking classifying/tagging data auto-enforces access without corresponding policies

KMS & Envelope Encryption (Masking/Anonymization)

Use KMS CMKs to wrap per-object DEKs; envelope encryption + masking reduces KMS calls and limits exposure.

Key Insight

Wrap per-object DEKs with CMKs—DEKs cut KMS usage but must not be reused; encrypted DEKs can safely accompany ciphertext.

Often Confused With

SSE-KMSClient-side encryption

Common Mistakes

Thinking envelope encryption removes KMS — CMKs still wrap/unwrap DEKs.
Reusing one DEK to save cost — increases blast radius; use per-object/file DEKs.
Believing encrypted DEK exposure is unsafe — the encrypted DEK can be stored; only plaintext DEK must be protected.

S3 Data Lake: Encryption, Access & Lifecycle

Design S3 with correct encryption (SSE options/client-side), tight policies, prefixes, lifecycle rules, versioning and O

Key Insight

Encryption doesn't change IAM/ACLs; default bucket encryption affects new uploads only; Object Lock is required for immutability.

Often Confused With

SSE-S3SSE-KMSS3 Versioning

Common Mistakes

Assuming server-side encryption prevents unauthorized reads — IAM/policies still control access.
Expecting default bucket encryption to affect existing objects — it applies only to new uploads.
Believing versioning alone ensures immutability — use Object Lock/retention for compliance.

CloudWatch Logs & Insights — Tamper‑Aware Pipeline Monitor

Centralize CloudTrail/S3/Lambda logs in CloudWatch Logs + Insights to detect tampering, access/config changes for audits

Key Insight

Monitor access, config and data events across regions with immutable storage and log‑file validation — encryption alone won't show tampering

Often Confused With

AWS CloudTrailAWS Config

Common Mistakes

Assume CloudTrail records S3 object/data‑plane events by default — data events must be explicitly enabled
Treat encryption or centralization as integrity checks — they don't detect deletions or config tampering
Keep logs in the producer account/bucket and expect protection — a compromised producer can alter its own logs

AWS CloudTrail — API Audit Trail

Captures AWS control‑plane API activity; deliver events to S3, CloudWatch Logs, or CloudTrail Lake for forensic queries,

Key Insight

CloudTrail records API (management) events; S3/KMS object‑level (data) events are optional and cost extra — enable multi‑region, log‑file validation, 

Often Confused With

Amazon CloudWatch LogsCloudTrail Lake

Common Mistakes

Expect CloudTrail to capture application or OS logs — it records AWS API activity only
Fail to enable data events or multi‑region logging — you'll miss S3/KMS object‑level and cross‑region activity
Rely on CloudTrail alone for real‑time alerts — integrate with CloudWatch/EventBridge/GuardDuty for notifications

KMS — CMK/DEK, Grants & Rotation

CMK types and policies; use envelope encryption (DEK encrypted by CMK); manage grants, rotation, and multi‑Region keys.

Key Insight

CMKs gate usage — IAM + key policy/grants both matter; envelope encryption keeps DEKs transient; rotation creates new key versions, not instant data‑w

Often Confused With

AWS‑managed CMKs (aws/*)Client‑side encryptionS3 SSE options (SSE‑S3 / SSE‑KMS)

Common Mistakes

Treating AWS‑managed CMKs as having the same control as customer CMKs
Thinking CMK rotation immediately prevents decrypting already encrypted data
Assuming IAM permissions alone let you decrypt without key policy/grant

Data Lineage — Provenance & Audit Trail

Capture end‑to‑end provenance (source, transform, actor, timestamp, row ID) to prove SARs, deletions, and impact for aud

Key Insight

Regulatory proofs need linkable, row/event‑level lineage combined with retention, access controls and deletion workflows — schema maps alone often fal

Often Confused With

Audit logsData catalog (Glue / Lake Formation)Data quality metrics

Common Mistakes

Relying on generic logs that lack linkable lineage fields (source/row/actor/timestamp)
Assuming schema‑level lineage satisfies SARs or deletion requests
Thinking lineage automatically enforces data quality or deletion without workflows

AWS Certified Data Engineer - Associate (DEA-C01) Practice Questions

Access Mock Exams & Comprehensive Question Bank

Listen to Audio Podcasts

Expert summaries for AWS Certified Data Engineer - Associate (DEA-C01)

Certification Overview

Duration:120 min

Questions:65

Passing:72%

Level:Intermediate

Cheat Sheet Content

34Key Concepts

4Exam Domains

AWS Certified Data Engineer - Associate (DEA-C01) Ultimate Cheat Sheet

Your Quick Reference Study Guide

AWS Certified Data Engineer - Associate (DEA-C01)

Data Ingestion and Transformation

Data Ingestion and Transformation

Kinesis Data Streams — Shards, Scale & Ordering

Kinesis Data Firehose — Buffer, Transform, Deliver

Glue ETL & Crawlers (Serverless Spark)

Glue Data Catalog (Metadata Store)

MWAA — Managed Airflow (DAGs on S3)

Glue Workflows — Glue-centric ETL Orchestration

AWS Orchestration: EventBridge, Step Functions, Glue, MWAA, ECS

Detect & Fix Data Skew: broadcast, shuffle, salting, repartition

Kinesis Data Streams — Shards, Scale & Ordering

Kinesis Data Firehose — Buffer, Transform, Deliver

Glue ETL & Crawlers (Serverless Spark)

Glue Data Catalog (Metadata Store)

MWAA — Managed Airflow (DAGs on S3)

Glue Workflows — Glue-centric ETL Orchestration

AWS Orchestration: EventBridge, Step Functions, Glue, MWAA, ECS

Detect & Fix Data Skew: broadcast, shuffle, salting, repartition

Data Store Management

Data Store Management

S3 — Data Lake Object Store

DynamoDB Streams (Table CDC)

Data Catalogs — Metadata Index & Governance

Glue Crawlers — Schema Inference

Data Lifecycle: Retention, Tiering & TTL

S3 Lifecycle & Intelligent‑Tiering

Row vs Columnar: CSV → Parquet

Schema Evolution & Data Modeling

Data Operations and Support

Data Operations and Support

AWS Lambda — Serverless Functions

Lambda ETL — Stream & Record Transforms

Redshift — Columnar MPP Warehouse (RA3 / Provisioned)

Athena — Serverless SQL on S3 (Presto/Trino)

Pipeline Orchestration & Resilience

Glue Job Monitoring: Metrics, Logs & Bookmarks

Glue Data Quality — In‑Job Checks & Error Handling

Deequ & DQDL — Spark Data Quality Checks

Data Security and Governance

Data Security and Governance

AWS IAM — Identities & JSON Policies

IAM Role — Temporary, Assumable Identity

Lake Formation: Named Grants, LF‑Tags & Row Filters

Least‑Privilege: IAM + Lake Formation + Boundaries + Secrets

KMS & Envelope Encryption (Masking/Anonymization)

S3 Data Lake: Encryption, Access & Lifecycle

CloudWatch Logs & Insights — Tamper‑Aware Pipeline Monitor

AWS CloudTrail — API Audit Trail

KMS — CMK/DEK, Grants & Rotation

Data Lineage — Provenance & Audit Trail

Certification Overview

Cheat Sheet Content

Similar Cheat Sheets