Study Path Agent Study Path Agent
Generate Your Own
System Design with Machine Learning
122 topics across 7 chapters
Chapter 1
System Design Foundations (for ML systems)
1
Requirements, constraints, and trade-offs (latency, cost, accuracy, freshness)
2
Design docs and communication (diagrams, APIs, assumptions, failure modes)
3
Distributed Systems Basics
4 subtopics
4
Consistency models, CAP, and read/write trade-offs
5
Sharding, partitioning, replication, and rebalancing
6
Time, ordering, idempotency, retries, and deduplication
7
Fault tolerance patterns (timeouts, circuit breakers, bulkheads)
8
Storage Systems (choosing the right datastore)
4 subtopics
9
Relational modeling, indexes, transactions, and query planning basics
10
NoSQL patterns: key-value, document, wide-column (when and why)
11
Data lake vs warehouse concepts (batch analytics foundations)
12
Search and vector databases (ANN indexes, recall/latency trade-offs)
13
Messaging and Streaming Basics
4 subtopics
14
Queues vs pub/sub (work distribution vs fan-out)
15
Kafka-style concepts: partitions, consumer groups, offsets
16
Delivery semantics (at-most/at-least/exactly-once) and implications
17
Stream processing basics (windows, watermarks, late events)
18
Caching and Performance Primitives
4 subtopics
19
Caching patterns (cache-aside, read-through, write-through, write-back)
20
Cache invalidation and consistency strategies (TTL, stampede protection)
21
CDN and edge concepts (latency reduction, global distribution)
22
Rate limiting, load shedding, backpressure, and graceful degradation
23
API Design and Service Interfaces
4 subtopics
24
REST vs gRPC and contract-driven APIs
25
Pagination, filtering, batching, and async APIs for heavy workloads
26
Versioning and backward compatibility strategies
27
AuthN/AuthZ integration points (tokens, scopes, service identity)
Chapter 2
ML and Data Fundamentals (for system designers)
28
Problem framing: objective, constraints, and success metrics
29
Data quality, labeling, bias, and sampling pitfalls
30
Evaluation basics: train/val/test, leakage, confidence intervals
31
Feature engineering basics (numerical, categorical, text, embeddings)
32
Model families and when to use them
4 subtopics
33
Linear models, tree-based models, and calibration basics
34
Deep learning basics (training dynamics, overfitting, generalization)
35
Ranking and recommenders (two-tower, matrix factorization, learning-to-rank)
36
LLM basics: prompting vs fine-tuning, context limits, hallucinations
37
Offline vs online inference trade-offs (latency, freshness, cost)
38
Responsible AI basics (fairness, transparency, human-in-the-loop)
Chapter 3
Data and Feature Pipelines
39
Ingestion design (batch vs streaming, CDC, event modeling)
40
Data validation (schema checks, constraints, anomaly detection)
41
ETL/ELT orchestration and pipeline correctness
4 subtopics
42
Orchestrators and DAGs (scheduling, dependencies, retries)
43
Idempotent pipelines, backfills, and reproducible reruns
44
Handling late/dirty data and schema evolution safely
45
Cost optimization (partitioning, file sizes, incremental processing)
46
Data versioning and lineage (datasets, code, configs, artifacts)
47
Feature stores and feature management
4 subtopics
48
Feature definitions, ownership, reuse, and documentation
49
Training set generation and point-in-time joins
50
Point-in-time correctness (preventing training/serving skew)
51
Online feature serving (latency budgets, caching, TTLs)
52
Online/offline feature consistency and skew detection
53
Real-time joins and freshness (state stores, enrichment, SLAs)
Chapter 4
Training and Experimentation Infrastructure
54
Experiment tracking and reproducibility (metrics, configs, seeds, artifacts)
55
Training data preparation (sampling, weighting, class imbalance)
56
Distributed training fundamentals
4 subtopics
57
Data parallel vs model parallel vs pipeline parallel (mental models)
58
Checkpointing, restartability, and handling preemptions
59
Mixed precision and throughput bottlenecks (I/O vs compute)
60
Distributed training stack concepts (collectives, parameter servers)
61
Hyperparameter tuning and reliable comparisons
4 subtopics
62
Search strategies (grid, random, Bayesian) and when to use each
63
Early stopping and pruning without biasing results
64
Parallelization and scheduling (trial concurrency, quotas, fairness)
65
Preventing leakage and over-tuning on validation/test sets
66
Compute management (GPU pools, quotas, scheduling, spot/preemptible)
67
Model registry and artifact management (versions, metadata, lineage)
68
CI/CD for ML (tests, validation gates, reproducible builds)
Chapter 5
Model Serving and Online ML Systems
69
Serving architectures (online, async, batch, edge) and SLA design
4 subtopics
70
Synchronous vs asynchronous inference (queues, callbacks, polling)
71
Request/response schemas (inputs, outputs, errors, metadata, tracing)
72
Multi-model routing and traffic splitting (by tenant, region, cohort)
73
Fallbacks and graceful degradation (stale model, heuristic, cached)
74
Model packaging and dependency management (containers, runtimes, ABI)
75
Low-latency inference optimization
4 subtopics
76
Quantization, pruning, distillation (accuracy vs latency trade-offs)
77
Hardware selection (CPU/GPU/accelerators) and concurrency models
78
Caching inference outputs and embeddings safely (keying, TTL, privacy)
79
Warmup, model loading, memory management, and tail latency controls
80
Scaling inference (autoscaling, batching, pooling, multi-tenancy)
81
Experimentation and safe rollouts (A/B, canary, shadow, holdouts)
82
Retrieval + ranking system design (search/recs)
4 subtopics
83
Candidate generation (rules, ANN, two-stage architectures)
84
Online feature computation for ranking (budgets, caching, consistency)
Search and vector databases (ANN indexes, recall/latency trade-offs) (see Chapter 1)
85
Re-ranking, diversity, and multi-objective optimization (business + user)
86
LLM serving patterns (RAG, tools, guardrails)
4 subtopics
87
RAG architecture (chunking, retrieval, reranking, citations, caching)
88
Tool/function calling (schemas, sandboxing, timeouts, determinism)
89
Guardrails and policy enforcement (filters, routing, refusals, PII rules)
90
LLM security and safety (prompt injection, data exfiltration, jailbreaks)
Chapter 6
Operations: Reliability, Monitoring, and MLOps
91
SLOs/SLIs for ML products (latency, availability, quality, freshness)
92
Observability foundations (logs, metrics, traces, correlation IDs)
93
Model monitoring and feedback loops
4 subtopics
94
Data drift vs concept drift (detection signals and limitations)
95
Ground truth collection (delayed labels, human review, weak labels)
96
Alerting design (thresholds, burn rates, noise reduction, on-call)
97
Mitigating feedback loops and unintended behavior (exploration, guardrails)
Experimentation and safe rollouts (A/B, canary, shadow, holdouts) (see Chapter 5)
98
Incident response for ML systems (triage, rollback, data/model blame)
99
Cost management and capacity planning (GPU cost, caching, batching)
100
Case studies and system design interview practice (ML-focused)
4 subtopics
101
Design exercise: end-to-end recommender (retrieval, ranking, evaluation)
102
Design exercise: real-time fraud detection (streaming, latency, labels)
103
Design exercise: LLM customer support bot (RAG, safety, monitoring, cost)
104
Interview frameworks and pitfalls (assumptions, bottlenecks, measurement)
Chapter 7
Security, Privacy, and Governance for ML Systems
105
Threat modeling for ML systems (assets, adversaries, attack surfaces)
106
Privacy engineering and PII handling
4 subtopics
107
De-identification/anonymization basics and common failure modes
108
Data retention, deletion, and subject rights workflows
109
Secure data sharing (least privilege, clean rooms, aggregate reporting)
110
Differential privacy concepts (noise, privacy budget, utility trade-offs)
111
Access control and secrets (service identity, key management, rotation)
112
Compliance and auditability (logging, approvals, traceability, evidence)
113
ML-specific threats (data poisoning, evasion, model stealing, membership inference)
LLM security and safety (prompt injection, data exfiltration, jailbreaks) (see Chapter 5)
114
Governance workflows (review gates, model cards, data approvals, change control)