Understanding Bottlenecks
What is a Bottleneck?
Section titled “What is a Bottleneck?”A bottleneck is the component that limits your system’s overall performance. No matter how fast other parts are, the system can only go as fast as its slowest component.
Types of Bottlenecks
Section titled “Types of Bottlenecks”1. CPU Bottleneck
Section titled “1. CPU Bottleneck”Symptoms: High CPU usage, slow computations
1# ❌ CPU-bound operation blocking the event loop2def calculate_fibonacci(n: int) -> int:3 if n <= 1:4 return n5 return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)6
7# ✅ Solution: Use efficient algorithm or offload to worker8from functools import lru_cache9
10@lru_cache(maxsize=1000)11def calculate_fibonacci_cached(n: int) -> int:12 if n <= 1:13 return n14 return calculate_fibonacci_cached(n-1) + calculate_fibonacci_cached(n-2)1// ❌ CPU-bound operation2public long calculateFibonacci(int n) {3 if (n <= 1) return n;4 return calculateFibonacci(n-1) + calculateFibonacci(n-2);5}6
7// ✅ Solution: Use efficient algorithm with memoization8private Map<Integer, Long> cache = new ConcurrentHashMap<>();9
10public long calculateFibonacciCached(int n) {11 if (n <= 1) return n;12 return cache.computeIfAbsent(n, key ->13 calculateFibonacciCached(key-1) + calculateFibonacciCached(key-2)14 );15}2. Memory Bottleneck
Section titled “2. Memory Bottleneck”Symptoms: High memory usage, OOM errors, GC pauses
1# ❌ Loading everything into memory2def process_large_file(filename: str) -> list:3 with open(filename) as f:4 data = f.readlines() # Loads entire file into memory!5 return [process(line) for line in data]6
7# ✅ Solution: Stream processing8def process_large_file_streaming(filename: str):9 with open(filename) as f:10 for line in f: # Reads one line at a time11 yield process(line)1// ❌ Loading everything into memory2public List<String> processLargeFile(String filename) throws IOException {3 List<String> lines = Files.readAllLines(Path.of(filename)); // Loads all!4 return lines.stream().map(this::process).collect(Collectors.toList());5}6
7// ✅ Solution: Stream processing8public Stream<String> processLargeFileStreaming(String filename) throws IOException {9 return Files.lines(Path.of(filename)) // Streams line by line10 .map(this::process);11}3. Database Bottleneck
Section titled “3. Database Bottleneck”Symptoms: Slow queries, connection pool exhaustion, high DB CPU
1# ❌ N+1 query problem2def get_orders_with_items(user_id: str) -> list:3 orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id)4 for order in orders:5 # This runs a query for EACH order!6 order.items = db.query("SELECT * FROM items WHERE order_id = ?", order.id)7 return orders8
9# ✅ Solution: Use JOIN or batch query10def get_orders_with_items_optimized(user_id: str) -> list:11 return db.query("""12 SELECT o.*, i.*13 FROM orders o14 LEFT JOIN items i ON o.id = i.order_id15 WHERE o.user_id = ?16 """, user_id)1// ❌ N+1 query problem2public List<Order> getOrdersWithItems(String userId) {3 List<Order> orders = db.query("SELECT * FROM orders WHERE user_id = ?", userId);4 for (Order order : orders) {5 // This runs a query for EACH order!6 order.setItems(db.query("SELECT * FROM items WHERE order_id = ?", order.getId()));7 }8 return orders;9}10
11// ✅ Solution: Use JOIN or batch query12public List<Order> getOrdersWithItemsOptimized(String userId) {13 return db.query("""14 SELECT o.*, i.*15 FROM orders o16 LEFT JOIN items i ON o.id = i.order_id17 WHERE o.user_id = ?18 """, userId);19}4. Network/I/O Bottleneck
Section titled “4. Network/I/O Bottleneck”Symptoms: High network latency, waiting on external services
Finding Bottlenecks
Section titled “Finding Bottlenecks”Step 1: Monitor Resource Utilization
Section titled “Step 1: Monitor Resource Utilization”| Resource | Tool | Warning Signs |
|---|---|---|
| CPU | top, htop, metrics | >80% sustained |
| Memory | free, vmstat | >90%, frequent GC |
| Disk | iostat, iotop | High wait times |
| Network | netstat, ss | Packet loss, high latency |
Step 2: Profile Your Code
Section titled “Step 2: Profile Your Code”1import cProfile2import pstats3
4def profile_function(func):5 """Decorator to profile a function"""6 def wrapper(*args, **kwargs):7 profiler = cProfile.Profile()8 profiler.enable()9 result = func(*args, **kwargs)10 profiler.disable()11
12 stats = pstats.Stats(profiler)13 stats.sort_stats('cumulative')14 stats.print_stats(10) # Top 10 slowest15
16 return result17 return wrapper18
19@profile_function20def my_slow_function():21 # Your code here22 pass1// Use JVM profilers: JProfiler, YourKit, or async-profiler2// Or simple timing:3public class SimpleProfiler {4 public static <T> T profile(String name, java.util.function.Supplier<T> operation) {5 long start = System.nanoTime();6 T result = operation.get();7 long duration = System.nanoTime() - start;8 System.out.printf("%s took %.2fms%n", name, duration / 1_000_000.0);9 return result;10 }11}12
13// Usage14User user = SimpleProfiler.profile("getUser", () -> userService.get(id));Step 3: Trace Requests End-to-End
Section titled “Step 3: Trace Requests End-to-End”Common Solutions
Section titled “Common Solutions”| Bottleneck Type | Solutions |
|---|---|
| CPU | Optimize algorithms, caching, horizontal scaling |
| Memory | Streaming, pagination, efficient data structures |
| Database | Indexing, query optimization, caching, read replicas |
| Network | Caching, compression, connection pooling |
| External APIs | Caching, async calls, circuit breakers, timeouts |
Advanced: Latency vs Throughput Bottlenecks
Section titled “Advanced: Latency vs Throughput Bottlenecks”Understanding the difference is crucial for senior engineers:
| Type | Symptom | Diagnosis | Solution |
|---|---|---|---|
| Latency | High response times | Profile shows slow operations | Optimize the slow code |
| Throughput | Requests queue up | Resources saturated | Add capacity or optimize resource usage |
| Both | Slow AND queueing | Everything is red | Triage: fix biggest impact first |
Deep Dive: Production Bottleneck Investigation
Section titled “Deep Dive: Production Bottleneck Investigation”Here’s how senior engineers approach bottleneck investigation in production:
Step 1: Establish Baseline Metrics
Section titled “Step 1: Establish Baseline Metrics”Before optimizing, you need to know what “normal” looks like. Track these key metrics:
| Category | Metrics to Track |
|---|---|
| Request Latency | P50, P95, P99 response times |
| Database | Query times by operation, connection pool usage |
| External APIs | Call durations by service, error rates |
| Resources | CPU, memory, disk I/O, network |
Tools:
- APM (Application Performance Monitoring): Datadog, New Relic, Dynatrace
- Metrics: Prometheus + Grafana
- Distributed Tracing: Jaeger, Zipkin, AWS X-Ray
Step 2: Identify the Hot Path
Section titled “Step 2: Identify the Hot Path”The hot path is the code that runs most frequently or consumes most resources:
Real-World Case Study: E-Commerce Checkout Bottleneck
Section titled “Real-World Case Study: E-Commerce Checkout Bottleneck”Situation: Checkout page takes 8 seconds to load during sales events.
Investigation Process
Section titled “Investigation Process”Step 1: Add timing instrumentation to each component
Step 2: Diagnose the root cause
The inventory service was making one database query per item:
- Cart with 100 items = 100 database queries
- Each query ~60ms = 6 seconds total
Step 3: Fix with batch query
1-- BEFORE: N+1 queries (100 queries for 100 items)2SELECT * FROM inventory WHERE sku = 'SKU001';3SELECT * FROM inventory WHERE sku = 'SKU002';4-- ... 98 more queries5
6-- AFTER: Single batch query7SELECT * FROM inventory WHERE sku IN ('SKU001', 'SKU002', ...);Results
Section titled “Results”| Metric | Before | After | Improvement |
|---|---|---|---|
| P50 Latency | 6.2s | 0.4s | 93% reduction |
| P99 Latency | 12s | 0.8s | 93% reduction |
| DB Queries | 103 | 5 | 95% reduction |
| Conversion Rate | 2.1% | 3.8% | 81% increase |
Key Takeaways
Section titled “Key Takeaways”What’s Next?
Section titled “What’s Next?”You’ve completed the Foundations section! You now understand:
- Why system design matters for LLD
- Scalability fundamentals
- Latency and throughput metrics
- How to find and fix bottlenecks
Continue your journey: Explore other HLD Concepts sections to deepen your understanding of distributed systems.