Understanding Bottlenecks
What is a Bottleneck?
Section titled “What is a Bottleneck?”A bottleneck is the component that limits your system’s overall performance. No matter how fast other parts are, the system can only go as fast as its slowest component.
Types of Bottlenecks
Section titled “Types of Bottlenecks”1. CPU Bottleneck
Section titled “1. CPU Bottleneck”Symptoms: High CPU usage, slow computations
2. Memory Bottleneck
Section titled “2. Memory Bottleneck”Symptoms: High memory usage, OOM errors, GC pauses
3. Database Bottleneck
Section titled “3. Database Bottleneck”Symptoms: Slow queries, connection pool exhaustion, high DB CPU
4. Network/I/O Bottleneck
Section titled “4. Network/I/O Bottleneck”Symptoms: High network latency, waiting on external services
Finding Bottlenecks
Section titled “Finding Bottlenecks”Step 1: Monitor Resource Utilization
Section titled “Step 1: Monitor Resource Utilization”| Resource | Tool | Warning Signs |
|---|---|---|
| CPU | top, htop, metrics | >80% sustained |
| Memory | free, vmstat | >90%, frequent GC |
| Disk | iostat, iotop | High wait times |
| Network | netstat, ss | Packet loss, high latency |
Step 2: Profile Your Code
Section titled “Step 2: Profile Your Code”Step 3: Trace Requests End-to-End
Section titled “Step 3: Trace Requests End-to-End”Common Solutions
Section titled “Common Solutions”| Bottleneck Type | Solutions |
|---|---|
| CPU | Optimize algorithms, caching, horizontal scaling |
| Memory | Streaming, pagination, efficient data structures |
| Database | Indexing, query optimization, caching, read replicas |
| Network | Caching, compression, connection pooling |
| External APIs | Caching, async calls, circuit breakers, timeouts |
Advanced: Latency vs Throughput Bottlenecks
Section titled “Advanced: Latency vs Throughput Bottlenecks”Understanding the difference is crucial for senior engineers:
| Type | Symptom | Diagnosis | Solution |
|---|---|---|---|
| Latency | High response times | Profile shows slow operations | Optimize the slow code |
| Throughput | Requests queue up | Resources saturated | Add capacity or optimize resource usage |
| Both | Slow AND queueing | Everything is red | Triage: fix biggest impact first |
Deep Dive: Production Bottleneck Investigation
Section titled “Deep Dive: Production Bottleneck Investigation”Here’s how senior engineers approach bottleneck investigation in production:
Step 1: Establish Baseline Metrics
Section titled “Step 1: Establish Baseline Metrics”Before optimizing, you need to know what “normal” looks like. Track these key metrics:
| Category | Metrics to Track |
|---|---|
| Request Latency | P50, P95, P99 response times |
| Database | Query times by operation, connection pool usage |
| External APIs | Call durations by service, error rates |
| Resources | CPU, memory, disk I/O, network |
Tools:
- APM (Application Performance Monitoring): Datadog, New Relic, Dynatrace
- Metrics: Prometheus + Grafana
- Distributed Tracing: Jaeger, Zipkin, AWS X-Ray
Step 2: Identify the Hot Path
Section titled “Step 2: Identify the Hot Path”The hot path is the code that runs most frequently or consumes most resources:
Real-World Examples
Section titled “Real-World Examples”Example 1: Facebook News Feed Database Bottleneck
Section titled “Example 1: Facebook News Feed Database Bottleneck”Company: Meta (Facebook)
Scenario: Facebook News Feed was loading slowly for users. Investigation revealed a database bottleneck causing high latency.
Implementation: Identified and fixed N+1 query problem:
Why This Matters:
- Scale: Billions of users, trillions of posts
- Impact: 48x latency reduction
- User Experience: Faster feed loading increases engagement
- Result: Reduced database load by 98%
Real-World Impact:
- Queries: Reduced from 81 to 1 query per feed load
- Latency: 2.4 seconds → 50ms (48x improvement)
- Database Load: 98% reduction in queries
Example 2: Twitter Timeline CPU Bottleneck
Section titled “Example 2: Twitter Timeline CPU Bottleneck”Company: Twitter (now X)
Scenario: Timeline generation was slow during peak hours. CPU usage was maxed out, causing high latency.
Implementation: Identified CPU bottleneck in ranking algorithm:
Why This Matters:
- Scale: Millions of timeline requests per second
- Impact: 20-50x latency reduction
- Resource Usage: CPU reduced from 100% to 30%
- Result: Handles 10x more traffic with same hardware
Real-World Impact:
- Latency: 2-5 seconds → 100-200ms (20-50x improvement)
- CPU Usage: 100% → 30% (70% reduction)
- Capacity: 10x more traffic handled
Example 3: Netflix Streaming Memory Bottleneck
Section titled “Example 3: Netflix Streaming Memory Bottleneck”Company: Netflix
Scenario: Video encoding service was running out of memory when processing large video files. OOM errors caused encoding failures.
Implementation: Fixed memory bottleneck with streaming:
Why This Matters:
- Scale: Thousands of videos encoded daily
- Impact: Eliminated OOM errors, 2x throughput increase
- Reliability: Encoding no longer fails due to memory
- Result: Can process larger videos with same hardware
Real-World Impact:
- Memory Usage: 100% → 3% (97% reduction)
- Reliability: OOM errors eliminated
- Throughput: 2x increase in encoding speed
Example 4: Amazon Checkout Network Bottleneck
Section titled “Example 4: Amazon Checkout Network Bottleneck”Company: Amazon
Scenario: Checkout page was slow during Prime Day. Investigation revealed external payment API was the bottleneck.
Implementation: Fixed network bottleneck with caching and async processing:
Why This Matters:
- Scale: 100K checkout requests per second during Prime Day
- Impact: 10x latency reduction, 15% conversion increase
- User Experience: Faster checkout increases conversions
- Result: Handles traffic spikes without degradation
Real-World Impact:
- Latency: 500ms → 50ms (10x improvement)
- Conversion Rate: +15% increase
- Revenue: Millions in additional revenue during Prime Day
Real-World Case Study: E-Commerce Checkout Bottleneck
Section titled “Real-World Case Study: E-Commerce Checkout Bottleneck”Situation: Checkout page takes 8 seconds to load during sales events.
Investigation Process
Section titled “Investigation Process”Step 1: Add timing instrumentation to each component
Step 2: Diagnose the root cause
The inventory service was making one database query per item:
- Cart with 100 items = 100 database queries
- Each query ~60ms = 6 seconds total
Step 3: Fix with batch query
-- BEFORE: N+1 queries (100 queries for 100 items)SELECT * FROM inventory WHERE sku = 'SKU001';SELECT * FROM inventory WHERE sku = 'SKU002';-- ... 98 more queries
-- AFTER: Single batch querySELECT * FROM inventory WHERE sku IN ('SKU001', 'SKU002', ...);Results
Section titled “Results”| Metric | Before | After | Improvement |
|---|---|---|---|
| P50 Latency | 6.2s | 0.4s | 93% reduction |
| P99 Latency | 12s | 0.8s | 93% reduction |
| DB Queries | 103 | 5 | 95% reduction |
| Conversion Rate | 2.1% | 3.8% | 81% increase |
Key Takeaways
Section titled “Key Takeaways”What’s Next?
Section titled “What’s Next?”You’ve completed the Foundations section! You now understand:
- Why system design matters for LLD
- Scalability fundamentals
- Latency and throughput metrics
- How to find and fix bottlenecks
Continue your journey: Explore other HLD Concepts sections to deepen your understanding of distributed systems.