Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Understanding Bottlenecks

Find the weakest link before it breaks

A bottleneck is the component that limits your system’s overall performance. No matter how fast other parts are, the system can only go as fast as its slowest component.

Diagram

Symptoms: High CPU usage, slow computations

Symptoms: High memory usage, OOM errors, GC pauses

Symptoms: Slow queries, connection pool exhaustion, high DB CPU

Symptoms: High network latency, waiting on external services

Diagram
ResourceToolWarning Signs
CPUtop, htop, metrics>80% sustained
Memoryfree, vmstat>90%, frequent GC
Diskiostat, iotopHigh wait times
Networknetstat, ssPacket loss, high latency
Diagram
Bottleneck TypeSolutions
CPUOptimize algorithms, caching, horizontal scaling
MemoryStreaming, pagination, efficient data structures
DatabaseIndexing, query optimization, caching, read replicas
NetworkCaching, compression, connection pooling
External APIsCaching, async calls, circuit breakers, timeouts

Advanced: Latency vs Throughput Bottlenecks

Section titled “Advanced: Latency vs Throughput Bottlenecks”

Understanding the difference is crucial for senior engineers:

Diagram
TypeSymptomDiagnosisSolution
LatencyHigh response timesProfile shows slow operationsOptimize the slow code
ThroughputRequests queue upResources saturatedAdd capacity or optimize resource usage
BothSlow AND queueingEverything is redTriage: fix biggest impact first

Deep Dive: Production Bottleneck Investigation

Section titled “Deep Dive: Production Bottleneck Investigation”

Here’s how senior engineers approach bottleneck investigation in production:

Before optimizing, you need to know what “normal” looks like. Track these key metrics:

CategoryMetrics to Track
Request LatencyP50, P95, P99 response times
DatabaseQuery times by operation, connection pool usage
External APIsCall durations by service, error rates
ResourcesCPU, memory, disk I/O, network

Tools:

  • APM (Application Performance Monitoring): Datadog, New Relic, Dynatrace
  • Metrics: Prometheus + Grafana
  • Distributed Tracing: Jaeger, Zipkin, AWS X-Ray

The hot path is the code that runs most frequently or consumes most resources:

Diagram

Example 1: Facebook News Feed Database Bottleneck

Section titled “Example 1: Facebook News Feed Database Bottleneck”

Company: Meta (Facebook)

Scenario: Facebook News Feed was loading slowly for users. Investigation revealed a database bottleneck causing high latency.

Implementation: Identified and fixed N+1 query problem:

Diagram

Why This Matters:

  • Scale: Billions of users, trillions of posts
  • Impact: 48x latency reduction
  • User Experience: Faster feed loading increases engagement
  • Result: Reduced database load by 98%

Real-World Impact:

  • Queries: Reduced from 81 to 1 query per feed load
  • Latency: 2.4 seconds → 50ms (48x improvement)
  • Database Load: 98% reduction in queries

Example 2: Twitter Timeline CPU Bottleneck

Section titled “Example 2: Twitter Timeline CPU Bottleneck”

Company: Twitter (now X)

Scenario: Timeline generation was slow during peak hours. CPU usage was maxed out, causing high latency.

Implementation: Identified CPU bottleneck in ranking algorithm:

Diagram

Why This Matters:

  • Scale: Millions of timeline requests per second
  • Impact: 20-50x latency reduction
  • Resource Usage: CPU reduced from 100% to 30%
  • Result: Handles 10x more traffic with same hardware

Real-World Impact:

  • Latency: 2-5 seconds → 100-200ms (20-50x improvement)
  • CPU Usage: 100% → 30% (70% reduction)
  • Capacity: 10x more traffic handled

Example 3: Netflix Streaming Memory Bottleneck

Section titled “Example 3: Netflix Streaming Memory Bottleneck”

Company: Netflix

Scenario: Video encoding service was running out of memory when processing large video files. OOM errors caused encoding failures.

Implementation: Fixed memory bottleneck with streaming:

Diagram

Why This Matters:

  • Scale: Thousands of videos encoded daily
  • Impact: Eliminated OOM errors, 2x throughput increase
  • Reliability: Encoding no longer fails due to memory
  • Result: Can process larger videos with same hardware

Real-World Impact:

  • Memory Usage: 100% → 3% (97% reduction)
  • Reliability: OOM errors eliminated
  • Throughput: 2x increase in encoding speed

Example 4: Amazon Checkout Network Bottleneck

Section titled “Example 4: Amazon Checkout Network Bottleneck”

Company: Amazon

Scenario: Checkout page was slow during Prime Day. Investigation revealed external payment API was the bottleneck.

Implementation: Fixed network bottleneck with caching and async processing:

Diagram

Why This Matters:

  • Scale: 100K checkout requests per second during Prime Day
  • Impact: 10x latency reduction, 15% conversion increase
  • User Experience: Faster checkout increases conversions
  • Result: Handles traffic spikes without degradation

Real-World Impact:

  • Latency: 500ms → 50ms (10x improvement)
  • Conversion Rate: +15% increase
  • Revenue: Millions in additional revenue during Prime Day

Real-World Case Study: E-Commerce Checkout Bottleneck

Section titled “Real-World Case Study: E-Commerce Checkout Bottleneck”

Situation: Checkout page takes 8 seconds to load during sales events.

Step 1: Add timing instrumentation to each component

Diagram

Step 2: Diagnose the root cause

The inventory service was making one database query per item:

  • Cart with 100 items = 100 database queries
  • Each query ~60ms = 6 seconds total

Step 3: Fix with batch query

-- BEFORE: N+1 queries (100 queries for 100 items)
SELECT * FROM inventory WHERE sku = 'SKU001';
SELECT * FROM inventory WHERE sku = 'SKU002';
-- ... 98 more queries
-- AFTER: Single batch query
SELECT * FROM inventory WHERE sku IN ('SKU001', 'SKU002', ...);
MetricBeforeAfterImprovement
P50 Latency6.2s0.4s93% reduction
P99 Latency12s0.8s93% reduction
DB Queries103595% reduction
Conversion Rate2.1%3.8%81% increase


You’ve completed the Foundations section! You now understand:

  • Why system design matters for LLD
  • Scalability fundamentals
  • Latency and throughput metrics
  • How to find and fix bottlenecks

Continue your journey: Explore other HLD Concepts sections to deepen your understanding of distributed systems.