Understanding Bottlenecks

Find the weakest link before it breaks

What is a Bottleneck?

A bottleneck is the component that limits your system’s overall performance. No matter how fast other parts are, the system can only go as fast as its slowest component.

Types of Bottlenecks

1. CPU Bottleneck

Symptoms: High CPU usage, slow computations

2. Memory Bottleneck

Symptoms: High memory usage, OOM errors, GC pauses

3. Database Bottleneck

Symptoms: Slow queries, connection pool exhaustion, high DB CPU

4. Network/I/O Bottleneck

Symptoms: High network latency, waiting on external services

Finding Bottlenecks

Step 1: Monitor Resource Utilization

Resource	Tool	Warning Signs
CPU	`top`, `htop`, metrics	>80% sustained
Memory	`free`, `vmstat`	>90%, frequent GC
Disk	`iostat`, `iotop`	High wait times
Network	`netstat`, `ss`	Packet loss, high latency

Step 2: Profile Your Code

Step 3: Trace Requests End-to-End

Common Solutions

Bottleneck Type	Solutions
CPU	Optimize algorithms, caching, horizontal scaling
Memory	Streaming, pagination, efficient data structures
Database	Indexing, query optimization, caching, read replicas
Network	Caching, compression, connection pooling
External APIs	Caching, async calls, circuit breakers, timeouts

Advanced: Latency vs Throughput Bottlenecks

Understanding the difference is crucial for senior engineers:

Type	Symptom	Diagnosis	Solution
Latency	High response times	Profile shows slow operations	Optimize the slow code
Throughput	Requests queue up	Resources saturated	Add capacity or optimize resource usage
Both	Slow AND queueing	Everything is red	Triage: fix biggest impact first

Deep Dive: Production Bottleneck Investigation

Here’s how senior engineers approach bottleneck investigation in production:

Step 1: Establish Baseline Metrics

Before optimizing, you need to know what “normal” looks like. Track these key metrics:

Category	Metrics to Track
Request Latency	P50, P95, P99 response times
Database	Query times by operation, connection pool usage
External APIs	Call durations by service, error rates
Resources	CPU, memory, disk I/O, network

Tools:

APM (Application Performance Monitoring): Datadog, New Relic, Dynatrace
Metrics: Prometheus + Grafana
Distributed Tracing: Jaeger, Zipkin, AWS X-Ray

Step 2: Identify the Hot Path

The hot path is the code that runs most frequently or consumes most resources:

Real-World Examples

Example 1: Facebook News Feed Database Bottleneck

Company: Meta (Facebook)

Scenario: Facebook News Feed was loading slowly for users. Investigation revealed a database bottleneck causing high latency.

Implementation: Identified and fixed N+1 query problem:

Why This Matters:

Scale: Billions of users, trillions of posts
Impact: 48x latency reduction
User Experience: Faster feed loading increases engagement
Result: Reduced database load by 98%

Real-World Impact:

Queries: Reduced from 81 to 1 query per feed load
Latency: 2.4 seconds → 50ms (48x improvement)
Database Load: 98% reduction in queries

Example 2: Twitter Timeline CPU Bottleneck

Company: Twitter (now X)

Scenario: Timeline generation was slow during peak hours. CPU usage was maxed out, causing high latency.

Implementation: Identified CPU bottleneck in ranking algorithm:

Why This Matters:

Scale: Millions of timeline requests per second
Impact: 20-50x latency reduction
Resource Usage: CPU reduced from 100% to 30%
Result: Handles 10x more traffic with same hardware

Real-World Impact:

Latency: 2-5 seconds → 100-200ms (20-50x improvement)
CPU Usage: 100% → 30% (70% reduction)
Capacity: 10x more traffic handled

Example 3: Netflix Streaming Memory Bottleneck

Company: Netflix

Scenario: Video encoding service was running out of memory when processing large video files. OOM errors caused encoding failures.

Implementation: Fixed memory bottleneck with streaming:

Why This Matters:

Scale: Thousands of videos encoded daily
Impact: Eliminated OOM errors, 2x throughput increase
Reliability: Encoding no longer fails due to memory
Result: Can process larger videos with same hardware

Real-World Impact:

Memory Usage: 100% → 3% (97% reduction)
Reliability: OOM errors eliminated
Throughput: 2x increase in encoding speed

Example 4: Amazon Checkout Network Bottleneck

Company: Amazon

Scenario: Checkout page was slow during Prime Day. Investigation revealed external payment API was the bottleneck.

Implementation: Fixed network bottleneck with caching and async processing:

Why This Matters:

Scale: 100K checkout requests per second during Prime Day
Impact: 10x latency reduction, 15% conversion increase
User Experience: Faster checkout increases conversions
Result: Handles traffic spikes without degradation

Real-World Impact:

Latency: 500ms → 50ms (10x improvement)
Conversion Rate: +15% increase
Revenue: Millions in additional revenue during Prime Day

Real-World Case Study: E-Commerce Checkout Bottleneck

Situation: Checkout page takes 8 seconds to load during sales events.

Investigation Process

Step 1: Add timing instrumentation to each component

Step 2: Diagnose the root cause

The inventory service was making one database query per item:

Cart with 100 items = 100 database queries
Each query ~60ms = 6 seconds total

Step 3: Fix with batch query

1
-- BEFORE: N+1 queries (100 queries for 100 items)
2
SELECT * FROM inventory WHERE sku = 'SKU001';
3
SELECT * FROM inventory WHERE sku = 'SKU002';
4
-- ... 98 more queries
5

6
-- AFTER: Single batch query
7
SELECT * FROM inventory WHERE sku IN ('SKU001', 'SKU002', ...);

Results

Metric	Before	After	Improvement
P50 Latency	6.2s	0.4s	93% reduction
P99 Latency	12s	0.8s	93% reduction
DB Queries	103	5	95% reduction
Conversion Rate	2.1%	3.8%	81% increase

Key Takeaways

What’s Next?

You’ve completed the Foundations section! You now understand:

Why system design matters for LLD
Scalability fundamentals
Latency and throughput metrics
How to find and fix bottlenecks

Continue your journey: Explore other HLD Concepts sections to deepen your understanding of distributed systems.

Request a feature or report an issue