Database Indexing Strategies

Making queries lightning fast

Why Indexes Matter

Without indexes, databases must scan every row to find data. With indexes, databases can jump directly to the data they need.

Index Structure Basics

An index is a separate data structure that maps search keys to data locations.

Key Concept: Index stores keys (what you search by) mapped to locations (where the data is).

Strategy 1: B-Tree Index

B-tree (Balanced Tree) is the most common index structure in SQL databases. It’s a self-balancing tree that keeps data sorted.

How B-Tree Works

Search Process:

Start at root
Compare key with node values
Follow appropriate branch
Repeat until leaf node
Time: O(log n) - logarithmic time!

B-Tree Characteristics

Pros:

Fast reads: O(log n) lookups
Range queries: Efficient “WHERE x BETWEEN a AND b”
Sorted data: Data stays sorted
Mature: Well-understood, optimized

Cons:

Slower writes: Must rebalance tree
Random writes: Not optimized for sequential writes
Write amplification: One write can cause multiple disk writes

Used in: PostgreSQL, MySQL, SQL Server, Oracle

Strategy 2: LSM Tree

LSM (Log-Structured Merge) Tree optimizes writes by batching them in memory and merging to disk.

How LSM Tree Works

Key Concept: Writes go to fast memory first, then flush to disk in batches. Background process merges files.

LSM Tree Characteristics

Pros:

Fast writes: Sequential writes, batched operations
High throughput: Can handle millions of writes/second
Write-optimized: Designed for write-heavy workloads
Compression: Merging compacts data

Cons:

Slower reads: May need to check multiple files
Read amplification: One read may touch multiple files
Compaction overhead: Background merging uses resources

Used in: Cassandra, RocksDB, LevelDB, HBase

Strategy 3: Inverted Index

Inverted index maps words to documents containing them. Used in search engines and full-text search.

How Inverted Index Works

Key Concept: Instead of “document → words”, store “word → documents” for fast keyword lookups.

Inverted Index Example

Documents:

Doc1: “The cat sat on the mat”
Doc2: “The dog chased the cat”
Doc3: “A bird flew away”

Inverted Index:

1
the → [Doc1, Doc2]
2
cat → [Doc1, Doc2]
3
dog → [Doc2]
4
bird → [Doc3]
5
sat → [Doc1]
6
mat → [Doc1]
7
chased → [Doc2]
8
flew → [Doc3]
9
away → [Doc3]

Query: “cat”

Lookup “cat” in index → [Doc1, Doc2]
Result: Doc1 and Doc2 contain “cat”
Fast! No need to scan all documents

Inverted Index Characteristics

Pros:

Fast search: O(1) word lookup
Boolean queries: AND, OR, NOT operations
Phrase queries: Find exact phrases
Ranking: Can rank results by relevance

Cons:

Storage overhead: Index can be large
Update cost: Must update index when documents change
Complexity: More complex than simple indexes

Used in: Elasticsearch, Lucene, Solr, full-text search

Real-World Examples

Different indexing strategies are used by major companies based on their workload patterns:

B-Tree Index: PostgreSQL and MySQL

The Challenge: Most applications are read-heavy. Need fast lookups and range queries.

The Solution: PostgreSQL and MySQL use B-Tree indexes:

Read performance: O(log n) lookups, fast range queries
Use case: SQL databases, read-heavy workloads
Examples: User lookups, product searches, reporting

Example: Find user by email:

Without index: Scan 1 million users (slow)
With B-Tree index: 20 comparisons (fast)
200x faster

Impact: Used by millions of applications. Fast reads. Excellent for SQL databases.

LSM Tree: Cassandra and RocksDB

The Challenge: Write-heavy workloads (logs, metrics, time-series data). Need high write throughput.

The Solution: Cassandra and RocksDB use LSM Trees:

Write performance: Sequential writes, batched operations
Use case: Write-heavy workloads, NoSQL databases
Examples: Logs, metrics, time-series data

Example: Write 1 million log entries:

B-Tree: Random writes, slow (minutes)
LSM Tree: Sequential writes, fast (seconds)
100x faster writes

Impact: Handles millions of writes/second. Used by major platforms (Facebook, Netflix). Essential for write-heavy workloads.

Inverted Index: Elasticsearch and Google Search

The Challenge: Search engines need fast keyword lookups across billions of documents.

The Solution: Elasticsearch and Google use inverted indexes:

Search performance: O(1) word lookup, fast boolean queries
Use case: Full-text search, search engines
Examples: Product search, document search, web search

Example: Search for “laptop” in 1 billion documents:

Without index: Scan all documents (hours)
With inverted index: Direct lookup (milliseconds)
1000x faster

Impact: Powers Google Search, Elasticsearch. Fast keyword search. Handles billions of documents.

Hybrid Approach: Modern Databases

The Challenge: Modern applications need both fast reads and fast writes.

The Solution: Many databases use hybrid approaches:

PostgreSQL: B-Tree for reads, write-optimized for writes
MongoDB: B-Tree indexes with write optimizations
DynamoDB: LSM Tree with read optimizations

Example: E-commerce platform:

Product catalog: B-Tree index (read-heavy)
Order processing: LSM Tree (write-heavy)
Search: Inverted index (keyword search)

Impact: Optimized for different workloads. Best of both worlds. Handles complex requirements.

B-Tree vs LSM Tree: When to Use What?

Aspect	B-Tree	LSM Tree
Read Performance	Fast (O(log n))	⚠️ Slower (may check multiple files)
Write Performance	⚠️ Slower (must rebalance)	Very fast (batched)
Range Queries	Excellent	⚠️ Good
Storage	Efficient	⚠️ More overhead (multiple files)
Use Case	Read-heavy, SQL databases	Write-heavy, NoSQL databases

Index Design Best Practices

1. Index Frequently Queried Columns

Rule: Index columns used in WHERE, JOIN, and ORDER BY clauses.

2. Composite Indexes for Multi-Column Queries

Rule: For queries with multiple WHERE conditions, create composite indexes.

3. Don’t Over-Index

Rule: Monitor index usage. Remove unused indexes. Balance read speed vs write performance.

LLD ↔ HLD Connection

How indexing strategies affect your code design:

Index-Aware Query Design

Deep Dive: Production Indexing Considerations

B-Tree: Advanced Performance Characteristics

Write Amplification in B-Trees

Write Amplification: One logical write causes multiple physical writes.

Example:

Logical write: Update one row
Physical writes:
- Write to data page
- Write to index page (B-tree node)
- Write to WAL (Write-Ahead Log)
- Total: 3-5x amplification

Production Impact:

SSD wear: More writes = shorter SSD lifespan
Throughput: Limited by write amplification
Latency: Higher latency due to multiple writes

Mitigation:

Batch writes: Group multiple updates
Delayed index updates: Update index asynchronously
SSD optimization: Use SSDs with high write endurance

B-Tree Node Size Optimization

Node Size: Affects tree height and performance.

Small Nodes (4KB):

Shallow tree: More nodes, but fits in cache
Cache-friendly: More nodes in memory
More I/O: More nodes to read

Large Nodes (64KB):

Deep tree: Fewer nodes, less I/O
Cache-unfriendly: Fewer nodes fit in cache
Waste: May read unused data

Production Defaults:

PostgreSQL: 8KB pages (good balance)
MySQL InnoDB: 16KB pages (optimized for modern SSDs)
SQL Server: 8KB pages

Best Practice: Use database defaults unless profiling shows issues

LSM Tree: Production Tuning

Compaction Strategies Comparison

Size-Tiered Compaction (STCS):

How it works:

Small files merged into larger files
Files organized by size tiers

Pros:

Simple to understand
Good for write-heavy workloads
Low write amplification

Cons:

High read amplification (many files to check)
Space amplification (temporary duplicates)

Use case: Write-heavy, read-light workloads

Leveled Compaction (LCS):

How it works:

Files organized into levels
Each level is 10x larger than previous
Files within level don’t overlap

Pros:

Low read amplification (fewer files to check)
Predictable space usage
Better for read-heavy workloads

Cons:

Higher write amplification (more rewriting)
More complex

Use case: Read-heavy, balanced workloads

Production Example: Cassandra

Default: Size-tiered compaction
Alternative: Leveled compaction (tunable)
Choice: Based on read/write ratio

LSM Tree Write Stall Problem

Problem: Compaction can’t keep up with writes.

Symptoms:

Write latency spikes: Writes stall waiting for compaction
Disk I/O saturation: Compaction consumes all I/O
Throughput degradation: System slows down

Solutions:

Solution 1: Throttle Writes

1
class ThrottledLSMWriter:
2
    def write(self, key, value):
3
        # Check compaction lag
4
        if self.compaction_lag > threshold:
5
            # Throttle writes
6
            time.sleep(0.1)  # Slow down writes
7
        self.memtable.write(key, value)

Solution 2: Increase Compaction Threads

More threads = faster compaction
Trade-off: More CPU/memory usage

Solution 3: Tune Compaction Parameters

Adjust compaction thresholds
Balance between write speed and read performance

Inverted Index: Production Optimization

Index Size Management

Challenge: Inverted indexes can be larger than original data.

Example:

Original documents: 10GB
Inverted index: 15GB (150% of original!)

Why so large:

Stores word → document mappings
Posting lists (lists of document IDs)
Term frequencies, positions

Optimization Techniques:

1. Stop Words Removal

Remove common words (“the”, “a”, “an”)
Reduction: 20-30% index size

2. Stemming

Reduce words to root (“running” → “run”)
Reduction: 10-20% index size

3. Compression

Compress posting lists (delta encoding)
Reduction: 50-70% index size

Production Example: Elasticsearch

Uses compression by default
Index size: Typically 30-50% of original data
Trade-off: Slightly slower queries (decompression overhead)

Query Performance Optimization

Challenge: Complex queries can be slow.

Optimization Techniques:

1. Boolean Query Optimization

Bad:

1
# Query all documents, then filter
2
results = index.search("term1 AND term2 AND term3")
3
# Checks all documents with term1

Good:

1
# Start with rarest term
2
results = index.search("rare_term AND common_term")
3
# Checks fewer documents first

Rule: Process terms from rarest to most common

2. Caching Frequent Queries

Pattern: Cache query results.

1
class CachedSearchEngine:
2
    def __init__(self, index, cache):
3
        self.index = index
4
        self.cache = cache
5

6
    def search(self, query):
7
        # Check cache first
8
        cache_key = f"query:{hash(query)}"
9
        cached = self.cache.get(cache_key)
10
        if cached:
11
            return cached
12

13
        # Execute query
14
        results = self.index.search(query)
15

16
        # Cache results
17
        self.cache.set(cache_key, results, ttl=300)
18
        return results

Benefit: 10-100x faster for repeated queries

Index Maintenance: Production Considerations

Index Rebuilding Strategies

When to Rebuild:

Fragmentation: Index becomes fragmented (slow queries)
Statistics outdated: Query planner uses wrong estimates
Schema changes: Adding/removing indexes

Strategies:

1. Online Rebuild (Zero Downtime)

PostgreSQL: REINDEX CONCURRENTLY
MySQL: Online DDL (5.6+)
Benefit: No downtime
Cost: Slower, uses more resources

2. Offline Rebuild (Faster)

PostgreSQL: REINDEX
MySQL: ALTER TABLE ... DROP INDEX, ADD INDEX
Benefit: Faster
Cost: Table locked during rebuild

Production Pattern:

Use online rebuild for production (zero downtime)
Use offline rebuild during maintenance windows

Index Monitoring and Maintenance

What to Monitor:

1. Index Usage

1
-- PostgreSQL: Find unused indexes
2
SELECT schemaname, tablename, indexname, idx_scan
3
FROM pg_stat_user_indexes
4
WHERE idx_scan = 0
5
ORDER BY pg_relation_size(indexrelid) DESC;

2. Index Bloat

1
-- PostgreSQL: Find bloated indexes
2
SELECT schemaname, tablename, indexname,
3
       pg_size_pretty(pg_relation_size(indexrelid)) as size
4
FROM pg_stat_user_indexes
5
WHERE pg_relation_size(indexrelid) > 1000000000  -- >1GB
6
ORDER BY pg_relation_size(indexrelid) DESC;

3. Index Fragmentation

Monitor index scan performance
Alert on degradation
Rebuild when fragmentation >20%

Production Automation:

1
class IndexMaintenance:
2
    def monitor_indexes(self):
3
        unused = self.find_unused_indexes()
4
        bloated = self.find_bloated_indexes()
5

6
        # Alert on unused indexes
7
        if unused:
8
            self.alert(f"Unused indexes: {unused}")
9

10
        # Schedule rebuild for bloated indexes
11
        if bloated:
12
            self.schedule_rebuild(bloated)

Performance Benchmarks: Real-World Index Performance

B-Tree vs LSM Tree: Production Comparison

Metric	B-Tree	LSM Tree
Read Latency (point)	1-5ms	5-20ms
Read Latency (range)	5-10ms	10-50ms
Write Latency	5-20ms	1-5ms
Write Throughput	1K-10K ops/sec	10K-100K ops/sec
Read Throughput	10K-50K ops/sec	5K-20K ops/sec
Space Amplification	1.1x	1.5-2x
Write Amplification	3-5x	2-10x (during compaction)

Key Insights:

B-Tree: Better for read-heavy workloads
LSM Tree: Better for write-heavy workloads
Hybrid: Some databases use both (e.g., RocksDB with B-Tree for reads)

Index Impact on Query Performance

Benchmark: 1M row table, query by indexed column

Scenario	Without Index	With Index	Improvement
Point lookup	500ms	2ms	250x faster
Range query	800ms	5ms	160x faster
Join on indexed column	2000ms	50ms	40x faster
Order by indexed column	1500ms	10ms	150x faster

Key Insight: Indexes provide 40-250x performance improvement!

Advanced Indexing Patterns

Pattern 1: Partial Indexes

Pattern: Index only subset of rows.

Example:

1
-- Index only active orders
2
CREATE INDEX idx_active_orders
3
ON orders(order_date)
4
WHERE status = 'active';

Benefits:

Smaller index (only active orders)
Faster queries (fewer rows to scan)
Less maintenance overhead

Use case: Queries that filter by common condition

Pattern 2: Covering Indexes

Pattern: Index contains all columns needed for query.

Example:

1
-- Query: SELECT name, email FROM users WHERE country = 'USA'
2
-- Covering index includes all columns
3
CREATE INDEX idx_users_country_name_email
4
ON users(country, name, email);

Benefits:

Index-only scan: Don’t need to read table
Faster queries: 2-5x faster than regular index
Less I/O: No table access needed

Trade-off: Larger index (more columns)

Pattern 3: Expression Indexes

Pattern: Index on computed expression.

Example:

1
-- Index on lowercased email (case-insensitive search)
2
CREATE INDEX idx_users_email_lower
3
ON users(LOWER(email));
4

5
-- Query uses index
6
SELECT * FROM users WHERE LOWER(email) = '[email protected]';

Benefits:

Enables index usage for computed queries
Faster than full table scan

Use case: Queries with functions/expressions

Key Takeaways

B-Tree: Balanced tree, fast reads, good for range queries, used in SQL databases
LSM Tree: Log-structured merge, fast writes, batched operations, used in NoSQL databases
Inverted Index: Word → documents mapping, fast keyword search, used in search engines
Trade-offs: B-tree = read-optimized, LSM = write-optimized, Inverted = search-optimized
Index Design: Index frequently queried columns, use composite indexes, don’t over-index
LLD Implementation: Design queries to use indexes, create indexes based on query patterns
Monitor: Track index usage, remove unused indexes, balance reads vs writes
Production considerations: Write amplification, compaction strategies, index maintenance
Performance: Indexes provide 40-250x improvement, B-tree better for reads, LSM better for writes
Advanced patterns: Partial indexes, covering indexes, expression indexes
Best practices: Monitor index usage, rebuild when fragmented, use covering indexes when possible

What’s Next?

Congratulations! You’ve completed the Database & Storage Systems section. You now understand:

Relational databases (ACID, normalization, indexing)
Isolation levels (read committed, repeatable read, serializable)
Scaling strategies (read replicas, connection pooling)
Sharding & partitioning (horizontal vs vertical)
NoSQL databases (Document, Key-Value, Column-family, Graph)
Database selection (decision framework)
Indexing strategies (B-trees, LSM trees, inverted indexes)

Next up: Explore Caching & Performance Optimization — Learn caching strategies to make your systems even faster!

Request a feature or report an issue