NoSQL Databases

Beyond relational: flexible data models for modern applications

What is NoSQL?

NoSQL (Not Only SQL) refers to non-relational databases that use flexible data models. They’re designed for scalability, performance, and handling unstructured/semi-structured data.

The Four Types of NoSQL Databases

Type 1: Document Databases

Document databases store data as documents (JSON, BSON, XML). Documents are self-contained and can have nested structures.

How Document Databases Work

Key Characteristics:

Flexible schema: Each document can have different fields
Nested data: Store related data together
No JOINs: Related data in same document
JSON-like: Easy to work with in applications

Examples: MongoDB, CouchDB, Amazon DocumentDB

Document Database Example

User Document in MongoDB:

1
{
2
  "_id": 123,
3
  "name": "Alice",
4
  "email": "[email protected]",
5
  "address": {
6
    "street": "123 Main St",
7
    "city": "San Francisco",
8
    "zip": "94102"
9
  },
10
  "orders": [
11
    {
12
      "order_id": 1,
13
      "date": "2024-01-15",
14
      "items": [
15
        {"product": "Laptop", "price": 1000},
16
        {"product": "Mouse", "price": 20}
17
      ],
18
      "total": 1020
19
    }
20
  ]
21
}

Benefits:

All user data in one document
No JOINs needed
Easy to read/write
Flexible (can add fields easily)

Type 2: Key-Value Stores

Key-value stores are the simplest NoSQL databases. They store data as key-value pairs.

How Key-Value Stores Work

Key Characteristics:

Simple: Just key-value pairs
Fast: O(1) lookups by key
Limited queries: Can only query by key
Great for caching: Fast access patterns

Examples: Redis, DynamoDB, Memcached

Key-Value Store Use Cases

Common Use Cases:

Caching: Store frequently accessed data
Session storage: User sessions
Configuration: App settings
Feature flags: Toggle features

Type 3: Column-Family Stores

Column-family stores organize data by columns instead of rows. Data is stored in column families, optimized for reading specific columns.

How Column-Family Stores Work

Key Characteristics:

Column-oriented: Data stored by columns
Wide tables: Can have many columns
Efficient reads: Read only needed columns
Time-series: Great for time-series data

Examples: Cassandra, HBase, Amazon Keyspaces

Column-Family Example

Time-Series Data in Cassandra:

Row Key	Timestamp	Temperature	Humidity	Pressure
sensor:1	2024-01-01 10:00	25°C	60%	1013
sensor:1	2024-01-01 11:00	26°C	58%	1014
sensor:1	2024-01-01 12:00	27°C	55%	1015

Benefits:

Efficient to read all temperatures
Can add new columns easily
Optimized for time-series queries

Type 4: Graph Databases

Graph databases store data as nodes (entities) and edges (relationships). Optimized for relationship queries.

How Graph Databases Work

Key Characteristics:

Nodes: Entities (users, products, etc.)
Edges: Relationships (friends, purchases, etc.)
Traversals: Follow relationships efficiently
Relationship queries: “Find friends of friends”

Examples: Neo4j, Amazon Neptune, ArangoDB

Graph Database Example

Social Network Graph:

1
Nodes:
2
- User(id: 1, name: "Alice")
3
- User(id: 2, name: "Bob")
4
- User(id: 3, name: "Charlie")
5
- Product(id: 10, name: "Laptop")
6

7
Edges:
8
- (Alice) -[FRIENDS]-> (Bob)
9
- (Bob) -[FRIENDS]-> (Charlie)
10
- (Alice) -[PURCHASED]-> (Laptop)
11
- (Bob) -[LIKES]-> (Laptop)

Query: “Find products liked by friends of Alice”

Start at Alice
Traverse FRIENDS edges → Bob
Traverse LIKES edges → Laptop
Result: Laptop

Real-World Examples

Major companies use NoSQL databases for different use cases:

Document Database: MongoDB at eBay

The Challenge: eBay stores product listings with varying structures. Some products have different attributes than others.

The Solution: eBay uses MongoDB for product catalog:

Flexible schema: Each product category has different fields
Nested data: Product details, images, reviews in single document
Scale: Millions of products, fast queries

Why Document DB?

Products have different structures (electronics vs clothing)
Related data together (no JOINs needed)
Fast reads (single document lookup)

Example: Product document:

1
{
2
  "id": 12345,
3
  "name": "iPhone 15",
4
  "category": "Electronics",
5
  "specs": {
6
    "storage": "256GB",
7
    "color": "Blue"
8
  },
9
  "reviews": [
10
    {"user": "John", "rating": 5},
11
    {"user": "Jane", "rating": 4}
12
  ]
13
}

Impact: Handles millions of products. Fast product pages. Flexible schema for different categories.

Key-Value Store: Redis at Twitter

The Challenge: Twitter needs fast access to user sessions, timelines, and cached data. Simple key-value lookups.

The Solution: Twitter uses Redis extensively:

Sessions: session:user123 → session data
Timelines: timeline:user123 → cached timeline
Counters: likes:tweet456 → like count

Why Key-Value?

Simple lookups (O(1) access)
Fast (in-memory)
Perfect for caching

Example: Get user timeline:

Key: timeline:user123
Value: Cached timeline data
Access: O(1) lookup, instant response

Impact: Timeline loads in milliseconds. Handles billions of keys. Essential for Twitter’s performance.

Column-Family Store: Cassandra at Netflix

The Challenge: Netflix stores time-series data (viewing history, recommendations, analytics). Need to read columns efficiently.

The Solution: Netflix uses Cassandra:

Time-series data: User viewing history by timestamp
Column-oriented: Read specific columns efficiently
Scale: Billions of rows, petabytes of data

Why Column-Family?

Efficient column reads (only read needed columns)
Time-series optimized (append-heavy workloads)
Scales horizontally

Example: User viewing history:

Row key: user:12345
Columns: 2024-01-01:movie1, 2024-01-02:movie2, etc.
Query: Read all columns for user (efficient)

Impact: Handles billions of viewing records. Fast analytics queries. Scales to petabytes.

Graph Database: Neo4j at LinkedIn

The Challenge: LinkedIn needs to find connections between users efficiently. “People you may know” requires graph traversals.

The Solution: LinkedIn uses Neo4j:

Nodes: Users, companies, skills
Edges: Connections, works_at, has_skill
Queries: “Find friends of friends who work at Google”

Why Graph DB?

Efficient relationship traversals
Complex queries (friends of friends)
Natural fit for social networks

Example: Find connections:

Start: User A
Traverse: A → friends → B → friends → C
Result: C is a friend of a friend

Impact: Fast connection discovery. “People you may know” in milliseconds. Handles billions of relationships.

Polyglot Persistence: Modern Applications

The Challenge: Modern applications have different data needs. One database can’t fit all.

The Solution: Companies use multiple databases:

PostgreSQL: User accounts, orders (ACID transactions)
MongoDB: Product catalog, content (flexible schema)
Redis: Sessions, cache (fast lookups)
Elasticsearch: Search (full-text search)

Example: E-commerce platform:

PostgreSQL: User accounts, orders, payments
MongoDB: Product catalog, reviews
Redis: Shopping cart, sessions
Elasticsearch: Product search

Impact: Right tool for each job. Optimized performance. Handles complex requirements.

NoSQL vs SQL: When to Use What?

Aspect	SQL	NoSQL
Schema	Fixed, rigid	Flexible, dynamic
Queries	Complex JOINs	Simple lookups
Scale	Vertical	Horizontal
Transactions	ACID	Eventually consistent
Use Case	Financial, ERP	Social media, IoT

LLD ↔ HLD Connection

How NoSQL databases affect your class design:

Document Database Classes

Key-Value Store Classes

Deep Dive: Production Patterns and Advanced Considerations

Document Databases: Schema Evolution in Production

The Schema-Less Myth

Reality: Document databases are schema-flexible, not schema-less.

Production Challenge: Schema changes still require migration planning.

Example: Adding Required Field

Before:

1
{
2
  "_id": 123,
3
  "name": "Alice",
4
  "email": "[email protected]"
5
}

After (New Required Field):

1
{
2
  "_id": 123,
3
  "name": "Alice",
4
  "email": "[email protected]",
5
  "phone": "123-456-7890"  // NEW REQUIRED FIELD
6
}

Migration Strategy:

1
class UserMigration:
2
    def migrate_user(self, user_doc):
3
        # Check if migration needed
4
        if 'phone' not in user_doc:
5
            # Backfill missing field
6
            user_doc['phone'] = self.fetch_phone_from_legacy_system(user_doc['_id'])
7
            self.collection.update_one(
8
                {'_id': user_doc['_id']},
9
                {'$set': {'phone': user_doc['phone']}}
10
            )
11
        return user_doc

Production Pattern:

Add field as optional (backward compatible)
Backfill existing documents (background job)
Make field required in application logic
Eventually enforce at database level

Document Size Limits and Sharding

Problem: Documents have size limits.

Limits:

MongoDB: 16MB per document
CouchDB: No hard limit, but performance degrades >1MB
DynamoDB: 400KB per item

Production Impact:

Large documents: Slow to transfer, memory intensive
Sharding: Large documents harder to shard efficiently

Solution: Reference Pattern

Instead of:

1
{
2
  "_id": 123,
3
  "name": "Alice",
4
  "orders": [
5
    { /* 1000 orders embedded */ }
6
  ]
7
}

Use References:

1
{
2
  "_id": 123,
3
  "name": "Alice",
4
  "order_ids": [1, 2, 3, ...]  // References
5
}

Benefit: Smaller documents, better sharding, faster queries

Key-Value Stores: Advanced Patterns

Pattern 1: Distributed Counters

Challenge: Atomic increments across distributed systems.

Solution: Redis INCR

1
class DistributedCounter:
2
    def __init__(self, redis_client):
3
        self.redis = redis_client
4

5
    def increment(self, key, amount=1):
6
        # Atomic increment
7
        return self.redis.incrby(key, amount)
8

9
    def decrement(self, key, amount=1):
10
        return self.redis.decrby(key, amount)
11

12
    def get(self, key):
13
        return int(self.redis.get(key) or 0)

Production Use Cases:

Page views: Track views across servers
Rate limiting: Count requests per user
Voting: Count votes in real-time

Pattern 2: Distributed Locks

Challenge: Coordinate across distributed systems.

Solution: Redis SETNX with TTL

1
class DistributedLock:
2
    def __init__(self, redis_client):
3
        self.redis = redis_client
4

5
    def acquire(self, lock_key, ttl_seconds=10):
6
        # Try to acquire lock
7
        acquired = self.redis.set(
8
            lock_key,
9
            "locked",
10
            nx=True,  # Only set if not exists
11
            ex=ttl_seconds  # Expire after TTL
12
        )
13
        return acquired is not None
14

15
    def release(self, lock_key):
16
        self.redis.delete(lock_key)
17

18
    @contextmanager
19
    def lock(self, lock_key, ttl_seconds=10):
20
        if self.acquire(lock_key, ttl_seconds):
21
            try:
22
                yield
23
            finally:
24
                self.release(lock_key)
25
        else:
26
            raise LockAcquisitionError("Could not acquire lock")

Production Considerations:

TTL: Prevents deadlocks (lock expires)
Renewal: Extend TTL for long operations
Fencing tokens: Prevent stale locks

Pattern 3: Pub/Sub for Event Distribution

Challenge: Notify multiple services of events.

Solution: Redis Pub/Sub

1
class EventPublisher:
2
    def __init__(self, redis_client):
3
        self.redis = redis_client
4

5
    def publish(self, channel, message):
6
        self.redis.publish(channel, json.dumps(message))
7

8
class EventSubscriber:
9
    def __init__(self, redis_client):
10
        self.redis = redis_client
11
        self.pubsub = redis_client.pubsub()
12

13
    def subscribe(self, channel, handler):
14
        self.pubsub.subscribe(channel)
15
        for message in self.pubsub.listen():
16
            if message['type'] == 'message':
17
                data = json.loads(message['data'])
18
                handler(data)

Production Use Cases:

Cache invalidation: Notify all servers to clear cache
Event distribution: Distribute events to multiple consumers
Real-time updates: Push updates to connected clients

Column-Family Stores: Production Considerations

Wide Rows and Partitioning

Challenge: Wide rows (many columns) can become very large.

Example: Time-Series Data

Row Structure:

1
Row Key: sensor:1
2
Columns:
3
  timestamp:2024-01-01-10:00 → temperature:25
4
  timestamp:2024-01-01-10:01 → temperature:26
5
  timestamp:2024-01-01-10:02 → temperature:27
6
  ... (millions of columns)

Problem: Row becomes too large, slow to read.

Solution: Row Partitioning

Partition by Time Window:

1
Row Key: sensor:1:2024-01-01
2
Columns: Only columns for that day
3

4
Row Key: sensor:1:2024-01-02
5
Columns: Only columns for next day

Benefit: Smaller rows, faster reads, better distribution

Compaction Strategies

Challenge: Column-family stores accumulate many versions (tombstones, updates).

Solution: Compaction

Types:

Size-tiered compaction: Merge small files into larger ones
Leveled compaction: Organize into levels, merge within levels
Time-window compaction: Compact by time windows

Production Impact:

Write amplification: Compaction rewrites data (2-10x)
Disk I/O: High during compaction
Performance: Compaction can slow down reads/writes

Best Practice: Schedule compaction during low-traffic periods

Graph Databases: Production Patterns

Pattern 1: Relationship Traversal Optimization

Challenge: Deep traversals can be slow.

Example: “Friends of Friends” Query

Naive Approach:

1
MATCH (user:User {id: 123})-[:FRIENDS]->(friend)-[:FRIENDS]->(fof)
2
RETURN fof

Problem: May traverse millions of relationships.

Optimized Approach:

1
MATCH (user:User {id: 123})-[:FRIENDS*2..2]->(fof)
2
WHERE fof.id <> 123  // Exclude self
3
RETURN DISTINCT fof
4
LIMIT 100  // Limit results

Production Techniques:

Limit depth: Don’t traverse too deep
Limit results: Use LIMIT clause
Index relationships: Index on relationship properties
Caching: Cache common traversals

Pattern 2: Graph Partitioning

Challenge: Large graphs don’t fit on single machine.

Solution: Graph Partitioning

Strategies:

Vertex-cut: Split vertices across machines
Edge-cut: Split edges across machines
Hybrid: Combination of both

Production Example: Neo4j Fabric

Sharding: Distributes graph across multiple databases
Query routing: Routes queries to appropriate shards
Cross-shard queries: Merges results from multiple shards

Trade-off: Cross-shard queries are slower (network overhead)

NoSQL Performance Benchmarks: Real-World Numbers

Database Type	Read Latency	Write Latency	Throughput	Use Case
Document (MongoDB)	1-5ms	5-20ms	10K-50K ops/sec	General purpose
Key-Value (Redis)	0.1-1ms	0.1-1ms	100K-1M ops/sec	Caching, sessions
Column-Family (Cassandra)	1-10ms	5-50ms	50K-200K ops/sec	Time-series, wide tables
Graph (Neo4j)	5-50ms	10-100ms	1K-10K ops/sec	Relationship queries

Key Insights:

Key-Value: Fastest (in-memory)
Document: Good balance (flexible + performant)
Column-Family: Best for writes (LSM trees)
Graph: Optimized for traversals (not raw speed)

Production Anti-Patterns

Anti-Pattern 1: Using NoSQL Like SQL

Problem: Trying to do complex JOINs in document databases.

Bad:

1
// Trying to JOIN in MongoDB (doesn't work well)
2
db.users.aggregate([
3
  { $lookup: { from: "orders", ... } },  // Expensive!
4
  { $lookup: { from: "payments", ... } }  // Very expensive!
5
])

Good:

1
// Denormalize data into documents
2
{
3
  "_id": 123,
4
  "name": "Alice",
5
  "recent_orders": [ /* embedded */ ],
6
  "payment_info": { /* embedded */ }
7
}

Lesson: Design for NoSQL’s strengths, not SQL patterns

Anti-Pattern 2: Ignoring Consistency Guarantees

Problem: Assuming eventual consistency means “eventually correct”.

Reality: Eventual consistency can lead to permanent inconsistencies if not handled.

Example:

User updates profile on Node A
User reads profile from Node B (stale)
User makes decision based on stale data
Result: Wrong decision, even after consistency

Solution: Use read-after-write consistency, version vectors

Anti-Pattern 3: Over-Normalizing in Document DBs

Problem: Normalizing like SQL (separate collections for everything).

Bad:

1
// Over-normalized (like SQL)
2
Users collection
3
Orders collection
4
OrderItems collection
5
Products collection
6
// Need multiple queries to get order!

Good:

1
// Denormalized (NoSQL style)
2
{
3
  "_id": "order:123",
4
  "user": { "id": 456, "name": "Alice" },  // Embedded
5
  "items": [
6
    { "product": "Laptop", "price": 1000 }  // Embedded
7
  ]
8
}
9
// Single query gets everything!

Lesson: Denormalize for read performance

Key Takeaways

NoSQL Types: Document, Key-Value, Column-Family, Graph
Document DBs: JSON documents, flexible schema, nested data (MongoDB)
Key-Value: Simple key-value pairs, fast lookups, caching (Redis)
Column-Family: Column-oriented, wide tables, time-series (Cassandra)
Graph DBs: Nodes and edges, relationship queries (Neo4j)
Choose based on: Data structure, query patterns, scale requirements
LLD Implementation: Flexible schema classes, document models, key-value access patterns
Trade-off: NoSQL = flexible and scalable, but limited query capabilities
Production patterns: Schema evolution, distributed counters, locks, pub/sub
Performance: Key-Value fastest, Document balanced, Column-Family write-optimized, Graph traversal-optimized
Anti-patterns: Don’t use NoSQL like SQL, handle consistency, denormalize appropriately

What’s Next?

Now that you understand different database types, let’s learn how to choose the right database for your use case:

Next up: Choosing the Right Database — Decision framework for database selection and mapping domain models to storage.

Request a feature or report an issue