When I set out to build s3-like-storage, I wanted to understand the fundamental challenges of distributed storage systems. Here's what I learned.
Key Design Decisions
1. Consistent Hashing
Traditional hash-based distribution fails when nodes are added or removed. Consistent hashing with virtual nodes provides:
- Minimal data movement during rebalancing
- Even load distribution
- Graceful handling of node failures
2. Replication Strategy
We chose 3-way replication with the following write path:
Client → Primary → Replica 1 → Replica 2 → Ack
(async) (async)
Writes are acknowledged after the primary confirms, with async replication for durability.
3. Metadata vs. Data Separation
PostgreSQL handles metadata (bucket info, object metadata, ACLs) while actual data lives on distributed nodes. This separation allows:
- ACID transactions for metadata
- Scalable data storage
- Independent scaling of each layer
Challenges Encountered
Split-Brain Prevention
Network partitions can cause split-brain scenarios. Our solution uses a quorum-based approach where writes require majority acknowledgment.
Large File Handling
Multipart upload was essential for large files. We implemented:
- Part tracking in metadata store
- Parallel part uploads
- Atomic completion with manifest
Performance Tuning
Key optimizations that improved throughput:
- Connection pooling: Reuse TCP connections to nodes
- Buffer sizing: 64KB buffers for network I/O
- Async I/O: Non-blocking disk operations
- Redis caching: Hot metadata in memory
The final system achieves 1,200 req/s for small files and 600 MB/s for large file streaming.