Building a Real-Time Collaboration Platform
Real-time collaboration has become a fundamental requirement for modern web applications. Whether it's Google Docs, Figma, or Notion, users expect seamless, instantaneous updates across multiple clients. In this story, we'll dive deep into how we built a scalable real-time collaboration platform from scratch.
The Challenge
Our client needed a collaborative document editing platform that could:
- Support 10,000+ concurrent users
- Deliver updates with <100ms latency
- Handle network failures gracefully
- Scale horizontally across multiple servers
Technical Requirements
We had to maintain consistency across all clients while dealing with network partitions, conflicting edits, and varying client connection qualities.
Architecture Overview
We chose WebSockets for bidirectional communication and Redis Pub/Sub for message broadcasting across server instances.
// WebSocket server setup with Redis adapter
import { Server } from 'socket.io';
import { createAdapter } from '@socket.io/redis-adapter';
import { createClient } from 'redis';
const io = new Server(server, {
cors: { origin: process.env.CLIENT_URL }
});
// Redis setup for scaling across multiple instances
const pubClient = createClient({ url: process.env.REDIS_URL });
const subClient = pubClient.duplicate();
await Promise.all([pubClient.connect(), subClient.connect()]);
io.adapter(createAdapter(pubClient, subClient));
// Connection handling
io.on('connection', (socket) => {
console.log('Client connected:', socket.id);
socket.on('document:join', (documentId) => {
socket.join(`doc:${documentId}`);
});
socket.on('document:edit', (data) => {
socket.to(`doc:${data.documentId}`).emit('document:update', data);
});
});
Operational Transforms
For conflict resolution, we implemented Operational Transformation (OT), a technique pioneered by Google Docs.
Tip
OT allows multiple users to edit the same document simultaneously by transforming operations based on concurrent edits. This ensures eventual consistency without locking.
The core concept is simple but powerful:
- Each edit operation has a position and content
- When operations conflict, they're transformed to account for other edits
- All clients eventually converge to the same state
function transform(op1: Operation, op2: Operation): Operation {
// If op1 inserts before op2, shift op2's position
if (op1.type === 'insert' && op2.position >= op1.position) {
return {
...op2,
position: op2.position + op1.content.length
};
}
// If op1 deletes before op2, shift op2's position back
if (op1.type === 'delete' && op2.position > op1.position) {
return {
...op2,
position: Math.max(op1.position, op2.position - op1.length)
};
}
return op2;
}
Performance Optimization
We implemented several optimizations to hit our <100ms latency target:
1. Connection Pooling
Reusing database connections reduced overhead by 40%.
2. Message Batching
Instead of broadcasting every keystroke, we batched operations every 50ms:
class OperationBatcher {
private batch: Operation[] = [];
private timer: NodeJS.Timeout | null = null;
add(operation: Operation) {
this.batch.push(operation);
if (!this.timer) {
this.timer = setTimeout(() => this.flush(), 50);
}
}
private flush() {
if (this.batch.length > 0) {
this.broadcast(this.batch);
this.batch = [];
}
this.timer = null;
}
}
3. Delta Compression
We used binary delta compression to reduce message size by 60% on average.
Warning
Be careful with aggressive compression – it can increase CPU usage. We found that gzip level 6 was the sweet spot for our use case.
Results
After 3 months of development and optimization:
- ✅ Supporting 12,000+ concurrent users
- ✅ Average latency: 73ms (27% under target)
- ✅ 99.9% uptime over 6 months
- ✅ Successfully scaled to 5 geographic regions
Key Takeaways
- Choose the right data structure: OT worked great for text, but CRDT might be better for other use cases
- Optimize the critical path: 80% of our performance gains came from 3 key optimizations
- Monitor everything: We instrumented every component with metrics from day one
- Plan for failure: Network issues are inevitable – graceful degradation is key
Danger
Never trust the client! Always validate and sanitize operations on the server. We caught several attempts to inject malicious operations during beta testing.
What's Next?
We're exploring CRDTs (Conflict-free Replicated Data Types) for our next iteration, which could simplify our architecture and improve offline support.
Have questions about real-time architecture? Want to discuss scaling challenges? Reach out to us on GitHub or Twitter.