Building a Real-Time Collaboration Platform

Real-time collaboration has become a fundamental requirement for modern web applications. Whether it's Google Docs, Figma, or Notion, users expect seamless, instantaneous updates across multiple clients. In this story, we'll dive deep into how we built a scalable real-time collaboration platform from scratch.

The Challenge

Our client needed a collaborative document editing platform that could:

Support 10,000+ concurrent users
Deliver updates with <100ms latency
Handle network failures gracefully
Scale horizontally across multiple servers

Technical Requirements

We had to maintain consistency across all clients while dealing with network partitions, conflicting edits, and varying client connection qualities.

Architecture Overview

We chose WebSockets for bidirectional communication and Redis Pub/Sub for message broadcasting across server instances.

hljs typescript

// WebSocket server setup with Redis adapter
import { Server } from 'socket.io';
import { createAdapter } from '@socket.io/redis-adapter';
import { createClient } from 'redis';

const io = new Server(server, {
  cors: { origin: process.env.CLIENT_URL }
});

// Redis setup for scaling across multiple instances
const pubClient = createClient({ url: process.env.REDIS_URL });
const subClient = pubClient.duplicate();

await Promise.all([pubClient.connect(), subClient.connect()]);

io.adapter(createAdapter(pubClient, subClient));

// Connection handling
io.on('connection', (socket) => {
  console.log('Client connected:', socket.id);

  socket.on('document:join', (documentId) => {
    socket.join(`doc:${documentId}`);
  });

  socket.on('document:edit', (data) => {
    socket.to(`doc:${data.documentId}`).emit('document:update', data);
  });
});

Operational Transforms

For conflict resolution, we implemented Operational Transformation (OT), a technique pioneered by Google Docs.

Tip

OT allows multiple users to edit the same document simultaneously by transforming operations based on concurrent edits. This ensures eventual consistency without locking.

The core concept is simple but powerful:

Each edit operation has a position and content
When operations conflict, they're transformed to account for other edits
All clients eventually converge to the same state

hljs typescript

function transform(op1: Operation, op2: Operation): Operation {
  // If op1 inserts before op2, shift op2's position
  if (op1.type === 'insert' && op2.position >= op1.position) {
    return {
      ...op2,
      position: op2.position + op1.content.length
    };
  }

  // If op1 deletes before op2, shift op2's position back
  if (op1.type === 'delete' && op2.position > op1.position) {
    return {
      ...op2,
      position: Math.max(op1.position, op2.position - op1.length)
    };
  }

  return op2;
}

Performance Optimization

We implemented several optimizations to hit our <100ms latency target:

1. Connection Pooling

Reusing database connections reduced overhead by 40%.

2. Message Batching

Instead of broadcasting every keystroke, we batched operations every 50ms:

hljs typescript

class OperationBatcher {
  private batch: Operation[] = [];
  private timer: NodeJS.Timeout | null = null;

  add(operation: Operation) {
    this.batch.push(operation);

    if (!this.timer) {
      this.timer = setTimeout(() => this.flush(), 50);
    }
  }

  private flush() {
    if (this.batch.length > 0) {
      this.broadcast(this.batch);
      this.batch = [];
    }
    this.timer = null;
  }
}

3. Delta Compression

We used binary delta compression to reduce message size by 60% on average.

Warning

Be careful with aggressive compression – it can increase CPU usage. We found that gzip level 6 was the sweet spot for our use case.

Results

After 3 months of development and optimization:

✅ Supporting 12,000+ concurrent users
✅ Average latency: 73ms (27% under target)
✅ 99.9% uptime over 6 months
✅ Successfully scaled to 5 geographic regions

Key Takeaways

Choose the right data structure: OT worked great for text, but CRDT might be better for other use cases
Optimize the critical path: 80% of our performance gains came from 3 key optimizations
Monitor everything: We instrumented every component with metrics from day one
Plan for failure: Network issues are inevitable – graceful degradation is key

Danger

Never trust the client! Always validate and sanitize operations on the server. We caught several attempts to inject malicious operations during beta testing.

What's Next?

We're exploring CRDTs (Conflict-free Replicated Data Types) for our next iteration, which could simplify our architecture and improve offline support.

Have questions about real-time architecture? Want to discuss scaling challenges? Reach out to us on GitHub or Twitter.

Building a Real-Time Collaboration Platform

Building a Real-Time Collaboration Platform

The Challenge

Technical Requirements

Architecture Overview

Operational Transforms

Tip

Performance Optimization

1. Connection Pooling

2. Message Batching

3. Delta Compression

Warning

Results

Key Takeaways

Danger

What's Next?

Want to read more?