Back to Stories
BackendDevTime Team

Building a Real-Time Collaboration Platform

How we architected a scalable WebSocket solution handling 10k+ concurrent users with minimal latency.

#websockets#scalability#nodejs#redis

Building a Real-Time Collaboration Platform

Real-time collaboration has become a fundamental requirement for modern web applications. Whether it's Google Docs, Figma, or Notion, users expect seamless, instantaneous updates across multiple clients. In this story, we'll dive deep into how we built a scalable real-time collaboration platform from scratch.

The Challenge

Our client needed a collaborative document editing platform that could:

  • Support 10,000+ concurrent users
  • Deliver updates with <100ms latency
  • Handle network failures gracefully
  • Scale horizontally across multiple servers

Technical Requirements

We had to maintain consistency across all clients while dealing with network partitions, conflicting edits, and varying client connection qualities.

Architecture Overview

We chose WebSockets for bidirectional communication and Redis Pub/Sub for message broadcasting across server instances.

hljs typescript
// WebSocket server setup with Redis adapter
import { Server } from 'socket.io';
import { createAdapter } from '@socket.io/redis-adapter';
import { createClient } from 'redis';

const io = new Server(server, {
  cors: { origin: process.env.CLIENT_URL }
});

// Redis setup for scaling across multiple instances
const pubClient = createClient({ url: process.env.REDIS_URL });
const subClient = pubClient.duplicate();

await Promise.all([pubClient.connect(), subClient.connect()]);

io.adapter(createAdapter(pubClient, subClient));

// Connection handling
io.on('connection', (socket) => {
  console.log('Client connected:', socket.id);

  socket.on('document:join', (documentId) => {
    socket.join(`doc:${documentId}`);
  });

  socket.on('document:edit', (data) => {
    socket.to(`doc:${data.documentId}`).emit('document:update', data);
  });
});

Operational Transforms

For conflict resolution, we implemented Operational Transformation (OT), a technique pioneered by Google Docs.

Tip

OT allows multiple users to edit the same document simultaneously by transforming operations based on concurrent edits. This ensures eventual consistency without locking.

The core concept is simple but powerful:

  1. Each edit operation has a position and content
  2. When operations conflict, they're transformed to account for other edits
  3. All clients eventually converge to the same state
hljs typescript
function transform(op1: Operation, op2: Operation): Operation {
  // If op1 inserts before op2, shift op2's position
  if (op1.type === 'insert' && op2.position >= op1.position) {
    return {
      ...op2,
      position: op2.position + op1.content.length
    };
  }

  // If op1 deletes before op2, shift op2's position back
  if (op1.type === 'delete' && op2.position > op1.position) {
    return {
      ...op2,
      position: Math.max(op1.position, op2.position - op1.length)
    };
  }

  return op2;
}

Performance Optimization

We implemented several optimizations to hit our <100ms latency target:

1. Connection Pooling

Reusing database connections reduced overhead by 40%.

2. Message Batching

Instead of broadcasting every keystroke, we batched operations every 50ms:

hljs typescript
class OperationBatcher {
  private batch: Operation[] = [];
  private timer: NodeJS.Timeout | null = null;

  add(operation: Operation) {
    this.batch.push(operation);

    if (!this.timer) {
      this.timer = setTimeout(() => this.flush(), 50);
    }
  }

  private flush() {
    if (this.batch.length > 0) {
      this.broadcast(this.batch);
      this.batch = [];
    }
    this.timer = null;
  }
}

3. Delta Compression

We used binary delta compression to reduce message size by 60% on average.

Warning

Be careful with aggressive compression – it can increase CPU usage. We found that gzip level 6 was the sweet spot for our use case.

Results

After 3 months of development and optimization:

  • ✅ Supporting 12,000+ concurrent users
  • ✅ Average latency: 73ms (27% under target)
  • ✅ 99.9% uptime over 6 months
  • ✅ Successfully scaled to 5 geographic regions

Key Takeaways

  1. Choose the right data structure: OT worked great for text, but CRDT might be better for other use cases
  2. Optimize the critical path: 80% of our performance gains came from 3 key optimizations
  3. Monitor everything: We instrumented every component with metrics from day one
  4. Plan for failure: Network issues are inevitable – graceful degradation is key

Danger

Never trust the client! Always validate and sanitize operations on the server. We caught several attempts to inject malicious operations during beta testing.

What's Next?

We're exploring CRDTs (Conflict-free Replicated Data Types) for our next iteration, which could simplify our architecture and improve offline support.


Have questions about real-time architecture? Want to discuss scaling challenges? Reach out to us on GitHub or Twitter.

Want to read more?

View All Stories