Guides 8 min read

A Developer's Guide to WebSocket Connection Monitoring

Learn how to monitor WebSocket connections effectively — health checks, reconnection strategies, metrics collection, and debugging persistent connection issues.

H

HookWatch Team

March 24, 2026

WebSockets give you something HTTP can't: a persistent, bidirectional connection between client and server. That's what makes them perfect for real-time features — chat, live dashboards, collaborative editing, streaming data. But persistent connections come with persistent problems.

An HTTP request either works or it doesn't. You get a response code, measure the latency, and move on. A WebSocket connection, on the other hand, can silently degrade. The connection might technically be open but no longer receiving data. The server might be alive but the connection has entered a half-open state where neither side knows it's broken. A client might reconnect in a tight loop, hammering your server without you realising it.

Monitoring WebSocket connections requires different tools and strategies than monitoring HTTP endpoints. This guide covers what to track, how to detect common failure modes, and how to build a monitoring setup that actually works.

Why WebSocket Monitoring Is Different

HTTP monitoring is stateless. You send a request, get a response, measure the result. Every interaction is independent. If a health check succeeds now, you know the endpoint is working right now.

WebSocket monitoring is stateful. You're tracking connections that persist for minutes, hours, or days. The questions you need to answer are fundamentally different:

HTTP MonitoringWebSocket Monitoring
Is the endpoint responding?Are connections staying alive?
What's the response latency?What's the message delivery latency?
What's the error rate?What's the reconnection rate?
How many requests per second?How many concurrent connections?
Is the response correct?Are messages being received in order?

The Core Metrics

1. Connection Count and Churn

Track the number of active connections over time, along with connection/disconnection rates:

Go
type WebSocketMetrics struct {
    ActiveConnections   int64
    TotalConnections    int64   // cumulative since start
    TotalDisconnections int64
    ConnectionsPerMin   float64
    DisconnectsPerMin   float64
}

What healthy looks like: Active connections should track your user activity patterns — rising during business hours, falling overnight. Connection and disconnection rates should be roughly equal over any 5-minute window.

Warning signs:

  • Active connections dropping suddenly while your user count stays the same — your server or a proxy is terminating connections
  • Disconnect rate significantly exceeding connection rate — clients are being dropped faster than they can reconnect
  • Connection rate spiking without a corresponding user increase — clients are reconnecting in a tight loop (reconnection storm)

2. Message Throughput and Latency

Track messages sent and received, with latency measurement:

Go
// Server-side message tracking
func (ws *WebSocketServer) sendMessage(conn *Connection, msg Message) error {
    start := time.Now()
    err := conn.WriteJSON(msg)
    duration := time.Since(start)

    metrics.RecordMessageSent(duration, err)
    return err
}

For end-to-end latency (how long it takes a message to go from server to client and back), implement a ping measurement:

Javascript
// Client-side latency measurement
function measureLatency(ws) {
	const start = performance.now();
	ws.send(JSON.stringify({ type: 'ping', timestamp: start }));

	// Server echoes back with type 'pong'
	ws.addEventListener('message', function handler(event) {
		const msg = JSON.parse(event.data);
		if (msg.type === 'pong') {
			const latency = performance.now() - msg.timestamp;
			reportLatency(latency);
			ws.removeEventListener('message', handler);
		}
	});
}

// Measure every 30 seconds
setInterval(() => measureLatency(ws), 30000);

3. Error Classification

Not all disconnections are equal. Classify them:

Go
type DisconnectReason string

const (
    DisconnectNormal      DisconnectReason = "normal_close"      // 1000
    DisconnectGoingAway   DisconnectReason = "going_away"        // 1001 (page nav)
    DisconnectProtocolErr DisconnectReason = "protocol_error"    // 1002
    DisconnectTimeout     DisconnectReason = "ping_timeout"      // no pong received
    DisconnectServerError DisconnectReason = "server_error"      // 1011
    DisconnectNetworkLoss DisconnectReason = "network_loss"      // no close frame
    DisconnectRateLimit   DisconnectReason = "rate_limited"      // too many messages
)

Normal closes (1000, 1001) are expected — users navigate away, tabs close. Protocol errors and timeouts are problems. Network loss (the connection drops without a close frame) is the hardest to detect and the most common cause of "ghost connections."

4. Reconnection Behaviour

Track how often clients reconnect and how quickly:

  • Reconnection rate per client — how many times each client has reconnected in the last hour
  • Time between reconnections — is it getting shorter (reconnection storm) or stable?
  • Reconnection success rate — what percentage of reconnection attempts succeed on the first try?

Detecting Common Failure Modes

Half-Open Connections

The most insidious WebSocket problem. One side thinks the connection is open, the other side has closed it (or the network path between them has broken). This happens when:

  • A mobile device switches from Wi-Fi to cellular
  • An intermediate proxy silently drops the connection
  • The server process crashes without sending a close frame

Detection: Application-Level Heartbeats

The WebSocket protocol has built-in ping/pong frames, but many proxies and load balancers don't forward them. Implement heartbeats at the application level:

Go
// Server-side heartbeat
func (c *Connection) heartbeatLoop() {
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            c.lastPingSent = time.Now()
            err := c.conn.WriteJSON(map[string]string{"type": "heartbeat"})
            if err != nil {
                c.handleDisconnect(DisconnectNetworkLoss)
                return
            }

            // If no pong within 10 seconds, connection is dead
            time.AfterFunc(10*time.Second, func() {
                if c.lastPongReceived.Before(c.lastPingSent) {
                    c.handleDisconnect(DisconnectTimeout)
                    c.conn.Close()
                }
            })
        case <-c.done:
            return
        }
    }
}
Javascript
// Client-side heartbeat response
ws.addEventListener('message', (event) => {
	const msg = JSON.parse(event.data);
	if (msg.type === 'heartbeat') {
		ws.send(JSON.stringify({ type: 'heartbeat_ack' }));
	}
});

Reconnection Storms

When your server restarts or a network disruption affects many clients simultaneously, all of them try to reconnect at once. This can overwhelm your server and cause a cascading failure.

Detection: Monitor connection rate. If it exceeds 10x the normal rate within a 1-minute window, you're experiencing a reconnection storm.

Prevention: Implement exponential backoff with jitter on the client side:

Javascript
class ReconnectingWebSocket {
	constructor(url) {
		this.url = url;
		this.reconnectAttempt = 0;
		this.maxDelay = 30000; // 30 seconds
		this.connect();
	}

	connect() {
		this.ws = new WebSocket(this.url);

		this.ws.onopen = () => {
			this.reconnectAttempt = 0; // Reset on success
		};

		this.ws.onclose = (event) => {
			if (event.code !== 1000) {
				// Not a normal close
				this.scheduleReconnect();
			}
		};
	}

	scheduleReconnect() {
		const baseDelay = Math.min(1000 * Math.pow(2, this.reconnectAttempt), this.maxDelay);
		const jitter = baseDelay * 0.5 * Math.random(); // 0-50% jitter
		const delay = baseDelay + jitter;

		this.reconnectAttempt++;
		setTimeout(() => this.connect(), delay);
	}
}

Message Ordering and Loss

WebSocket guarantees in-order delivery over a single connection. But when a client reconnects, there's a gap. Messages sent between the old connection closing and the new one opening are lost.

Detection: Include a sequence number in your messages:

Go
type Message struct {
    Sequence int64       `json:"seq"`
    Type     string      `json:"type"`
    Data     interface{} `json:"data"`
}

The client tracks the last sequence number received. On reconnection, it sends this number to the server, which replays any missed messages. If you see clients consistently requesting replays, your reconnection logic has gaps.

Memory Leaks from Accumulated Connections

Each WebSocket connection holds state on the server — buffers, goroutines, user context. If connections aren't cleaned up properly (e.g., the close handler doesn't fire due to a network issue), memory usage grows over time.

Detection: Correlate your active connection count with your process's memory usage. If memory grows linearly while connection count stays flat, you have a leak.

Go
// Periodic cleanup of stale connections
func (s *WebSocketServer) cleanupStaleConnections() {
    ticker := time.NewTicker(1 * time.Minute)
    for range ticker.C {
        s.mu.Lock()
        for id, conn := range s.connections {
            if time.Since(conn.lastActivity) > 5*time.Minute {
                conn.conn.Close()
                delete(s.connections, id)
                metrics.RecordDisconnect(DisconnectTimeout)
            }
        }
        s.mu.Unlock()
    }
}

Infrastructure Monitoring

Proxy and Load Balancer Configuration

WebSocket connections require special handling at the infrastructure level:

Nginx
# nginx WebSocket proxy configuration
location /ws {
    proxy_pass http://websocket_backend;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";

    # Critical: set timeouts appropriate for long-lived connections
    proxy_read_timeout 3600s;  # 1 hour
    proxy_send_timeout 3600s;
    proxy_connect_timeout 10s;
}

Common issue: Default proxy timeouts (60 seconds) silently close idle WebSocket connections. If your heartbeat interval is longer than the proxy timeout, connections get dropped between heartbeats.

Monitor: track connection durations. If most connections last exactly 60 seconds (or 300 seconds, or whatever your proxy's default timeout is), your proxy is killing them.

Connection Limits

Operating systems and reverse proxies have connection limits:

Bash
# Check system limits
ulimit -n              # Per-process file descriptor limit
sysctl net.core.somaxconn  # Listen backlog queue size

# Check current connections
ss -s                  # Socket statistics summary

Monitor your connection count against these limits. At 80% capacity, start alerting. At 95%, you're about to start dropping connections.

Building a Monitoring Dashboard

A useful WebSocket monitoring dashboard shows:

Real-time Panel

  • Active connections (current count)
  • Messages per second (in/out)
  • Connection rate (new connections per minute)
  • Disconnection rate (by reason)

Health Indicators

  • Heartbeat success rate (should be >99.9%)
  • Average message latency (p50, p95, p99)
  • Reconnection storm indicator (connection rate vs baseline)

Historical Panel

  • Connection count over time (24h, 7d)
  • Error rate trends
  • Connection duration distribution (are connections lasting as long as expected?)
  • Memory usage correlated with connection count

Integrating With Your Monitoring Stack

WebSocket metrics fit naturally into the same observability pipeline as your HTTP metrics. Export them via Prometheus, StatsD, or whatever your stack uses:

Go
var (
    wsConnections = prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "websocket_active_connections",
        Help: "Number of active WebSocket connections",
    })
    wsMessageLatency = prometheus.NewHistogram(prometheus.HistogramOpts{
        Name:    "websocket_message_latency_seconds",
        Help:    "Message delivery latency",
        Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0},
    })
    wsDisconnects = prometheus.NewCounterVec(prometheus.CounterOpts{
        Name: "websocket_disconnections_total",
        Help: "Total disconnections by reason",
    }, []string{"reason"})
)

For teams using [HookWatch](https://hookwatch.dev), the WebSocket service provides built-in connection monitoring and real-time event streaming — the same infrastructure that powers webhook delivery monitoring also tracks WebSocket connection health, reconnection patterns, and message delivery across your endpoints.

Conclusion

WebSocket monitoring is fundamentally about tracking the health of persistent connections over time. Unlike HTTP monitoring, where each request is independent, WebSocket monitoring requires understanding connection lifecycles, detecting silent failures, and preventing cascading disconnection events.

The key takeaways:

  1. Implement application-level heartbeats — don't rely on TCP or WebSocket protocol-level pings
  2. Track reconnection patterns — they reveal client-side and infrastructure problems before they become outages
  3. Classify disconnection reasons — not all disconnects are errors, and knowing the difference prevents alert fatigue
  4. Monitor infrastructure limits — file descriptors, proxy timeouts, and connection caps are the most common causes of WebSocket issues at scale
  5. Correlate connection count with resource usage — memory leaks from unclean disconnections are common and hard to catch without correlation

WebSocket connections are powerful, but they require active monitoring to stay reliable. The investment in proper observability pays off every time you catch a reconnection storm or a half-open connection leak before your users notice.

Tags: websocketmonitoringreal-timedebuggingreliability

Share this article

Ready to try HookWatch?

Start monitoring your webhooks in minutes. No credit card required.

Start Free Today