A Developer's Guide to WebSocket Connection Monitoring
Learn how to monitor WebSocket connections effectively — health checks, reconnection strategies, metrics collection, and debugging persistent connection issues.
HookWatch Team
March 24, 2026
WebSockets give you something HTTP can't: a persistent, bidirectional connection between client and server. That's what makes them perfect for real-time features — chat, live dashboards, collaborative editing, streaming data. But persistent connections come with persistent problems.
An HTTP request either works or it doesn't. You get a response code, measure the latency, and move on. A WebSocket connection, on the other hand, can silently degrade. The connection might technically be open but no longer receiving data. The server might be alive but the connection has entered a half-open state where neither side knows it's broken. A client might reconnect in a tight loop, hammering your server without you realising it.
Monitoring WebSocket connections requires different tools and strategies than monitoring HTTP endpoints. This guide covers what to track, how to detect common failure modes, and how to build a monitoring setup that actually works.
Why WebSocket Monitoring Is Different
HTTP monitoring is stateless. You send a request, get a response, measure the result. Every interaction is independent. If a health check succeeds now, you know the endpoint is working right now.
WebSocket monitoring is stateful. You're tracking connections that persist for minutes, hours, or days. The questions you need to answer are fundamentally different:
| HTTP Monitoring | WebSocket Monitoring |
|---|---|
| Is the endpoint responding? | Are connections staying alive? |
| What's the response latency? | What's the message delivery latency? |
| What's the error rate? | What's the reconnection rate? |
| How many requests per second? | How many concurrent connections? |
| Is the response correct? | Are messages being received in order? |
The Core Metrics
1. Connection Count and Churn
Track the number of active connections over time, along with connection/disconnection rates:
type WebSocketMetrics struct {
ActiveConnections int64
TotalConnections int64 // cumulative since start
TotalDisconnections int64
ConnectionsPerMin float64
DisconnectsPerMin float64
}
What healthy looks like: Active connections should track your user activity patterns — rising during business hours, falling overnight. Connection and disconnection rates should be roughly equal over any 5-minute window.
Warning signs:
- Active connections dropping suddenly while your user count stays the same — your server or a proxy is terminating connections
- Disconnect rate significantly exceeding connection rate — clients are being dropped faster than they can reconnect
- Connection rate spiking without a corresponding user increase — clients are reconnecting in a tight loop (reconnection storm)
2. Message Throughput and Latency
Track messages sent and received, with latency measurement:
// Server-side message tracking
func (ws *WebSocketServer) sendMessage(conn *Connection, msg Message) error {
start := time.Now()
err := conn.WriteJSON(msg)
duration := time.Since(start)
metrics.RecordMessageSent(duration, err)
return err
}
For end-to-end latency (how long it takes a message to go from server to client and back), implement a ping measurement:
// Client-side latency measurement
function measureLatency(ws) {
const start = performance.now();
ws.send(JSON.stringify({ type: 'ping', timestamp: start }));
// Server echoes back with type 'pong'
ws.addEventListener('message', function handler(event) {
const msg = JSON.parse(event.data);
if (msg.type === 'pong') {
const latency = performance.now() - msg.timestamp;
reportLatency(latency);
ws.removeEventListener('message', handler);
}
});
}
// Measure every 30 seconds
setInterval(() => measureLatency(ws), 30000);
3. Error Classification
Not all disconnections are equal. Classify them:
type DisconnectReason string
const (
DisconnectNormal DisconnectReason = "normal_close" // 1000
DisconnectGoingAway DisconnectReason = "going_away" // 1001 (page nav)
DisconnectProtocolErr DisconnectReason = "protocol_error" // 1002
DisconnectTimeout DisconnectReason = "ping_timeout" // no pong received
DisconnectServerError DisconnectReason = "server_error" // 1011
DisconnectNetworkLoss DisconnectReason = "network_loss" // no close frame
DisconnectRateLimit DisconnectReason = "rate_limited" // too many messages
)
Normal closes (1000, 1001) are expected — users navigate away, tabs close. Protocol errors and timeouts are problems. Network loss (the connection drops without a close frame) is the hardest to detect and the most common cause of "ghost connections."
4. Reconnection Behaviour
Track how often clients reconnect and how quickly:
- Reconnection rate per client — how many times each client has reconnected in the last hour
- Time between reconnections — is it getting shorter (reconnection storm) or stable?
- Reconnection success rate — what percentage of reconnection attempts succeed on the first try?
Detecting Common Failure Modes
Half-Open Connections
The most insidious WebSocket problem. One side thinks the connection is open, the other side has closed it (or the network path between them has broken). This happens when:
- A mobile device switches from Wi-Fi to cellular
- An intermediate proxy silently drops the connection
- The server process crashes without sending a close frame
Detection: Application-Level Heartbeats
The WebSocket protocol has built-in ping/pong frames, but many proxies and load balancers don't forward them. Implement heartbeats at the application level:
// Server-side heartbeat
func (c *Connection) heartbeatLoop() {
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()
for {
select {
case <-ticker.C:
c.lastPingSent = time.Now()
err := c.conn.WriteJSON(map[string]string{"type": "heartbeat"})
if err != nil {
c.handleDisconnect(DisconnectNetworkLoss)
return
}
// If no pong within 10 seconds, connection is dead
time.AfterFunc(10*time.Second, func() {
if c.lastPongReceived.Before(c.lastPingSent) {
c.handleDisconnect(DisconnectTimeout)
c.conn.Close()
}
})
case <-c.done:
return
}
}
}
// Client-side heartbeat response
ws.addEventListener('message', (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'heartbeat') {
ws.send(JSON.stringify({ type: 'heartbeat_ack' }));
}
});
Reconnection Storms
When your server restarts or a network disruption affects many clients simultaneously, all of them try to reconnect at once. This can overwhelm your server and cause a cascading failure.
Detection: Monitor connection rate. If it exceeds 10x the normal rate within a 1-minute window, you're experiencing a reconnection storm.
Prevention: Implement exponential backoff with jitter on the client side:
class ReconnectingWebSocket {
constructor(url) {
this.url = url;
this.reconnectAttempt = 0;
this.maxDelay = 30000; // 30 seconds
this.connect();
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
this.reconnectAttempt = 0; // Reset on success
};
this.ws.onclose = (event) => {
if (event.code !== 1000) {
// Not a normal close
this.scheduleReconnect();
}
};
}
scheduleReconnect() {
const baseDelay = Math.min(1000 * Math.pow(2, this.reconnectAttempt), this.maxDelay);
const jitter = baseDelay * 0.5 * Math.random(); // 0-50% jitter
const delay = baseDelay + jitter;
this.reconnectAttempt++;
setTimeout(() => this.connect(), delay);
}
}
Message Ordering and Loss
WebSocket guarantees in-order delivery over a single connection. But when a client reconnects, there's a gap. Messages sent between the old connection closing and the new one opening are lost.
Detection: Include a sequence number in your messages:
type Message struct {
Sequence int64 `json:"seq"`
Type string `json:"type"`
Data interface{} `json:"data"`
}
The client tracks the last sequence number received. On reconnection, it sends this number to the server, which replays any missed messages. If you see clients consistently requesting replays, your reconnection logic has gaps.
Memory Leaks from Accumulated Connections
Each WebSocket connection holds state on the server — buffers, goroutines, user context. If connections aren't cleaned up properly (e.g., the close handler doesn't fire due to a network issue), memory usage grows over time.
Detection: Correlate your active connection count with your process's memory usage. If memory grows linearly while connection count stays flat, you have a leak.
// Periodic cleanup of stale connections
func (s *WebSocketServer) cleanupStaleConnections() {
ticker := time.NewTicker(1 * time.Minute)
for range ticker.C {
s.mu.Lock()
for id, conn := range s.connections {
if time.Since(conn.lastActivity) > 5*time.Minute {
conn.conn.Close()
delete(s.connections, id)
metrics.RecordDisconnect(DisconnectTimeout)
}
}
s.mu.Unlock()
}
}
Infrastructure Monitoring
Proxy and Load Balancer Configuration
WebSocket connections require special handling at the infrastructure level:
# nginx WebSocket proxy configuration
location /ws {
proxy_pass http://websocket_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# Critical: set timeouts appropriate for long-lived connections
proxy_read_timeout 3600s; # 1 hour
proxy_send_timeout 3600s;
proxy_connect_timeout 10s;
}
Common issue: Default proxy timeouts (60 seconds) silently close idle WebSocket connections. If your heartbeat interval is longer than the proxy timeout, connections get dropped between heartbeats.
Monitor: track connection durations. If most connections last exactly 60 seconds (or 300 seconds, or whatever your proxy's default timeout is), your proxy is killing them.
Connection Limits
Operating systems and reverse proxies have connection limits:
# Check system limits
ulimit -n # Per-process file descriptor limit
sysctl net.core.somaxconn # Listen backlog queue size
# Check current connections
ss -s # Socket statistics summary
Monitor your connection count against these limits. At 80% capacity, start alerting. At 95%, you're about to start dropping connections.
Building a Monitoring Dashboard
A useful WebSocket monitoring dashboard shows:
Real-time Panel
- Active connections (current count)
- Messages per second (in/out)
- Connection rate (new connections per minute)
- Disconnection rate (by reason)
Health Indicators
- Heartbeat success rate (should be >99.9%)
- Average message latency (p50, p95, p99)
- Reconnection storm indicator (connection rate vs baseline)
Historical Panel
- Connection count over time (24h, 7d)
- Error rate trends
- Connection duration distribution (are connections lasting as long as expected?)
- Memory usage correlated with connection count
Integrating With Your Monitoring Stack
WebSocket metrics fit naturally into the same observability pipeline as your HTTP metrics. Export them via Prometheus, StatsD, or whatever your stack uses:
var (
wsConnections = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "websocket_active_connections",
Help: "Number of active WebSocket connections",
})
wsMessageLatency = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "websocket_message_latency_seconds",
Help: "Message delivery latency",
Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0},
})
wsDisconnects = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "websocket_disconnections_total",
Help: "Total disconnections by reason",
}, []string{"reason"})
)
For teams using [HookWatch](https://hookwatch.dev), the WebSocket service provides built-in connection monitoring and real-time event streaming — the same infrastructure that powers webhook delivery monitoring also tracks WebSocket connection health, reconnection patterns, and message delivery across your endpoints.
Conclusion
WebSocket monitoring is fundamentally about tracking the health of persistent connections over time. Unlike HTTP monitoring, where each request is independent, WebSocket monitoring requires understanding connection lifecycles, detecting silent failures, and preventing cascading disconnection events.
The key takeaways:
- Implement application-level heartbeats — don't rely on TCP or WebSocket protocol-level pings
- Track reconnection patterns — they reveal client-side and infrastructure problems before they become outages
- Classify disconnection reasons — not all disconnects are errors, and knowing the difference prevents alert fatigue
- Monitor infrastructure limits — file descriptors, proxy timeouts, and connection caps are the most common causes of WebSocket issues at scale
- Correlate connection count with resource usage — memory leaks from unclean disconnections are common and hard to catch without correlation
WebSocket connections are powerful, but they require active monitoring to stay reliable. The investment in proper observability pays off every time you catch a reconnection storm or a half-open connection leak before your users notice.