Guides 7 min read

How to Never Miss a Failed Cron Job Again

Cron jobs fail silently by default. Learn how to add failure alerts, dead man's switches, and monitoring to your scheduled tasks so nothing slips through.

HookWatch Team

March 24, 2026

Somewhere right now, a cron job is failing on a server and nobody knows about it. The backup script that stopped working two weeks ago. The report generator that's been silently timing out since the last deployment. The database cleanup task that ran out of disk space and quit.

Cron is one of the oldest and most reliable tools in the Unix ecosystem. It's also one of the worst at telling you when something goes wrong.

The Fundamental Problem with Cron

Cron's job is simple: run a command at a scheduled time. It does this extremely well. What it doesn't do is care about the outcome.

By default, cron sends the output of a failed job to the local mail spool of the user who owns the crontab. In practice, this means the output goes absolutely nowhere. Most servers don't have a local MTA configured, nobody checks /var/mail/deploy, and even if they did, distinguishing a failed run from a successful one requires parsing unstructured text.

Bash

# This cron job will fail silently if the script exits non-zero
0 2 * * * /opt/scripts/nightly-backup.sh

There's no built-in retry. No alerting. No history of past executions. If you want to know whether your cron job ran successfully at 2 AM, you have to go looking.

Why This Matters More Than You Think

Failed cron jobs tend to have a compounding effect:

Data gaps — a reporting job that fails for three days means your weekly report has three days of missing data, and someone has to manually backfill it
Resource exhaustion — a cleanup job that stops running leads to disk space or database bloat that eventually takes down production
Compliance violations — audit log exports, data retention purges, and GDPR deletion tasks have legal deadlines that don't care about your cron problems
Security drift — certificate renewal, key rotation, and vulnerability scan cron jobs failing silently is how you end up with expired certs in production

The insidious part is the lag between failure and discovery. A nightly backup that fails on Monday might not be discovered until Friday when someone needs to restore data. By then, you've lost a week of backups.

Strategy 1: Wrapper Scripts with Exit Code Handling

The simplest improvement is wrapping your cron commands in a script that checks the exit code and sends a notification on failure.

Bash

#!/bin/bash
# /opt/scripts/cron-wrapper.sh

COMMAND="$@"
OUTPUT=$($COMMAND 2>&1)
EXIT_CODE=$?

if [ $EXIT_CODE -ne 0 ]; then
    curl -X POST "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
        -H "Content-Type: application/json" \
        -d "{
            \"text\": \"Cron job failed on $(hostname)\",
            \"attachments\": [{
                \"color\": \"danger\",
                \"fields\": [
                    {\"title\": \"Command\", \"value\": \"$COMMAND\", \"short\": false},
                    {\"title\": \"Exit Code\", \"value\": \"$EXIT_CODE\", \"short\": true},
                    {\"title\": \"Output\", \"value\": \"$(echo $OUTPUT | head -c 500)\", \"short\": false}
                ]
            }]
        }"
fi

Then your crontab becomes:

Bash

0 2 * * * /opt/scripts/cron-wrapper.sh /opt/scripts/nightly-backup.sh

Pros: Simple, no external dependencies beyond a webhook URL.

Cons: Only catches non-zero exit codes. If the script hangs forever, the wrapper hangs with it. If the server is down, the cron never fires and no alert is sent. You're also not tracking execution history.

Strategy 2: Dead Man's Switch (Heartbeat Monitoring)

A dead man's switch flips the monitoring model: instead of alerting when something fails, you alert when something doesn't check in.

The concept is straightforward:

Your cron job pings a monitoring endpoint when it completes successfully
The monitoring service expects a ping every N minutes/hours
If the ping doesn't arrive within the expected window, an alert fires

Bash

# Cron job with heartbeat
0 2 * * * /opt/scripts/nightly-backup.sh && curl -s https://your-monitor.com/ping/backup-job

This catches several failure modes that wrapper scripts miss:

The job never starts — server is down, crond isn't running, crontab was accidentally deleted
The job hangs — it's stuck waiting on a lock, a network call, or a full disk
The job is slower than expected — it usually takes 10 minutes but today it's taken 3 hours and counting

The && operator ensures the ping only fires on success (exit code 0). If the backup script fails, the curl never executes, and the monitoring service raises an alert when the expected ping doesn't arrive.

Strategy 3: Structured Logging and Execution Tracking

For teams running more than a handful of cron jobs, you need execution history — not just alerts.

This means logging every execution with:

Start time and end time (duration)
Exit code
Stdout and stderr output (truncated to a reasonable size)
Which server ran it (critical if you have jobs on multiple machines)

Bash

#!/bin/bash
# Enhanced cron wrapper with structured logging

JOB_ID=$(uuidgen)
JOB_NAME="$1"
shift
COMMAND="$@"
START_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

OUTPUT=$($COMMAND 2>&1)
EXIT_CODE=$?
END_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

# Send structured execution record
curl -s -X POST "https://your-api.com/cron/executions" \
    -H "Content-Type: application/json" \
    -d "{
        \"id\": \"$JOB_ID\",
        \"job\": \"$JOB_NAME\",
        \"host\": \"$(hostname)\",
        \"command\": \"$COMMAND\",
        \"exit_code\": $EXIT_CODE,
        \"started_at\": \"$START_TIME\",
        \"finished_at\": \"$END_TIME\",
        \"output\": $(echo "$OUTPUT" | head -c 2000 | jq -Rs .)
    }"

Now you have a queryable history. You can answer questions like:

"When did the backup job last succeed?"
"How long does the cleanup task typically take? Is it getting slower?"
"Which server ran the reporting job last night?"

Strategy 4: Centralised Cron Management

The strategies above work well when you have a few jobs on a few servers. But as your infrastructure grows, managing crontabs across dozens of machines becomes its own problem.

Centralised cron management moves the job definitions and execution tracking into a single system:

Define jobs in one place (not scattered across server crontabs)
Track execution status and history across all servers
Get unified alerting — one dashboard for all your scheduled tasks
Run jobs manually when you need to retry a failure or test a change

This is one of the reasons we built cron monitoring into [HookWatch](https://hookwatch.dev). The CLI agent runs on your servers and pulls job definitions from the cloud. When a job executes, it reports the result back — exit code, output, duration. If a job fails or misses its schedule, you get an alert through Slack, Discord, email, or Telegram. And you can trigger manual re-runs from the dashboard without SSH-ing into the server.

Handling Common Failure Modes

Beyond basic alerting, here are specific failure modes you should design for:

Overlapping Executions

A job scheduled every 5 minutes takes 7 minutes to run. Now you have two instances running simultaneously, potentially corrupting data.

Bash

# Use flock to prevent overlapping
* * * * * flock -n /tmp/myworker.lock /opt/scripts/process-queue.sh

The -n flag makes flock fail immediately if the lock is held, rather than waiting. Combine this with your monitoring: if the job is consistently taking longer than its interval, that's a metric worth alerting on.

Timeout Protection

A job that hangs indefinitely blocks the next execution and wastes resources.

Bash

# Kill the job after 30 minutes
0 * * * * timeout 1800 /opt/scripts/hourly-sync.sh

timeout sends SIGTERM after the specified seconds. The exit code will be 124, which your wrapper script can detect and report as a timeout specifically.

Environment Issues

Cron runs with a minimal environment — no .bashrc, no PATH customisation, no environment variables from your shell. This is the single most common cause of "it works when I run it manually but fails in cron."

Bash

# Explicit environment in crontab
PATH=/usr/local/bin:/usr/bin:/bin
DATABASE_URL=postgres://localhost/myapp

0 2 * * * /opt/scripts/nightly-backup.sh

Or source your environment explicitly in the script:

Bash

#!/bin/bash
source /opt/scripts/.env
# Now DATABASE_URL and other vars are available

Daylight Saving Time

Cron uses the system's local time zone. During DST transitions, a job scheduled at 2:30 AM might run twice (when clocks fall back) or not at all (when clocks spring forward).

Bash

# Use UTC to avoid DST issues
CRON_TZ=UTC
30 2 * * * /opt/scripts/nightly-report.sh

Building an Alert Strategy That Works

Not all cron failures are equal. Your alerting should reflect this:

Critical (immediate page)

Payment processing jobs
Security certificate renewal
Database backups
Compliance-related tasks

Warning (Slack/email within minutes)

Report generation
Cache warming
Data synchronisation with non-critical services

Informational (dashboard only)

Analytics aggregation
Log rotation
Temporary file cleanup

The worst thing you can do is send every failure to the same channel at the same severity. Alert fatigue kills monitoring faster than no monitoring at all.

A Practical Checklist

If you're starting from zero, here's a prioritised checklist:

Audit your crontabs — run crontab -l on every server, document what exists and what it does
Categorise by criticality — separate "the business stops" jobs from "nice to have" jobs
Add dead man's switches to critical jobs first — even a simple ping to an HTTP endpoint is better than nothing
Add wrapper scripts for failure notification — Slack webhook, email, whatever your team actually checks
Implement flock for jobs that must not overlap — this prevents an entire class of subtle bugs
Add timeout to long-running jobs — a job that hangs for 6 hours is worse than one that fails immediately
Set up execution history — even if it's just appending to a log file at first, you need history to debug patterns
Centralise when it makes sense — once you have more than 10 jobs across multiple servers, managing individual crontabs doesn't scale

Conclusion

Cron is a brilliant piece of engineering that has run the world's scheduled tasks for over 40 years. Its one weakness — silent failure — is entirely solvable with the right monitoring strategy.

The goal isn't to replace cron. It's to add the observability layer that cron was never designed to provide. Whether you use wrapper scripts, dead man's switches, or a dedicated monitoring tool like [HookWatch](https://hookwatch.dev), the important thing is that no scheduled task on your infrastructure should be able to fail without someone knowing about it.

Start with your most critical jobs. Get alerts working. Then expand coverage. Your future self — the one who doesn't get woken up at 3 AM because a backup failed silently for a week — will thank you.

Tags: cron-jobsmonitoringalertingdevopsreliability

Share this article

Twitter LinkedIn

How to Never Miss a Failed Cron Job Again

The Fundamental Problem with Cron

Why This Matters More Than You Think

Strategy 1: Wrapper Scripts with Exit Code Handling

Strategy 2: Dead Man's Switch (Heartbeat Monitoring)

Strategy 3: Structured Logging and Execution Tracking

Strategy 4: Centralised Cron Management

Handling Common Failure Modes

Overlapping Executions

Timeout Protection

Environment Issues

Daylight Saving Time

Building an Alert Strategy That Works

Critical (immediate page)

Warning (Slack/email within minutes)

Informational (dashboard only)

A Practical Checklist

Conclusion

Share this article

Related Posts

A Developer's Guide to WebSocket Connection Monitoring

What is Webhook Monitoring and Why Your API Integrations Need It

Ready to try HookWatch?