How to Never Miss a Failed Cron Job Again
Cron jobs fail silently by default. Learn how to add failure alerts, dead man's switches, and monitoring to your scheduled tasks so nothing slips through.
HookWatch Team
March 24, 2026
Somewhere right now, a cron job is failing on a server and nobody knows about it. The backup script that stopped working two weeks ago. The report generator that's been silently timing out since the last deployment. The database cleanup task that ran out of disk space and quit.
Cron is one of the oldest and most reliable tools in the Unix ecosystem. It's also one of the worst at telling you when something goes wrong.
The Fundamental Problem with Cron
Cron's job is simple: run a command at a scheduled time. It does this extremely well. What it doesn't do is care about the outcome.
By default, cron sends the output of a failed job to the local mail spool of the user who owns the crontab. In practice, this means the output goes absolutely nowhere. Most servers don't have a local MTA configured, nobody checks /var/mail/deploy, and even if they did, distinguishing a failed run from a successful one requires parsing unstructured text.
# This cron job will fail silently if the script exits non-zero
0 2 * * * /opt/scripts/nightly-backup.sh
There's no built-in retry. No alerting. No history of past executions. If you want to know whether your cron job ran successfully at 2 AM, you have to go looking.
Why This Matters More Than You Think
Failed cron jobs tend to have a compounding effect:
- Data gaps — a reporting job that fails for three days means your weekly report has three days of missing data, and someone has to manually backfill it
- Resource exhaustion — a cleanup job that stops running leads to disk space or database bloat that eventually takes down production
- Compliance violations — audit log exports, data retention purges, and GDPR deletion tasks have legal deadlines that don't care about your cron problems
- Security drift — certificate renewal, key rotation, and vulnerability scan cron jobs failing silently is how you end up with expired certs in production
The insidious part is the lag between failure and discovery. A nightly backup that fails on Monday might not be discovered until Friday when someone needs to restore data. By then, you've lost a week of backups.
Strategy 1: Wrapper Scripts with Exit Code Handling
The simplest improvement is wrapping your cron commands in a script that checks the exit code and sends a notification on failure.
#!/bin/bash
# /opt/scripts/cron-wrapper.sh
COMMAND="$@"
OUTPUT=$($COMMAND 2>&1)
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
curl -X POST "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
-H "Content-Type: application/json" \
-d "{
\"text\": \"Cron job failed on $(hostname)\",
\"attachments\": [{
\"color\": \"danger\",
\"fields\": [
{\"title\": \"Command\", \"value\": \"$COMMAND\", \"short\": false},
{\"title\": \"Exit Code\", \"value\": \"$EXIT_CODE\", \"short\": true},
{\"title\": \"Output\", \"value\": \"$(echo $OUTPUT | head -c 500)\", \"short\": false}
]
}]
}"
fi
Then your crontab becomes:
0 2 * * * /opt/scripts/cron-wrapper.sh /opt/scripts/nightly-backup.sh
Pros: Simple, no external dependencies beyond a webhook URL.
Cons: Only catches non-zero exit codes. If the script hangs forever, the wrapper hangs with it. If the server is down, the cron never fires and no alert is sent. You're also not tracking execution history.
Strategy 2: Dead Man's Switch (Heartbeat Monitoring)
A dead man's switch flips the monitoring model: instead of alerting when something fails, you alert when something doesn't check in.
The concept is straightforward:
- Your cron job pings a monitoring endpoint when it completes successfully
- The monitoring service expects a ping every N minutes/hours
- If the ping doesn't arrive within the expected window, an alert fires
# Cron job with heartbeat
0 2 * * * /opt/scripts/nightly-backup.sh && curl -s https://your-monitor.com/ping/backup-job
This catches several failure modes that wrapper scripts miss:
- The job never starts — server is down, crond isn't running, crontab was accidentally deleted
- The job hangs — it's stuck waiting on a lock, a network call, or a full disk
- The job is slower than expected — it usually takes 10 minutes but today it's taken 3 hours and counting
The && operator ensures the ping only fires on success (exit code 0). If the backup script fails, the curl never executes, and the monitoring service raises an alert when the expected ping doesn't arrive.
Strategy 3: Structured Logging and Execution Tracking
For teams running more than a handful of cron jobs, you need execution history — not just alerts.
This means logging every execution with:
- Start time and end time (duration)
- Exit code
- Stdout and stderr output (truncated to a reasonable size)
- Which server ran it (critical if you have jobs on multiple machines)
#!/bin/bash
# Enhanced cron wrapper with structured logging
JOB_ID=$(uuidgen)
JOB_NAME="$1"
shift
COMMAND="$@"
START_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
OUTPUT=$($COMMAND 2>&1)
EXIT_CODE=$?
END_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
# Send structured execution record
curl -s -X POST "https://your-api.com/cron/executions" \
-H "Content-Type: application/json" \
-d "{
\"id\": \"$JOB_ID\",
\"job\": \"$JOB_NAME\",
\"host\": \"$(hostname)\",
\"command\": \"$COMMAND\",
\"exit_code\": $EXIT_CODE,
\"started_at\": \"$START_TIME\",
\"finished_at\": \"$END_TIME\",
\"output\": $(echo "$OUTPUT" | head -c 2000 | jq -Rs .)
}"
Now you have a queryable history. You can answer questions like:
- "When did the backup job last succeed?"
- "How long does the cleanup task typically take? Is it getting slower?"
- "Which server ran the reporting job last night?"
Strategy 4: Centralised Cron Management
The strategies above work well when you have a few jobs on a few servers. But as your infrastructure grows, managing crontabs across dozens of machines becomes its own problem.
Centralised cron management moves the job definitions and execution tracking into a single system:
- Define jobs in one place (not scattered across server crontabs)
- Track execution status and history across all servers
- Get unified alerting — one dashboard for all your scheduled tasks
- Run jobs manually when you need to retry a failure or test a change
This is one of the reasons we built cron monitoring into [HookWatch](https://hookwatch.dev). The CLI agent runs on your servers and pulls job definitions from the cloud. When a job executes, it reports the result back — exit code, output, duration. If a job fails or misses its schedule, you get an alert through Slack, Discord, email, or Telegram. And you can trigger manual re-runs from the dashboard without SSH-ing into the server.
Handling Common Failure Modes
Beyond basic alerting, here are specific failure modes you should design for:
Overlapping Executions
A job scheduled every 5 minutes takes 7 minutes to run. Now you have two instances running simultaneously, potentially corrupting data.
# Use flock to prevent overlapping
* * * * * flock -n /tmp/myworker.lock /opt/scripts/process-queue.sh
The -n flag makes flock fail immediately if the lock is held, rather than waiting. Combine this with your monitoring: if the job is consistently taking longer than its interval, that's a metric worth alerting on.
Timeout Protection
A job that hangs indefinitely blocks the next execution and wastes resources.
# Kill the job after 30 minutes
0 * * * * timeout 1800 /opt/scripts/hourly-sync.sh
timeout sends SIGTERM after the specified seconds. The exit code will be 124, which your wrapper script can detect and report as a timeout specifically.
Environment Issues
Cron runs with a minimal environment — no .bashrc, no PATH customisation, no environment variables from your shell. This is the single most common cause of "it works when I run it manually but fails in cron."
# Explicit environment in crontab
PATH=/usr/local/bin:/usr/bin:/bin
DATABASE_URL=postgres://localhost/myapp
0 2 * * * /opt/scripts/nightly-backup.sh
Or source your environment explicitly in the script:
#!/bin/bash
source /opt/scripts/.env
# Now DATABASE_URL and other vars are available
Daylight Saving Time
Cron uses the system's local time zone. During DST transitions, a job scheduled at 2:30 AM might run twice (when clocks fall back) or not at all (when clocks spring forward).
# Use UTC to avoid DST issues
CRON_TZ=UTC
30 2 * * * /opt/scripts/nightly-report.sh
Building an Alert Strategy That Works
Not all cron failures are equal. Your alerting should reflect this:
Critical (immediate page)
- Payment processing jobs
- Security certificate renewal
- Database backups
- Compliance-related tasks
Warning (Slack/email within minutes)
- Report generation
- Cache warming
- Data synchronisation with non-critical services
Informational (dashboard only)
- Analytics aggregation
- Log rotation
- Temporary file cleanup
The worst thing you can do is send every failure to the same channel at the same severity. Alert fatigue kills monitoring faster than no monitoring at all.
A Practical Checklist
If you're starting from zero, here's a prioritised checklist:
- Audit your crontabs — run
crontab -lon every server, document what exists and what it does - Categorise by criticality — separate "the business stops" jobs from "nice to have" jobs
- Add dead man's switches to critical jobs first — even a simple ping to an HTTP endpoint is better than nothing
- Add wrapper scripts for failure notification — Slack webhook, email, whatever your team actually checks
- Implement flock for jobs that must not overlap — this prevents an entire class of subtle bugs
- Add timeout to long-running jobs — a job that hangs for 6 hours is worse than one that fails immediately
- Set up execution history — even if it's just appending to a log file at first, you need history to debug patterns
- Centralise when it makes sense — once you have more than 10 jobs across multiple servers, managing individual crontabs doesn't scale
Conclusion
Cron is a brilliant piece of engineering that has run the world's scheduled tasks for over 40 years. Its one weakness — silent failure — is entirely solvable with the right monitoring strategy.
The goal isn't to replace cron. It's to add the observability layer that cron was never designed to provide. Whether you use wrapper scripts, dead man's switches, or a dedicated monitoring tool like [HookWatch](https://hookwatch.dev), the important thing is that no scheduled task on your infrastructure should be able to fail without someone knowing about it.
Start with your most critical jobs. Get alerts working. Then expand coverage. Your future self — the one who doesn't get woken up at 3 AM because a backup failed silently for a week — will thank you.