Part of my Linux Mastery Road to Cloud series — Domain 12: System Monitoring & Scheduling


When I first set up my home server at limonlab.online and exposed it to the real internet, I thought the hard part was over. Nginx was running. WordPress was live. MySQL was connected. Job done.

Then a week later, the site slowed to a crawl. PHP-FPM had spawned so many worker processes that the server was eating itself alive. I had no monitoring in place. I had no idea it was happening. I found out because the page just stopped loading.

That incident taught me something I should have learned earlier: a system you can’t observe is a system you can’t operate. This post covers the monitoring and scheduling tools that every Linux sysadmin and cloud engineer needs to actually understand — not just memorize, but use when things are on fire.


The Problem With “Everything Looks Fine”

Most servers look fine until they don’t. A runaway process accumulates over hours. A cron job silently fails for weeks. Disk fills up not because of one big file but because of thousands of rotated log backups nobody configured a retention policy for.

The tools in this domain exist for one purpose: to give you situational awareness before problems become outages.


1. Real-Time Process Monitoring: top and htop

top is the first thing I run when something feels wrong. It gives you a live view of CPU usage, memory pressure, load averages, and the processes consuming the most resources.

top

The three numbers after “load average” are the system load over the last 1, 5, and 15 minutes. On a single-core machine, a load average above 1.0 means the CPU has a queue. On a 4-core machine, above 4.0 means the same. This number is your first signal that something is working harder than it should be.

Press P to sort by CPU, M to sort by memory. Press k and type a PID to kill a process without leaving top.

htop is the same idea with a more readable interface and mouse support. If it’s installed, I use it instead. If not, top is always there.

What I actually look for:

  • Any single process consuming >80% CPU consistently
  • Memory usage climbing without dropping back down (memory leak)
  • Load average spiking well above the number of CPU cores

2. Memory: free

free -h

The -h flag gives human-readable output (MB, GB instead of raw bytes). The key column to watch is available — not free. Available includes memory that’s currently used for cache but can be reclaimed immediately. Linux is aggressive about using memory for cache; a server showing nearly zero free memory isn’t necessarily under pressure.

When I diagnosed the PHP-FPM incident, free -h showed available memory dropping to near zero with no cache to reclaim. That’s the real warning sign.


3. Disk Space: df and du

df -h

Shows disk usage per mounted filesystem. The column that matters is Use%. If your root filesystem (/) is at 95%+, you are in a bad situation. Many services start failing silently before you hit 100% — they just can’t write logs, can’t create temp files, can’t complete transactions.

When I’ve seen full disks, the cause was almost always one of three things:

  • Unrotated or undeleted log files
  • Backup files accumulating with no retention limit
  • A database growing faster than expected

du tells you where the space went:

du -sh /var/log/*
du -sh /var/backups/*

-s gives a summary total per item, not every subdirectory. -h makes it readable. Run these when df shows high usage and you need to find the culprit.

One specific trick I use often:

du -sh /* 2>/dev/null | sort -rh | head -10

This shows the 10 largest directories at the root level, sorted by size. It’s fast triage.


4. Disk I/O: iostat and iotop

High CPU and normal memory doesn’t always tell the whole story. Sometimes the problem is disk I/O — a process reading or writing so heavily that it bottlenecks everything else.

iostat -x 2

The -x flag shows extended stats including %util — the percentage of time the disk was busy. If this is consistently near 100%, your disk is the bottleneck, not your CPU or RAM.

iotop

iotop shows which processes are responsible for the I/O. This is how I diagnosed a runaway tar backup job that was reading and writing gigabytes in the background while the site was responding slowly. The CPU was fine. iotop showed one process doing 200MB/s of disk writes. Mystery solved.

iotop requires root or sudo. If it’s not installed, apt install iotop.


5. Cron: Scheduling Without Surprises

Cron is the classic Linux job scheduler. It runs commands at scheduled times using a simple syntax. Most servers use it for backups, cleanup tasks, log rotation triggers, and health checks.

The Syntax

* * * * * /path/to/command
│ │ │ │ │
│ │ │ │ └── Day of week (0-6, Sunday=0)
│ │ │ └──── Month (1-12)
│ │ └────── Day of month (1-31)
│ └──────── Hour (0-23)
└────────── Minute (0-59)

A * means “every”. So 30 2 * * * means “at 2:30 AM every day.” 0 */6 * * * means “every 6 hours, on the hour.”

crontab -e    # Edit your cron jobs
crontab -l    # List current cron jobs

The PATH Problem (This Will Bite You)

This is the most common reason cron jobs silently fail: the PATH inside a cron environment is not the same as your interactive shell’s PATH.

When you run tar in your terminal, it works because /usr/bin is in your shell’s PATH. Cron has a minimal PATH by default — something like /usr/bin:/bin. So if your script calls a command installed in /usr/local/bin, cron won’t find it and the job will fail silently.

The fix: Always use full absolute paths in cron jobs, or explicitly set PATH at the top of your crontab:

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

30 2 * * * /usr/bin/tar -czf /var/backups/site-$(date +\%F).tar.gz /var/www/html

Don’t Let Output Disappear

By default, cron emails output to the local system user. On most servers, that email goes nowhere. You’ll never know if your backup failed.

Redirect output to a log file instead:

30 2 * * * /usr/local/bin/backup.sh >> /var/log/backup.log 2>&1

>> /var/log/backup.log appends stdout to a log file.
2>&1 also sends stderr to the same file.
Now when something breaks, you have evidence.


6. Systemd Timers: The Modern Way to Schedule

Systemd timers are the newer alternative to cron. They’re more powerful and more debuggable — you can check status, see last run time, and read output with standard journalctl commands.

A timer consists of two files: a .service that defines what runs, and a .timer that defines when.

Example — a daily backup timer:

/etc/systemd/system/backup.service

[Unit]
Description=Daily site backup

[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh
User=www-data

/etc/systemd/system/backup.timer

[Unit]
Description=Run backup daily at 2:30 AM

[Timer]
OnCalendar=*-*-* 02:30:00
Persistent=true

[Install]
WantedBy=timers.target

Enable and start the timer:

systemctl enable --now backup.timer

Check its status:

systemctl status backup.timer
systemctl list-timers --all

Persistent=true is important — it means if the system was off at 2:30 AM, the timer will run the job as soon as the system boots. Cron skips missed jobs silently. Systemd timers don’t.

Debugging Timers

The common failure I’ve seen: a bash syntax error in the script the service calls. The timer fires, the service starts, the script crashes, and nothing seems to have run. The fix is simple:

journalctl -u backup.service

Every output line, every error message, every exit code — it’s all there. This is the biggest practical advantage of systemd timers over cron: logging is built in.


7. Checking Open Files: lsof

One situation that confuses a lot of people: df shows the disk is full, but when you delete a large log file, the disk doesn’t free up. The space is still gone.

This happens because a running process still has the file open. On Linux, a deleted file’s disk space isn’t released until all file descriptors pointing to it are closed.

lsof | grep deleted

This shows every file that has been deleted but is still held open by a running process. You’ll typically see a log file that a service is still writing to even though you deleted it. The fix: restart the service, or send it a signal to reopen its log files (many services do this on SIGHUP).


The Mindset Shift

Every tool in this domain answers one of three questions:

  1. What is the system doing right now?top, htop, free, iostat, iotop
  2. What has the system scheduled to do?crontab -l, systemctl list-timers
  3. What did the system do, and did it succeed?journalctl, log files, cron redirect output

When something breaks on a production server, you work backward through those three questions. You don’t guess. You look.

That’s the discipline this domain teaches. Not the syntax of cron — you can look that up. The discipline is knowing to check iotop before assuming it’s a CPU problem. Knowing to check lsof | grep deleted before rebooting because the disk “should” have space. Knowing to read journalctl -u service-name before touching the script.

The server tells you what’s wrong. These tools are how you listen.


This post is part of my Linux Mastery Road to Cloud series. I’m documenting every domain I study on my journey from zero IT background to a Junior Cloud/DevOps role. The incidents I write about happened on my live server at limonlab.online — a real Ubuntu box exposed to the internet, generating real traffic and real problems.

Leave a Reply

Your email address will not be published. Required fields are marked *