Ticket Simulations Part 1 — What I Learned Investigating Real Server Problems - LimonLab

This is part of my ongoing Linux learning series where I document everything I study, including the mistakes, the confusions, and the breakthroughs. Domain 4 covers Process and Resource Management — one of the most important domains for anyone targeting a Junior Cloud or DevOps role.

Background

Domain 4 is not just about knowing commands. It is about understanding what is actually happening inside a Linux system when things go wrong. Processes consuming too much CPU, zombie processes that cannot be killed, memory exhaustion that takes down your database — these are real production problems. In this session I worked through 10 ticket-style simulations on my live Hetzner VPS running limonlab.online.

Part 1 covers tickets 1 through 5.

Ticket 1 — High Load Average: WordPress Site Slowing Down

Alert: Load average above 4.0 on a 2 vCPU machine. WordPress responding slowly.

The first command I always run when load is high is top. It immediately showed a php-fpm worker (PID 7431) consuming 98.7% CPU for over 12 minutes. Load average of 4.21 on a 2 vCPU machine means the system is seriously struggling — anything above 2.0 on a 2 vCPU server means processes are queuing for CPU time.

Next I ran strace -p 7431 to see exactly what the process was doing at the system call level. But I hit a permission error:

strace: attach: ptrace(PTRACE_SEIZE, 7431): Operation not permitted

This confused me at first. Even with sudo it failed. The reason is ptrace_scope — a kernel security setting. Checking /proc/sys/kernel/yama/ptrace_scope showed value 1, which restricts ptrace even for sudo users. The fix was switching to a full root shell with sudo -i first, then running strace.

Once attached, strace showed this repeating thousands of times per second:

brk(NULL) = 0x55a3f2c1000
brk(NULL) = 0x55a3f2c1000
brk(NULL) = 0x55a3f2c1000

brk() is a memory management syscall. Seeing it loop with no file I/O, no network calls, no sleep means the process is stuck in a pure CPU spinning loop in memory. This is your smoking gun for an infinite loop.

I killed the process with kill 7431 (SIGTERM) and load dropped from 4.21 to 2.11 within seconds. But killing it was just the band-aid. The real investigation was checking the php-fpm error log:

WARNING: child 7431, script 'wp-content/plugins/wp-statistics/includes/class-wp-statistics-hits.php',
max execution time (300) will be reached in 1 seconds

The wp-statistics WordPress plugin was recording visitor data to MySQL on every page hit. A bot from IP 94.102.49.12 was hammering the site repeatedly. Each hit triggered an expensive database write. The MySQL table had grown too large and queries were getting slow. The PHP worker just sat there waiting for MySQL to reply — burning CPU doing nothing useful for 5 full minutes until PHP’s built-in safety net fired.

What I learned:

strace patterns: brk() looping = CPU spin, recvfrom(<unfinished>) = stuck waiting on network/database
ptrace_scope can block strace even with sudo — use sudo -i first
Killing the process fixes the symptom, investigating the logs finds the cause
300 second max_execution_time in php.ini is far too long for WordPress — 60 seconds is more appropriate

Ticket 2 — Zombie Processes: Why Kill Does Not Work

Report: Teammate noticed zombie processes in top and cannot kill them.

top showed three php-fpm processes with Z in the status column and zero CPU, zero memory. My teammate had already tried kill -9 on them and nothing happened.

Running pstree -aps 8821 showed:

systemd,1
  └─php-fpm8.3,892 master process
        └─php-fpm8.3,8821 [defunct]

The fix was sudo systemctl restart php8.3-fpm. After restart all three zombies disappeared immediately.

Why kill -9 never works on zombies:

A zombie process has already finished running. It has no code executing, no memory allocated, nothing to kill. It is just a record in the kernel’s process table — waiting for its parent to collect its exit status using wait(). You cannot kill a record.

When the parent (php-fpm master) failed to collect the exit status of its finished children, those children became zombies. Restarting php-fpm killed the parent process. The zombie children became orphans and were automatically re-parented to systemd (PID 1). systemd always calls wait() on its children — which is exactly what the original parent failed to do — and the zombies disappeared.

The one-line mental model:

Zombie = child finished but parent never collected the death certificate. Fix the parent, not the zombie.

Ticket 3 — Backup Script Competing With Production: Using renice

Situation: A mysqldump backup script is running during the day and making the site sluggish. Need to lower its CPU priority without stopping it.

top showed mysqldump (PID 9842) consuming 42.3% CPU with NI value of 0 (default niceness). The fix:

sudo renice -n 19 -p 9842

Nice values range from -20 (highest priority) to 19 (lowest priority). Setting 19 makes the process as “nice” as possible — it yields CPU to every other process whenever they need it.

After renice, top showed:

mysqldump CPU dropped from 42.3% to 14.2%
PR (kernel priority) changed from 20 to 39
NI column now shows 19
CPU line showed 12.4 ni — the percentage being used by niced processes

The backup continued running and completing its work. It just stopped competing aggressively with nginx and php-fpm.

Important rule: Only root can set negative nice values (higher priority). Normal users can only increase the nice value (lower priority). This is a deliberate security decision — otherwise any user could boost their process and starve the system.

Ticket 4 — Too Many Open Files: php-fpm Failing on Image Uploads

Alert: php-fpm error log showing Too many open files. WordPress image uploads failing.

First I needed to find which process was hitting the limit:

lsof -n | awk 'NR>1 {print $2}' | sort | uniq -c | sort -rn | head -n 3

Output showed PID 9012 (php-fpm worker) had 847 open file descriptors. Then I checked its limit:

cat /proc/9012/limits

Max open files    1024    4096    files

Soft limit was 1024 and the process had 847 open — dangerously close. At peak traffic (multiple simultaneous requests each opening PHP files, WordPress plugin files, MySQL connections, uploaded images) it briefly spiked past 1024, threw the error, then dropped back down when some connections closed.

Before fixing I checked the parent:

ps -o ppid= -p 9012

Output: 892 — the php-fpm master process, started by systemd. This determined the fix. If it had been started from my shell, ulimit -n 4096 would work. But systemd-managed services inherit limits from systemd, not from your shell. So:

sudo systemctl edit php8.3-fpm

[Service]
LimitNOFILE=4096

Then daemon-reload and restart. Verification via /proc/[new-pid]/limits confirmed both soft and hard limits now showed 4096.

Key lesson: ulimit only affects your current shell and its direct children. For any service managed by systemd, you must use LimitNOFILE in the unit file. Always identify where the process comes from before choosing your fix.

Ticket 5 — Capping MySQL and php-fpm CPU Before Heavy Traffic

Situation: Big match night on stadiumbuzz.live. Want to prevent MySQL and php-fpm from consuming all CPU and starving the rest of the system.

Applied CPU quotas via systemd drop-in overrides:

sudo systemctl edit mysql

[Service]
CPUQuota=50%

sudo systemctl edit php8.3-fpm

[Service]
CPUQuota=80%

After daemon-reload and restart, verification:

systemctl show mysql | grep CPUQuota
CPUQuotaPerSecUSec=500ms

systemctl show php8.3-fpm | grep CPUQuota
CPUQuotaPerSecUSec=800ms

systemd converts percentages to microseconds per second internally. 50% = 500ms of CPU time per second allowed.

Then I verified at the kernel level directly via cgroups:

cat /sys/fs/cgroup/system.slice/mysql.service/cpu.max
50000 100000

This means 50,000 microseconds of CPU time per 100,000 microsecond period — exactly 50%. If the cgroup file shows the right number, the kernel is actually enforcing it. This is the most trustworthy verification step — not just systemd config.

Important note on CPUQuota values:

On a 2 vCPU server, 200% quota = using both cores fully
Setting quota above your actual hardware maximum = effectively uncapped
Always set quota below the real hardware ceiling for it to have any effect

Summary — Part 1

Ticket	Problem	Key Tool	Fix
1	PHP worker infinite loop	strace, top	kill -15, fix plugin
2	Zombie processes	pstree, top	restart parent service
3	Backup competing with production	renice	renice -n 19
4	Too many open files	lsof, /proc/limits	LimitNOFILE in systemd
5	No CPU limits before traffic spike	systemctl, cgroups	CPUQuota in unit file

Part 2 covers tickets 6 through 10 — SSH disconnects killing processes, OOM killer taking down MySQL, job control, signal theory, and a cross-topic investigation combining everything.

Ticket Simulations Part 1 — What I Learned Investigating Real Server Problems