Continuing from Part 1. This post covers tickets 6 through 10 from my Domain 4 Round 2 simulations — the more advanced topics including job control, OOM killer behavior, signal theory, /proc filesystem deep dive, and a full cross-topic investigation.


Ticket 6 — SSH Disconnection Killed My Long-Running Process

Situation: A long database optimization command was running manually. SSH connection dropped. After reconnecting — is the process still running or is it dead?

The instinct is to run jobs but that only shows processes attached to the current shell session. After reconnecting you have a brand new session — nothing shows up.

The right approach for finding a process when you don’t know its name:

ps aux --sort=start_time | tail -20

Sorting by start time and checking the bottom of the list shows recently started processes. I found it:

root  12001  4.2  0.2  mysqlcheck --optimize --all-databases -u root -p
TTY: ?

The ? in the TTY column means no controlling terminal — the process survived the disconnect. To confirm why:

cat /proc/12001/status | grep PPid
PPid: 1

Parent is PID 1 — systemd. When the SSH session died, the shell that launched mysqlcheck died too. The process became an orphan and Linux automatically re-parented it to systemd. systemd does not send SIGHUP, so the process kept running.

It survived by luck, not by design.

The right way to run long processes:

# nohup — explicitly ignores SIGHUP
nohup mysqlcheck --optimize --all-databases -u root -p &

# tmux — persistent terminal session, reconnect anytime
tmux new -s dboptimize
mysqlcheck --optimize --all-databases -u root -p
# detach: Ctrl+b then d
# reconnect: tmux attach -t dboptimize

# systemd-run — runs as transient systemd unit
systemd-run mysqlcheck --optimize --all-databases -u root -p

For practical daily use tmux is the best option — you keep full output visibility, can reconnect at any time, and it works for any long running task.

Why SSH disconnect kills processes: When a terminal closes, the shell sends SIGHUP (Signal 1 — “hangup”) to all child processes. The name comes from the era of physical serial terminals where a physical disconnection literally “hung up” the line. Same concept applies today with SSH. Most processes do not catch SIGHUP so they die immediately.


Ticket 7 — MySQL Died With No Warning: The OOM Killer

Alert: MySQL suddenly died. WordPress showing database connection errors. Nobody touched the server.

First step was checking the systemd journal with error filter:

journalctl -u mysql -p err --since "10 minutes ago"
kernel: Out of memory: Killed process 981 (mysqld) score 744 oom_score_adj 0
systemd: mysql.service: Failed with result 'oom-kill'.
systemd: Started MySQL Community Server.

The OOM (Out of Memory) killer fired and killed MySQL. systemd already auto-restarted it (last line) so the site was recovering. But the real question was — why did memory get exhausted?

dmesg --since "20 minutes ago" | grep -i "oom\|kill\|memory"
11:28:14  php-fpm8.3 invoked oom_killer
11:28:14  Killed process 9012 (php-fpm) score 672
11:28:14  Killed process 9013 (php-fpm) score 671
11:28:31  mysqld invoked oom_killer
11:28:31  Killed process 981 (mysqld) score 744

The full chain: a traffic spike caused php-fpm workers to request more RAM. System was already at 3.4GB out of 3.7GB used. OOM killer fired and killed the php-fpm workers first (scores 672, 671). That did not free enough memory. 17 seconds later MySQL itself needed memory — OOM killer fired again and killed MySQL with score 744 (highest = biggest target).

Understanding OOM scores:

  • Range: -1000 to +1000
  • Higher score = more likely to be killed first
  • Lower score = more protected
  • Score is calculated based on memory usage — the more RAM a process uses, the higher its score

The fix — protect MySQL permanently:

# Temporary (dies on restart)
echo -1000 | sudo tee /proc/981/oom_score_adj

# Permanent via systemd
sudo systemctl edit mysql
[Service]
OOMScoreAdjust=-900

Why -900 and not -1000? Because -1000 means completely immune — even if MySQL itself is causing the memory exhaustion, kernel cannot kill it. That can hang your entire server. -900 means heavily protected but kernel can still act as absolute last resort.

Critical Linux gotcha I hit during this ticket:

# This FAILS even with sudo
sudo echo -1000 > /proc/981/oom_score_adj

# This WORKS
echo -1000 | sudo tee /proc/981/oom_score_adj

Why? sudo only elevates the echo command. The > redirection is handled by your current shell before sudo runs. Your shell does not have root — so opening the /proc file fails. With tee, the tee command itself runs as root and does the writing. This pattern appears constantly in Linux administration — any time you need to redirect to a file that requires root, use sudo tee.


Ticket 8 — Ctrl+Z Froze My Import: Job Control

Report: Teammate running a CSV import script accidentally pressed Ctrl+Z. Terminal frozen. Import not running.

Running jobs showed:

[1]+  Stopped    php8.3 import_csv.php

Ctrl+Z sends SIGTSTP to the process — it pauses it completely. Not crashed, not killed. All work done so far is intact in memory.

Two ways to resume:

fg %1    # bring back to foreground — terminal occupied, watch progress
bg %1    # resume in background — terminal free for other commands

Both send SIGCONT to the process which resumes it from exactly where it paused.

Complete job control signal map:

Ctrl+Z    → SIGTSTP  → pauses process → becomes stopped job
fg %1     → SIGCONT  → resumes in foreground
bg %1     → SIGCONT  → resumes in background
Ctrl+C    → SIGINT   → kills process entirely
jobs      → shows all stopped/background jobs in current session

Important gotcha: If your teammate does bg %1 and then closes the terminal — the import dies anyway. Background jobs are still children of that shell. SSH disconnect → shell dies → SIGHUP → import dies. For a long import the correct solution is tmux or nohup before starting, not after.


Ticket 9 — Teaching a Junior: kill -9 and /proc

Situation: Junior teammate asks why everyone uses kill -9 for everything, and what /proc actually is.

On kill signals — the correct workflow:

kill -15 PID        # SIGTERM — polite request, process can clean up
# wait 10-30 seconds
kill -2 PID         # SIGINT — same as Ctrl+C, another graceful option
# still running?
kill -9 PID         # SIGKILL — nuclear weapon, last resort only

kill -9 always works because SIGKILL cannot be caught or ignored by any process — it goes directly to the kernel. But that is exactly the problem. The process gets zero chance to:

  • Save its current state
  • Close open files cleanly
  • Flush database write buffers
  • Complete in-progress transactions

Using kill -9 on a database process mid-write is a reliable way to corrupt data. Use SIGTERM first, always.

What /proc actually is:

/proc is a virtual filesystem. Nothing in it exists on disk. The kernel generates everything inside it on the fly when you read it. Tools like top, ps, and lsof are all just reading /proc under the hood.

Most useful files inside /proc/PID/:

FileWhat it contains
cmdlineExact command and arguments that started the process
statusProcess name, PID, PPID, memory usage, current state
limitsSoft and hard resource limits (open files, CPU, memory)
oom_score_adjOOM killer priority — adjust to protect or target
cwdSymlink to working directory when process started
fd/Directory of all open file descriptors
environEnvironment variables at launch time

Useful trick for cmdline — it stores arguments separated by null bytes which looks garbled raw:

cat /proc/981/cmdline | tr '\0' ' '

tr '\0' ' ' replaces null bytes with spaces making it human readable.

Beyond PID folders — system-wide /proc:

/proc/loadavg     # raw load average — exactly what top reads
/proc/meminfo     # complete memory breakdown
/proc/cpuinfo     # CPU details
/proc/sys/kernel/ # live kernel tuning parameters

Ticket 10 — Cross-Topic: Mystery Python Script Consuming Everything

Alert: New developer pushed a Python thumbnail generation script. Nobody knows if it’s running, stuck, or finished. Memory climbing. Site slowing.

top showed python3 (PID 13421) consuming 61% CPU, 681MB RAM, 18 minutes runtime. Swap already at 923MB used out of 2GB.

One thing I now always check: the swap line at the top of top, not just individual process memory. 923MB swap used across the whole system means memory has been under pressure for a while — the kernel has been pushing pages to disk to make room. Combined with only 312MB free RAM, one more memory spike and the OOM killer fires.

Before touching the process I checked exactly what it was doing:

cat /proc/13421/cmdline | tr '\0' ' '
python3 /var/www/limonlab.online/scripts/generate_thumbnails.py
--input /var/www/limonlab.online/wp-content/uploads/
--output /var/www/limonlab.online/wp-content/thumbnails/

This told me two important things:

  1. It is writing to a separate /thumbnails/ directory — original uploads are safe
  2. It can be safely killed without data loss to originals — some incomplete thumbnails may exist but those are easily cleaned up

Then I checked if it was actually making progress or stuck:

ls -lth /var/www/limonlab.online/wp-content/thumbnails/ | head -10

Timestamps showed files modified at 15:41 — current time was 15:42. Script was actively working, not stuck.

The decision: This is where being a junior sysadmin means knowing when NOT to act unilaterally. The script is legitimate, doing real work, and the developer who pushed it needs to know about the resource impact. Before killing it I would escalate:

“There is a Python thumbnail script consuming 61% CPU and 681MB RAM for 18 minutes. Site is slowing down and swap is at 923MB. Script is actively processing. Should I kill it, throttle it, or let it finish?”

While waiting for the response — safe non-destructive action:

sudo renice -n 19 -p 13421

This immediately reduces pressure on nginx and php-fpm without stopping the script or losing any work. The script keeps running at lower priority.

If senior says kill it:

kill -15 13421      # SIGTERM first
# wait 30 seconds
kill -9 13421       # only if still running

# cleanup incomplete files
find /var/www/limonlab.online/wp-content/thumbnails/ -size 0 -delete

The investigation framework I use for every ticket:

1. WHAT is misbehaving?     → top, ps aux
2. WHO started it?          → ps -o ppid, pstree
3. WHAT is it doing?        → strace, cat /proc/PID/cmdline
4. HOW LONG has it been?    → ps aux TIME column
5. WHAT is it touching?     → lsof -p PID
6. HOW BAD is it?           → free -h, /proc/PID/limits, /proc/PID/status
7. CAN I safely stop it?    → understand what it's writing to first

Summary — Part 2

TicketProblemKey ToolFix
6Process died on SSH disconnectps aux, /proc/PPidtmux or nohup before running
7OOM killer took down MySQLjournalctl, dmesgOOMScoreAdjust=-900 in systemd
8Ctrl+Z froze import scriptjobsfg or bg to resume
9Junior misusing kill -9/proc, kill signalsteach SIGTERM first workflow
10Python script consuming all resourcestop, cmdline, reniceescalate + renice as safe action

Final Thoughts on Domain 4

Process and resource management is one of those domains where the commands are easy to memorize but the real skill is the investigation mindset. Knowing top and kill is not enough. You need to understand why a process is behaving the way it is before you touch it.

The biggest thing I took from these simulations: always check /proc before acting. It holds more information about a running process than almost any other tool. And always think about what a process is writing to before killing it — a process writing to a database or an active file needs a graceful shutdown, not a kill -9.

Leave a Reply

Your email address will not be published. Required fields are marked *