Systemd in Production: Service Management Beyond the Basics

I spent years managing services with Docker Compose, screen sessions, and the occasional nohup'd process. They all worked — until they didn't. A server reboot at 3 AM, a process that silently died, logs scattered across random files. Eventually, every deployment I ran ended up under systemd, not because it's trendy, but because it solves problems I kept hitting.

This post covers how I use systemd in production: writing service files that survive reboots, managing services day-to-day, working with journald for logging, and hardening units for security. It's not a reference manual — it's the patterns I've settled on after running systemd-managed services across several machines.

Why systemd for Production?

Before systemd, managing background services on Linux meant writing init scripts, managing PID files, and praying nothing crashed at 2 AM. systemd changed that by providing:

Automatic restart — services that crash come back without manual intervention
Dependency ordering — your app starts after PostgreSQL is actually ready, not just after the process spawns
Centralized logging — no more stdout >> /var/log/myapp.log 2>&1, everything goes to journald
Resource tracking — cgroup integration shows exactly what each service is using
Socket activation — services start on-demand when a connection arrives (useful for low-traffic daemons)

If you're still using nohup or screen for production services, systemd is the upgrade you're looking for.

Writing Production-Grade Service Files

A good service file is the foundation of reliable service management. Here's the template I use for every new service:

The Base Template

[Unit]
Description=My Production Service
After=network-online.target
Wants=network-online.target
 
[Service]
Type=simple
ExecStart=/usr/local/bin/myapp --config /etc/myapp/config.yaml
Restart=always
RestartSec=5
User=myapp
Group=myapp
WorkingDirectory=/var/lib/myapp
 
# Environment
Environment=NODE_ENV=production
EnvironmentFile=/etc/myapp/myapp.env
 
# Security hardening
NoNewPrivileges=yes
PrivateTmp=yes
ProtectHome=yes
ProtectSystem=full
ReadWritePaths=/var/lib/myapp /var/log/myapp
 
# Resource limits
LimitNOFILE=65536
LimitNPROC=4096
 
[Install]
WantedBy=multi-user.target

Let me break down why each section matters.

The [Unit] Section

Description=My Production Service
After=network-online.target
Wants=network-online.target

After vs Wants vs Requires — This is the most common point of confusion and getting it wrong causes subtle boot failures.

After only affects ordering (when things start)
Wants is a soft dependency (if the target fails, your service still starts)
Requires is a hard dependency (if the target fails, your service fails too)

I use Wants in the template rather than Requires because most services can handle a temporary network absence — a web API might fail its first request but recover on the retry. Requires is appropriate for services that genuinely cannot function without the dependency: a database that must reach a remote replica, or a worker that must connect to a message broker at startup. For everything else, Wants gives you the ordering benefit without the hard failure coupling.

After=network-online.target — This is important. network.target is reached as soon as network management starts, not when the network is actually configured. If your service needs to make outbound connections, use network-online.target. The difference can save you from debugging startup race conditions.

The [Service] Section

Type=simple
ExecStart=/usr/local/bin/myapp --config /etc/myapp/config.yaml
Restart=always
RestartSec=5
User=myapp
Group=myapp
WorkingDirectory=/var/lib/myapp

Type=simple — The default, and correct for most modern applications. Your process runs in the foreground, systemd tracks it directly. No forking, no PID files, no complexity.

Restart=always — The killer feature. If your process exits for any reason, systemd brings it back. This is the systemd equivalent of Docker's restart: unless-stopped.

A quick warning: Restart=always combined with a crashing service can create restart storms. systemd has built-in rate limiting through StartLimitIntervalSec (default 10 seconds) and StartLimitBurst (default 5 starts). If a service fails more than 5 times within 10 seconds, systemd stops trying and marks the unit as failed. You can tune these if needed, but the defaults are sensible for most cases — they prevent a buggy service from burning CPU in a restart loop without you noticing.

RestartSec=5 — Wait 5 seconds before restarting. Without this, a crashing service restarts in a tight loop, eating CPU and flooding logs. The delay gives you time to notice and intervene.

User=myapp / Group=myapp — Never run services as root. Each service gets its own system user. If a service is compromised, the blast radius is limited to that user's permissions.

WorkingDirectory=/var/lib/myapp — Sets the working directory. Useful for services that expect to find relative paths or need a specific data directory.

Resource Limits

LimitNOFILE=65536
LimitNPROC=4096

Many applications (databases, web servers, message queues) need more file descriptors than the default 1024. Set LimitNOFILE explicitly rather than relying on the application to call setrlimit(). These values apply at the systemd level and are inherited by the service process.

Environment Variables

Environment=NODE_ENV=production
EnvironmentFile=/etc/myapp/myapp.env

Two approaches for configuring your service:

Environment= — For individual variables that are universal and rarely change. Hardcoding NODE_ENV=production in the unit file is fine because it's the same everywhere.

EnvironmentFile= — For secrets, per-deployment settings, or anything that differs between environments. The file path is in the unit file, but the values live separately. This is how I manage API keys, database URLs, and staging vs production differences.

The file format is simple key-value pairs:

# /etc/myapp/myapp.env
DATABASE_URL=postgres://user:pass@localhost:5432/myapp
REDIS_URL=redis://localhost:6379
LOG_LEVEL=info

One detail worth knowing: EnvironmentFile does not support variable expansion or shell features. No $(command), no ${VAR:-default}. It's a straight key-value parser. If you need that, wrap your service in a script that sources the file before exec'ing the application.

The [Install] Section

WantedBy=multi-user.target

This defines when the service starts at boot. multi-user.target is the standard multi-user, non-GUI system state. Most server services should use this.

Service Types: When to Use What

The Type directive is worth understanding because getting it wrong causes subtle issues.

Type	Behavior	When to Use
`simple`	systemd considers the service started as soon as `ExecStart` runs	Most modern apps (Node.js, Go, Python)
`forking`	The process forks, parent exits, child continues	Legacy daemons (older databases, traditional Unix services)
`oneshot`	Runs once, systemd waits for it to complete	One-time setup tasks, boot scripts
`notify`	Process sends `READY=1` via sd_notify()	Apps that signal readiness explicitly (e.g., after loading config)
`dbus`	Service registers on D-Bus bus	D-Bus activated services

In production, simple is right 90% of the time. If your application runs in the foreground (most modern apps do), use Type=simple. Only reach for forking if you're dealing with a legacy daemon that insists on forking.

The notify type is useful for services with slow startup — your app calls sd_notify("READY=1") after initialization, and systemd waits before considering dependencies satisfied.

Managing Services Day-to-Day

Here are the commands I actually use in production, not the full reference.

Standard Operations

# Check if a service is running (good for monitoring scripts)
systemctl is-active myapp
 
# Check if a service is enabled at boot
systemctl is-enabled myapp
 
# Detailed status with logs and process info
systemctl status myapp
 
# Restart and check status in one flow
systemctl restart myapp && systemctl status myapp --no-pager
 
# Reload config without restarting (if the app supports SIGHUP)
systemctl reload myapp
 
# See all failed units at a glance
systemctl --failed

systemctl status is usually the first command I run during debugging. It shows the process state (running, exited, failed), the last few log lines from journald, the exit code if the service crashed, and the restart count. A restart count climbing steadily is a tell-tale sign of a service that's crashing and being respawned — worth investigating even if the service appears to be running.

I script systemctl is-active in monitoring checks. It returns exit code 0 if the service is active, non-zero otherwise. No parsing of status output needed.

After Editing a Unit File

# Always do this after modifying a .service file
systemctl daemon-reload
 
# Then restart the service
systemctl restart myapp

Forgetting daemon-reload is the most common mistake. systemd caches unit files — editing them does nothing until you reload. The reload takes milliseconds and has no effect on running services.

Overrides Without Modifying the Original

# Edit overrides (creates /etc/systemd/system/myapp.service.d/override.conf)
systemctl edit myapp
 
# See the effective configuration (merged original + overrides)
systemctl cat myapp
 
# Show all properties of a running service
systemctl show myapp

systemctl edit is one of my favorite features. I can add environment-specific overrides (different memory limits in staging vs production) without touching the original unit file shipped by the package manager. The override lives in /etc/systemd/system/ which takes precedence over /usr/lib/systemd/system/.

Logging with Journald

Before systemd, every service had its own logging setup — some wrote to files, some to syslog, some to stdout that nobody captured. Journald centralizes all of it.

Daily Journal Usage

# Follow logs for a service (like tail -f)
journalctl -u myapp -f
 
# Last 50 lines with errors
journalctl -u myapp -n 50 -p err
 
# Logs since yesterday
journalctl -u myapp --since yesterday
 
# Logs for a specific time window
journalctl -u myapp --since "09:00" --until "09:30"
 
# See disk usage
journalctl --disk-usage
 
# Follow all system errors in real time
journalctl -p err -f

The -u flag filters by unit name. The -p flag filters by priority (emerg, alert, crit, err, warning, notice, info, debug). Combined, they make finding production issues fast.

Structured Logging

Journald supports structured metadata, not just text. If your application logs JSON, journald preserves the structure:

# Filter by unit and specific fields
journalctl -u myapp _PID=1234
journalctl -u myapp _UID=1000

You can also add custom fields to your logs. In a Node.js application using structured logging, the journal preserves the JSON keys. This makes querying specific events much easier than grep'ing through log files.

Journal Configuration

My production journald config (/etc/systemd/journald.conf):

[Journal]
Storage=persistent
Compress=yes
SystemMaxUse=1G
SystemMaxFileSize=100M
MaxFileSec=1week
ForwardToSyslog=no

Key settings:

Storage=persistent — ensures logs survive reboots (writes to /var/log/journal/)
SystemMaxUse=1G — caps journal disk usage at 1GB
MaxFileSec=1week — rotates files weekly

Without Storage=persistent, logs are stored in /run/log/journal/ which is volatile and lost on reboot. For production, always enable persistent storage.

Vacuuming (When You Need Space)

# Remove logs older than 2 weeks
journalctl --vacuum-time=2weeks
 
# Remove logs until total size is under 500MB
journalctl --vacuum-size=500M
 
# Remove logs older than 30 days
journalctl --vacuum-time=30d

I run these in cron for machines with tight disk, but with SystemMaxUse=1G in the config, manual vacuuming is rarely needed.

Security Hardening

systemd has built-in security features that act as a lightweight sandbox. They're not a replacement for SELinux or AppArmor, but they raise the bar significantly.

The Standard Hardening Set

I apply these to every production service:

[Service]
# Dynamically allocate a system user — no manual user creation needed
DynamicUser=yes
 
# Prevent privilege escalation
NoNewPrivileges=yes
 
# Isolate /tmp — the service sees its own private /tmp
PrivateTmp=yes
 
# Block access to /home, /root, /run/user
ProtectHome=yes
 
# Make /usr and /etc read-only
ProtectSystem=full
 
# Explicitly allow only specific write paths
ReadWritePaths=/var/lib/myapp /var/log/myapp

What this does in practice:

DynamicUser=yes creates a transient system user for the service — no need to useradd before deploying. The user exists only while the service is running and is removed on stop. Perfect for stateless services that don't need persistent ownership of files.
If an attacker compromises the service process, they can't escalate to root (NoNewPrivileges)
They can't access other users' home directories (ProtectHome=yes)
They can't modify system binaries or configuration (ProtectSystem=full)
They can only write to explicitly allowed directories (ReadWritePaths)

If the service needs persistent file ownership (databases, stateful applications), stick with a static User= / Group=. For everything else, DynamicUser=yes is cleaner — one less user to manage, one less attack surface.

Advanced Hardening Options

[Service]
# Network isolation
PrivateNetwork=yes    # No network access at all
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX  # Only specific socket families
 
# Filesystem restrictions
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectControlGroups=yes
 
# Capability dropping
CapabilityBoundingSet=CAP_NET_BIND_SERVICE  # Only what's needed
AmbientCapabilities=CAP_NET_BIND_SERVICE
 
# System call filtering
SystemCallFilter=@system-service
SystemCallArchitectures=native

When to use these:

PrivateNetwork=yes — for batch jobs or workers that don't need inbound connections
CapabilityBoundingSet=CAP_NET_BIND_SERVICE — for web servers that need to bind to ports < 1024
SystemCallFilter=@system-service — restricts to a safe set of system calls

These options are declarative — they don't require additional tools or policies. You can layer them incrementally. Start with the standard hardening set, then add more as you understand the service's needs.

Verifying Hardening

# Check what security settings are active
systemd-analyze security myapp
 
# This produces a score from 0 (exposed) to 10 (hardened)
# and lists which protections are enabled/disabled

systemd-analyze security scores your service's exposure level. A score of 5-7 is reasonable for most services. Scores above 9 require extensive hardening that may break functionality.

Boot Optimization

Slow boot times matter when you're iterating on infrastructure or dealing with frequent reboots. systemd provides tools to diagnose and fix them.

# Total boot time
systemd-analyze
 
# Which services take the longest
systemd-analyze blame
 
# The critical chain (what's slowing boot)
systemd-analyze critical-chain
 
# Generate a visual SVG for detailed analysis
systemd-analyze plot > boot.svg

systemd-analyze blame is my first stop for boot optimization. It shows each service and how long it took to start, sorted slowest first. I've found cases where a service with After=network-online.target was waiting for DHCP timeout, adding 30 seconds to boot for no reason.

Common boot slowdowns:

Services with After=network-online.target when they don't actually need network
Heavy initialization in ExecStartPre scripts
Timeouts from services waiting for unavailable resources

Timers: Cron on Steroids

systemd timers are cron replacements with better reliability guarantees. If the system was off when a timer was supposed to fire, cron misses it. systemd can catch up.

# /etc/systemd/system/db-backup.timer
[Unit]
Description=Daily database backup
 
[Timer]
OnCalendar=daily
Persistent=true
RandomizedDelaySec=1h
 
[Install]
WantedBy=timers.target

# /etc/systemd/system/db-backup.service
[Unit]
Description=Database backup job
 
[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup-db
User=backup

Persistent=true — the killer feature. If the system was down during the scheduled time, the timer fires immediately after boot. Cron loses that event entirely.

RandomizedDelaySec=1h — prevents the thundering herd problem when multiple timers fire at the same calendar time.

Enable and start the timer, not the service:

systemctl enable --now db-backup.timer
systemctl list-timers

What I've Learned Running systemd in Production

After is ordering, Requires is dependency — confusing these causes subtle startup failures that only appear after a reboot.
RestartSec prevents restart loops — a 5-second delay is usually enough. Without it, a crashing service floods the journal and burns CPU.
daemon-reload is easy to forget — edit a unit file, nothing happens, you restart the service, and it runs the old config. Run daemon-reload after every unit change.
Persistent logging is not the default — without Storage=persistent in journald.conf, logs are lost on reboot. I've learned this the hard way.
systemctl edit is better than modifying unit files directly — overrides survive package updates and keep the original install clean.
service-level hardening is cheap and effective — NoNewPrivileges, PrivateTmp, ProtectSystem, and ProtectHome take 10 seconds to add and prevent entire classes of exploits.

Key Takeaways

Start with the template — Type=simple, Restart=always, User=myapp, After=network-online.target covers 90% of production services.
Use journald with persistent storage — one journalctl -u myapp -f command replaces hunting through log files.
Layer security hardening incrementally — systemd-analyze security tells you your score. Start with NoNewPrivileges and PrivateTmp, then add more as needed.
Timers over cron — Persistent=true catches missed events after downtime. RandomizedDelaySec prevents load spikes.
systemctl edit for overrides — keeps the original unit file untouched and makes configuration management cleaner.
Boot optimization is iterative — systemd-analyze blame identifies the slowest services. Often one misconfigured dependency is responsible for most of the delay.