在运维工作中,系统难免会出现各种突发状况:服务启动失败、CPU爆满、磁盘告警、网络延迟…… 在分秒必争的故障排查现场,熟练使用预先准备好的应急脚本,往往能帮助我们快速定位问题、甚至直接化解危机,成为名副其实的“救命稻草”。
一、 脚本清单及使用场景
脚本名称 | 核心命令 | 主要应用场景 | 风险提示 |
---|---|---|---|
1. check_port | lsof -i:$PORT |
服务启动失败,端口冲突检查 | 无 |
2. clean_zombie | `ps -ef | grep defunct, kill -9` |
系统卡顿,僵尸进程堆积 |
3. free_cache | echo 3 > /proc/sys/vm/drop_caches |
内存不足,应用可能发生 OOM | 临时措施,会降低缓存命中率,可能引起短暂 I/O 波动 |
4. log_clean | cat /dev/null > $LOGFILE |
磁盘写满,急需清理单个大日志文件 | 务必先确认或备份! 盲目清理可能导致日志数据丢失 |
5. conn_count | `netstat -an | grep ESTABLISHED` | 连接数暴增,怀疑应用异常或遭遇攻击 |
6. cpu_hog | top -Hp $PID |
单进程 CPU 使用率 100%,定位问题线程 | 高危操作,定位后需结合 jstack 等工具深入分析,谨慎操作 |
7. disk_io | iostat -x 1 3 |
应用卡顿,怀疑磁盘 I/O 瓶颈 | 输出结果需要一定经验解读 |
8. net_latency | mtr -r -c 10 $host |
应用掉线,网络延迟或抖动 | 部分网络环境可能屏蔽 mtr 数据包 |
9. find_bigfile | `du -ah $DIR | sort -rh | head -n 10` |
10. sys_snapshot | uname , top , free , df , ss |
综合故障,需快速收集系统状态信息用于分析 | 输出内容较多,适合重定向到文件 |
二、 脚本详解与示例
1. 检查端口占用:check_port
#!/bin/bash
# Author: SysOps
# Description: 检查指定端口是否被占用,并列出占用进程的详细信息。
if [ $# -ne 1 ]; then
echo "Usage: ${0##*/} <port_number>"
exit 1
fi
PORT="$1"
# 验证输入是否为数字
if ! [[ $PORT =~ ^[0-9]+$ ]]; then
echo "Error: Port must be a number."
exit 2
fi
echo "[INFO] Checking status of port: $PORT"
# 使用 ss 命令(比 netstat 更高效)检查端口
if ss -tuln | grep -q ":${PORT} "; then
echo "[WARN] Port $PORT is in use. Details:"
echo "----------------------------------------"
# 使用 lsof 获取更详细的进程信息,如果 lsof 不可用则回退到 ss
if command -v lsof &> /dev/null; then
lsof -i :${PORT} -sTCP:LISTEN || ss -tulp | grep ":${PORT} "
else
ss -tulp | grep ":${PORT} "
fi
echo "----------------------------------------"
else
echo "[INFO] Port $PORT is free."
fi
需要在脚本后面加具体的端口来探测。
2. 清理僵尸进程:clean_zombie
#!/bin/bash
# Author: SysOps
# Description: 查找并尝试清理僵尸进程。提示用户确认。
echo "[INFO] Scanning for zombie processes..."
zombies=$(ps -eo pid,stat,comm | grep -w Z | grep -v grep)
if [ -z "$zombies" ]; then
echo "[INFO] Great! No zombie processes found."
exit 0
fi
echo "[WARN] Found zombie processes:"
echo "PID COMMAND"
echo "$zombies"
echo "----------------------------------------"
# 获取僵尸进程的父 PID,这通常是需要关注的
zombie_pids=$(echo "$zombies" | awk '{print $1}')
for pid in $zombie_pids; do
ppid=$(ps -o ppid= -p $pid 2>/dev/null | tr -d ' ')
echo "[INFO] Zombie PID: $pid, Parent PID: $ppid (Command: $(ps -o comm= -p $ppid 2>/dev/null))"
done
read -p "-> Attempt to send SIGCHLD to the parent processes? (y/N): " confirm
if [[ $confirm == [yY] || $confirm == [yY][eE][sS] ]]; then
for pid in $zombie_pids; do
ppid=$(ps -o ppid= -p $pid 2>/dev/null | tr -d ' ')
if [ -n "$ppid" ]; then
echo "[INFO] Sending SIGCHLD to parent process $ppid..."
kill -s SIGCHLD $ppid
fi
done
sleep 2
echo "[INFO] Re-scanning for zombies..."
ps -eo pid,stat,comm | grep -w Z | grep -v grep || echo "[INFO] Zombies cleared."
else
echo "[INFO] Operation cancelled."
fi
3. 释放内存缓存:free_cache
#!/bin/bash
# Author: SysOps
# Description: 释放系统缓存(PageCache, dentries, inodes)。需要 root 权限。
if [ "$(id -u)" -ne 0 ]; then
echo "Error: This script must be run as root (use sudo)." >&2
exit 1
fi
echo "[INFO] Current memory status:"
free -h
echo
echo "[INFO] Flushing caches..."
# 使用 sync 将脏页写入磁盘,然后释放缓存
sync
# 写入 3 到 drop_caches 来清理 PageCache, dentries and inodes.
echo 3 > /proc/sys/vm/drop_caches
echo "[INFO] Caches flushed. New memory status:"
free -h
4. (安全)日志清理:log_clean
#!/bin/bash
# Author: SysOps
# Description: 安全地清理日志文件。提供备份和截断选项。
if [ $# -ne 1 ]; then
echo "Usage: ${0##*/} </path/to/logfile.log>"
exit 1
fi
LOGFILE="$1"
if [ ! -f "$LOGFILE" ]; then
echo "Error: Log file '$LOGFILE' does not exist." >&2
exit 2
fi
SIZE_MB=$(du -m "$LOGFILE" | cut -f1)
echo "[WARN] Log file: $LOGFILE"
echo "[WARN] Current size: $SIZE_MB MB"
# 提供选项
echo
echo "Choose an action:"
echo "1) Create a backup (.bak) and then truncate the original log."
echo "2) Just truncate the log (clear all content)."
echo "3) Exit (do nothing)."
read -p "Enter your choice (1/2/3): " choice
case $choice in
1)
backup_file="${LOGFILE}.$(date +%Y%m%d_%H%M%S).bak"
echo "[INFO] Creating backup: $backup_file"
cp "$LOGFILE" "$backup_file"
gzip "$backup_file" # 压缩备份以节省空间
echo "[INFO] Backup saved and compressed."
;;
2)
# 什么都不做,继续执行后面的 truncate
;;
3)
echo "[INFO] Exiting."
exit 0
;;
*)
echo "Invalid choice. Exiting." >&2
exit 3
;;
esac
if [[ $choice == 1 || $choice == 2 ]]; then
echo "[INFO] Truncating log file..."
: > "$LOGFILE" # 使用 : > 比 cat /dev/null > 更常见
echo "[INFO] Done. New size: $(du -h "$LOGFILE" | cut -f1)"
fi
脚本后面加日志名称
5. 查看实时连接数:conn_count
#!/bin/bash
# Author: SysOps
# Description: 查看系统或指定端口的 ESTABLISHED 连接数。
if [ $# -eq 1 ]; then
PORT="$1"
if ! [[ $PORT =~ ^[0-9]+$ ]]; then
echo "Error: Port must be a number." >&2
exit 2
fi
COUNT=$(ss -tan | grep -c "ESTAB.*:${PORT}$")
echo "[INFO] Current ESTABLISHED connections on port $PORT: $COUNT"
else
TOTAL=$(ss -tan | grep -c 'ESTAB')
echo "[INFO] Total current ESTABLISHED connections: $TOTAL"
fi
6. CPU 占用定位:cpu_hog
#!/bin/bash
# Author: SysOps
# Description: 定位指定进程内 CPU 占用最高的线程。
if [ $# -ne 1 ]; then
echo "Usage: ${0##*/} <pid>"
exit 1
fi
PID="$1"
# 检查进程是否存在
if ! ps -p "$PID" > /dev/null; then
echo "Error: Process with PID $PID does not exist." >&2
exit 2
fi
echo "[INFO] Analyzing threads for process PID: $PID"
echo "[INFO] Press 'q' to quit the top display."
echo "----------------------------------------"
# 以高频率刷新,更快看到结果
top -H -d 0.5 -p "$PID"
脚本后面跟PID进程
7. 磁盘 I/O 排查:disk_io
#!/bin/bash
# Author: SysOps
# Description: 查看磁盘 I/O 使用情况。需要 sysstat 包。
if ! command -v iostat &> /dev/null; then
echo "Error: iostat command not found. Please install the 'sysstat' package." >&2
exit 1
fi
echo "[INFO] Disk I/O statistics (refreshing every 2 seconds, 5 times)..."
iostat -dxhtm 2 5
8. 网络延迟排查:net_latency
#!/bin/bash
# Author: SysOps
# Description: 使用 mtr 排查到目标主机的网络延迟和丢包。
if [ $# -ne 1 ]; then
echo "Usage: ${0##*/} <hostname_or_ip>"
exit 1
fi
TARGET="$1"
# 检查 mtr 是否存在
if ! command -v mtr &> /dev/null; then
echo "Error: mtr command not found. Please install it." >&2
exit 1
fi
echo "[INFO] Tracing route and measuring latency to $TARGET (10 pings)..."
mtr --report --report-wide --no-dns --timeout 3 --interval 1 -c 10 "$TARGET" 2>/dev/null
if [ $? -ne 0 ]; then
echo "Note: mtr might have been interrupted or failed. Trying a simpler output..."
mtr --report -c 4 "$TARGET"
fi
9. 查找大文件:find_bigfile
#!/bin/bash
# Author: SysOps
# Description: 在指定目录下查找最大的前10个文件。
SEARCH_PATH="${1:-/}" # 默认搜索根目录,但通常需要 sudo
DEPTH_LEVEL="${2:-3}" # 默认搜索深度为3层,避免耗时过长
echo "[INFO] Finding top 10 largest files in $SEARCH_PATH (max depth: $DEPTH_LEVEL). This may take a while..."
echo
# 使用 find 指定深度,避免搜索整个文件系统,然后通过 du 排序
find "$SEARCH_PATH" -maxdepth "$DEPTH_LEVEL" -type f -exec du -h {} + 2>/dev/null | sort -rh | head -n 10
if [ $? -ne 0 ]; then
echo
echo "Note: Some files were not accessible due to permissions."
echo "Try running with sudo for a complete list."
fi
10. 一键收集系统信息:sys_snapshot
#!/bin/bash
# Author: SysOps
# Description: 快速收集系统关键状态信息,便于故障排查复盘。
SNAPSHOT_FILE="system_snapshot_$(hostname)_$(date +%Y%m%d_%H%M%S).log"
{
echo "==== SYSTEM SNAPSHOT ===="
echo "Date: $(date)"
echo "Hostname: $(hostname)"
echo "Uptime: $(uptime)"
echo
echo "==== OS RELEASE ===="
cat /etc/*release 2>/dev/null
echo
echo "==== KERNEL INFO ===="
uname -a
echo
echo "==== CPU (Top 5 processes) ===="
ps -eo pid,user,%cpu,%mem,comm --sort=-%cpu | head -n 6
echo
echo "==== MEMORY Usage ===="
free -h
echo
echo "==== DISK Usage ===="
df -hT | grep -v tmpfs
echo
echo "==== TOP 5 Memory Processes ===="
ps -eo pid,user,%cpu,%mem,comm --sort=-%mem | head -n 6
echo
echo "==== NETWORK Summary (ss) ===="
ss -s
echo
echo "==== Snapshot saved to: $SNAPSHOT_FILE ===="
} | tee "$SNAPSHOT_FILE"
echo "[INFO] Done. System snapshot saved to: $SNAPSHOT_FILE"
三、 使用建议与注意事项
- 权限问题:部分脚本(如
free_cache
、log_clean
、disk_io
)需要sudo
权限,请合理配置 sudoers 或切换 root 用户。 - 谨慎操作:尤其是
kill
、log_clean
等命令,确认无误后再执行。log_clean
清理前优先考虑cp /dev/null > file.log
或日志轮转。 - 工具依赖:确保系统已安装所需命令(如
lsof
、mtr
、iostat
(通常在sysstat
包中))。 - 因地制宜:这些脚本是通用模板,请根据自己公司的环境和流程进行修改和优化,并放入统一的
~/bin
目录或分发到标准路径。 - 知其所以然:脚本是利器,但理解其背后的命令和输出含义更为重要。避免成为只会运行脚本的“脚本小子”。
结语
工欲善其事,必先利其器。将这十大脚本收入你的工具箱,勤加练习和理解,必能在下一次故障来袭时更加从容不迫,快速解决问题,成为一名真正的运维“救火英雄”。