Proper Server Reboot Practices: Lessons from a Kernel Upgrade Incident
After a painful P1 incident caused by a kernel upgrade, this article outlines how to properly reboot a Linux server by documenting pre‑restart state, verifying post‑restart consistency with process and port snapshots, and using simple commands such as ps, netstat, diff, and last to ensure service continuity.
《老何的1001夜》, Qunar colleagues understand
Triggered by an Incident
Upgrading a kernel caused a P1 incident, a painful lesson.
What Is a Proper Server Reboot?
A proper reboot means that after the server restarts, aside from the expected changes (e.g., kernel version upgrade), all external interfaces behave consistently and unchanged.
When You Know the Server's Users
If you know who uses the server, start with communication: ask the users what services run, list the processes, and after reboot verify the list matches.
When You Know Nothing About the Server
If you have no prior knowledge, follow this logical procedure:
Capture a snapshot of the system before reboot.
After reboot, compare the current snapshot with the pre‑restart snapshot.
If the two snapshots match, the reboot is successful.
If they differ, investigate the differences or consult the last logged‑in user.
How to Capture a System Snapshot
The simplest way is to record running processes and listening ports on Linux using the following commands:
sudo ps -ef | sort -k 8 > processlist sudo netstat -an | grep LIST | sort -k 5 > portlistGenerally, a system snapshot consists of its process list and port list; if both are identical before and after reboot, the system state can be considered unchanged (application‑level issues are analyzed separately).
How to Compare System Snapshots
Use the diff command to see differences:
diff -c processlist_before_reboot processlist_after_rebootFind Recent Users of the System
The last command shows recent logins. If you don't know who owns the server, run:
lastReview the most recent users; recognizing familiar usernames is a good habit.
Conclusion
The core principle remains unshakable: for any change, you must know the expected outcome and devise a method to verify that the result is correct.
He Weiping
Qunar / Travel Vacation Division – Search technology researcher and database researcher, translator of the first Chinese PostgreSQL manual and the third edition of Programming Perl. Focuses on search, distributed systems, databases, and cluster design. 18 years of IT experience.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.