Operations 20 min read

85 Essential Ops Rules Every Engineer Should Follow

This article presents a comprehensive list of 85 practical operations rules covering capacity planning, monitoring, automation, security, documentation, budgeting, team management, and incident handling, offering actionable guidance for building reliable, scalable, and efficient IT infrastructure.

Efficient Ops

Jan 31, 2018

85 Essential Ops Rules Every Engineer Should Follow

85 Essential Operations Rules

1. Prioritize capacity first, then optimize; ignoring this leads to downtime.

2. When using Postgres, ensure every network matches WAL, Slony replication, snapshot technology, and disk‑based DB versioning.

3. Avoid adding “optimizations” that become operational burdens; ensure tools are fully handed over before further development.

4. Keep it simple – KISS (Keep It Simple, Stupid).

5. Use caching cautiously; it hinders horizontal scaling and should serve end‑user performance, not just increase site capacity.

6. Don't write all code yourself or outsource everything; use the right tool at the right time.

7. Effective negotiation starts with research and feasible proposals; choose the right vendor when needed.

8. Maintain N+1 redundancy; avoid exceeding 49% load on a single server and prefer N+2 architecture when possible.

9. Data loss is unacceptable; its cost far exceeds preventive measures.

10. Parallelize wherever possible, e.g., using MogileFS for real‑time data replication.

11. Read manuals (e.g., RAID cards) to catch hidden details.

12. Identify bottlenecks and trace them layer by layer (disk, memory, CPU).

13. Conduct regular capacity‑management programs; without data trends you cannot see weak points.

14. Embrace change and avoid fostering failure.

15. Don't set traps for yourself; your work should empower future tasks.

16. Ops‑written code should be tooling, not application software.

17. Value project managers, technical writers, and financial analysts in the ops team.

18. Monitor everything; use alerts for anomalies and collect data for trend analysis.

19. Regularly review trend data across all areas.

20. Keep monitoring clean; a noisy system is useless.

21. Ensure monitoring is simple enough for anyone in the company to use.

22. Only check where improvements can be made; otherwise, waste no time.

23. Publish inspection reports with data for easy consumption.

24. Assign owners to every technical point.

25. Provide backup personnel for critical roles.

26. Keep hiring continuously, even without open slots.

27. Be self‑disciplined; constantly improve regardless of confidence.

28. Benchmark against other companies; look outward.

29. Attend one major industry event per year.

30. Purchase what you need, not what you want; prioritize simplicity and safety.

31. Always act in the best business interest, even if it means leaving.

32. Implement formal accountability: record commitments and track fulfillment.

33. Avoid failing more than twice; distinguish between accidental and systemic errors.

34. Be ruthless when necessary.

35. Sign off on work you complete; completion matters.

36. Become useful to others.

37. Collaborate with startups; offer expertise and receive free products.

38. Treat capacity as a business/product issue; make costs transparent for every request.

39. Continuously break budgets; ops often consume the most spend.

40. Test old processes with tools before assuming they still work.

41. Document everything; make newcomers ask for procedures.

42. Draw a large diagram of your data‑center network topology.

43. Create a flowchart for each product’s business process.

44. Use a FAQ/Wiki so “how‑to‑fix” articles are easy to find.

45. Ensure anyone can be replaced.

46. Recognize that many work better from home than the office.

47. Bundle orders for better discounts and terms when buying hardware in bulk.

48. Maintain long‑term relationships with suppliers.

49. Equip every ops engineer with remote‑work gear (handheld, Wi‑Fi, large monitor).

50. Avoid being trapped by legacy OS standards; don’t force Windows if Mac works better.

51. Have a rational procurement process; align technical and financial budgets.

52. Hold weekly meetings to follow up on actions and accountability.

53. Build a separate escalation system to isolate dev‑code issues from production impact.

54. Integrate ops considerations (scalability, monitoring, reliability) from design through product development.

55. Adopt security/audit standards (Sarbanes‑Oxley, WebTrust, SAS 70, PCI) early with ticketing tools.

56. Simplify redundant login processes; without true scalability the system fails.

57. Standard‑edition Oracle or SQL Server is worthwhile if you stay within its limits.

58. Consider free databases like Postgres or MySQL; choose based on transaction integrity needs.

59. Design capacity with a 20‑30% headroom over daily peaks.

60. Read free economic magazines; they provide valuable insights.

61. Enforce separation of duties: developers should not have production access; ops controls permissions.

62. Control access points; enable two‑factor authentication.

63. Log access to production environments; capture screenshots where possible.

64. Ensure redundant VPN access points for production; don’t rely on a single VPN.

65. Use LDAP authentication even for small fleets.

66. Learn Windows if you need a Windows server; don’t avoid it out of ignorance.

67. Provide reliable Wi‑Fi everywhere (office, conference rooms, entrances).

68. Recognize that ops staff often sacrifice personal time for rapid incident response.

69. Centralize all product outcomes in a relational DB and replicate to remote sites.

70. Automate processes for OS/product releases, file distribution, and log analysis.

71. Use an ops database as the source of truth for automation.

72. Classify servers as offline, online, or production; use configuration tools (cfengine, rsync) for online state.

73. Export logs before decommissioning or rebuilding devices.

74. If scaling too fast for optimization, lock defaults and revisit only when necessary.

75. Accept that ops engineers will inevitably make critical mistakes (e.g., accidental rm ‑rf /).

76. Keep the team environment fun; foster ownership, not managerial micromanagement.

77. Achieving 99.999% availability provides flexibility for redundancy and rapid changes.

78. If you can guarantee 99.999% uptime, promise 100% service to customers.

79. Preserve the ability to roll back releases; avoid relying on ad‑hoc fixes.

80. Design every step with the end‑user’s service quality as the primary goal.

81. Aim to get it right the first time; rework wastes resources.

82. Learn from peers and allies; share experiences to improve collectively.

83. Hire talent that sets a high standard and can be a role model.

84. Distinguish IT from ops; a good ops manager can handle both, but a traditional IT engineer may struggle with internet‑scale ops.

85. At the start of a new job or each year, fight for a realistic budget based on historical data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations system reliability Best Practices IT Management

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.