Understanding Logs, Their Value, and Practices for Observability and Operations
This article explains what logs are, when to record them, their importance in troubleshooting, performance optimization, security monitoring, and business decisions, and describes how centralized logging, metrics, tracing, and tools like ELK, Prometheus, and OpenTracing enable effective observability in modern distributed systems.
Log is a time‑ordered record of events that captures what happened, when it happened, and provides precise system information for error tracing, performance analysis, security monitoring, and audit purposes.
Logs are categorized by severity levels such as FATAL, WARNING, NOTICE, DEBUG, and TRACE; projects typically define a threshold so that only logs above a certain level are persisted.
In large‑scale architectures logs are a core component for troubleshooting, performance optimization, product operation decisions, and detecting security attacks like login errors or abnormal accesses.
Distributed log operations require centralized collection, processing, and storage; solutions like the ELK stack (Elasticsearch, Logstash, Kibana) offer powerful search, analysis, and visualization capabilities.
Application Performance Management (APM) unifies logs, metrics, and tracing to achieve observability; metrics are aggregated time‑series data often collected with Prometheus, while tracing records request‑level call chains across services.
Tracing captures request‑scoped spans, enabling developers to understand call relationships and latency; OpenTracing provides a vendor‑neutral API with implementations such as Zipkin, Jaeger, and SkyWalking.
A Go example demonstrates batch log retrieval via SSH, showing how to execute commands concurrently across many instances.
package main
import (
"fmt"
"log"
"os/exec"
"runtime"
"sync"
)
// 并发环境
var wg sync.WaitGroup
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
instancesHost := getInstances()
wg.Add(len(instancesHost))
for _, host := range instancesHost {
go sshCmd(host)
}
wg.Wait()
fmt.Println("over!")
}
// 执行查询命令
func sshCmd(host string) {
defer wg.Done()
logPath := "/xx/xx/xx/"
logShell := "grep 'FATAL' xx.log.20230207"
cmd := exec.Command("ssh", "PasswordAuthentication=no", "ConnectTimeout=1", host, "-l", "root", "cd", logPath, "&&", logShell)
out, err := cmd.CombinedOutput()
fmt.Printf("exec: %s\n", cmd)
if err != nil {
fmt.Printf("combined out:\n%s\n", string(out))
log.Fatalf("cmd.Run() failed with %s\n", err)
}
fmt.Printf("combined out:\n%s\n", string(out))
}
// 获取要查询的实例ip地址库
func getInstances() []string {
return []string{
"x.x.x.x",
"x.x.x.x",
"x.x.x.x",
}
}Good log practices include clear messages, consistent formatting, appropriate severity levels, minimal performance impact, and alignment with business needs; bad practices involve ambiguous, unstructured, overly verbose, insecure, or improperly leveled logs that hinder diagnosis.
Building a robust log‑operation platform that integrates logging, metrics, and tracing enables real‑time monitoring, alerting, and root‑cause analysis, thereby improving reliability and observability in cloud‑native environments.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.