Operations 19 min read

Understanding Logs, Their Value, and Practices for Observability and Operations

This article explains what logs are, when to record them, their importance in troubleshooting, performance optimization, security monitoring, and business decisions, and describes how centralized logging, metrics, tracing, and tools like ELK, Prometheus, and OpenTracing enable effective observability in modern distributed systems.

Architecture Digest

Apr 4, 2023

Understanding Logs, Their Value, and Practices for Observability and Operations

Log is a time‑ordered record of events that captures what happened, when it happened, and provides precise system information for error tracing, performance analysis, security monitoring, and audit purposes.

Logs are categorized by severity levels such as FATAL, WARNING, NOTICE, DEBUG, and TRACE; projects typically define a threshold so that only logs above a certain level are persisted.

In large‑scale architectures logs are a core component for troubleshooting, performance optimization, product operation decisions, and detecting security attacks like login errors or abnormal accesses.

Distributed log operations require centralized collection, processing, and storage; solutions like the ELK stack (Elasticsearch, Logstash, Kibana) offer powerful search, analysis, and visualization capabilities.

Application Performance Management (APM) unifies logs, metrics, and tracing to achieve observability; metrics are aggregated time‑series data often collected with Prometheus, while tracing records request‑level call chains across services.

Tracing captures request‑scoped spans, enabling developers to understand call relationships and latency; OpenTracing provides a vendor‑neutral API with implementations such as Zipkin, Jaeger, and SkyWalking.

A Go example demonstrates batch log retrieval via SSH, showing how to execute commands concurrently across many instances.

package main

import (
    "fmt"
    "log"
    "os/exec"
    "runtime"
    "sync"
)

// 并发环境
var wg sync.WaitGroup

func main() {
    runtime.GOMAXPROCS(runtime.NumCPU())
    instancesHost := getInstances()
    wg.Add(len(instancesHost))
    for _, host := range instancesHost {
        go sshCmd(host)
    }
    wg.Wait()
    fmt.Println("over!")
}

// 执行查询命令
func sshCmd(host string) {
    defer wg.Done()
    logPath := "/xx/xx/xx/"
    logShell := "grep 'FATAL' xx.log.20230207"
    cmd := exec.Command("ssh", "PasswordAuthentication=no", "ConnectTimeout=1", host, "-l", "root", "cd", logPath, "&&", logShell)
    out, err := cmd.CombinedOutput()
    fmt.Printf("exec: %s
", cmd)
    if err != nil {
        fmt.Printf("combined out:
%s
", string(out))
        log.Fatalf("cmd.Run() failed with %s
", err)
    }
    fmt.Printf("combined out:
%s
", string(out))
}

// 获取要查询的实例ip地址库
func getInstances() []string {
    return []string{
        "x.x.x.x",
        "x.x.x.x",
        "x.x.x.x",
    }
}

Good log practices include clear messages, consistent formatting, appropriate severity levels, minimal performance impact, and alignment with business needs; bad practices involve ambiguous, unstructured, overly verbose, insecure, or improperly leveled logs that hinder diagnosis.

Building a robust log‑operation platform that integrates logging, metrics, and tracing enables real‑time monitoring, alerting, and root‑cause analysis, thereby improving reliability and observability in cloud‑native environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

APM Operations Tracing

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.