How 360 Scaled Network Monitoring to 1 TB Daily Traffic: Lessons in Operations Automation
This article details how Qihoo 360 built large‑scale traffic analysis, VxLAN deployment, and SDN practices to automate network operations, improve visibility, and support security, while sharing real‑world challenges, solutions, and a Q&A on network automation strategies.
1. Role of Software in Qihoo Network
The first task for software and network teams is outbound traffic analysis; initially we used Cacti to obtain device traffic, which only revealed link speed (1 G or 10 G) without insight into applications, IP addresses, or protocols.
After two years of effort we analyzed nearly 1 TB of outbound traffic, pinpointing which IP addresses generated which traffic, identifying protocols, and mapping traffic for hundreds of core data‑center machines.
Full‑flow collection feeds the security team, enabling them to detect and respond to intrusion incidents.
Our large‑scale architecture requires re‑hashing traffic so that packets of the same TCP connection are directed to a single host.
We built a platform where, given a source IP, all network devices on the path are listed and per‑port traffic and error statistics are displayed, allowing rapid fault localisation; this capability resulted from two years of accumulating address and path data.
We also collect LLDP, routing tables, ARP, etc., but collection is periodic and cannot provide real‑time results.
Bandwidth monitoring shows network quality across regions, yet it cannot detect sub‑1 % packet loss; we are researching a Microsoft‑style full‑mesh monitoring solution stored in Hadoop for deeper analysis.
Using Exabgp we implemented automated block and rate‑limit rules for DDoS mitigation.
Automation is challenged by heterogeneous vendor APIs and differing command versions, making integration painful.
2. VxLAN Practice on Qihoo Hardware
We initially used OpenStack with VLAN for private‑cloud network access, which required stacking and caused redundancy issues; we avoided stacking where possible.
VM live migration was impossible due to inconsistent configurations. We upgraded OVS to DPDK, offloaded VxLAN checksum and encapsulation, and validated OVS+VxLAN integration.
Eventually we deployed EVPN, working with vendors to support the technology; EVPN is now a mainstream direction for our OpenStack private‑cloud topology.
3. Our Team’s View on SDN
After four years, SDN is entering a maturity plateau; users now focus on concrete problems SDN should solve, and vendors have clearer solution roadmaps.
We advocate a "southbound" approach to abstract vendor differences so that a user can configure VLANs without worrying about the underlying device brand, and a "northbound" approach that provides visualizations and APIs for business teams to retrieve needed data.
Q&A
Q: Should we build our own network automation system or combine open‑source monitoring tools? A: Use a commercial network management suite (e.g., SolarWinds) for device discovery and alerting, then develop custom frameworks for specific automation needs. Q: How does flow analysis help detect intrusion events? A: By capturing full traffic streams via port mirroring and hashing, we can reconstruct protocols, extract LLDP information, and provide detailed logs to the security team. Q: How do you map a source IP to the physical path and devices? A: We periodically collect ARP tables and routing information via SNMP and other methods, building a database that can be queried to display the complete path. Q: What challenges remain in bandwidth monitoring? A: Detecting sub‑1 % packet loss is still difficult; we are exploring full‑mesh monitoring stored in Hadoop for finer granularity.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.