If you’ve spent time supporting AI infrastructure, whether that’s a GPU training cluster, a fleet of inference nodes, or a multi-tenant model serving platform, you’ve probably noticed something: the network telemetry tools that served you well in a traditional data center feel slightly out of place here. Not useless. Just not quite designed for this.
The traffic patterns are different. The failure modes are different. The things you need to catch early are different. And if you’re running NetFlow or sFlow collection – which you should be – understanding where that data genuinely helps versus where you’re looking at the wrong instrument is the difference between a useful monitoring stack and a false sense of coverage.
Why AI Traffic Is Different
Most of the networking intuition you’ve built over a career was forged on north-south traffic – clients reaching services, users reaching the internet, workloads reaching storage. Even in modern microservices environments with heavy east-west traffic, flows are relatively short-lived, heterogeneous in size, and largely TCP-based with normal congestion dynamics.
AI training breaks most of those assumptions simultaneously.
A distributed training job across a GPU cluster is synchronous in a way that most networked workloads are not. Every GPU in a collective operation, say, an AllReduce synchronizing gradients across 512 nodes, must complete its contribution before any node can advance to the next compute step. The network becomes the synchronization barrier. This creates genuinely unusual traffic characteristics:
Most flows are elephant flows
A gradient synchronization pass over a large cluster moves gigabytes per node in a coordinated burst, not a trickle of small transactions. Whether your flow collector sees this depends on your fabric. If your cluster runs collective communication over TCP/IP – common in cloud-based training environments or clusters without dedicated RDMA hardware – you will see sustained multi-gigabit flows where traditional workloads generate thousands of small ones. If your cluster runs RoCEv2 or InfiniBand, this traffic bypasses the IP stack and is largely invisible to NetFlow and sFlow collection. Know your fabric before assuming flow data covers your compute tier.
Bursts are synchronized across the whole cluster
When AllReduce begins, all nodes transmit simultaneously. This creates incast conditions at aggregation switches – a convergence of many flows to a small number of destinations at the same moment. This is a known and serious problem in AI fabric design. If your collective communication runs over TCP/IP, this shows up in flow data as correlated burst events across many source IPs simultaneously. On RDMA fabrics, the incast problem is equally real and equally damaging to training throughput, but it is not visible to a flow analyzer.
Conversation patterns are structured and repetitive
GPU-to-GPU communication during AllReduce follows strict topological patterns. In a ring AllReduce each node communicates only with its two neighbors; in a tree AllReduce communication follows a fixed hierarchy. These patterns repeat with every training step. On TCP/IP-based training fabrics this structure is visible in flow data: a fixed set of source-destination IP pairs exchanging large volumes at regular intervals, which makes baseline deviation relatively easy to detect. On RDMA fabrics, this structure exists at the hardware layer but does not surface in flow records.
The traffic alternates between distinct phases
Training jobs cycle between compute phases and I/O phases. The compute phase is dominated by GPU-to-GPU gradient exchange, typically running over RDMA (RoCEv2 or InfiniBand) – traffic that is largely invisible to NetFlow and sFlow collection. The I/O phase – dataset loading from distributed storage, checkpoint writes – uses different source/destination IP ranges, different timing, and potentially different protocols. Whether I/O phase traffic appears as TCP in your flow data depends on your storage backend: NFS, iSCSI, and S3-compatible object storage all use TCP; RDMA-based parallel filesystems such as Lustre with RDMA, GPFS, DAOS, or NVMe-oF do not. Know which you’re running before assuming flow data covers your storage tier.
Where Flow Data Actually Earns Its Keep
Spotting the Slow Node
This is the use case that converts skeptics. Distributed training is only as fast as its slowest participant – the straggler problem is real and expensive. When a training job is consistently underperforming its expected throughput, the first instinct is to look at compute metrics: GPU utilization, memory bandwidth, CUDA errors. Those are valid. But the problem is often in the network, and it shows up first in flow data.
If you’ve defined your GPU nodes as a custom group in NFA and you’re watching per-node flow volumes during a training run, a node that’s sending substantially less data than its peers stands out immediately. Not because NFA tells you the GPU is underperforming – it can’t do that – but because the network traffic is the observable symptom of a node not pulling its weight in the collective operation.

If you’ve spent time supporting AI infrastructure, whether that’s a GPU training cluster, a fleet of inference nodes, or a multi-tenant model serving platform, you’ve probably noticed something: the network telemetry tools that served you well in a traditional data center feel slightly out of place here. Not useless. Just not quite designed for this.
The traffic patterns are different. The failure modes are different. The things you need to catch early are different. And if you’re running NetFlow or sFlow collection – which you should be – understanding where that data genuinely helps versus where you’re looking at the wrong instrument is the difference between a useful monitoring stack and a false sense of coverage.
Why AI Traffic Is Different
Most of the networking intuition you’ve built over a career was forged on north-south traffic – clients reaching services, users reaching the internet, workloads reaching storage. Even in modern microservices environments with heavy east-west traffic, flows are relatively short-lived, heterogeneous in size, and largely TCP-based with normal congestion dynamics.
AI training breaks most of those assumptions simultaneously.
A distributed training job across a GPU cluster is synchronous in a way that most networked workloads are not. Every GPU in a collective operation, say, an AllReduce synchronizing gradients across 512 nodes, must complete its contribution before any node can advance to the next compute step. The network becomes the synchronization barrier. This creates genuinely unusual traffic characteristics:
Most flows are elephant flows
A gradient synchronization pass over a large cluster moves gigabytes per node in a coordinated burst, not a trickle of small transactions. Whether your flow collector sees this depends on your fabric. If your cluster runs collective communication over TCP/IP – common in cloud-based training environments or clusters without dedicated RDMA hardware – you will see sustained multi-gigabit flows where traditional workloads generate thousands of small ones. If your cluster runs RoCEv2 or InfiniBand, this traffic bypasses the IP stack and is largely invisible to NetFlow and sFlow collection. Know your fabric before assuming flow data covers your compute tier.
Bursts are synchronized across the whole cluster
When AllReduce begins, all nodes transmit simultaneously. This creates incast conditions at aggregation switches – a convergence of many flows to a small number of destinations at the same moment. This is a known and serious problem in AI fabric design. If your collective communication runs over TCP/IP, this shows up in flow data as correlated burst events across many source IPs simultaneously. On RDMA fabrics, the incast problem is equally real and equally damaging to training throughput, but it is not visible to a flow analyzer.
Conversation patterns are structured and repetitive
GPU-to-GPU communication during AllReduce follows strict topological patterns. In a ring AllReduce each node communicates only with its two neighbors; in a tree AllReduce communication follows a fixed hierarchy. These patterns repeat with every training step. On TCP/IP-based training fabrics this structure is visible in flow data: a fixed set of source-destination IP pairs exchanging large volumes at regular intervals, which makes baseline deviation relatively easy to detect. On RDMA fabrics, this structure exists at the hardware layer but does not surface in flow records.
The traffic alternates between distinct phases
Training jobs cycle between compute phases and I/O phases. The compute phase is dominated by GPU-to-GPU gradient exchange, typically running over RDMA (RoCEv2 or InfiniBand) – traffic that is largely invisible to NetFlow and sFlow collection. The I/O phase – dataset loading from distributed storage, checkpoint writes – uses different source/destination IP ranges, different timing, and potentially different protocols. Whether I/O phase traffic appears as TCP in your flow data depends on your storage backend: NFS, iSCSI, and S3-compatible object storage all use TCP; RDMA-based parallel filesystems such as Lustre with RDMA, GPFS, DAOS, or NVMe-oF do not. Know which you’re running before assuming flow data covers your storage tier.
Where Flow Data Actually Earns Its Keep
Spotting the Slow Node
This is the use case that converts skeptics. Distributed training is only as fast as its slowest participant – the straggler problem is real and expensive. When a training job is consistently underperforming its expected throughput, the first instinct is to look at compute metrics: GPU utilization, memory bandwidth, CUDA errors. Those are valid. But the problem is often in the network, and it shows up first in flow data.
If you’ve defined your GPU nodes as a custom group in NFA and you’re watching per-node flow volumes during a training run, a node that’s sending substantially less data than its peers stands out immediately. Not because NFA tells you the GPU is underperforming – it can’t do that – but because the network traffic is the observable symptom of a node not pulling its weight in the collective operation.
If you’ve spent time supporting AI infrastructure, whether that’s a GPU training cluster, a fleet of inference nodes, or a multi-tenant model serving platform, you’ve probably noticed something: the network telemetry tools that served you well in a traditional data center feel slightly out of place here. Not useless. Just not quite designed for this.
The traffic patterns are different. The failure modes are different. The things you need to catch early are different. And if you’re running NetFlow or sFlow collection – which you should be – understanding where that data genuinely helps versus where you’re looking at the wrong instrument is the difference between a useful monitoring stack and a false sense of coverage.
Why AI Traffic Is Different
Most of the networking intuition you’ve built over a career was forged on north-south traffic – clients reaching services, users reaching the internet, workloads reaching storage. Even in modern microservices environments with heavy east-west traffic, flows are relatively short-lived, heterogeneous in size, and largely TCP-based with normal congestion dynamics.
AI training breaks most of those assumptions simultaneously.
A distributed training job across a GPU cluster is synchronous in a way that most networked workloads are not. Every GPU in a collective operation, say, an AllReduce synchronizing gradients across 512 nodes, must complete its contribution before any node can advance to the next compute step. The network becomes the synchronization barrier. This creates genuinely unusual traffic characteristics:
Most flows are elephant flows
A gradient synchronization pass over a large cluster moves gigabytes per node in a coordinated burst, not a trickle of small transactions. Whether your flow collector sees this depends on your fabric. If your cluster runs collective communication over TCP/IP – common in cloud-based training environments or clusters without dedicated RDMA hardware – you will see sustained multi-gigabit flows where traditional workloads generate thousands of small ones. If your cluster runs RoCEv2 or InfiniBand, this traffic bypasses the IP stack and is largely invisible to NetFlow and sFlow collection. Know your fabric before assuming flow data covers your compute tier.
Bursts are synchronized across the whole cluster
When AllReduce begins, all nodes transmit simultaneously. This creates incast conditions at aggregation switches – a convergence of many flows to a small number of destinations at the same moment. This is a known and serious problem in AI fabric design. If your collective communication runs over TCP/IP, this shows up in flow data as correlated burst events across many source IPs simultaneously. On RDMA fabrics, the incast problem is equally real and equally damaging to training throughput, but it is not visible to a flow analyzer.
Conversation patterns are structured and repetitive
GPU-to-GPU communication during AllReduce follows strict topological patterns. In a ring AllReduce each node communicates only with its two neighbors; in a tree AllReduce communication follows a fixed hierarchy. These patterns repeat with every training step. On TCP/IP-based training fabrics this structure is visible in flow data: a fixed set of source-destination IP pairs exchanging large volumes at regular intervals, which makes baseline deviation relatively easy to detect. On RDMA fabrics, this structure exists at the hardware layer but does not surface in flow records.
The traffic alternates between distinct phases
Training jobs cycle between compute phases and I/O phases. The compute phase is dominated by GPU-to-GPU gradient exchange, typically running over RDMA (RoCEv2 or InfiniBand) – traffic that is largely invisible to NetFlow and sFlow collection. The I/O phase – dataset loading from distributed storage, checkpoint writes – uses different source/destination IP ranges, different timing, and potentially different protocols. Whether I/O phase traffic appears as TCP in your flow data depends on your storage backend: NFS, iSCSI, and S3-compatible object storage all use TCP; RDMA-based parallel filesystems such as Lustre with RDMA, GPFS, DAOS, or NVMe-oF do not. Know which you’re running before assuming flow data covers your storage tier.
Where Flow Data Actually Earns Its Keep
Spotting the Slow Node
This is the use case that converts skeptics. Distributed training is only as fast as its slowest participant – the straggler problem is real and expensive. When a training job is consistently underperforming its expected throughput, the first instinct is to look at compute metrics: GPU utilization, memory bandwidth, CUDA errors. Those are valid. But the problem is often in the network, and it shows up first in flow data.
If you’ve defined your GPU nodes as a custom group in NFA and you’re watching per-node flow volumes during a training run, a node that’s sending substantially less data than its peers stands out immediately. Not because NFA tells you the GPU is underperforming – it can’t do that – but because the network traffic is the observable symptom of a node not pulling its weight in the collective operation.
SCOPE BOUNDARY
What flow data shows you here: byte and packet counts per source IP, per time window. What it does not show: why the node is slow – driver issue, hardware fault, misconfigured NCCL, bad cable at the NIC. Flow data gets you to the right machine. The diagnosis happens elsewhere.
Catching Checkpoint and Dataset I/O Problems
The I/O phases of a training job generate traffic that is completely distinct from the GPU-to-GPU communication – different source and destination IP ranges, different timing, and on TCP-based storage backends, a different protocol. If your storage infrastructure uses NFS, iSCSI, or S3-compatible object storage, checkpoint writes and dataset reads will appear as TCP flows in NFA and give you a useful visibility window into whether I/O is occurring at the expected frequency and volume.
If your storage fabric uses RDMA-based transport: Lustre with RDMA, GPFS, DAOS, or NVMe-oF, those flows will not appear meaningfully in NetFlow or sFlow records, for the same reason that intra-cluster GPU traffic is largely invisible. Verify your storage protocol before treating the absence of storage traffic in NFA as a problem, or its presence as complete coverage.
For TCP-backed storage, flow data can tell you whether checkpoint writes are happening at expected intervals, whether dataset loading is completing before the next compute phase begins, and whether storage traffic is bleeding onto the compute fabric – a misconfiguration that creates unnecessary contention. Those are real findings, available from flow collection with no additional instrumentation, provided the transport assumption holds.
Validating That Your Segmentation Actually Works
AI infrastructure is almost always built with logical segmentation: training nodes in one VLAN or subnet, management traffic on another, storage traffic on a third. The intent is to isolate management access, prevent job interference, and control what communicates with what.
The gap between intended segmentation and actual segmentation is often embarrassing. Flow data closes that gap. Using NFA‘s custom group feature to define your infrastructure tiers and then querying for inter-group conversations gives you a ground-truth view of what’s actually communicating, independent of what your firewall rules say they’re supposed to do.
A training container that’s been misconfigured and is opening connections outside its expected scope will show up in that query. A shadow GPU node that someone spun up and forgot to register will appear as an unexpected source IP. A dependency on an external model hub that nobody documented will show up as a recurring outbound flow to a CDN IP range. None of this requires deep packet inspection. It’s all visible in flow metadata.
The Security Case: Models Are Worth Stealing
A fine-tuned LLM trained on proprietary data, with a carefully curated instruction set and RLHF pass, represents a significant capital investment. It also sits on a storage server somewhere with an IP address that’s reachable from inside your network. Flow-based monitoring is not a DLP solution, but it is a very good early warning system for bulk exfiltration, because exfiltration at scale requires moving large volumes of data, and volume is exactly what NetFlow measures.
The alert configuration for this in NFA is straightforward: define your model storage servers as a custom group, define your approved external destinations as another group – cloud backup endpoints, specific CDN ranges, known partner IPs, and alert on any significant outbound volume from the model storage group to destinations not in the approved list. Add a time-of-day rule that flags large transfers outside business hours. That covers the most common exfiltration patterns with a few minutes of configuration.
What you get when an alert fires: source IP, destination IP, protocol, bytes transferred, duration. What you don’t get: what the data was. That determination requires host-level logging or a packet capture. Flow data is the trigger. Investigation uses other tools.
What to Actually Configure in NFA for AI Infrastructure
Here’s what alert rules and custom groups make sense specifically for GPU cluster environments.
Custom Groups to Define First
Everything useful in NFA for AI environments flows from having your infrastructure logically grouped. Define these before building any alert rules or dashboard queries:
| Group Name | What Goes In It | Why It Matters |
|---|---|---|
| GPU Training Nodes | IP subnet(s) of compute nodes in your training cluster | Source of all intra-cluster flow analysis |
| Storage / Checkpointing | NAS, object store, or distributed filesystem node IPs – TCP-backed backends only provide useful flow visibility | Separates the I/O phase from the compute phase traffic, where protocol support allows |
| Model Storage | Servers holding trained model artifacts and datasets | Primary target for exfiltration alerting |
| Approved External | Cloud provider ranges, model hub CDN IPs, and known partner networks | Baseline for unexpected outbound flow detection |
| Management Plane | Jump hosts, monitoring servers, orchestration nodes | Keep management traffic out of the training cluster analysis |
Alert Rules Worth Configuring
| Alert | Trigger | What You’re Catching |
|---|---|---|
| Outbound volume spike from model storage | Bytes from Model Storage group to non-Approved External destinations exceed X GB in Y minutes | Bulk data exfiltration attempt |
| First-ever external connection from sensitive hosts | Any flow from Model Storage or Training Nodes to an IP not in any defined group | Unknown external connection: malware callout, misconfigured container, or supply chain issue |
| Off-hours large transfer | Significant outbound volume from any sensitive group between 11 pm and 6 am | Covers the most common exfiltration timing window |
| Training node traffic dropout | Flow volume from a specific training node IP deviates significantly below its own established baseline during an active job window – threshold should be derived from observed variance across previous training runs, not a fixed percentage | Although not common, it can be used for straggler node detection – a network symptom of a compute or connectivity problem |
A Representative Data Explorer Query
Although not a common scenario for the straggler detection use case, here’s the query logic that can be used in NFA’s Data Explorer during or after a training run:
Group By: Source IP
Order By: Bytes (ascending)
Filter: Source IP in group [GPU Training Nodes]
AND Destination IP in group [GPU Training Nodes]
Narrow By: Interfaces facing your AI fabric
Time range: Duration of the training job
Report type: Traffic volume (bytes)What you’re looking for: a roughly uniform distribution of sent bytes across all training node IPs. Sort ascending to surface low senders at the top. Any node sitting significantly below the cluster mean across the full training window is your candidate straggler.
For checkpoint write visibility, applicable only if your storage backend uses TCP transport, the same structure applies with Source IP in [GPU Training Nodes] and Destination IP in [Storage / Checkpointing], grouped by time interval. You’re looking for periodic traffic that matches your checkpoint schedule. If your framework checkpoints on elapsed time or loss improvement rather than a fixed step interval, the pattern will not be regular, and the absence of regularity is not itself a signal of a problem.
The bottom line
Flow analysis has been a standard network operations tool for decades. In AI environments, it can still be useful, but mainly for the parts of the infrastructure that run over normal TCP/IP networks. Flow data shows traffic volume, direction, timing, and communication patterns; it won’t tell you anything about GPU performance or diagnose RDMA fabric issues.
Where it tends to help most is with things like spotting large outbound transfers, validating network segmentation, observing dataset loading or checkpoint traffic on TCP-based storage, and supporting capacity planning with real traffic data.
Tools like Noction Flow Analyzer work well for this kind of visibility because they let you group infrastructure components, explore traffic patterns, and create versatile alerts. If you already collect flow data from your GPU infrastructure, organizing it around your compute, storage, and external networks can make it easier to notice when traffic patterns start to drift from the norm.




