Receive flow steering
Today's increasing bandwidth, and faster networking hardware, has made it difficult for a single CPU to keep up. Multiple cores and packages have helped matters on the transmit side, but the receive side is trickier. Tom Herbert's receive packet steering (RPS) patches, which we looked at back in November, provide a way to steer packets to particular CPUs based on a hash of the packet's protocol data. Those patches were applied to the network subsystem tree and are bound for 2.6.35, but now Herbert is back with an enhancement to RPS that will attempt to steer packets to the CPU on which the receiving application is running: receive flow steering (RFS).
RFS uses the RPS hash table to store the CPU of an application when it calls recvmsg() or sendmsg(). Instead of picking an arbitrary CPU based on the hash and a CPU mask optionally set by an administrator, as RPS does, RFS tries to use the CPU where the receiving application is running. Based on the hash calculated on the incoming packet, RFS can look up the "proper" CPU and assign the packet there.
The RPS CPU masks, which can be set via sysfs for each device (and queue for devices with multiple queues), represent the allowable CPUs to assign for a packet. But dynamically changing those values introduces the possibility of out-of-order packets. For RPS, with largely static CPU masks, it was not necessarily a big problem. For RFS, however, multiple threads trying to read from the same socket, while potentially bouncing around to different CPUs, would cause the CPU value in the hash table to change frequently, thus increasing the likelihood of out-of-order packets.
$ sudo subscribe todaySubscribe today and elevate your LWN privileges. You’ll have access to all of LWN’s high-quality articles as soon as they’re published, and help support LWN in the process. Act now and you can start with a free trial subscription.
For RFS, that was considered to be a "non-starter
", Herbert
said, so a different approach was required. To eliminate the out-of-order
packets, two types of hash tables are created, both indexed by the hash
calculated from the packet information. The global
rps_sock_flow_table is populated by the recvmsg() or
sendmsg() call with the CPU number where the application is running
(this is called the "desired" CPU).
Each device queue then gets a rps_dev_flow_table which contains
the most recent CPU used to handle packets for that connection (which is
called the "current" CPU). In addition, the value of the tail queue
counter for the current CPU's backlog queue is stored in the
rps_dev_flow_table entry.
The two CPU values are compared when deciding which CPU to process the packet on (which is done in get_rps_cpu()). If the current CPU (as determined from the rps_dev_flow_table hash table) is unset (presumably for the first packet) or that CPU is offline, the desired CPU (from rps_sock_flow_table) is used. If the two CPU values are the same, obviously, that CPU is used. But if they are both valid CPU numbers, but different, the backlog tail queue counter is consulted.
Backlog queues have a queue head counter that gets incremented when packets are removed from the queue. Using that and the queue length, a queue tail counter value can be calculated. That is what gets stored in rps_dev_flow_table. When the kernel makes its decision about which CPU to assign the packet to, it needs to consider both the current (really last used by the kernel) CPU and the desired (last used by an application for sending or receiving) CPU.
The kernel compares the current CPU's queue tail counter (as stored in the hash table) with that CPU's queue head counter. If the tail counter is less than or equal the head counter, that means that all packets that were put on the queue by this connection have been processed. That in turn means that switching to the desired CPU will not result in out-of-order packets.
Herbert's current patch is for TCP, but RFS should be "usable for other
flow oriented protocols
". The benefit is that it can achieve better
CPU locality for the processing of the packet, both by the kernel, and the
application itself. Depending on various factors—cache hierarchy and
application are given as examples—it can and does increase the
packets per second that can be processed as well as lowering the latency
before a packet gets processed. But, interestingly, "on simple
benchmarks, we don't necessarily see improvement and sometimes see degradation
".
For more complex benchmarks, the performance increase looks to be significant. Herbert gave numbers for a netperf run where the transactions per second went from 104K without either RFS or RPS, to 290K for the best RPS configuration, and to 303K with RFS and RPS. A different test, with 100 threads handling an RPC-like request/response with some user-space work being done, was even more dramatic. That test showed 103K, 174K, and 223K respectively, but also showed a marked decrease in the latency for both RPS and RPS + RFS.
These patches are coming from Google, which has been known to process a few packets using the Linux kernel. If RFS is being used on production systems at Google, that would seem to bode well for its reliability and performance beyond just benchmarks. The patches were posted April 2, and seemed to be generally well-received, so it's a little early to tell when they might make it into the mainline. But it seems rather likely that we will see them in either 2.6.35 or 36.
Index entries for this article | |
---|---|
Kernel | Networking |
Posted Apr 8, 2010 7:44 UTC (Thu)
by intgr (subscriber, #39733)
[Link] (4 responses)
1. Daemon gets scheduled to processor A
So the daemon and packet processing thread would be playing cat-and-mouse and repeatedly jumping through different processors.
I guess for just plain TCP, the processing overhead can be quite low, but if you add iptables processing, packet capturing etc, it can become significant.
Posted Apr 8, 2010 13:36 UTC (Thu)
by clugstj (subscriber, #4020)
[Link]
Posted Apr 8, 2010 16:57 UTC (Thu)
by cventers (guest, #31465)
[Link] (1 responses)
Posted Aug 3, 2010 14:15 UTC (Tue)
by SEJeff (guest, #51588)
[Link]
watch -n1 cat /proc/interrupts
Look for the line that says "Rescheduling interrupts". If your apps are loading up with some sort of io such as sssd block io or very fast network io, the scheduler tends to flip out. Thats causes your app to slow down due to being rescheduled. The solution is to setup the affinity of your block devices, network cards, or applications.
In short... depending on your hardware and applications, it can still very much be an issue.
Posted Aug 6, 2011 2:41 UTC (Sat)
by ZhuYanhai (guest, #44977)
[Link]
1. Daemon gets scheduled to processor A
So I think the scheduler should already have some methods to detect and prevent such thrashing.
Posted Apr 3, 2014 18:11 UTC (Thu)
by smasetty (guest, #96391)
[Link]
Receive flow steering
2. All incoming packets are steered to processor A
3. Processor gets loaded due to processing of incoming packets
4. Scheduler decides that the daemon could get more work done and reschedules it to processor B
5. Repeat
Receive flow steering
Receive flow steering
Receive flow steering
Receive flow steering
2. processor A get very busy.
3. Scheduler decides that the daemon could get more work done and reschedules it to processor B
4. Repeat
Receive flow steering