[netperf-talk] global question concerning Netperf test and SMP support
Rick Jones
rick.jones2 at hp.com
Fri Mar 30 10:23:11 PDT 2012
Indeed, if one system can only achieve 7 Gbit/s over loopback, I would
not have terribly high expectations of its ability via a network interface.
> Ok for the nestat and ethtool stat. This is a good alternative to CPU
> utilization provided by Netperf. I will watch in this direction.
If you use a later version of netperf - eg 2.5.0 or top of trunk, you
can ask it to emit the ID and utilization of the most utilized CPU on
either side, using the omni output selection mechanism:
raj at tardy:~/netperf2_trunk$ src/netperf -c -C -H 192.168.1.3 -- -o
throughput,local_cpu_util,local_cpu_peak_util,local_cpu_peak_id,remote_cpu_util,remote_cpu_peak_util,remote_cpu_peak_id
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.1.3 () port 0 AF_INET : demo
Throughput,Local CPU Util %,Local Peak Per CPU Util %,Local Peak Per CPU
ID,Remote CPU Util %,Remote Peak Per CPU Util %,Remote Peak Per CPU ID
941.36,8.28,28.50,0,44.55,88.71,0
http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Omni-Output-Selection
> How could it be faster in TCP than in UDP... Does my server is so limiting?
Difference in offload support by the NIC and in behaviour. For UDP, if
you send larger than the MTU less UDP header size, the IP datagram
carrying the UDP datagram will be fragmented, and if any of the IP
datagram fragments are lost, the entire IP datagram and by extension UDP
datagram are effectively lost. And there is also the issue of
overflowing the receiving socket as UDP has no flow control.
> >rick: Based on how I interpret your question, the TCP/IP stack is
> >fully SMP.
> >However... a single "flow" (eg TCP connection) will not make use of the
> >services of more than one or possibly two CPUs on either end. One
> >unless one binds the netperf/netserver to a CPU other than the one
> >taking interrupts from the NIC.
>
> Ok for this, but I read that it is better to get the TCP connection and
> the NIC interrupts on the same CPU or group of CPU for memory access.
Well, it is most efficient if the interrupt and netperf run on the same
CPU - minimizes the cache to cache transfer. However, if one is CPU
cycle bound, the decrease in efficiency can (at least sometimes) be made
up for by the doubling of the number of available CPU cycles by
involving a second CPU.
If that second CPU shares a cache in the hierarchy then the loss of
efficiency will not be as large.
Going beyond being on the same CPU, it is next best if everything is in
the same NUMA node. A cache may not be shared, but the memory accesses
will be local rather than remote.
Some of the confusion may stem from a confusion of terms. The
terminology I use is there are processors (ie chips) which have cores,
and those cores may have threads. At the OS level, all those
threads/cores of the processor(s) in the system will be presented as CPUs.
It probably doesn't help when something like /proc/cpuinfo uses
"processor" as a label - in a place where I would say CPU.
Others may have different thoughts on terminology.
> For my server, the interrupts are shared out between my 8 cores due to
> architecture considerations.
You mean to say that the IRQ processing of a given IRQ is shared between
cores?
> For my client, the interrupts are located on a single CPU.
> Is it the spinlock which determines which core processes TCP/IP stack?
No, I do not believe that a spinlock will determine in which core TCP is
processed.
> A last question concerning TCP/IP stack: TCP/IP input and TCP/IP output
> are distinct, could and should they run in a separate core?
Well, they are distinct, but they have intersection points. Where
things run is a very complicated "It depends" sort of thing, and I will
probably get some of it wrong :)
If we ignore TCP for the moment and just think about traffic going in
one direction - say UDP (ie no ACKs) - and consider UDP_STREAM on the
netperf side, the outbound processing will go from the sendto() call
made by netperf to the queueing of the packet to the NIC or transmit
queue on the CPU on which netperf runs. The NIC driver still must
process transmit completions from the NIC. Where that happens will
depend on the way the NIC and driver work. It could check for transmit
completions on the way down the stack, in which case those will happen
on the same CPU as netperf runs on. It could instead rely on transmit
completion interrupts. Those will happen on whatever CPU(s) get
assigned the IRQ associated with the transmit completion interrupts.
On inbound, the NIC will raise an interrupt, the driver will be invoked
on that CPU, it will then pass up through IP and UDP on that CPU, and
then go to wake-up the receiving process. If that process is not bound
to a specific CPU, the scheduler will attempt to run it on the same CPU.
That was the simple case. When there are ACKs, and blocking on full
sockets and intra-stack flow control the story becomes more complicated
as to where "outbound" and "inbound" processing takes place :)
It would be good if you could tell us a bit more about this server - it
has 8 cores, but are they all in one processor/chip, or distinct chips?
Is the system NUMA or UMA? What sort of processor frequency. All
those sorts of things.
happy benchmarking,
rick jones
More information about the netperf-talk
mailing list