[netperf-talk] global question concerning Netperf test and SMP support

Rick Jones rick.jones2 at hp.com
Fri Mar 30 10:23:11 PDT 2012


Indeed, if one system can only achieve 7 Gbit/s over loopback, I would 
not have terribly high expectations of its ability via a network interface.

> Ok for the nestat and ethtool stat. This is a good alternative to CPU
> utilization provided by Netperf. I will watch in this direction.

If you use a later version of netperf - eg 2.5.0 or top of trunk, you 
can ask it to emit the ID and utilization of the most utilized CPU on 
either side, using the omni output selection mechanism:

raj at tardy:~/netperf2_trunk$ src/netperf -c -C -H 192.168.1.3 -- -o 
throughput,local_cpu_util,local_cpu_peak_util,local_cpu_peak_id,remote_cpu_util,remote_cpu_peak_util,remote_cpu_peak_id
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.3 () port 0 AF_INET : demo
Throughput,Local CPU Util %,Local Peak Per CPU Util %,Local Peak Per CPU 
ID,Remote CPU Util %,Remote Peak Per CPU Util %,Remote Peak Per CPU ID
941.36,8.28,28.50,0,44.55,88.71,0

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Omni-Output-Selection


> How could it be faster in TCP than in UDP... Does my server is so limiting?

Difference in offload support by the NIC and in behaviour.  For UDP, if 
you send larger than the MTU less UDP header size, the IP datagram 
carrying the UDP datagram will be fragmented, and if any of the IP 
datagram fragments are lost, the entire IP datagram and by extension UDP 
datagram are effectively lost.  And there is also the issue of 
overflowing the receiving socket as UDP has no flow control.


>  >rick: Based on how I interpret your question, the TCP/IP stack is
>  >fully SMP.
>  >However... a single "flow" (eg TCP connection) will not make use of the
>  >services of more than one or possibly two CPUs on either end. One
>  >unless one binds the netperf/netserver to a CPU other than the one
>  >taking interrupts from the NIC.
>
> Ok for this, but I read that it is better to get the TCP connection and
> the NIC interrupts on the same CPU or group of CPU for memory access.

Well, it is most efficient if the interrupt and netperf run on the same 
CPU - minimizes the cache to cache transfer.  However, if one is CPU 
cycle bound, the decrease in efficiency can (at least sometimes) be made 
up for by the doubling of the number of available CPU cycles by 
involving a second CPU.

If that second CPU shares a cache in the hierarchy then the loss of 
efficiency will not be as large.

Going beyond being on the same CPU, it is next best if everything is in 
the same NUMA node.  A cache may not be shared, but the memory accesses 
will be local rather than remote.

Some of the confusion may stem from a confusion of terms.  The 
terminology I use is there are processors (ie chips) which have cores, 
and those cores may have threads.  At the OS level, all those 
threads/cores of the processor(s) in the system will be presented as CPUs.

It probably doesn't help when something like /proc/cpuinfo uses 
"processor" as a label - in a place where I would say CPU.

Others may have different thoughts on terminology.

> For my server, the interrupts are shared out between my 8 cores due to
> architecture considerations.

You mean to say that the IRQ processing of a given IRQ is shared between 
cores?

> For my client, the interrupts are located on a single CPU.
> Is it the spinlock which determines which core processes TCP/IP stack?

No, I do not believe that a spinlock will determine in which core TCP is 
processed.

> A last question concerning TCP/IP stack: TCP/IP input and TCP/IP output
> are distinct, could and should they run in a separate core?

Well, they are distinct, but they have intersection points.  Where 
things run is a very complicated "It depends" sort of thing, and I will 
probably get some of it wrong :)

If we ignore TCP for the moment and just think about traffic going in 
one direction - say UDP (ie no ACKs) - and consider UDP_STREAM on the 
netperf side, the outbound processing will go from the sendto() call 
made by netperf to the queueing of the packet to the NIC or transmit 
queue on the CPU on which netperf runs.   The NIC driver still must 
process transmit completions from the NIC.  Where that happens will 
depend on the way the NIC and driver work.  It could check for transmit 
completions on the way down the stack, in which case those will happen 
on the same CPU as netperf runs on.  It could instead rely on transmit 
completion interrupts.  Those will happen on whatever CPU(s) get 
assigned the IRQ associated with the transmit completion interrupts.

On inbound, the NIC will raise an interrupt, the driver will be invoked 
on that CPU, it will then pass up through IP and UDP on that CPU, and 
then go to wake-up the receiving process.  If that process is not bound 
to a specific CPU, the scheduler will attempt to run it on the same CPU.

That was the simple case.  When there are ACKs, and blocking on full 
sockets and intra-stack flow control the story becomes more complicated 
as to where "outbound" and "inbound" processing takes place :)

It would be good if you could tell us a bit more about this server - it 
has 8 cores, but are they all in one processor/chip, or distinct chips? 
  Is the system NUMA or UMA?  What sort of processor frequency.  All 
those sorts of things.

happy benchmarking,

rick jones


More information about the netperf-talk mailing list