[netperf-talk] CPU Utilization Issues
Rick Jones
rick.jones2 at hp.com
Fri Feb 5 14:12:53 PST 2010
Lentz, Benjamin A. wrote:
> Thank you for responding. Some responses below:
>
>
>>>
>>>The CPU utilization is so high, in fact, that Oracle Clusterware
>
> begins
>
>>>node eviction procedures when this test is performed while
>
> Clusterware
>
>>>is running.
>
>
>>How high is "so high?"
>
>
> This is difficult to quantify but perfectly valid. The behavior we've
> seen is that the system is unresponsive via ssh- even when the
> interfaces being load tested are different from those being used to
> access the system via SSH. It also causes a hang long enough to make
> Oracle Clusterware believe that the node has hung and that it needs to
> begin the node eviction process in order for the cluster's data
> integrity to be maintained.
If you add a -c and -C option to netperf, what CPU utilization does it
report?
BTW, is this the system on which netperf is running that is thought to
be dead, or the one that is the target of the netperf test (ie where
netserver is running)?
>> The netperf UDP_STREAM test has no flow-control of its own.
>> Netperf will sit there and send UDP datagrams as fast as it can.
>> If there is no intra-stack flow control, it will take a core to
>> 100% CPU utilization. If the path length of the stack is long
>> enough relative to the "oomph" of the core, it will go to 100%
>> utilization on a core even on a stack with intra-stack flow control
>> if that bottleneck is reached before the link itself.
>
>
> Is it possible that our interconnect throughput is over running the
> throughput in other areas of the system?
Perhaps. Can you provide information such as the output of
/proc/interrupts and then /proc/irq/<irq>/smp_affinity for some of the NICs?
Is the irqbalance service running.
>>Does Oracle Clusterware try to send traffic through the same NIC while
>>the test is running?
>
>
> Yes, indeed, and this is another reason why we believe there is a
> conflict with testing the interconnect with netperf whilst Clusterware
> is running. Oracle has confirmed that we shouldn't be doing this at the
> same time.
>
> However! Our vendor argues that they can indeed test with netperf while
> Clusterware is running, and they identify this as a problem with our
> environment.
You are 100% certain they are running a UDP_STREAM and not a TCP_STREAM
test? Are you using the exact same command line as they are? They
haven't done something like ./configure --enable-intervals and "paced"
the UDP_STREAM test?
>>Out of curiousity, what led you to want to run a UDP_STREAM test over
>>say a TCP_STREAM test?
>
>
> This is a good question. Our vendor is following Oracle Metalink
> Documents 810394.1 and 563566.1, which appear to make reference to
> netperf.
>
> Oddly enough, all TCP_STREAM testing the vendor did seems to work fine.
TCP_STREAM benefits from TCP's flow control. It will not run wild and
potentially fill the drivers transmit queue. Any other traffic will
then have a much better chance of actually getting through.
> Certainly! We as a Oracle and Red Hat customer, wishing to provide our
> vendor with compelling evidence that our interconnect is actually
> operating correctly, find ourselves entirely at the mercy of your
> utility; our vendor refuses to make further progress with our systems
> without being 100% certain that netperf confirms that our network
> connection between our two systems is operating flawlessly.
I love to see people using netperf. However I have to remind folks that
netperf is a *performance benchmark* - it is *not* a functional test
tool. Even though it can often uncover functional problems in a stack
or NIC or whatnot.
If the desire is to see that the interconnect is working properly, I
would think that "successful" netperf tests without the clustering
software running would be sufficient. Of course I am a networking guy,
not a clustering or database type :)
rick jones
More information about the netperf-talk
mailing list