Thursday, August 2, 2012

Comparing 10Gbe Cards

10Gbe network interfaces deliver lower latency as well as higher bandwidth.  Even if you are not close to saturating the 1Gbe cards in your setup, it might be worth considering the next generation of networking technology on the basis of latency alone.  In this post, I will compare a few 10Gbe cards which I have had the privilege to try out.  I do have a favorite card at this point, which I will reveal at the end (hint: it has more to do with the software stack than the hardware specifically.

Wired magazine online had a series of articles earlier in the year where it was revealed that over 60% of the network ports in the data centers managed by internet giants were 10Gbe.  The potential latency for Infiniband (a similar but slightly older technology) is known to be under 5 microseconds.  It is fairly typical to see latencies in 1Gbe hover in the 150 to 30 microseconds (latency is dependent on the size of the packet payload).

In order to push a network interface to handle this much data, a modern computer with a PCIe slot with enough lanes (typically 10Gbe cards use an x8 slot -- so the x16 slot available for a graphics card is sufficient) is required.  Achieving the lower latencies this new hardware is capable of is challenging for the operating system (in my case Linux) since the TCP/IP stack and Berkeley sockets API begins to become a bottle neck.  Almost every manufacturer  has attempted to solve this problem in  their own way, providing a software work-around which achieves higher performance than what is dirrectly available via the kernel and standard API.


Method


To test a pair of cards, I plugged them into the x16 slots on two cluster nodes and cabled them directly to each other (so no switch in between).  I then configured them for ethernet, assigned IP addresses, and  ran some benchmarks:
 
 # modprobe 
 # service openib start 
 # ifconfig eth1 192.168.3.10
 # iperf 
 # NPtcp

And on the other node:
 
 # modprobe 
 # service openib start
 # ifconfig eth1 192.168.3.11
 # iperf -c 192.168.3.10
 # NPtcp -h 192.168.3.10

The iperf test mostly checks bandwidth and for me is just a basic sanity test. The more interesting test is netpipe (NPtcp) which does a latency test at various packet sizes.

Testing RDMA latency on a card that provides it is a simple matter of running a bundled utility (-s indicates payload bytes and -t is the number of iterations):


# rdma_lat -c -s32 -t500 192.168.3.10
 2935: Local address:  LID 0000, QPN 000000, PSN 0xb1c951 RKey 0x70001901 VAddr 0x00000001834020
 2935: Remote address: LID 0000, QPN 000000, PSN 0xcbb0d3, RKey 0x001901 VAddr 0x0000000165b020
 
 Latency typical: 0.984693 usec
 Latency best   : 0.929267 usec
 Latency worst  : 15.8892 usec


Mellanox 



This card is widely used and very popular.  Mellannox has considerable experience with Infiniband products, and have been able to produce cards which are capable of transporting TCP/IP traffic ontop of infiniband technology.  For them this is mostly accomplished using kernel drivers which are part of the OFED software stack.  I found that it is best to get a snapshot of this suite of packages from Mellanox directly for one of the particular distributions (all of them, ultimately, a variation on Red Hat Linux).  Although Debian wheezy had OFED packages in its repository, they were not recent enough for one of the newer cards I was trying.  For these reasons, I ended up dual booting my cluster to Oracle Enterprise Linux (OEL 6.1 specifically).  Debian Wheezy was able to run this card as an ethernet interface (using the kernel TCP/IP stack), it's just that fancy things like Infiniband and RDMA were not accessible.

I also managed to test a Mellannox ConnectX3 card, but I found that its performance was not (statistically) discernable from the ConnectX2.  If you told me to figure out which card was in a box from its benchmarks I would not be able to separate the ConnectX2 and ConnectX3 -- although presumably the new revision does have some advantage which I did not find.


Solar Flare 


Solar Flare makes several models of 10Gbe cards.  The value added by solar flare is mostly in their open onload driver technology which makes use of their alternative network stack which runs mostly in user space.  This software accesses a so-called virtual NIC interface on the card to speedup network interaction bypassing the standard kernel TCP/IP stack.  Just like the Mellanox cards, I found that Debian Wheezy could recognize the cards and run them with the Linux kernel TCP stack, but the special drivers (open onload) needed to run on OEL (I hope to attempt to build the sfc kernel driver on Wheezy soon).


Measurements and Summary



Below is a summary of the measurements that I did on these cards using various TCP stacks (vanilla indicates the Linux 3.2.0 Kernel is being used) and RDMA


Communication Type Card Distro K Mod Mesg bytes Latency
vanilla TCP solar flare OEL sfc 32 17 usec
vanilla TCP solar flare OEL sfc 1024 18 usec
vanilla TCP Mellanox connectX3 OEL mlx4_en 32 12 usec
vanilla TCP Mellanox connectX3 OEL mlx4_en 1024 16 usec
vanilla TCP Mellanox connectX2 OEL mlx4_en 32 13 usec
vanilla TCP Mellanox connectX2 OEL mlx4_en 1024 9 usec
onload userspaceTCP solar flare OEL sfc 32 2.4 usec
onload userspaceTCP solar flare OEL sfc 1024 3.6 usec
RDMA Mellanox connectX3 OEL mlx4_ib 32 1.0 usec
RDMA Mellanox connectX3 OEL mlx4_ib 1024 3.0 usec
RDMA Mellanox connectX2 OEL mlx4_ib 32 1.0 usec
RDMA Mellanox connectX2 OEL mlx4_ib 1024 3.0 usec


The big surprise in this investigation is open onload (more info can be had from this presentation).  This driver is activated selectively using user space system call interposition (so you can choose which applications run on it).  It does not require the application to be rewritten, recompiled or rebuilt in any way.  This means, in particular that closed source third party software can use it.  It's this extra flexibility which really has my attention.  Without coding to a fancy/complicated API, a developer can make use of familiar programming tools to create systems with low networking latency.

2 comments:

  1. The advance 10G bit tcp offload is capable of reducing data traffic that will increase the speed of data transfer. This is among the best solution for faster and dependable internet connection.
    Thanks...

    ReplyDelete
  2. However, this should not be problematic since one of the key features and benefits of the TCP / IP is to provide an abstraction of the medium so that the exchange of information between different media and technologies that are initially incompatible possible.
    TCP Offload
    Thanks....

    ReplyDelete


Follow Mark on GitHub