Poison, Poison, Tasty Fish!: TIPC: communication on a cluster

Tasty.

Inter-process communication on a cluster is an interesting niche. 10Gbe networking hardware is not just capable of better bandwidth, but also has lower latency. Using the network as an inter-process communication mechanism within a cluster makes sense since two processes which need to talk may not be running on the same node. In fact, it is likely a very good idea in some instances (e.g. replication) that such processes do not run on same cluster node.

Most software people have some experience with using TCP/IP since its kind of running the show on the internet (HTTP for instance). Less common, is the use of UDP which provides less guarantees (reliable delivery), but has the added flexibility of being connection-less so it can be quite good for broadcasting information. Both of these options are somewhat legacy, and other alternatives have been imagined specifically for running on single-hop networks in the context of clusters (so LAN distances).

TIPC (transparent inter-process communication) is one such option which was invented by an engineer at Ericsson Canada named Jon Paul Maloy. It is an open source project now, and has several contributors now.

Alternatives

There are alternatives to TIPC which are similar (I only know Spread -- as of version 4.0.0 -- very well from experience) and may be worth investigating as well:

RDS (Reliable Datagram Sockets)
PGM (Pragmatic General Multicast)
QSM (Quicksilver Scalable Multicast)
Spread Toolkit

Spread toolkit is a user-space solution which consists of a daemon running on every node, a library for applications to talk to it, a messaging semantics that is based on groups (similar to the concept of IRC actually) and a token ring over UDP implemented inter-daemon protocol. In the documentation of Spread there is some fairly clear language that it is not meant to be used for real-time applications. Unfortunately, it found its way into some real-time applications that I care about. One particular pit-fall was running the token ring through several servers and firewalls with different endian-ness (Sparc and x86) and different speed characteristics. The combination was somewhat of a disaster from a performance perspective since the inter-daemon flow control code seemed to work better when the nodes in the Spread network were more equal.

Linux Installation of TIPC and Version Determination

It was mainly developed on Linux as a loadable kernel module. You can install it on Debian (my cluster uses Wheezy) with:

 # sudo apt-get install tipcutils

In order to see what version of TIPC you are running you can figure this out using this command:

 # kvers=`uname -r` strings \
  `find /lib/modules/$kvers/ -name tipc.ko` \
   | grep ^version=
version=1.6.4
version=2.0.0

A grep for TIPC in /var/log/syslog is also a good idea after you have the kernel module loaded. In my case TIPC 2.0.0 was running on the 3.2.0 x86_64 Linux kernel. Some notes I have on TIPC versions:

1.6.x: had a native API as well as berkeley sockets (via AF_TIPC)
1.7.x: uncommon to find in a linux distro, was ported to Solaris (and I tried that out)
2.0.x found commonly with Linux kernels past 2.6.30, native API was dropped

I think there may be code out there for a 2.1.x release, but I have not tried it. The main admin interface to TIPC is the tipc-config command which can: report statistics, list open ports, configure bearers, and window sizes. I gotcha found was that I had to build it from source in order to get it to properly configure the window size on the broadcast link.

Node Addressing and Scope

Nodes in the cluster are addressed using Z.C.N syntax for the triple consisting of the zone (cluster of clusters) Z, cluster C, and node N. If you are like me and just have the one cluster you live in 1.1.N for all of your nodes (the limit is 4095 of those as well, in contrast to IPs limit of 255). When opening a socket the API allows the programmer to select the scope of the socket to be node, cluster or zone. This allows you to specify a local service that is running somewhere in a cluster, but which is accessible to all nodes there. The same is accomplished with VIP management when using TCP.

Socket Types

The socket types that are supported are:

SOCK_STREAM: byte-oriented reliable stream much like TCP
SOCK_DGRAM: message-oriented unreliable datagram multicast like UDP
SOCK_RDM: message-oriented reliable datagram multicast
SOCK_SEQPACKET: message-oriented reliable stream

Message oriented communication is nicer too program with in some ways, since you don't have to deal with partial messages being sent or received (only whole messages do these things). The largest message size (or cummulative message size with scatter-gather calls like recvmsg/sendmsg) is 66000 bytes. This number is important to keep in mind since it has an effect on things like the effective MAX_IOV.

SOCK_STREAM is interesting as well since several nodes in the cluster can bind to it to server the same port, which causes TIPC to balance traffic between those two service end points. You can see how some choices have been made to gear TIPC up for highly available services.

The socket type I found to be the most immediately interesting was SOCK_RDM (message-oriented, reliable multicast). Being able to send the same message to multiple end points at the same time seems really powerful. The reliable part requires some explanation. From my read of various forum postings, the message is reliably delivered to another node when it is acknowledged by that node. If the message cannot be delivered, it is possible to set things up so that the first 1024 bytes of that message are returned to the sending socket (letting the sender application know that it failed to be received). The best practice for being sure that an application has received a message is to have that application send an explicit acknowledgement message back to the sender. For more information please consult the older or newer documentation.

Configuration

I get the module loaded using the following bash script on my cluster (the node machines are node010 and node011 -- which are DRBL assigned names based on thier IP addresses in the 192.168.1.x range):

#!/bin/bash

mussh -h node010 -h node011 -c '
modprobe sfc
export SUFFIX=`hostname | sed 's/node0//g'`
ifconfig eth1 192.168.3.$SUFFIX
modprobe tipc
tipc-config -netid=1234 -addr=1.1.$SUFFIX -be=eth:eth1
lsmod | grep tipc
/home/mark/git/tipcutils/tipc-config/tipc-config -lw=broadcast-link/100
'

Here I am running the TIPC network through 10Gbe eth1 interfaces using the sfc kernel driver. For example cluster node node010 has 1Gbe address 192.168.1.10 on eth0, 10Gbe address 192.168.3.10 on eth1, and this latter interface is a TIPC bearer where the node is known uniquely as 1.1.10. I find that this scheme is easy to remember.

Functional Addressing

I already mentioned that you can open a socket with cluster scope, which makes it easily accessible to the whole cluster (client sockets connect to the port without specifying a specific node address). This is very neat since you can easily setup services within the cluster as if it were one big happy computer (ideal for dealing with a cluster).

The concept of functional addressing of ports also allows you to have subranges of services within that port, which allows for messages to be filtered based on how the socket is setup. Suppose a server opens a {portname,lower,higher} (where lower is less than or equal to higher) he is saying in effect that he is interested in messages addressed to {portname,lower}, {portname,lower+1}, ..., {portname,higher}. Suppose we had a mini-blogging service running on this where portname is 999 and the instance receives messages for a particular user id -- so listening to {portname,instance} gives you a single user's feed. A bit of a contrived example, but you get the idea.

Name Subscriptions

Another mechanism exists for programs to subscribe for events which indicate major events, like a program binding to a particular portname and instance range, or a node going down completely. I have not really played with this functionality very much, but it seems like a very useful way of coding fail-safe scenarios (not sending a message if there is nobody bound and listening).

Summary

TIPC is a really neat idea for communicating processes running in a cluster. It makes use of several useful abstractions and conveniences which are not available with TCP directly. It also does away with much complication that TCP needs to deal with for connections which are higher latency (WAN context for instance). I would be very excited to learn that TIPC got the attention of kernal-bypass stuff that seems to be useful for speeding up networking over 10Gbe network cards. If this happens, I get the best of both worlds (nice cluster API, and very low latency).

Poison, Poison, Tasty Fish!

Friday, July 20, 2012

TIPC: communication on a cluster