The kernel connection multiplexer
KCM functionality
In its simplest form, a KCM socket can be thought of as a sort of wrapper attached to an ordinary TCP socket:
Within the kernel, an object called a "psock" sits between the application (which is using the KCM socket) and the actual TCP connection to the world. Outgoing messages are sent as formatted by the application. Incoming data is buffered until the kernel sees (via a mechanism to be discussed momentarily) that a full message has arrived; that message is then made available to the application via the KCM socket. The psock module ensures that messages are sent and received as atomic units, freeing the application from having to manage that aspect of the protocol.
Applications are often split into multiple processes, though, each of which can deal with incoming messages in the same way. Multiplexers to dispatch messages are thus built into many applications. But, while the kernel is handling message receipt, it can also deal with the multiplexing. So the actual architecture of KCM looks a bit more like this:
Each of the KCM sockets is seen as equivalent by the kernel — an incoming message can just as well be passed to any one of them. What actually happens, of course, is that the kernel chooses between one of the processes that is actually waiting for a message when one arrives; if there are no waiting processes, the message will be queued.
Things get more complicated on the other side of the multiplexer as well, in that there could be value in having multiple TCP connections to the same destination. A phone handset, for example, might connect to a service over both broadband and WiFi. In such a situation, the multiplexer could choose between those connections for outgoing messages, and accept incoming messages on any of them. And, indeed, that's what KCM does:
Thus, KCM can be thought of as implementing a sort of poor hacker's multipath TCP, where the application is charged with setting up the connections over the various paths.
There is one final detail: how is the TCP data stream broken up into discrete messages? One could certainly envision building some sort of framing mechanism into KCM, as has been done in the kernel's reliable datagram sockets layer, but that would limit its flexibility when it comes to implementing existing protocols. If KCM is to be usable for an existing message-oriented mechanism, there must be a way to tell KCM how the framing of messages is done.
The solution here is perhaps predictable, since it is increasingly being used as the way to extend kernel functionality: use the Berkeley packet filter (BPF) subsystem. Whenever some data shows up at the psock level, it is passed to a BPF program for evaluation. Normally, the program will examine the data, determine the length of the message, and return that value. A return value of zero indicates that the length cannot be determined yet — more data must be received before trying again. A negative number indicates some sort of protocol error; when that happens, KCM will stop servicing the affected connection and signal an error to user space.
API details
An application wanting to use KCM starts by creating a new multiplexer; this is done with a socket() call:
#include <linux/kcm.h>
kcm_socket = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED);
If additional KCM sockets (attached to the same multiplexer) are needed, they can be created by calling accept() on the initial KCM socket. Normally, accept() is not applicable to datagram sockets, so this usage, while perhaps a little surprising, is arguably a reasonable way of overloading this system call.
The application must also take care of the establishment of one or more TCP connections to the remote side and attaching them to the multiplexer. Once a socket is open, it should be placed into a kcm_attach structure:
struct kcm_attach {
int fd;
int bpf_type;
union {
int bpf_fd;
struct sock_fprog fprog;
};
};
Here, fd is the file descriptor for the open socket. The rest of this structure exists so that the application can supply the BPF program that will help split the incoming data stream into messages. If bpf_type is KCM_BPF_TYPE_FD, then bpf_fd contains a file descriptor corresponding to a BPF program that has been loaded with the bpf() system call. Otherwise, if bpf_type is KCM_BPF_TYPE_PROG, then fprog points to a program to be loaded directly at this time.
The provision of two ways to load the BPF program may look a bit strange. It is, in fact, a combination of new and old ways of solving this particular problem. The bpf() approach is the newer way of doing things, while the sock_fprog approach has been used to load BPF programs into the network subsystem until now. It would not be entirely surprising to see a request from reviewers to narrow the interface down to just one of the two, most likely the bpf() method.
In any case, once the structure has been filled in, it is passed to a new ioctl() command (SIOCKCMATTACH) to connect the socket to the multiplexer. There is also a SIOCKCMUNATTACH operation to disconnect a socket.
Once the pieces are connected together, the application can use the KCM socket(s) like any other datagram socket. Messages can be sent and received with interfaces like sendmsg() and recvmsg() (or write() and read()), poll() can be used to wait for a message to arrive, and so on. The whole structure will persist until the last KCM socket is closed, at which point it is torn down.
One might wonder why this mechanism is being proposed for the kernel, given
that applications have been solving this problem in user space for years.
There is no explicit justification offered in the patch set, but one can
imagine that it would involve performance improvements (avoiding copying or
retransmission of data), the ability to do smart load balancing, and having
a single implementation used by all. The "to do" list in the patch
posting, which says that "TLS-in-kernel is a separate
initiative
", suggests that there may be efforts to push other
functionality into the kernel in the future. Whether these efforts will
succeed remains to be seen, but there is clearly a lot of interest in
adding interesting functionality to the Linux networking stack.
Index entries for this article Kernel Networking
Posted Sep 21, 2015 23:33 UTC (Mon)
by flewellyn (subscriber, #5047)
[Link] (28 responses)
This kind of task seems to me to be a user-space task. And I don't really see the justification for putting it in-kernel, vis a vis performance.
Besides which, if you want message-oriented service with reliable delivery, there's RUDP, DCCP, SCTP, and others. Why force TCP into doing something it's not designed for?
Posted Sep 21, 2015 23:55 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (26 responses)
> Why force TCP into doing something it's not designed for?
Posted Sep 22, 2015 7:13 UTC (Tue)
by dgm (subscriber, #49227)
[Link] (18 responses)
Posted Sep 22, 2015 8:25 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (17 responses)
Personally, I always wished to be able to do "socket(AF_TLS, SOCK_SEQPACKET)" and be able to use encryption transparently.
Posted Sep 22, 2015 9:09 UTC (Tue)
by dlang (guest, #313)
[Link] (5 responses)
Posted Sep 22, 2015 12:25 UTC (Tue)
by k3ninho (subscriber, #50375)
[Link]
K3n.
Posted Sep 22, 2015 17:37 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Sep 22, 2015 18:24 UTC (Tue)
by dlang (guest, #313)
[Link] (2 responses)
Posted Sep 22, 2015 20:25 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted May 19, 2016 6:01 UTC (Thu)
by wahern (subscriber, #37304)
[Link]
1) Con: Which would you prefer: having to upgrade your kernel when a TLS issue arises, or OpenSSL? Keep in mind, we also must appreciate the cost of upgrading incurred by our network peers. Upgrading and evolving TLS across the global network would be even slower if OpenSSL weren't the predominate server-side library. At one of the companies I work at, we're still using 2.6 kernels. Apple has tried to shift away from OpenSSL, but their CoreCrypto stack is pretty limited, both wrt to certificate management APIs and cipher mode support. They simply bit off more than they could chew, but few notice because few people use the stack or demand much of it, or when they do they either control the server side or simply are oblivious to the problems.
Adding another TLS implementation to the heap probably isn't a good idea. Which is why libressl and BoringSSL are so committed to practical API compatibility; and keeping code in sync between themselves and, as much as possible, with OpenSSL. Imagine trying to roll out something like ALPN. That would be much uglier than the current problems wrt HTTP/2 if there was another widely used stack.
2) Pro: IIRC, proposals for kernel-based TLS support which have come closest to inclusion keep the handshake and PKI management in userspace. The kernel layer is only invoked once the session is established, and perhaps in initial routing based on SNI. That aspect of the protocol changes the least, at least going forward and assuming the latest crop of cipher mode schemes remain viable.
Posted Sep 22, 2015 19:29 UTC (Tue)
by smckay (guest, #103253)
[Link] (10 responses)
Posted Sep 22, 2015 20:04 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (9 responses)
Posted Sep 22, 2015 20:15 UTC (Tue)
by smckay (guest, #103253)
[Link] (8 responses)
Posted Sep 22, 2015 20:40 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (7 responses)
Posted Sep 22, 2015 20:49 UTC (Tue)
by smckay (guest, #103253)
[Link] (5 responses)
Posted Sep 22, 2015 21:14 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Can be done in userspace, but no existing TLS library can serialize its internal state for inter-process transmission (bad OpenSSL, bad).
Posted Sep 24, 2015 19:24 UTC (Thu)
by luto (subscriber, #39314)
[Link] (3 responses)
Also, TLS supports things like renegotiation, so it's not as simple as "create a connection, set up session keys, and you're done".
Posted Sep 24, 2015 20:35 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
> Also, TLS supports things like renegotiation, so it's not as simple as "create a connection, set up session keys, and you're done".
Also, having a dedicated userspace component is also a possibility - i.e. the kernel happily demuxes everything during the normal operation but punts requests back to userspace if anything is abnormal (like requests for renegotiation or PFS key renewal).
Posted Sep 25, 2015 9:49 UTC (Fri)
by hkario (subscriber, #94864)
[Link] (1 responses)
Posted May 19, 2016 6:12 UTC (Thu)
by wahern (subscriber, #37304)
[Link]
It also might be the case that TLS session resumption permits rekeying, and that might still be supported by libressl and BoringSSL. In that case applications could still do a little dance to rekey a session.
Posted Sep 23, 2015 7:40 UTC (Wed)
by lsl (guest, #86508)
[Link]
Posted Sep 22, 2015 17:38 UTC (Tue)
by mm7323 (subscriber, #87386)
[Link] (6 responses)
For small messages you can normally just receive into a large buffer and get a header and message in one go, so long as the sender is careful about using MSG_MORE or TCP_CORK correctly when sending. You still have to handle the case it doesn't come off like that, but the efficiency of handling lots of small messages over TCP can be made reasonably efficient with some care. KCM will also always add overhead in touching payloads an extra time for the BPF message length determination.
Posted Sep 22, 2015 20:29 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Sep 22, 2015 21:41 UTC (Tue)
by mm7323 (subscriber, #87386)
[Link] (1 responses)
You aren't doing something funny with signals are you, and interrupting all your system calls to get EINTR?
Posted Sep 22, 2015 21:49 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Sep 24, 2015 8:21 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Modern high-performance network servers are often limited by the context switching overhead, so there are several projects that instead move _everything_ to userspace, including the network stack itself. And results are often quite impressive - just check http://www.seastar-project.org/ with DPDK.
But this doesn't really scale - to get the best performance you have to dedicate a whole network card to one application. And then you lose many important kernel features (firewalls, iptables, shapers, security). It would be nice if we could get _most_ of DPDK performance by adding kernel features like KCM.
Posted Sep 24, 2015 16:41 UTC (Thu)
by mm7323 (subscriber, #87386)
[Link]
Really it all smacks of the old Berkeley sockets API not being appropriate to high performance applications though. I'm not sure this kmux thing is the answer to that, though it causes no harm either.
Posted Sep 25, 2015 18:20 UTC (Fri)
by nix (subscriber, #2304)
[Link]
(I wrote a library that does more or less exactly what this is proposing in a previous job, in, oh, 2000 lines, except it worked for arbitrary stream file descriptors, not just TCP network sockets. It always read() big lumps at once (or the maximum available, if no big lump was available). It was not hard to write.
Posted Oct 4, 2015 5:48 UTC (Sun)
by toyotabedzrock (guest, #88005)
[Link]
Posted Sep 22, 2015 7:53 UTC (Tue)
by kugel (subscriber, #70540)
[Link] (8 responses)
Posted Sep 22, 2015 12:43 UTC (Tue)
by tshow (subscriber, #6411)
[Link] (7 responses)
From the looks of it, the advantage is that the other end of the connection doesn't know the difference, and so just talks TCP. I'd *love* to use SCTP, and have wanted to be able to since 2000 or so, but invariably I have a Linux server that can speak SCTP and a bunch of clients (windows machines, game consoles, phones, tablets...) which will probably never have SCTP in their stack, regardless of how fervently I wish for it.
Believe me; I've been wishing REALLY hard for a decade and a half, and it hasn't been working.
So, I'm tunneling reliable-ish message-oriented traffic over TCP and UDP (a lot of what I'm doing is games, so it's TCP for bulk data transfers and UDP for time-critical game data). This would appear to make things cleaner on the server side.
Posted Sep 22, 2015 14:33 UTC (Tue)
by dlang (guest, #313)
[Link] (2 responses)
Posted Sep 22, 2015 16:16 UTC (Tue)
by smckay (guest, #103253)
[Link]
Posted Sep 22, 2015 16:26 UTC (Tue)
by butlerm (subscriber, #13312)
[Link]
> In order to delineate message in a TCP stream for receive in KCM, the
One of the major advantages of doing this is you can avoid waking up the user process until a complete message is received. It also allows you to efficiently load balance incoming TCP datagrams on the same connection across multiple threads or processes. No user space demultiplexing thread required.
Posted Sep 23, 2015 11:05 UTC (Wed)
by jezuch (subscriber, #52988)
[Link] (3 responses)
Posted Sep 23, 2015 15:46 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link] (2 responses)
Posted Sep 24, 2015 21:21 UTC (Thu)
by tdalman (guest, #41971)
[Link]
I also would like to see SCTP to be used more in real-life applications. Unfortunately, it seems that up until now almost all use cases are only found in the academic area :-(
Posted Sep 25, 2015 10:34 UTC (Fri)
by jezuch (subscriber, #52988)
[Link]
Well, yeah :) "Any problem in computer science can be solved by another layer of indirection", after all ;)
Posted Sep 22, 2015 7:53 UTC (Tue)
by bytelicker (guest, #92320)
[Link] (5 responses)
Maybe it tries to solve a problem that was never a problem, but it does not mean that it is a worthless addition. I think this potentially could be very useful for future projects.
Posted Sep 22, 2015 16:47 UTC (Tue)
by shemminger (subscriber, #5739)
[Link] (1 responses)
Posted Sep 22, 2015 17:35 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Sep 23, 2015 4:31 UTC (Wed)
by wahern (subscriber, #37304)
[Link] (2 responses)
Using AF_KCM you'll still have to go through the trouble of compiling a BPF parser as part of your build, and it would be exceptionally limited. One might as well just use Ragel, which is immensely more powerful and expressive.
People in the telecommunication industry use ASN.1 to send and receive billions, nay trillions, of tiny messages each day. But you'll almost never see them using a low-level ASN.1 library as in OpenSSL, parsing messages field-by-field. They use ASN.1 compilers that generate the code to pack and unpack all the data into application data structures. There's only a single open source project that does this, asn1c, which is pretty good. You can feed the parsers it generates arbitrary numbers of bytes and attempt to dequeue parsed messages. it reduces what would be a tortuous and bug-ridden I/O, buffering, and parsing module into: while (buflen = socket_read(buf)) { parser_write(parser, buf, buflen); while ((msg = parser_getmsg(parser))) { ... } }. Notice how that pattern works whether the socket source is blocking or non-blocking.
Ragel is a general purpose tool that allows you to easily implement this pattern for any arbitrary protocol, in C and many other languages. Not only will it simplify your code, but your parser will invariably be much faster than handwritten C code. All my Ragel-based parsers, for example, blow away ffmpeg's container parsers, which are implemented using the typical parser pattern--read a length tag, buffer length-bytes, then parse all the fields individually with applicable conditional logic.
Posted Sep 23, 2015 5:36 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
And telecom people? There's a reason why so many telecom protocols are message-based.
Posted Sep 24, 2015 7:45 UTC (Thu)
by pbonzini (subscriber, #60935)
[Link]
KCM solves a real problem, and it solves it without requiring changes to the routing quipment between the two ends. It's not optimal, especially on the sending side (if you're doing high-throughput 1:1 comms it doesn't help, for example) but it helps a lot.
Posted Sep 22, 2015 21:05 UTC (Tue)
by caitlinbestler (guest, #32532)
[Link]
This would be a very thin library if the kernel-to-kernel communications were done with SCTP.
Posted Sep 22, 2015 21:30 UTC (Tue)
by smckay (guest, #103253)
[Link]
The kernel connection multiplexer
The kernel connection multiplexer
It'll be a major improvement for many protocols that do a "read-4-octet-length-and-then-the-packet" song-and-dance.
Middleboxes...
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
Absolutely. "SSL Terminator" is about the worst pattern for high-performance services. You have _multiple_ kernel switches with lots of TLB thrashing for each packet as a result.
They are optional (and should not be used, anyway). Just terminate the connection in this case.
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
And I don't really see the justification for putting it in-kernel, vis a vis performance.
It'll be a major improvement for many protocols that do a "read-4-octet-length-and-then-the-packet" song-and-dance.The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
> kernel implements a message parser. For this we chose to employ BPF
> which is applied to the TCP stream. BPF code parses application layer
> messages and returns a message length. Nearly all binary application
> protocols are parsable in this manner, so KCM should be applicable
> across a wide range of applications.
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
The kernel connection multiplexer
