Security API/Insanity

From wiki.zmanda.com
Jump to navigation Jump to search

Yes, it's that bad. This is a collection of notes on bugs that exist in the security API implementation. Trying to fix any of these bugs will, invariably, cause other things to fail. So we live with them until we can delete the whole mess.

Multiple Services

Amandad is written as if it can handle multiple simultaneous services (it has a nice serviceq and everything), but in reality that's silly - it would have no idea which service a particular packet applied to, since nothing is tagged. Clients, fortunately, never try to run more than one service at a time.

Subsequent Protocol Transactions

Amandad is also written as if it would exit when the single service is complete. In fact, it will exit at a multiple of 30 seconds after it has started, if there are no active services, or if it gets an EOF on a socket from a TCPM-based auth. However, we routinely see the same amandad handle back-to-back transactions over the same TCPM connection. How does this work?

server side

It's a bug. Well, two bugs.

First, when the callback for security_accept is invoked, the internal function pointer (tcp_conn.accept_fn) is never cleared. Second, when the connection is closed, a refcounting error results in the read event handler for the tcpm connection being left on. When the frame containing the first packet for the subsequent transaction arrives, that handler is invoked. The frame is for a different stream than the first transaction (that stream has been closed in both directions, in fact). Once the frame is decoded, sec_tcp_conn_read_callback tries to find an existing stream that it corresponds to. Failing that, it checks accept_fn; if this were NULL, it would reject the frame as unhandled. However, it's not NULL, so sec_tcp_conn_read_callback calls it as if this were the first frame in the connection, parsing and passing the packet along. Amandad plays along, treating the packet like the first packet it's ever seen.

client side

On the client side, it also doesn't look like the tcp connection would be maintained. How does this work?

It's a bug. A pretty subtle one. Well, maybe it's a feature, but if so it's a subtle feature.

Planner is the most common app to make two protocol transactions to the same host back-to-back. In particular, it will run SERVICE noop, followed immediately by SERVICE sendsize. As the noop request winds down, protocol_sendreq calls back to handle_result as its continuation. However, despite the very computer-sciencey term "continuation", the protocol code does other stuff after this callback returns. In particular, it calls security_close after handle_result is done. If handle_result didn't do anything more, then security_close would unref the tcp_conn object and close the connection to amandad (or maybe not; see below). However, handle_result immediately starts a new request, while the tcp_conn object still exists. This new request piggybacks on the existing tcp_conn, adding a reference count to it, so when security_close finally does get around to unreffing the tcp_conn, its refcount does not reach 0.

security_close vs. security_close_connection

There are two methods to close a security handle. The first, security_close, does about what you'd expect: stops reading on the connection and frees the handle. Due to a (different) refcounting bug, it leaves the tcp_conn object in place, though. The second, security_close_connection, doesn't do much but set some flags that are never checked elsewhere, and as far as I can tell is a no-op. Where it's used, though, it introduces another refcounting bug, as it calls sec_tcp_conn_put(rh->rc), but doesn't set that value to NULL, and thus continues to use it without holding a registered reference.

Half-closed Connections

How does amandad know when a stream should be closed? It should ensure that both the service and the client have signaled EOF first, right?

No. The code looks like it works that way, because it checks whether as->ev_write is non-NULL, which makes you think it's checking whether it's seen EOF from the client. But this field is never used (all writes from the client to the service are done with full_write, which is a bug in its own right). So amandad will unconditionally shut down a stream when it receives EOF from the service. It doesn't do much when it gets an EOF from the client, but luckily this never happens first.