codahale.com

The Joy Of Duplexes

2022-04-03T13:48:00+00:00

This article is about cryptography, which is most commonly used in high-risk scenarios. It’s an extremely subtle and complex subject and it’s one in which I’m absolutely not an expert. The constructions, schemes, and protocols I discuss are not standardized, nor have they been analyzed by a professional or anyone else who is assuming risk for whatever you might use them for. The vibe of this post is “hey wow, look at these neat ideas” and not “everyone should use duplexes”. Govern yourself accordingly.

I’ve been playing around with a little side project lately which makes heavy use of a relatively new type of cryptographic primitive–the duplex–and I figured I should take the time to share what I’ve learned and some of the pretty profound impacts the duplex can have on designing cryptographic protocols.

In this post, we’ll talk about permutations, sponge functions, and duplexes–what they are, how they’re built, what makes them secure, and how they’re used. We’ll use a sponge function to build our own hash function, calculate message authentication codes, generate pseudo-random data, and make a stream cipher. We’ll use a duplex to do authenticated encryption, make cryptographic transcripts, create digital signatures, and build an authenticated public key encryption scheme (a.k.a. signcryption). We’ll even design our own authenticated session encryption protocol with only a duplex and an elliptic curve.

In order to do all of this, though, we need to understand what a duplex is, which means we first need to understand what a permutation is and how they’re used to build a sponge. It’s a long post, so pack a lunch. Let’s go.

Permutations

A permutation, as the name implies, shuffles things around: it’s a function which maps inputs to outputs so that no two inputs produce the same output. We pass it a fixed-size array of bits $x$; it returns another fixed-sized array $y$. We only get $y$ if we pass it $x$ again.

A good permutation from a cryptographic perspective is one which hides patterns in the input. In technical terms, it should be hard to distinguish a permutation from a random oracle. (A random oracle is a theoretical service which maps inputs to random outputs via an infinite secret lookup table; it’s impossible by definition to detect any pattern between inputs and outputs.) Ideally, the only way to tell the difference between a permutation and a random oracle is by knowing exactly what the permutation is and running the algorithm in reverse.

In that way, a block cipher with a fixed key is a kind of permutation: each unique plaintext block is mapped to a unique ciphertext block. (If it didn’t, you’d have some ciphertexts which decrypt to the wrong plaintext!) A hash algorithm, on the other hand, is not a permutation: some inputs produce the same outputs via collision. (This gets kind of important later, so keep it in mind.)

Finally, a good permutation should be constant-time: the operations it performs shouldn’t be dependent on the value of the input block. This means that an adversary can’t use timing attacks to learn about any of the data we’re passing to our permutation function.

Examples of permutation algorithms include Keccak-p (used in SHA-3), Xoodoo (used in Xoodyak), and Gimli (used in libhydrogen).

For now, imagine we have a permutation function $f: \{0,1\}^b \rightarrow \{0,1\}^b$ which maps blocks of $b$ bits to other blocks of $b$ bits in a way which makes it hard to tell what the input was.

Let’s use it to make a sponge.

Sponges

The sponge, in cryptography, is a function which absorbs an input string of arbitrary length and then squeezes an output string of arbitrary length. (Like a sponge. Get it?) It was originally defined by Guido Bertoni, Joan Daemen (co-inventor of AES), Michaël Peeters, and Gilles Van Assche as part of their design process of Keccak for the NIST SHA-3 competition. (Their design won the competition, making SHA-3 the most widely deployed sponge scheme to date.)

Sponge As Function

From the outside, a sponge is a function which takes an arbitrarily-long input value $M$ and a non-negative integer $\ell$ and returns a $\ell$-bit value:

\[\begin{gather*} \text{Sponge}(M, \ell): \{0,1\}^{\mathbb{Z}^*} \times \mathbb{Z}^* \rightarrow \{0,1\}^\ell \end{gather*}\]

The implementation of a sponge starts with our permutation function $f$ and adds a state variable $S: \{0,1\}^b$ which is split into two sections: an outer section of $r$ bits and an inner section of $c=b-r$ bits. We call $b$ the sponge’s width, $r$ the sponge’s rate, and $c$ the sponge’s capacity.

Finally, we add a padding function, $P$, which maps a string $M$ of arbitrary length to a sequence of $r$-bit blocks. (The usual choice appends a $1$-bit, then $(|M| \bmod r)-2$ $0$-bits, then a final $1$-bit. Nothing fancy.)

Next, we’ll use these to build the two phases of a sponge function: $\text{Absorb}$ and $\text{Squeeze}$.

Absorbing Data

We start with the sponge’s state $S$ initialized to some zero value; it doesn’t really matter what as long as it’s a constant:

\[\begin{gather*} S \gets \{0\}^b \end{gather*}\]

We take the input string $M$ and turn it into a sequence of $r$-sized blocks with the padding function $P$. For each block $M_i$, we update the outer section of $S$ by XORing the two, then we run the entirety of $S$ through the permutation:

\[\begin{gather*} M_{0..n} \gets P(M) \\ S_{0..r} \gets S_{0..r} \oplus M_0 \\ S \gets f(S) \\ \dots \\ S_{0..r} \gets S_{0..r} \oplus M_n \\ S \gets f(S) \\ \end{gather*}\]

Notice that we’re only writing to $S_{0..r}$, the outer section of $S$, but we pass the whole of $S$ to our permutation function, making the final value of $S$ dependent on both sections.

Squeezing Data

Once we’re done absorbing data, we can squeeze an $\ell$-bit value $Z$ out of the state.

If $\ell≤r$, we return the first $\ell$ bits of the outer section of $S$:

\[\begin{gather*} Z \gets S_{0..\ell} \\ S \gets f(S) \end{gather*}\]

If $\ell>r$, we iterate the process, returning bytes from $S_{0..r}$ and updating $S$ with our permutation:

\[\begin{gather*} Z_{0..r} \gets S_{0..r} \\ S \gets f(S) \\ Z_{r..\ell} \gets S_{0..\min{(r, \ell)}} \\ S \gets f(S) \end{gather*}\]

Again, notice that we’re only reading from outer section of $S$, but each successive value of $S$ depends on both the outer and inner sections of the previous value.

The resulting construction looks like this (from Duplexing the sponge by Bertoni et al.):

Picking Parameters

When absorbing or squeezing, we only directly write to and read from the outer section of the sponge’s state, $S_{0..r}$. The inner section, $S_{r..b}$, is only used as input for our permutation function $f$, and the only thing we write to it is the output of $f$. The inner section of $S$ acts a reservoir of bits that an adversary either has to hold constant or never gets to see, depending on what we’re doing. This means that the ratio of the sponge’s rate $r$ to its capacity $c$ is critical.

If our sponge’s capacity is too small relative to its rate, it becomes feasible for an adversary to use the output of a sponge function to recreate the sponge’s entire state. For example, a sponge with $c=1$ would mean an adversary with an $r$-bit block of output would only need to try $2$ possible values in order to recover the full contents of $S$, at which point they could re-create the sponge’s previous states and produce arbitrary outputs.

If our sponge’s rate is too small relative to its capacity, its performance suffers due to the overhead of the permutation function. For example, a sponge with $r=1$ would have to run the permutation function for each bit of input and output.

The designers of sponge functions pick their $r$ and $c$ values carefully to balance speed and security. This will become clearer as we start building cryptographic operations with our sponge.

Message Digests

The natural use for a sponge function is as a cryptographic hash function. We pick a digest length, $n$, and pass our message $m$ to get back an $n$-bit digest:

\[\begin{gather*} d \gets \text{Sponge}(m, n) \end{gather*}\]

This is essentially how SHA-3 works.

Two things make this secure: first, the strength of our permutation function $f$; second, the fact that the inner section of $S$ acts as a reservoir of unmodifiable bits.

Consider an adversary trying to find a message $m$ which produces a fixed digest $d$ (i.e. trying to break its pre-image resistance) with a sponge function whose parameters are $b=384$, $r=256$, $c=128$. To keep it simple, we’ll imagine they’re trying to find a short $m$ such that $|m|≤r$ (i.e. the absorb phase consists of a single iteration of $f$, and $S$ is all zeroes to begin with).

Our adversary is therefore trying to find some state value $S'=P(m')||\{0\}^c$ such that $f(S')$ begins with the digest $d$. If our permutation is strong–if it’s hard to distinguish between our permutation and a random oracle–our adversary is stuck having to try all of the $2^{128}$ possible values of the inner section of $S$.

Note that because the input to a sponge function is always padded, it’s not vulnerable to length-extension attacks, unlike Merkle–Damgård constructions (e.g. MD5, SHA1, SHA-256): given $H(m)$, it’s not possible to easily calculate $H(m||x)$. That’s because $H(m)$ is actually $H(P(m))$, and our padding function $P$ is non-commutative, so $P(m||x) \not= P(m)||P(x)$.

Message Authentication Codes

The fact that sponges aren’t vulnerable to length-extension attacks means we can securely extend our message digest construction to calculate message authentication codes (MACs). We concatenate the secret key $k$ and message $m$ into a single input and squeeze an $n$-bit authentication tag $t$:

\[\begin{gather*} t \gets \text{Sponge}(k||m, n) \end{gather*}\]

This is essentially how cSHAKE and KMAC work.

It superficially resembles the kind of naive MAC construction that gets people in trouble, e.g. $\text{MD5}(k||m)$, but remember that the sponge, unlike most hash functions, isn’t vulnerable to length-extension attacks. As long as $k$ is constant size, it’s secure.

(You can run into trouble if $k$ is variable size. Proper standards like KMAC pad the key to a constant size as well as add the output length and other domain separation data to the input. If you’re trying to create a MAC of multiple pieces of information, you have to develop a padding scheme like TupleHash to ensure those different pieces of information don’t overlap in the input space or use an existing scheme. More about that in a bit.)

Let’s look at what makes our sponge-based MAC function secure. For simplicity’s sake, let’s assume $|k||m|$ and $t$ are both less than $r$ (i.e. both absorb and squeeze phases involve a single iteration of $f$). Our MAC algorithm looks like this:

\[\begin{gather*} m_0 \gets P(k||m) \\ S_{0..r} \gets S_{0..r} \oplus m_0 \\ S \gets f(S) \\ t \gets S_{0..n} \end{gather*}\]

Consider an adversary who is trying to forge a tag $t'$ for a message $m'$ without knowing the key $k$. Even if they can see an arbitrary number of messages and tags created with $k$, they’re still stuck trying to recreate the inner section of $S$ after the absorb phase. If that’s big enough, then trying to forge a tag is infeasible.

Extendable-Output Functions (XOFs)

So far we’ve only been using short outputs of the sponge, but it’s entirely possible to produce longer outputs. Let’s say we need a lot of pseudo-random data we can derive from a high-entropy value $x$. We can pass $x$ to our sponge function and squeeze as many bits as we need:

\[\begin{gather*} z \gets \text{Sponge}(x, n) \end{gather*}\]

Remember that the squeeze phase, when used to produce multiple blocks of output, looks like this:

\[\begin{gather*} S \gets f(S) \\ z_0 \gets S_{0..r} \\ S \gets f(S) \\ z_1 \gets S_{0..r} \\ S \gets f(S) \\ z_2 \gets S_{0..r} \\ \dots \end{gather*}\]

Each additional block of output can be calculated in an incremental fashion, meaning that it’s easy to build an abstraction on top of this which produces an indefinite series of bits for a given input: an extendable-output function (XOF). How secure is it?

Consider an adversary who doesn’t know the seed value $x$ but can see a single block of output $z_i$. What would they need to do to predict the next block $z_{i+1}$? If the sponge’s capacity was zero, then it would be simple: $z_{i+1}=f(z_i)$. Likewise, an adversary trying to determine the previous block could calculate $z_{i-1}=f^{-1}(z_i)$ if the sponge’s capacity didn’t exist. But a sponge with a large enough capacity makes it infeasible for an adversary to predict output without the seed value $x$.

Now, remember that $f$ is a permutation and not a compression function–it’s shuffling bits around in $S$ without collision. This is where that matters. A compression function has some probability that $f(S) = f(S')$. As that function is iterated, that probability increases, and the number of possible values for $S$ decreases, reducing the security. Thankfully, we’re using a permutation for $f$, which means we’re not creating those internal collisions as we produce output. Thus, the security of our sponge stays constant relative to the strength of $f$, allowing us to produce a long stream of effectively pseudo-random bits.

Stream Ciphers

Since we’ve got an unbounded stream of pseudo-random bits, let’s go ahead and make a stream cipher. We pass our sponge function a concatenated key $k$ and nonce $n$ and XOR our plaintext $p$ with the output:

\[\begin{gather*} c \gets p \oplus \text{Sponge}(k||n, |p|) \end{gather*}\]

Decrypting is the same thing, only with the ciphertext $c$ as input:

\[\begin{gather*} p \gets c \oplus \text{Sponge}(k||n, |c|) \end{gather*}\]

As with the XOF, this is secure if our permutation is strong and the sponge’s capacity is big enough. An adversary who knows $n$ but not $k$ will be left trying to reconstruct the inner section of $S$, unable to reconstruct the output, and therefore unable to decrypt $c$. And because we can incrementally produce the keystream, we can encrypt streams of data without buffering.

Of course, stream ciphers by themselves are not very secure. If we accidentally re-use a nonce, an adversary can calculate $p_0 \oplus p_1$, which violates the confidentiality of the plaintexts. Also, an adversary can flip bits in the ciphertext to produce alternative plaintexts that we might not be able to detect (e.g. pay me $100 could turn into give Mal $5). We could solve this problem with the sponge function, but it starts to get gnarly: once we’re done encrypting the plaintext we’d need to pass the ciphertext to another sponge function to create a MAC, but then we’re talking about a two-pass algorithm.

To elegantly solve this problem–and many, many others–we’ll need to graduate from the sponge and build a duplex.

Duplexes

A duplex, like a sponge, has a permutation function $f$, a rate $r$, a capacity $c$, a state variable $S$ divided into an $r$-bit outer section and a $c$-bit inner section, and a padding function $P$. The big difference is that a duplex is a stateful object, not a function–instead of a single absorb phase and a single squeeze phase, it maintains its state across an arbitrary sequence of absorb and squeeze phases.

The duplex was invented by Bertoni, Daemen, Peeters, and Van Assche (again) as part of their continued work on permutation-based cryptography. Examples of duplexes include Mike Hamburg’s STROBE and Daemen et al.’s Xoodyak, a finalist in NIST’s Lightweight Cryptography competition. (Xoodyak’s my personal favorite.)

Duplex As Object

It’s helpful to consider the duplex as an object with an API:

\[\begin{gather*} \text{Init}() \\ \text{Absorb}(\sigma) \\ \text{Squeeze}(\ell): \mathbb{Z}^* \rightarrow \{0,1\}^\ell \end{gather*}\]

$\text{Init}$ creates a new duplex, initializing $S$ with a zero value:

\[\begin{gather*} S \gets \{0\}^b \end{gather*}\]

$\text{Absorb}(\sigma)$ updates the duplex’s state with an arbitrary string $\sigma$ in exactly the same way as the absorb phase of our sponge function.

$\text{Squeeze}(\ell)$ returns an $\ell$-byte string and updates the duplex’s state, iterating and truncating as necessary just like the squeeze phase of our sponge function.

Our $\text{Sponge}(M, \ell)$ function, when mapped to a duplex, becomes this:

\[\begin{gather*} \text{Init}() \\ \text{Absorb}(M) \\ \text{Squeeze}(\ell) \end{gather*}\]

The duplex construction itself looks like this (from Duplexing the sponge by Bertoni et al.):

There are two major differences between a sponge function and a duplex.

First, a duplex’s inputs and outputs are non-commutative. That is, $\text{Absorb}(a)$ and $\text{Absorb}(b)$ is not the same as $\text{Absorb}(a||b)$, and $\text{Squeeze}(a)$ and $\text{Squeeze}(b)$ is not the same as $\text{Squeeze}(a+b)$. With a duplex, those produce entirely different states and thus entirely different outputs.

Second, where the sponge is limited to a single absorb phase and a single squeeze phase, the duplex can alternate between series of $\text{Absorb}$ operations and $\text{Squeeze}$ operations indefinitely.

These two changes enable us to build a frankly staggering array of cryptographic operations.

Authenticated Encryption

Let’s start with our original motivation for the duplex: authenticated encryption. The sponge gave us an XOF and thus a stream cipher, but in order to authenticate ciphertexts we’d need to use another sponge function to create a MAC or something. We can do better with the duplex.

We initialize a duplex, absorb a key $k$ and nonce $n$ and XOR our plaintext $p$ with the output much like we did with the sponge, but after that we absorb $p$ and squeeze an $m$-bit authentication tag $t$:

\[\begin{gather*} \text{Init}() \\ \text{Absorb}(k) \\ \text{Absorb}(n) \\ c \gets p \oplus \text{Squeeze}(|p|) \\ \text{Absorb}(p) \\ t \gets \text{Squeeze}(m) \end{gather*}\]

Superficially, this looks a little like the dreaded MAC-then-Encrypt construction because we’re absorbing $p$, then squeezing a tag $t$. The critical difference, however, is that $t$ isn’t only derived from $p$–it’s derived from the same state which produced the keystream to make $c$. By absorbing $p$, we ensure that $t$ seals both the plaintext $p$ and the ciphertext $c$.

We can see how this plays out when we decrypt $c$:

\[\begin{gather*} \text{Init}() \\ \text{Absorb}(k) \\ \text{Absorb}(n) \\ p' \gets c \oplus \text{Squeeze}(|c|) \\ \text{Absorb}(p') \\ t' \gets \text{Squeeze}(m) \\ t \stackrel{?}{=} t' \end{gather*}\]

If an adversary modifies $c$, then $p' \not= p$ and therefore $t' \not= t$.

This can be extended to a full AEAD by adding $\text{Absorb}(d)$ after absorbing the nonce. In fact, if we’ve got more than one piece of data to authenticate, we can add more $\text{Absorb}$ calls. There’s no need to devise our own padding scheme here. (You’ll still need to use a constant-time comparison for the authentication tag, though.)

It’s worth noting here that here, unlike with the sponge function, we’re starting to build cryptographic constructions which integrate the properties of multiple other constructions. Here we begin with what looks like a stream cipher but then turns into something more like a MAC, producing both ciphertext and tag with the same underlying primitive. In contrast, the GCM mode (e.g. AES-128-GCM) combines a block cipher (e.g. AES), a counter mode of operation, GMAC, a polynomial MAC algorithm, and a number of very specific padding functions to produce the same results.

Cryptographic Transcripts

The fact that the $\text{Absorb}$ operations aren’t commutative opens up some interesting possibilities. For example, we can create a transcript of higher-level operations and have cryptographic assurances that the transcript has not been altered or that another party performing the same actions hasn’t diverged from the same sequence. That’s a pretty abstract concept, so let’s use the example of ensuring that two users are actually the same. We initialize a duplex, absorb the pieces of data which identify the user, and finally squeeze a cryptographic identifier:

\[\begin{gather*} \text{Init}() \\ \text{Absorb}(\texttt{user.email}) \\ \text{Absorb}(\texttt{user.name}) \\ \text{Absorb}(\texttt{user.phone\_number}) \\ I \gets \text{Squeeze}(n) \\ \end{gather*}\]

This is similar to using a hash function $H(E||N||P)$, but because the duplex pads its inputs, it preserves each pieces of data as independent inputs. We can even add semantic information to ensure domain separation:

\[\begin{gather*} \text{Absorb}(\texttt{"phone\_number"+user.phone\_number}) \end{gather*}\]

This might be a little heavy-handed for user identifiers, but the ability to create a transcript of operations with cryptographic assurance of their semantic integrity turns out to be incredibly useful.

Digital Signatures

By combining our duplex with a well-defined elliptic curve like ristretto255, we can turn a cryptographic transcript into a sophisticated digital signature. For example, let’s create an EdDSA-style Schnorr signature of a message $m$ given the signer’s key pair $(d, Q)$.

Moon Math Primer

But first, we need to talk about elliptic curves.

Elliptic curves are awesome but don’t make a lot of intuitive sense to folks who are familiar with other cryptographic things like hash functions and encryption algorithms, so here’s a quick primer on what’s in play.

A scalar is a like a regular integer in that you can do arbitrary arithmetic with them. Scalars are usually notated using lowercase variables like $d$ and $k$. You can turn an arbitrary string of bits $\{0,1\}^n$ into a scalar by reducing them to fit into the space of possible scalar values:

\[\{0,1\}^n \bmod \ell \in \{0 \dots \ell\}\]

A point is a set of coordinates to a point on the elliptic curve and notated using uppercase variables like $Q$ and $G$. $G$ is the generator point, which is a point on the curve everyone agreed to use as a constant. You can add points together and you can subtract one point from another, but you can’t multiply or divide them.

You can multiply a point $Q$ by a scalar $d$ (notated $[d]Q$) in that you can add $Q$ to itself $d$ times. Calculating $[x]Y$ is super cheap; going from $[x]Y$ to either $x$ or $Y$ is functionally impossible.

Private keys are scalars which are picked at random from the set of all possible scalars. A public key $Q$ for a private key $d$ is calculated by adding the generator point $G$ to itself $d$ times: $Q=[d]G$. Again, it’s easy to go from private key to public key; it’s effectively impossible to go from public key to private key.

A longer, gentler introduction to the subject can be found here.

Ok, back to our digital signature scheme.

Signing A Message

First, we initialize a duplex and use it to absorb the message $m$ and the signer’s public key $Q$:

\[\begin{gather*} \text{Init}() \\ \text{Absorb}(Q) \\ \text{Absorb}(m) \end{gather*}\]

Next, we generate a random commitment scalar $k$, then calculate and absorb the commitment point $I$:

\[\begin{gather*} k \stackrel{\$}{\gets} \{0,1\}^{512} \bmod \ell \\ I \gets [k]G \\ \text{Absorb}(I) \end{gather*}\]

Finally, we squeeze a challenge scalar $r$ and calculate the proof scalar $s$:

\[\begin{gather*} r \gets \text{Squeeze}(512) \bmod \ell \\ s \gets dr+k \end{gather*}\]

The resulting signature is $I||s$.

Verifying A Signature

The signature can be verified like this:

\[\begin{gather*} \text{Init}() \\ \text{Absorb}(Q) \\ \text{Absorb}(m) \\ \text{Absorb}(I) \\ r' \gets \text{Squeeze}(512) \bmod \ell \\ I' \gets [s]G - [r']Q \\ I \stackrel{?}{=} I' \end{gather*}\]

Now, EdDSA signatures are entirely possible to create with regular hash algorithms–Ed25519 uses SHA2-512), but by using a duplex we get a number of benefits.

For one, it allows us to easily establish cryptographic dependence between values. For Schnorr signatures, we need the challenge scalar $r$ to be dependent on all the inputs: $Q$, $m$, and $I$. If we don’t include the signer’s public key $Q$, then it’s possible to find another public key which produces the same signature for the same message. If we don’t include the message $m$, then what are we even signing? Likewise, if we don’t include the commitment point $I$, then it becomes possible to produce an infinite number of alternate valid signature/public key pairs given a signed message.

For two, we’re not limited to signing a single bitstring. We can easily sign arbitrarily structured data as long as it can be traversed in a deterministic fashion. A set of financial transactions, a complex mathematical structure, a conversation between services, whatever. You can even use duplexes to securely structure zero-knowledge proofs.

For these more complex protocols, it’s helpful to think of the duplex as establishing a transcript of all the values being used. If we absorb all the inputs, the outputs will always be cryptographic dependent on them, giving adversaries far fewer opportunities to modify them.

Signcryption

Now that we’ve got encryption and digital signatures going, we can combine the two to produce a signcryption scheme, which provides both confidentiality and authenticity in the public key setting. Given a sender’s key pair $(d_S, Q_S)$, a receiver’s public key $Q_R$, and a plaintext $p$, we start by initializing a duplex and absorbing everyone’s public keys:

\[\begin{gather*} \text{Init}() \\ \text{Absorb}(Q_S) \\ \text{Absorb}(Q_R) \end{gather*}\]

Next we generate an ephemeral key pair $(d_E, Q_E)$ and absorb the public key:

\[\begin{gather*} d_E \stackrel{\$}{\gets} \{0,1\}^{512} \bmod \ell \\ Q_E \gets [d_E]G \\ \text{Absorb}(Q_E) \end{gather*}\]

After that we calculate the Elliptic Curve Diffie-Hellman (ECDH) shared secret between the ephemeral private key $d_E$ and the receiver’s public key $Q_E$, $K$, and absorb it, effectively using it as a symmetric key:

\[\begin{gather*} K \gets [d_E]Q_R \\ \text{Absorb}(K) \end{gather*}\]

Now that we’ve got a duplex keyed with a secret, we encrypt and absorb $p$:

\[\begin{gather*} c \gets p \oplus \text{Squeeze}(|p|) \\ \text{Absorb}(p) \\ \end{gather*}\]

At this point we have a duplex which is effectively a transcript of all the critical information we have: the sender’s public key, the receiver’s public key, the ephemeral public key, the ECDH shared secret, and the plaintext. If we squeezed an authentication tag here, we’d essentially have created ECIES using just two cryptographic primitives. But we can improve on that, so let’s integrate our Schnorr signature scheme:

\[\begin{gather*} k \stackrel{\$}{\gets} \{0,1\}^{512} \bmod \ell \\ I \gets [k]G \\ \text{Absorb}(I) \\ r \gets \text{Squeeze}(512) \bmod \ell \\ s \gets {d_S}r+k \end{gather*}\]

The final ciphertext is $C=Q_E||c||I||s$, which we can unsigncrypt (yes, that’s what they call it) like this:

\[\begin{gather*} \text{Init}() \\ \text{Absorb}(Q_S) \\ \text{Absorb}(Q_R) \\ \text{Absorb}(Q_E) \\ K \gets [d_R]Q_E \\ \text{Absorb}(K) \\ p' \gets c \oplus \text{Squeeze}(|c|) \\ \text{Absorb}(p') \\ \text{Absorb}(I) \\ r' \gets \text{Squeeze}(512) \bmod \ell \\ I' \gets [s]G - [r']Q_S \\ I \stackrel{?}{=} I' \end{gather*}\]

The receiver doesn’t know the ephemeral private key $d_E$, so they calculate the ECDH shared secret as $[d_R]Q_E$ because $[d_E]Q_R=[d_R]Q_E=[d_Ed_R]G$. Likewise, they don’t know the sender’s private key $d_S$, so they verify the signature using their public key $Q_S$.

The really cool thing is that our signcryption construction offers a much stronger degree of security than most because using a duplex allows us to seamlessly integrate our encryption and digital signature algorithms.

Security Through Integration

Because we’re using a duplex, our signcryption scheme is stronger than the generic schemes which combine separate public key encryption and digital signature algorithms: Encrypt-Then-Sign ($\mathcal{EtS}$) and Sign-then-Encrypt ($\mathcal{StE}$).

An adversary attacking an $\mathcal{EtS}$ scheme can strip the signature from someone else’s encrypted message and replace it with their own, potentially allowing them to trick the recipient into decrypting the message for them. That’s possible because the signature is of the ciphertext itself, which the adversary knows.

With our scheme, on the other hand, the digital signature isn’t of the ciphertext alone. A standard Schnorr signature scheme like Ed25519 derives the challenge scalar $r$ from a hash of the signer’s public key and the message being signed (i.e. the ciphertext). Our challenge scalar, on the other hand, is derived from the duplex’s state, which depends on (among other things) the ECDH shared secret $K$ and the plaintext message $p$. Unless our adversary already knows $K$ (the secret key that $p$ is encrypted with) and $p$ (the plaintext they’re trying to trick someone into giving them), they can’t create their own signature (which they’re trying to do in order to trick someone into giving them $p$). You can see how that might be hard for them.

An adversary attacking an $\mathcal{StE}$ scheme can decrypt a signed message sent to them and re-encrypt it for someone else, potentially posing as the original sender. For example, you send a sternly-worded letter to your manager, informing them that you hate your job and are quitting. Your boss decrypts the message, keeps the signed plaintext, and several months later encrypts it with your new manager’s public key and sends it to her. If the message is sufficiently generic (e.g. I hate this job and am quitting.), it looks like an authentic message from you to your new manager and you’re out of a job.

Our duplex-based scheme isn’t vulnerable to this kind of attack, thankfully, because, again, the challenge scalar $r$ is dependent on the duplex’s state, which in addition to the plaintext message $p$ also includes the ECDH shared secret $K$ and all the public keys. An adversary can duplicate the duplex’s state prior to the $\text{Absorb}(K)$ operation, but no further.

Security Through Composition

The fact that we can tightly integrate our encryption and digital signature algorithms without falling prey to the same attacks as the generic constructions means we end up with a stronger scheme than even very modern standards, like HPKE. HPKE provides an authenticated scheme in the form of the $\text{AuthEncap}$ KEM, but it’s vulnerable to very sneaky type of attack: key compromise impersonation. Key compromise impersonation (KCI) occurs when a scheme allows an adversary who compromises a user’s private key to forge messages from other users which the compromised user will believe are authentic.

For example, let’s say our adversary compromises the victim’s private key $d_V$. The adversary can take an arbitrary public key–let’s say Beyoncé’s key, $Q_B$–and use it to calculate the shared secret Beyoncé would use if she actually sent the victim a message: $[d_V]Q_B=[d_B]Q_V=[d_Bd_V]G$. If the authenticity of the message depends entirely on only the sender and receiver knowing that shared secret, as with HPKE or ECIES, the adversary could pose as Beyoncé to their victim, which would be an emotionally crushing kind of catfishing.

Our signcryption scheme, on the other hand, is immune to these particular shenanigans. The receiver of a message can verify the signature but cannot create one themselves; therefore, an adversary with the same power as the receiver (i.e. with knowledge of their private key) cannot forge a signature from someone else.

Security Through Simplicity

Finally, it’s worth stepping back and appreciating the parsimony of this approach. It’s totally possible to build signcryption constructions out of traditional primitives like block ciphers, but the degree of complexity (and the resulting risk of screwing something) is dramatically higher. For example, a design with similar properties might pass the ECDH shared secret point to HKDF-SHA2-512 and extract an AES key, CTR nonce, and HMAC-SHA2-512 key, then use Ed25519 (which also requires SHA2-512) to sign (but not transmit) the MAC of the ciphertext. Like, I think that’d work?

But you’re talking about an implementation which requires AES, SHA2-512, the twisted Edwards form of Curve25519, and probably the Montgomery form of Curve25519 (for X25519) as primitives, plus implementations of HKDF, HMAC, CTR mode, and Ed25519.

Compare that to our construction, which requires a duplex (e.g Xoodyak) and an elliptic curve (e.g. Ristretto255). (Admittedly, Ristretto255 is based on the Edwards twist of Curve25519, but NIST’s P-256 would work, too.)

Session Encryption

As a final example of what you can do with a duplex, let’s talk about session encryption. Imagine we’ve got two people that want to securely talk to each other over a (reasonably) reliable network (e.g. TCP). They want to know that their conversation is confidential (i.e. no one can eavesdrop) and authentic (i.e. no one can alter what’s said). (Obviously these two should use TLS 1.3 and be done with it, but let’s see what we can do ourselves.)

We’ve got a sender (with a static key pair $(d_S, Q_S)$) and a receiver (with a static key pair $(d_R, Q_R)$).

The sender begins by generating an ephemeral key pair $(d_S^E, Q_S^E)$ and transmitting their ephemeral public key $Q_S^E$ to the receiver. The receiver generates their own ephemeral key pair $(d_R^E, Q_R^E)$ and transmits their ephemeral public key $Q_R^E$ back the sender. They both initialize duplexes and absorb all of the public keys in the same order:

\[\begin{gather*} \text{Init}() \\ \text{Absorb}(Q_S) \\ \text{Absorb}(Q_S^E) \\ \text{Absorb}(Q_R) \\ \text{Absorb}(Q_R^E) \\ \end{gather*}\]

Next, both sender and receiver calculate two ECDH shared secrets: $K_0$, between the receiver’s ephemeral key pair and the sender’s static key pair, and $K_1$, between the sender’s ephemeral key pair and the receiver’s static key pair.

On the sender’s side:

\[\begin{gather*} K_0 \gets [d_S^E]Q_R \\ K_1 \gets [d_S]Q_R^E \\ \text{Absorb}(K_0) \\ \text{Absorb}(K_1) \end{gather*}\]

On the receiver’s side:

\[\begin{gather*} K_0 \gets [d_R]Q_S^E \\ K_1 \gets [d_R^E]Q_S \\ \text{Absorb}(K_0) \\ \text{Absorb}(K_1) \end{gather*}\]

Next, both sender and receiver clone their duplexes into incoming and outgoing duplexes (after all, a duplex is just a state variable $S$). Using their outgoing duplexes, they both create Schnorr signatures of the outgoing duplexes’ states. For example, on the sender’s side:

\[\begin{gather*} k_S \stackrel{\$}{\gets} \{0,1\}^{512} \bmod \ell \\ I_S \gets [k_S]G \\ \text{Absorb}(I_S) \\ r_S \gets \text{Squeeze}(512) \bmod \ell \\ s_S \gets d_Sr_S+k_S \end{gather*}\]

The sender transmits their signature $I_S||s_S$ to the receiver and the receiver responds with their signature $I_R||s_R$, at which point both parties use their incoming duplexes to verify the signatures. For example, on the receiver’s side:

\[\begin{gather*} \text{Absorb}(I_S) \\ r_S' \gets \text{Squeeze}(512) \bmod \ell \\ I_S' \gets [s_S]G - [r_S']Q_S \\ I_S \stackrel{?}{=} I_S' \end{gather*}\]

If the signatures are valid, the two parties have established a confidential and authentic channel of communication to each other. When the sender wants to send a message, they can use the outgoing duplex’s current state to encrypt the message and generate and authentication tag:

\[\begin{gather*} c \gets p \oplus \text{Squeeze}(|p|) \\ \text{Absorb}(p) \\ t \gets \text{Squeeze}(m) \end{gather*}\]

Likewise, when they receive a message, they can use their incoming duplex to decrypt and authenticate:

\[\begin{gather*} p' \gets c \oplus \text{Squeeze}(|c|) \\ \text{Absorb}(p') \\ t' \gets \text{Squeeze}(m) \\ t \stackrel{?}{=} t' \end{gather*}\]

Instead of a static key, this effectively creates a session encryption system with a rolling key, with the duplex’s state being cryptographically dependent on every prior operation. In order to eavesdrop, an adversary would have to reconstruct the inner section of the duplex’s state. If an adversary alters a message, the receiver’s state would diverge from the sender’s, making any communication impossible. Like our signcryption construction, the handshake provides what’s called “insider security”, in that it maintains confidentiality if the sender is compromised and authenticity if the receiver is compromised.

Again, this is a toy protocol. Our sender and receiver should use TLS. But it’s worth noting that we created a toy mutually-authenticated key exchange with streaming, authenticated encryption in what might be a couple of hundred lines of code in a way which makes it incredibly clear to document and analyze and all of that using just two cryptographic primitives.

Neat, huh?

tl;dr

The duplex is a profoundly useful cryptographic primitive which allows us to build sophisticated cryptosystems in a robust and reliable way. You can use a duplex to implement message digests, MACs, PRFs, stream ciphers, AEADs, and more using a single cryptographic primitive whose security reduces entirely to the strength of a single permutation function. Duplexes make it easier to build high-level constructions and protocols, but it’s such a new concept that relatively little work has been done in terms of standardizing specific constructions or developing security proof techniques for analyzing duplex-based protocols. Hopefully everyone who makes it to the end of this overly-long article will be as enthusiastic as I am about the future of the cryptographic duplex (and permutation-based cryptography, generally). Here’s a ten-hour video of some mountains as a reward.

Updated April 03, 2022

Clarified the definition of permutation. Thanks, quasi_qua_quasi!

Thanks to various Fiascans and Jeff Hodges for reviewing this post. Thanks to Niki Hale for suggesting putting a video at the bottom to reward diligent folks such as yourself. Any mistakes in this article are mine, not theirs.

Work Is Work

2020-01-12T15:52:00+00:00

Every time I’ve written or spoken about organizational design, I’ve regretted it. There’s something about staking out a position on it which manages to prove me wrong a few years later. But I’ve been having some long thinks about it again, and here’s what I’ve got. Strap the fuck in.

At some point in time, every organization realizes that it’s slowing down. Features take longer to ship, folks spend more and more time in meetings, and everyone gets panicky about estimation and planning. If we’ve internalized essentialist bullshit about “B players hiring C players”, maybe we get nervous about The Bar and hiring. If we’re anxious and uncertain, we might get religion about Agile or Scrum or whatever. If we’re prone to modernist flights of fancy, we might decide to spend our innovation tokens trying to disrupt the 200,000 year old industry of getting humans to work together. If we’re the type to do our eight and hit the gate, we chalk it up as the price of success.

These approaches rarely yield results.

Our Dogmatic Slumber

Most explanations of organizational success or failure are crap. Emic accounts–i.e., those from within the organization–are limited to those concepts and narratives which exist within the organization. They may structurally serve as self-narratives and foster a sense of in-group identity and purpose but because they come from the epistemological equivalent of a gravity well their explanatory power outside of that organization is usually terrible.

If we take the etic perspective–i.e. from outside the organization–we can see that emic explanations for an organization’s success or failure have readily available counterfactuals in other organizations. If agile, flat organizations, code reviews, monorepos, open offices, fancy type systems, etc. were actually the causal factors they’re purported to be, then why do so many organizations adopt those practices without success? Why are there other successful organizations which lack those practices? How can we tell the difference between cum hoc, ergo propter hoc just-so stories and actual causal factors?

More importantly, can we determine the supervenience of some set of factors on organizational performance, not just in a particular context but across all possible organizations? That is, are there necessary, a priori truths of organizational performance?

As it happens, there are.

Corporate America’s Next Top Model

If you squint hard enough, an organization doing work is just an incredibly complex, dynamic, distributed, parallel process. We have very good tools for understanding the rough outlines of how those work, dating back at least to Manabrea’s 1842 comment to Babbage regarding his analytical machine. Now, I won’t pretend to having developed psychohistory and given how difficult it is to predict the behavior of even simple dynamic systems, a fully predictive model of organizational success certainly seems impossible.

But like Kant attempting to derive the necessary preconditions of subjective experience in Critique Of Pure Reason, we can sketch out the boundaries of what an organization is capable of and the dynamics of that as it grows. In the same way that the a priori knowledge that ten pounds weighs more than five pounds informs how much shit we can expect to fit into a bag, modeling organizations as parallel processes can inform the way we design them.

What happens inside those boundaries is a matter of execution and effort; what happens outside those boundaries is impossible.

The Ceiling Is Low

The work capacity of an organization scales, at most, linearly as new members are added. Each new member in an organization adds a constant number of possible work hours to the total possible work hours of the company’s existing employees. Amdahl’s Law states that given a fixed task, a parallel solution utilizing $N$ processors will run faster than a sequential solution by at most a factor of $N$.

As parallel resources are added, the total time spent in the parallelizable portion of the task amortizes to zero; in contrast, the total time spent in the sequential portion of the task never drops below a floor value. This is as true for a group of people trying to write software as it is for a group of CPUs trying to model the behavior of stars in the galaxy. Our intuition tells us that larger organizations do exhibit superlinear behaviors, but this literally cannot be the case if hiring is the only variable in the equation. Therefore, our only hope for superlinear productivity lies in changing the task which is being executed. Thankfully, work capacity is not the same as productivity.

As an organization hires more employees, work on productivity improvements must be a constant priority. Internal tooling, training, and services must be developed and fielded to ensure that all members are able to work on problems of continuously increasing impact. The ceaseless pursuit of force multipliers is the only possible route to superlinear productivity improvements as an organization grows.

Finally, it must be emphasized that this linear bound on work capacity is a ceiling, not a floor. One cannot do better than linear, but one can certainly do worse. There are many other factors which act as a drag on work capacity, and organization-wide improvements in productivity are critical in mitigating them.

The Floor Is Lava

Contention costs grow superlinearly as new members are added. Parallel solutions to tasks are rarely perfectly concurrent (and indeed, such tasks are rightfully called “embarrassingly parallel”), and often require some sequential critical sections. The line of CPUs or people waiting to enter a critical section can be modeled as a queue, which allows us to use queueing theory to understand how the queue’s cycle time changes as the queue size grows. If we model the line for a sequential section as a $G/G/1$ queue, which is to say, without making any assertions about the arrival process or service time distribution but assuming a single queue server (i.e. only one CPU or person can hold the lock), we arrive at Kingman’s Formula for the mean wait time:

\[\mathbb E(W_q) \approx \left( \frac{\rho}{1-\rho} \right) \left( \frac{c_a^2+c_s^2}{2}\right) \tau\]

Notably, the wait time of a queue increases non-linearly with respect to $\rho$ (utilization) and quadratically with respect to $c_a$ (the coefficient of variation for arrivals) and $c_s$ (the coefficient of variation for service times). (This is the quantified form of the intuition that queues are either empty or overflowing.)

The non-linearity of this should give us pause, as increasing the number of people contending for a shared resource is the same thing as increasing $\rho$. If contention on those resources is unmanaged, organizational growth can result in catastrophic increases in wait time. At some point, adding new members can cause the organization’s overall productivity to decrease instead of increase, as the increase in wait time due to contention is greater than the increase in work capacity. (This is the organizational version of the latency spikes we see as servers become overloaded.)

These shared resources aren’t necessarily physical things, like bathrooms or printers; they can be digital, like files in a source code repository or tickets in a bug tracker, or organizational, like code reviews or work assignments. As with writing highly-concurrent applications, building high-performing organizations requires a careful and continuous search for shared resources, and developing explicit strategies for mitigating their impact on performance.

A commonly applied but rarely successful strategy is using external resources–e.g. consultants, agencies, staff augmentation–as an end-run around contention on internal resources. While the consultants can indeed move quickly in a low-contention environment, integrating their work product back into the contended resources often has the effect of ballooning $c_s$ (the variation of service times, or how long a critical section is held). This produces a quadratic spike in wait times which increases utilization which in turn produces a superlinear spike in wait times. (Queueing theory is a harsh mistress.) Successful strategies for reducing contention include increasing the number of instances of a shared resource (e.g., adding bathrooms as we add employees) and developing stateless heuristics for coordinating access to shared resources (e.g., grouping employees into teams).

As with heavily layered applications, the more distance between those designing the organization and the work being done, the greater the risk of unmanaged points of contention. Top-down organizational methods can lead to subdivisions which seem like parallel efforts when listed on a slide but which are, in actuality, highly interdependent and interlocking. Staffing highly sequential efforts as if they were entirely parallel leads to catastrophe.

Hell Is Other People

Coherence costs grow quadratically as new members are added. Working on complex tasks using parallel resources (or with a group of people) requires communication. A group of $3$ has $3$ dyads; a group of $4$ has $6$; a group of $5$ has $10$; a group of $N$ people has $\frac{N^2-N}{2}$ possible dyads. Point-to-point communication (i.e., talking to each other) can be modeled as the activation of a subset of those dyads.

While some organizations are chattier than others, this communication is essential for the sharing of information and the coordination of action. But it’s not free. Communication takes time. If the relative percentage of people who need to talk to each other to get something done stays constant as the organization grows (i.e., $x\%$ of all dyads), the total time spent communicating will grow quadratically as the work capacity of the organization grows linearly.

We can consider group meetings as a batching strategy to reduce the number of entities involved in point-to-point communications, but the effectiveness of this strategy depends heavily on the relative overlap of groups and the group structures. The degree to which the groups overlap is essentially the same factor as the percentage of dyads required for communication. If the group sizes are bounded, the growth of coherence costs will be reduced by a constant factor but still grow quadratically. It may be tempting to try to punt on coherence and just ride dirty, but even subtle forms of incoherence have massive business costs. The only scalable strategy for containing coherence costs is to limit the number of people an individual needs to talk to in order to do their job to a constant factor.

In terms of organizational design, this means limiting both the types and numbers of consulted constituencies in the organization’s process. Each additional person or group in a responsibility assignment matrix geometrically increases the area of that matrix. Each additional responsibility assignment in that matrix geometrically increases the cost of organizational coherence.

It’s also worth noting that these pair-wise communications don’t need to be formal, planned, or even well-known in order to have costs. Neither your employee handbook nor your calendar are accurate depictions of how work in the organization is done. Unless your organization is staffed with zombies, members of the organization will constantly be subverting standard operating procedure in order to get actual work done. Even ants improvise. An accurate accounting of these hidden costs can only be developed via an honest, blameless, and continuous end-to-end analysis of the work as it is happening.

Principles From Beyond Space And Time

Keep the work parallel, the groups small, and the resources local

When presented with a set of problems which grow superlinearly intractable as $N$ increases, our best bet is to keep $N$ small. If the organization’s intent is to increase value delivery by hiring more people, work efforts must be as independent as possible. Leaders should develop practices and processes to ensure that the work efforts which their strategies consider parallel are actually parallel. Shared resources should be continuously managed for contention, and where possible, the resources a group needs should be colocated with that group (e.g., if the work involves a lot of design, staff a designer to that group). Combined arms doctrine isn’t just for soldiers.

Prioritize the development of force multipliers

If an organization is largely working on the same types of problems it was in previous years, it’s cause for concern. Teams dedicated to internal tooling should be staffed and given the explicit direction of building tools and optimizing processes to help increase their co-workers’ productivity. If the percentage of the organization dedicated to improving how the organization works begins to fall, ask yourself–have we hit a global or local maximum? Go long on high-leverage tools, but stay grounded in whether or not they actually help.

If possible, factor work products into independent modules; if not, grow slowly and optimize

If your work product–e.g. codebase, documents, etc.–can be factored into independent modules, do so. The key word there is independent. Slicing your shit up into a hundred microservices will not help you if everyone needs to change ten of them to get anything done. Some problems are not particularly amenable to partition; these are also problems which do not benefit much from additional workers. If the problem is a fixed point, look at ways of optimizing the sequential portion of the work. Know that throwing bodies at that problem will produce a clusterfuck.

Scale organizational efforts across a portfolio of synergistic products

Most smart businesses start out with a single product. They go long on their product hypothesis, put their eggs in a single basket, and swing for the fences. If they’re lucky enough to get traction, they double down on this. Over. And over. And over. At some point, they’ve got several battalions’ worth of people milling around, all trying to figure out who owns the turboencabulator UI and whether or not the new marzelvanes will be fully antigravic by the big fall marketing push.

To avoid that, organization leaders should keep the development of a product portfolio as an explicit goal. Feature or product ideas which are complimentary to the organization’s overall business strategy but don’t naturally coexist with the main product can be developed as separate products by independent teams. We have evidence that software development schedules can be shorted by, at most, 25%; it should be easy to pick between a single product in 18 weeks or two products in 24 weeks.

Successful new products can be incrementally integrated with the existing products where it makes sense, and tooling, libraries, and frameworks can be developed by force multiplier teams to reduce both time-to-market of new products and the carrying costs of existing products. Unsuccessful new products can be gracefully removed from market at dramatically lower cost than features of similar complexity. After all, removing a feature from a product involves the same contention and coherence costs as adding a feature. It’s rare for a feature to present carrying cost greater than the cost of its removal, thus the prevalence of Flying Dutchman features.

As a concrete example of the virtue of a product portfolio, imagine Amazon Web Services as a single product, staffed by a hundred thousand doomed souls and fronted by a UI which is just a series of buttons allowing you to provision and operate virtual machines, databases, data lakes, robotics applications, augmented reality apps, IoT dinguses, and more. Such a creature would implode under its own weight.

Instead, Amazon Web Services is a portfolio of synergistic products. EC2 has its own independent set of features, developed and operated by an independent set of employees. When its needs can be provided by another AWS product (e.g., storing virtual machine images on S3, or sending metrics to CloudWatch), a cross-product integration is introduced. This product structure enables the highly concurrent organizational structure that enables Amazon to field a blistering number of new products every year while continuing to support and develop existing products. Failed services can be sunsetted or drawn down without disrupting the rest of the organization.

Keep responsibility assignment matrices small, sparse, and local

As an organization matures, ad-hoc roles are often developed into full teams. This specialization is often critical for building internal economies of scale, but the formalization of new constituencies should be kept in check. Each column in your responsibility assignment matrix expands the possible set of required interactions geometrically; each assignment in the matrix is a coordination point requiring waiting. Where a matrix indicates a high-touch relationship between two groups (e.g., a group of engineers working on a feature and the lawyers trying to ensure the legal compliance of that feature), efforts should be made to reduce the cost of that interaction by colocating their members (e.g., embed a lawyer with the engineers).

Prioritize asynchronous information distribution over synchronous

A significant source of failure demand for meetings and status updates is the desire of organizational leaders to keep abreast of who’s doing what. This situational awareness is indeed important, but trying to maintain it by calling meetings, messaging people on Slack, and catching people on the hallways is a significant systemic drag on organizational productivity.

A better model for staying informed of developments as the organization scales is for groups to publish status updates as part of the regular cadence of their work. Leaders can asynchronously read these updates and, should the need arise, initiate additional, synchronous conversation to ask questions, provide feedback, etc.

Synchronous meetings should be reserved for low-latency collaboration on complex issues; likewise, collaboration should be reserved for synchronous meetings.

What happens inside the boundaries is important

That we know some of the boundaries of organizational performance and their dynamics doesn’t excuse us from using our empathy to build humane organizations. Companies are groups of people being compensated for having to spend some of their finite lifetimes not being with their partners, children, pets, or super weird hobbies. They deserve to be members of organizations which honor that time by ensuring that their work has value and meaning. There is no mathematical model to guide us to that goal.

Thanks to various Fiascans for reviewing this post. Any mistakes in this article are mine, not theirs.

Towards A Safer Footgun

2017-06-08T00:00:00+00:00

Modern symmetric encryption is built around Authenticated Encryption with Associated Data (AEAD) constructions: combinations of ciphers and message authentication codes which provide strong guarantees of both confidentiality and integrity. These constructions avoid the “doom principle” which made so many older cryptosystems vulnerable to online attacks, but many of the standard AEAD constructions have problems of their own.

What’s an AEAD?

An AEAD is essentially a pair of functions:

\[\begin{array}{rccc} \operatorname{Encrypt}\colon & (K, N, M, D) & \to & (C, T)\\ \operatorname{Decrypt}\colon & (K, N, C, T, D) & \to & (M) \coprod \varnothing\\ \end{array}\]

Given a key $K$, a nonce $N$, a message $M$, and some associated data $D$, the $\operatorname{Encrypt}$ function returns a ciphertext $C$ and an authentication tag $T$. (The data is not included in the ciphertext.) Passing the same key, nonce, data, ciphertext, and tag to the $\operatorname{Decrypt}$ function will return the original message $M$. If any bit of the key, nonce, data, ciphertext, or tag are altered, however, $\operatorname{Decrypt}$ will return nothing (or an error).

Fantastical constructions…

The addition of AEAD-based ciphersuites to TLS 1.2 gave the world a set of standard AEAD constructions, including the ubiquitous AES-GCM and the more recent ChaCha20-Poly1305. These have found their way into standard libraries, high-level libraries, open-source projects, and online services, largely displacing a small ecosystem of lightly-scrutinized, ad-hoc constructions. This is a positive development, but it’s worth noting that the proliferation of these constructions outside of the context for which they were developed–i.e. TLS–has exposed us to additional cryptographic challenges.

…and where to find them

TLS 1.2 uses AEADs to protect records–batches of data being sent or received via a TLS connection–and the standard IETF AEAD constructions are designed to take advantage of specific details of that context. For one, TLS connections are protected using randomly generated, per-session keys. For two, TLS records have unique, sequenced identifiers to prevent an attacker from dropping or duplicating records without detection.

As a result, all IETF AEAD standards use 96-bit nonces derived from the record identifiers, which are guaranteed to be unique inside a single connection. Because the keys are also scoped to a single connection, there is less risk of accidentally re-using a key/nonce pair. That’s an important consideration, because their security guarantees of confidentiality and integrity are completely broken if a key/nonce pair is ever re-used.

Every cloud has a granite lining

AES-GCM provides confidentiality by using AES in counter mode to encrypt the data. A 128-bit counter, which is derived from the nonce, is encrypted and the resulting key stream is xor-ed with the plaintext to produce the ciphertext. If a key/nonce pair is used for two different messages, an attacker can recover $M_1 \oplus M_2$, which will reveal information about both messages. ChaCha20-Poly1305 uses ChaCha20, a stream cipher, which is vulnerable to the same attack should a nonce ever be reused.

AES-GCM and ChaCha20-Poly1305 both provide integrity by hashing the ciphertext with polynomial message authentication codes (GHASH for the former, Poly1305 for the latter). These types of algorithms are vulnerable to forgery attacks should an authentication key ever be reused. ChaCha20-Poly1305’s construction is resilient this sort of attack because the authentication key is derived from the ChaCha20 keystream, but AES-GCM is very fragile and fails catastrophically in the event of key/nonce reuse.

Because key/nonce reuse is essentially impossible in the context of a TLS connection, these constructions trade nonce-misuse resistance for performance. But what about in other contexts?

Betting one’s shirt

Without a naturally-occurring nonce like TLS’s record identifiers, developers often generate random nonces on the assumption that the probability of two operations picking the same 96-bit value is low enough to safely ignore. For a few operations, that’s certainly correct, but thanks to the birthday problem, those probabilities get big quick. At what point does that become unsafe?

Thankfully, NIST made concrete recommendations regarding this:

The probability that the authenticated encryption function ever will be invoked with the same IV and the same key on two (or more) distinct sets of input data shall be no greater than 2^-32.

…unless an implementation only uses 96-bit IVs that are generated by the deterministic construction:

The total number of invocations of the authenticated encryption function shall not exceed 2³², including all IV lengths and all instances of the authenticated encryption function with the given key.

If we’re randomly generating nonces, then, we should encrypt no more than 4,294,967,296 messages with the same key. That’s a lot, right? Well, no.

As an example, let’s assume we’ve got a web application running on a bunch of independent servers and we’re using AES-GCM or ChaCha20-Poly1305 to secure something in each request with a shared key. At 100 requests a second, we won’t need to rotate keys for 16 months. At 1,000 requests per second, we’ll need to rotate keys every 50 days or so. At 10,000 requests per second, we’ll be rotating keys more than once a week. At 100,000 requests per second, we’ll be rotating keys more than twice a day.

What’s a developer to do?

A safer footgun

The solution, broadly speaking, is a class of AEAD constructions which are called “nonce-misuse resistant”, which means they don’t immediately fall apart should a key/nonce pair be re-used. This doesn’t mean, however, they can tolerate arbitrary amounts of re-use, but rather their security bounds degrade much more slowly with each re-used nonce.

The most practical nonce-misuse resistant AEAD proposal, I think, is AES-GCM-SIV, an improvement of GCM-SIV designed for use in Google’s QUIC protocol. QUIC allows clients to prove they’re actually using a particular IP address by issuing cryptographic “source-address” tokens.

Since a central allocation system for nonces is not operationally viable, random selection of nonces is the only possibility. AES-GCM’s limit of 2³² random nonces (per key) suggests that, even if the system rotated these secret keys daily, it could not issue more than about 50K tokens per second. However, in order to process DDoS attacks the system may need to sustain issuance of several hundred million per second.

As a result, AES-GCM-SIV is designed with a few important properties:

For encryption, it uses AES, which is well-studied and widely supported in hardware thanks to AES-NI.
For authentication, it uses a MAC called POLYVAL, which is essentially GCM’s GHASH without the byte swapping. As a result, it can leverage Intel’s PCLMULQDQ instruction to achieve speeds only slightly slower than GCM, despite requiring two full passes on each message.
Its security bounds degrade almost linearly with nonce reuse, improving on GCM-SIV’s quadratic degradation, and making it ideal for use in systems like QUIC where GCM’s per-key limits are infeasible.
Unlike constructions like XSalsa20, which protect against nonce reuse by increasing the nonce size, AES-GCM-SIV still uses 96-bit nonces. For a system like QUIC, which may peak at 10⁸ operations per second, this matters. At that rate, XSalsa20 would require 9.6Gb/s more bandwidth.

AES-GCM-SIV is not yet an IETF standard, but it’s in progress. Once standardized, however, I can see it being the recommended construction for authenticated encryption.

As you may have guessed, I’ve written a Java library which implements the most recent draft specification for AES-GCM-SIV. It’s based on the AES and GHASH implementations in BouncyCastle, so it doesn’t leverage the performance and security benefits of AES-NI and PCLMULQDQ instructions, making it around twice as slow as AES-GCM. It’s still plenty fast (~40-50µs for a 1KiB message), however.

tl;dr

AEADs have been a critical development in making symmetric encryption safe to use, but most standard constructions were developed to be used in contexts with deterministic nonces, like TLS’s record identifiers. In contexts with long-lived keys and where nonces must be randomly generated, the Birthday Problem makes nonces much more likely to collide in systems performing many operations. Should a nonce be reused, the security guarantees of most standard AEADs are null and void–an attacker can recover plaintexts, forge messages, etc.

Cryptographers have been working on “nonce-misuse resistant” AEADs, which have security bounds which degrade more gracefully when nonces are re-used. Of these, AES-GCM-SIV is the most promising and, once standardized, will be a better choice than AES-GCM in contexts where nonces must be generated randomly.

Updated June 10, 2017

Fixed my assertions about ChaCha20-Poly1305 forgery attacks. Thanks, @dchest!

Thanks to Thomas Ptacek for reviewing this post. Any mistakes in this article are mine, not his.

usl4j And You

2017-05-31T00:00:00+00:00

My friend Jeff wrote a thing a while back which contained a cornucopia of truth, one of my favorite bits being the following:

“It’s slow” is the hardest problem you’ll ever debug. “It’s slow” might mean one or more of the number of systems involved in performing a user request is slow. It might mean one or more of the parts of a pipeline of transformations across many machines is slow. “It’s slow” is hard, in part, because the problem statement doesn’t provide many clues to location of the flaw. Partial failures, ones that don’t show up on the graphs you usually look up, are lurking in a dark corner. And, until the degradation becomes very obvious, you won’t receive as many resources (time, money, and tooling) to solve it. Dapper and Zipkin were built for a reason.

I’ve been thinking about this lately, and I think another factor which makes this such a hard problem is that even if you have the ability to segment performance telemetry by space (i.e. which subsystem is slow) it’s not guaranteed that doing so will actually find the problem. As with performance optimization, if you don’t find any hotspots, the problem is often systemic, not local, and as such requires a different set of tools to resolve.

In this post, I’ll introduce you to Little’s Law, the Universal Scalability Law, and usl4j, a Java library for modeling system performance given a small set of real-world measurements.

Little’s Law

First, let’s go over Little’s Law. Little’s Law is a simple equation of the behavior of queues. It formally describes the relationship of queue size ($N$), throughput ($X$) and latency ($R$):

\[\begin{array}{rcl} N & = & XR\\ X & = & N/R\\ R & = & N/X\\ \end{array}\]

For example, consider a coffee shop which takes an average of 90 seconds to make an order ($R=90$). A customer places an order, on average, every 60 seconds ($X=1/60$). The average number of in-flight orders, therefore, is $N=XR=1/60 \times 90 = 1.5$.

Little’s Law is incredibly helpful in terms of being able to predict systems’ behaviors by modeling them as big-ass queues. A simple auto-scaling system, given a threshold latency of $R$ seconds, can monitor the number of concurrent requests $N$ over a set of servers. When the number of requests per second $X$ begins to approach $N/R$, the system can bring new servers online to increase the capacity and stay under the threshold latency.

Given any two parameters, Little’s Law allows us to derive the third, but what if we want to predict a system’s behavior given only a single parameter?

The Universal Scalability Law

The Universal Scalability Law, developed by Neil J. Gunther, is a model which combines Amdahl’s Law and Gustafson’s Law to produce a nonlinear model which can be used to predict a system’s behavior:

\[X(N) = \dfrac{\lambda N}{1+\sigma(N-1)+\kappa N(N-1)}\]

It describes the expected throughput of a system ($X$) at a given level of concurrency ($N$) as the nonlinear relationship of three parameters:

$\sigma$, the overhead of contention
$\kappa$, the overhead of crosstalk
$\lambda$, how fast the system operates in ideal conditions

(For a more in-depth explanation, I highly recommend reading Baron Schwartz’s excellent book, Practical Scalability Analysis with the Universal Scalability Law.)

Given a set of USL parameters $\{\sigma,\kappa,\lambda\}$, we can use the USL equation to pin down one parameter of Little’s Law, allowing us to make predictions given the value of any single parameter:

\[\begin{array}{rcc} R(N) & = & \dfrac{1 + \sigma(N-1) + \kappa N(N-1)}{\lambda}\\ R(X) & = & \dfrac{-\sqrt{X^2 (\kappa^2 + 2 \kappa (\sigma - 2) + \sigma^2) + 2 \lambda X (\kappa - \sigma) + \lambda^2} + \kappa X + \lambda - \sigma X}{2 \kappa X^2}\\ X(R) & = & \dfrac{\sqrt{\sigma^2 + \kappa^2 + 2 \kappa (2 \lambda R + \sigma - 2) - \kappa + \sigma}}{2 \kappa R}\\ N(R) & = & \dfrac{\kappa - \sigma + \sqrt{\sigma^2 + \kappa^2 + 2 \kappa (2 \lambda R + \sigma - 2)}}{2 \kappa}\\ \end{array}\]

We can even predict the maximum concurrency of a system:

\[N_{max} = \left\lfloor \sqrt{\dfrac{1 - \sigma}{\kappa}} \right\rfloor\]

But where do $\sigma$, $\kappa$, and $\lambda$ come from? In order to determine the USL parameters for a system, we must first gather a set of measurements of the system’s behavior.

Building a model

These measurements must be of two of the three parameters of Little’s Law: mean response time (in seconds), throughput (in requests per second), and concurrency (i.e. the number of concurrent clients).

Because response time tends to be a property of load (i.e. it rises as throughput or concurrency rises), the dependent variable in our tests should be mean response time. This leaves either throughput or concurrency as our independent variable, but thanks to Little’s Law it doesn’t matter which one we use. Because the USL is defined in terms of concurrency ($N$) and throughput ($X$), it’s more straight-forward to keep these measurements in those terms.

For the purposes of discussion, let’s say we measure throughput as a function of the number of concurrent clients sending requests as fast as they can. After our load testing is done, we should have a set of measurements shaped like this:

Concurrency	Throughput
1	955.16
2	1878.91
3	2688.01
4	3548.68
5	4315.54
6	5130.43
7	5931.37
8	6531.08

Next comes the hard part: we need to use a nonlinear solver to generate optimal coefficients for the USL which fit these measurements.

Using usl4j

Luckily for you, I wrote usl4j, a Java library which uses DDogleg’s Levenberg-Marquardt least-squares optimizer to build a fully-parameterized USL model given a set of measurements:

import com.codahale.usl4j.Measurement;
import com.codahale.usl4j.Model;
import java.util.Arrays;

class Example {
  void buildModel() {
    double[][] points = { {1, 955.16}, {2, 1878.91}, {3, 2688.01} }; // etc.
  
    // Map the points to measurements of concurrency and 
    // throughput, then build a model from them. 
    Model model = Arrays.stream(points)
                        .map(Measurement.ofConcurrency()::andThroughput)
                        .collect(Model.toModel());
    
    // Predict the throughput for various levels of
    // possible concurrency.
    for (int i = 10; i < 200; i+=10) {
      System.out.printf("At %d concurrent clients, expect %f req/sec\n", 
        i, model.throughputAtConcurrency(i));
    }
  }
}

The resulting data looks something like this:

usl4j allows you to calculate all of the parameters of Little’s Law: $N(X)$, $N(R)$, $X(N)$, $X(R)$, $R(N)$, $R(X)$, as well as $N_\text{max}$ and $X_\text{max}$.

Continuous Measurement

Because we can build USL models using submaximal testing (i.e. not testing to overload), the measurements we use aren’t necessarily restricted to test bench experiments. We can take real measurements from live systems and continuously build models from them, as building a model from a small set of measurements is very fast (i.e. <100µs).

Building USL models in realtime can augment dashboards and alerting systems with answers to critical questions:

How close is the system to the predicted maximum throughput or concurrency?
How close are we to the point at which our latency will be over SLA?
How much would adding another server help our latency?

Building USL models over historical data allows you to quantitatively answer key questions about how the system has changed over time with regard to scalability:

A decreased $\sigma$ value means decreased contention (e.g. better database lock scheduling or a new lock-free data structure).
A decreased $\kappa$ value means decreased crosstalk (e.g. removing false cache sharing, or reduced fanout in a distributed system).
An increased $\lambda$ value means the system has increased in unloaded performance (e.g. a new compiler optimization or runtime version).

USL For All

Because the USL has a strong physical basis, its parameters can indicate where effort is best spent: increasing $\sigma$, $\kappa$, or $\lambda$. Unlike existing observability tools, the USL can allow you to pinpoint what kind of process is behind “it’s slow”, even if the slowness isn’t limited to a subset of the system.

The USL is a powerful, accessible tool for modeling the behavior of software systems, and it’s my firm hope that making it easily automatable with usl4j leads to its adoption in observability platforms.

Risky Business Requires Active Operators

2017-05-19T00:00:00+00:00

This article was originally posted on blog.skyliner.io on Feb 23, 2017.

It’s tempting, as developers, to consider automation an absolute good. We are, after all, paid to automate work. But as the complexity of a system passes some hazy, ill-defined threshold, we need to be very clear about the risks of automation in order to successfully and safely wield its power.

As the co-founder of Skyliner, a continuous delivery launch platform for AWS, I’m obviously convinced of the virtues of automating huge swaths of traditionally manual administration tasks. I mean, I helped build a product which spins up a full multi-environment AWS architecture at the press of a button—so why am I counseling caution?

You had one job

It doesn’t take much for a system to become complex, especially when you take into account its human operators. Unlike simple systems, which can often be analyzed in terms of failure probabilities, the safety of complex systems must be considered holistically: that is, human operators included.

Automation, by removing human intervention from a task, removes the possibility of human error. Unlike a human operator, a Roomba never gets bored of vacuuming your living room, never goes on vacation, and never takes a sick day. But automation also removes the possibility of human error correction. Unlike a human operator, a Roomba will diligently paint your floors with the unexpected mess your new puppy left in the hallway. As a result, in order to build safe automation we need to consider the role of human operators in keeping complex systems safe.

When good days go bad

It’s worth noting that complex systems aren’t homogenous blobs of complexity. Some components and their activities are relatively straight-forward and some are more complex. When looking at change boundaries and automation, we should focus on those with a higher degree of downside risk which can’t be effectively hedged with automation.

For example, launching an EC2 instance has a small amount of downside risk—if it doesn’t come up cleanly, you’re paying for a useless instance—which can be very effectively hedged by using auto-scaling groups. In contrast, deploying a new version of your application to production has a large amount of downside risk—if the new version contains a subtle but horrible bug, you lose money—and this risk can’t always be hedged via blue/green deploys, canary deploys, or other types of automation.

Around these kinds of change boundaries, where the downside side is high and correctness can’t be automatically verified, we must ensure that our systems keep humans operators active. After all, they’re the only ones who will be able to detect and respond to such incidents to keep the system safe.

On the care and feeding of human operators

An active human operator is one who:

is aware they are crossing a change boundary
understands doing so may cause an incident
is prepared to respond should one occur

In order to make human operators aware of change boundaries and their ambient risks, we must design automation systems which require their active participation in order to perform some actions around those boundaries. If a system is purely autonomous, any distinctions between safer and riskier actions inside it are invisible to a human operator. Similarly, actions performed by human operators should not traverse those change boundaries as an unintended consequence—they should be explicitly labeled as such and not intermediated by other, unrelated actions. By making change boundaries visible and explicit, we also help prepare human operators to respond to possible incidents. The more proximal an action is to its effect, the easier it becomes for us to reason about the immediate cause of an incident while it’s happening.

A combination emergency brake/gas pedal

For example, consider the relatively common practice of running pending database migrations as part of a successful deploy. On the happy path, this is a huge time-saver, and removes the potential for human error: never again will Gary forget to run his goddamn migrations! Off of the happy path, though, this type of automation can make incidents more likely and harder to respond to.

To begin with, this combines the crossing of two risky change boundaries in a single action. Database migrations, especially on large, live databases, are notorious for causing incidents. A poorly-written migration can take milliseconds on a developer’s laptop, seconds in QA, and hours in production, all while holding an exclusive lock and thrashing the disk.

Coupling these two actions not only increases the probability of failure, it also obscures their connection, which slows down incident response: a deploy is centered around the application, not the database, and a responder will be primed to look at potential bugs in the application before considering the possibility of a problem with the migration.

Further, it complicates any automated responses to deploy failures in general. Any attempts to restore the system to safety must take into consideration not only that the last deploy may have modified the application’s external dependencies, but also that a migration have left the database in an indeterminate state, requiring manual intervention.

Putting the pilot in “auto-pilot”

In designing Skyliner, we intentionally chose to make deploying changes to production an active process. By default, Skyliner will build commits and deploy them to QA when you push them, but promoting that build to production requires an actual human to push a button.

We also chose to avoid making that button press on your keyboard. We’re very familiar with command line tooling for deploys, and while command line tooling is amazing for automation, it often does a very poor job of actually engaging human operators in considering the surrounding context of an action:

Unlike a purely automatic system or a command line tool, Skyliner places users in an environment which primes both their expectations of what will happen when they push that button and their situational awareness of events after they push that button:

Making changes to complex systems is inherently risky, and only some of that risk can be mitigated by automation. Managing the remaining, necessary risk is the job of human operators like you and me. Safe, successful automation must empower us, not relegate us to the sidelines.

Virtual Machines Are Fleeting Things

2017-05-18T00:00:00+00:00

This article was originally posted on blog.skyliner.io on Feb 23, 2017.

Last month, AWS sent us an email regretfully informing us that the hardware running one of our instances had begun to fail. They gave us two weeks to move our data, if we could, to a new instance. I smiled, hit the archive button, and confidently went about the rest of my day. We’d long since terminated that instance, and our software had already moved on. At other companies, an instance retirement email means a day’s work for someone. My happy indifference to the loss of instance i-0b3baeac82b2b8461 was a direct result of a key design decision we made when building Skyliner: Skyliner performs everything — builds and deploys — by launching new, impermanent EC2 instances. The result is a dramatically simpler system, and thus a more reliable and operable system.

At the lowest level, deploying with fresh, purpose-specific instances means there is zero risk of accidental state from the last deploy: no conflicting system packages, no mangled symlinks, no zombie processes, no old configuration files. This radically reduces the path dependence of builds and deploys, which in turn increases their reliability and operability. When a deploy fails, you don’t need to consider the entire history of a server—you can be confident that the failure is scoped to that deploy’s (hopefully small) diff.

The cumulative effect of this elimination of accidental state is a radically reduced divergence from an ideal state over time. If an engineer manually logs into one of the servers to debug something, whatever changes they may make—installing a package, modifying a configuration file, taking a heap dump—automatically disappear after the next deploy, instead of when said engineer diligently remembers to roll back all of the changes they made when trying to diagnose an outage at 3am. If a developer forgets to delete their temporary files or debug logging is accidentally enabled for everything, those files vanish after the next deploy. Like apoptosis, continuously destroying and creating instances makes it nearly impossible to accumulate error.

Instance impermanence is also a helpful invariant for steering applications themselves towards more reliable architectures. The ability to depend on local storage as long-term persistence is a kind of false affordance: it implicitly makes the promise to application developers that /tmp is not, despite the name, actually temporary. A classic example of this are web applications which store session data as temporary files (e.g. Rails circa 2006). When working on and deploying to permanent servers, it’s easy for a developer (e.g. me circa early 2006) to build software on top of that false affordance of permanence (e.g. authentication circa early 2006) with significant reliability and operability problems (e.g. inode exhaustion from old session files, logging out all users from cleaning up what I thought were old session files, kernel locks and corrupted data from trying to store the sessions on an NFS server circa mid-2006) which would not have happened with an alternate architecture (e.g. storing sessions in memcached circa late 2006). (It was a different time.)

Unsurprisingly, this invariant is also helpful in guiding infrastructure applications like Skyliner towards simpler, more reliable architectures. In particular, it collapses provisioning, deployment, and configuration management into a single, highly reliable pathway. Need more capacity? Launch new instances. Want to deploy your changes? Launch new instances. Want to change the configuration? Launch new instances. In systems with separate provisioning, deployment, and configuration management processes, it’s common for the less frequently used processes to be slow, complex, and unreliable: deploys are pleasant, installing a new package for an application takes half a day wrestling with Puppet, and provisioning additional capacity is a full-on devops adaptation of Voltaire’s Candide. By combining those pathways, we ensure that an improvement for one is an improvement for all, and its constant operation gives us confidence in our ability to perform more infrequent tasks like provisioning and configuration.

Finally, a focus on new instances means being able to take advantage of key AWS technologies like autoscaling groups. Every Skyliner application is deployed in an autoscaling group, which makes recovering from instance crashes or retirement automatic. Skyliner is itself a Skyliner application, so when I read that email about poor instance i-0b3baeac82b2b8461, I could simply smile and get back to work.

On The Difficulty Of Conjuring Up A Dryad

2017-05-17T00:00:00+00:00

This article was originally posted on blog.skyliner.io on Nov 29, 2016.

When we started building Skyliner, our goal was to make deploys on AWS safe, reliable, and easy. To accomplish this at scale, we made some key design and implementation decisions. In this post, I’ll tell you what those decisions were, why they work, and how we built a system which is reliable enough to even deploy itself.

Use A Finite-State Machine

Our earliest major design decision was to model the Skyliner deploy process as a Finite-State Machine (FSM), with transitions from one state to another associated with specific conditions and actions. For example, a deploy in the rollout-wait state will check the newly-launched instances of the deploy. If the instances are up and running, the deploy is advanced via rollout-ok to the evaluate-wait state. If the instances have failed to launch, the deploy is advanced via rollout-failed to the rollback state. If the instances are still launching, the deploy is kept in the rollout-wait state via rollout-in-progress.

Using an FSM allows us not just to exhaustively determine that all possible states of a deploy are handled, but also to decompose our own code into small, comprehensively-tested, state-specific functions. It also allows us to extract state management as first-class concern; unlike many deploy tools which keep this state in memory, we store an append-only history of deploy states. As a result, our code is reentrant: it can be interrupted at any time and safely resumed later. If one of our servers crashes another one can seamlessly take its place with no disruption in service.

Use A Reliable Coordinator

Unlike some deploy tools which require a single “master” server to coordinate deploys, Skyliner uses Amazon’s Simple Queue Service (SQS)—a highly-available, scalable, reliable message queue service—as a distributed clock to advance each deploy’s state.

SQS has a very robust model for dealing with failures: when a consumer polls the server for a new message, it specifies a visibility timeout — a period during which that message will not be visible to other workers. If the consumer successfully processes the message, it deletes it from the queue. If the consumer crashes, the visibility timeout elapses and the message becomes visible to another consumer. Similarly, when sending a message to a queue, one can specify a delay — a period of time during which that message will not be visible to any consumer. We use the delay and the visibility timeouts to create “ticks” for deploys.

When a deploy is started, we send an SQS message with the deploy ID, environment, etc. using a delay of e.g. 10 seconds. After 10 seconds, it becomes visible to a Skyliner background thread, which receives it using a visibility timeout of e.g. 10 seconds. The thread looks up the deploy’s current state and takes any appropriate action to advance it. If the deploy has finished, the thread deletes the message from the queue. Otherwise, the message is left in the queue to reappear after another 10 seconds has passed.

If the instance handling the deploy crashes, the deploy’s message becomes visible after another 10 seconds, and another instance will receive it and advance the deploy.

Use Blue-Green Deploys

Skyliner uses blue-green deploys. Instead of modifying servers in-place and hoping nothing goes wrong, we leverage EC2’s elasticity and launch an entirely new set of instances running the new version of your software. If the new version passes healthchecks, we roll forward by terminating the old instances. If the new version doesn’t come up cleanly, we roll back by terminating the new instances.

As a result, deploys on Skyliner are:

Reliable. Both rolling forward and backward requires only the termination of EC2 instances—no coordinating system package upgrades and downgrades, no modifying local Git checkouts or switching symlinks.
Safe. At no point in time is your application’s capacity reduced, so there’s no need to wait until a scheduled maintenance period to deploy.
Fast. Compared to other safe deploy strategies, Skyliner doesn’t have to reduce the rollout rate in order to maintain capacity. In a traditional data center, the set of servers an application can run on is typically fixed, and in order to maintain a minimum capacity, deploys are usually done in a “rolling” fashion, either making changes one server at a time or in small batches. Consequently, rolling deploys take $\mathcal{O}(\frac{N}{M})$ time for $N$ total servers and batches of size $M$. Blue-green deploys, on the other hand, can perform all their operations in parallel, and take $\Omega(1)$ time—only as long as the worst-case time of any single instance.

Skyliner On Skyliner

The end result of these decisions is that several times a day, we can use Skyliner to deploy a new version of itself. When a build is finished, we use the old version to start the deploy, which launches EC2 instances with the new version of Skyliner. For a brief moment, both are running simultaneously until a background thread rolls the deploy forward and begins the termination of the EC2 instances running the old version. As those shut down, the instances running the new version oversee the cleanup of the deploy which brought them into existence.

The Happy Genius Of My Household

2017-05-16T00:00:00+00:00

This article was originally posted on blog.skyliner.io on Aug 29, 2016.

When viewed from one angle, we’re in the middle of an infrastructure software renaissance: a cornucopia of distributed databases, self-healing meshes, and software-defined anythings, all radiating with potential to help you get your own applications up and running. From a different angle however, the average software developer resembles a modern Tantalus, ever-grasping at an ever-retreating image of operability and reliability.

When we started work on Skyliner, we knew that in building a product that helps people quickly get their applications up and running in the cloud we would have to make a number of hard architectural decisions not just for ourselves but also for our customers. One of the biggest of these—and one I remain very confident about—was the decision to avoid container platforms like Swarm and Kubernetes entirely. We believe that by focusing on a single cloud provider—in our case, AWS—and tapping into the economies of scale which make cloud computing so attractive, we can deliver a better application platform for our customers. Here’s why.

Pick a cloud. Any cloud.

To start, a bit of devops apostasy: the idea of multi-cloud architectures is, for the vast majority of companies, a complete and total boondoggle.

The idea of jumping from Amazon to Microsoft to Google all nimbly-pimbly is an attractive one, and for some very constrained workloads the financial upside of pricing arbitrage might even be worth it. For the other 99% of the market, however, the reality of the situation is that a truly multi-cloud architecture must either limit itself to the lowest common denominator of the supported providers’ functionality or else foist on its users the responsibility for resolving the inconsistencies in feature sets and behaviors (probably at 4am during an outage).

Historically, most companies have settled on a single provider, but with the rising hype about containers some companies have decided to invest in systems which limit their requirements to the multi-cloud lowest common denominator but provide a platform on top of that which is so universal that the affordances and amenities that customers require could be built on top of it. In essence: what if the cloud was built on top of us?

The potential upside of these container platforms is well-marketed but the costs are far less widely acknowledged.

You only get one hill to die on, so choose wisely.

A viable production install of a container platform usually looks something like this. First, you set up a clustered configuration service (e.g. Consul, Etcd, ZooKeeper). Second, you set up a clustered scheduler service (e.g. Kubernetes, Swarm, etc.). Third, you set up a set of workers which actually run your software. At every step of the process, you alone are responsible for all operational tasks associated with keeping these systems running 24/7.

Once you manage to set all that up, you’ll only have a blank slate upon which you can begin the work of building your own application platform. Log aggregation? Metrics collection? Databases? Backups? Storage? Alerting? Load balancing? Key management? Service discovery? Authentication? You’ll need to find solutions for all of these problems, set them up, integrate them, and then keep them running for the rest of your business’s natural life. (Not that you need it all for the business, but once you get locked into a serious container platform, the tendency is to push it as far as you can.)

Even if you cobble together a suitable application platform, you’ll still be on the wrong side of the economic dynamics which make cloud computing so appealing. To wit: there is a large team of people at Amazon whose job it is to keep the load balancers working 24/7, and because they operate at a vast scale, they can offer this service for $16 a month per load balancer. That’s about 12 minutes per month of a software developer’s time. It’s unlikely that a jerry-built ball of Nginx and HAProxy can beat that value proposition.

Unless you’re in the business of building platforms, it’s wise to consider the set of responsibilities you already have—the features you need to ship, the bugs you need to fix, the markets you need to address, the customers you need to help—and ask yourself if just participating in your cloud provider’s economies of scale wouldn’t make more sense.

At Skyliner, we came to the conclusion that the point of running our software in the cloud was to make life easier—not more exciting—and we think we’ve hit a fine balance in how to accomplish this.

Use containers. Not too much. Mostly for packaging.

Skyliner uses containers purely as an application packaging system, allowing you to build a container image on your laptop, build server, etc. and have a high degree of confidence that it can be successfully deployed. Skyliner doesn’t use registries, scheduling, service discovery, virtualized networking, or any other advanced features. Instead, we use AWS services with proven reliability like S3, Autoscaling, and Elastic Load Balancing — services which have seen almost a decade of continuous use and improvement. Each instance in your application’s autoscaling group downloads the container image from S3, loads it, and runs it locally. It’s simple, it’s reliable, and it’s managed entirely by AWS.

While there’s a lot of interesting technology on the horizon, the point of Skyliner is to help you run your applications today. Even if you’re excited by the potential of containers, you already have software you need to run—your own—and you probably won’t be helped in that by adding additional operational tasks to your long list of responsibilities.

If you lived here, you’d be home by now.

As part of building Skyliner, we actively sought out economies of scale and technical advantages in AWS—places where Amazon teams provide far better value than a small company could on their own. For example:

Elastic Load Balancing can offer superior service by virtue of the fact that Amazon controls the networking fabric in ways that Amazon customers cannot;
Key Management Service is backed by cryptographic hardware but is orders of magnitude cheaper than going it alone;
S3’s storage pricing was industry-changing when it launched a decade ago and it’s only gotten cheaper;
a t2.medium instance on EC2 is about half the cost of a c4.8xlarge instance when normalized for CPU and memory.

With Skyliner, you can leverage the best parts of the existing AWS ecosystem without having to be an expert. We do the curation for you. It’s our job to do the work of researching each service, weighing its value proposition, selecting the winners, and connecting it all together. As a customer, you get some amazing functionality directly out of the box:

A full, tamper-proof audit log of all AWS operations performed on your account via CloudTrail.
Load balancers with HTTP/2, WebSockets, and a modern, managed TLS stack via ELB’s Application Load Balancers.
Scalable log aggregation with search and alerting via CloudWatch Logs.
Free SSL certificates which renew automatically via Amazon Certificate Manager.
Secure configuration parameter storage via Key Management Service.
Metric aggregation and alerting with CloudWatch.
Resilience to data center outages via Autoscaling Groups and multi-availability zone Virtual Private Clouds.

The core component of what we do at Skyliner is turn our expertise with AWS into integrated, reliable features for you, and because Skyliner runs everything on your own AWS account, you’re not limited to what we can do. Want to set up a managed database server with RDS? Want to use Kinesis for stream processing? Redshift for big data analytics? SQS for background jobs? CloudFront for faster web pages? Go ahead and set it up. Where possible, we add things like subnet and security groups to make it easy for you to set things up yourself, even if we don’t automate their management yet.

As software developers, we understand the allure of shiny new technologies, but ultimately we decided that we prefer the quiet satisfaction of sleeping through a night while on call. After all, we’re not just building a platform for our customers—we run our applications on Skyliner, too.

On Containers

2016-08-30T00:00:00+00:00

As you may or may not know, I founded a company (Skyliner) with some friends at the beginning of the year, focused on making it easier to set up and operate web applications on AWS.

I just wrote an article about our approach to containers and the economics underlying that approach.

My co-founder Dan McKinley–he of the “choose boring technology” motto–wrote an excellent article on how Skyliner deploys your applications and why we built it like that.

My other co-founder, Marc Hedlund–previously the VP of Engineering at Stripe and Etsy before that–needs to finish his blog post about why we’re not a traditionally VC-funded company. *ahem*

I have some thoughts about how Skyliner is the philosophical extension of my work with projects like Dropwizard, but I’ll save them for later.

Fan-In

2016-02-01T00:00:00+00:00

For a long time, my go-to question when interviewing a candidate was about building Twitter. I was looking to see how they thought about scalability and so, in classic programming interview cargo cult style, I would try to lead the candidate down the road of designing a system which could handle the pub/sub-y part of Twitter’s timeline.

(This was years ago, mind you, back when Twitter’s product design was simple enough that telling engineers at Twitter how to do their jobs was a common hobby.)

The Bieber Problem

My question’s focus was the Bieber problem: given that Mr. Bieber is now followed by over 75 million people, how do you build a timeline system in which his adoring fans can, with low latency, see his tweets? Candidates would toss something up on the board, I’d ask questions about why they went with a particular approach or how they thought a particular bit would handle increased load, and they’d toss more stuff up on the board. It was great.

(Like almost all interview questions, it was ultimately just a vehicle for my own prejudices and superstitions but it passed for clever at the time and no one, including myself, noticed.)

The Other Bieber Problem

Some years later, I described this question to a friend of mine over drinks at one of the City bars where people argue about databases and eat tater tots. He was a third-wave Twitter old blood, depending on how one counts his rings, and he smiled and nodded and told me about the other Bieber problem.

Fan-out’s rough, but that’s mostly parallelizable. The real rough part is that all of his fans reply.

I also smiled and nodded and, thinking only of computers, commiserated: “Wow, that sounds hard.” Like most things, I hadn’t actually thought it through.

The Small, Nice Thing

The other day I shared a small observation I’d made about programming in Clojure and compared it to a number of other languages I’ve used over the years. Unlike some of the other things I’ve written, it was not an advocacy piece. I had found a neat bit of Clojure where everything lined up perfectly—like the congruence of azimuth and street which slots the sunset neatly between skyscrapers on certain days—and wanted to share. I had hoped to return to my long-neglected blog by publishing a quick observation about some recent work of mine, the way I’d done back when I’d tried to explain what I liked about JAX-RS to folks in the Ruby community.

(Again, this was years ago, back when Rails was new and shiny enough that telling rails-core how to do their job was a common hobby.)

And so I wrote and published a post comparing the pair-wise addition of sequences of numbers in some of the languages with which I’ve worked. I had hoped to shake my reluctance to write, but what happened instead confirmed the dread I’d felt about writing at all.

Some folks—friends, mostly—were happy to see me return to writing, which was nice to hear. A few—again, mostly friends—suggested more idiomatic solutions, and I updated the post with their suggestions. Others—mostly Clojure programmers—said they’d noticed the same thing or shared other little serendipitous bits of code, which made me smile.

Most people, though, wanted to argue. In retrospect, it’s a classic bikeshed setup—a simple bit of code about which everyone can have an opinion—and that’s exactly what happened. Folks mentioned me to tell me how it could be done in languages I haven’t used and don’t care about, to tell me how much better statically-typed languages are (as if I hadn’t spent years programming in statically-typed languages), to suggest that I cherry-picked this case in order to better advocate for Clojure, to shit-talk the languages I’ve used, &c. &c. &c.

This wasn’t entirely unexpected—lord knows I’ve written some fairly incendiary shit and lord knows I’ve argued on the internet about computers before—but what took me by surprise was the degree to which these fairly garden-variety responses fucked up my day.

In order to explain why, though, I need to back up a bit.

The Arc Of The Choral Universe

At the dawn of time, when the earth was still new and everyone hosted their own websites, responses to a piece of writing arrived as either emails or more pieces of public writing. If your work moved someone to respond—agreement, correction, outrage, whatever—they would either send you an email or publish something themselves.

Later, when the animals had been named and everyone was running their blogs on insecure bits of PHP, responses would arrive as comments on your post or pingbacks to new posts on other insecure bits of PHP.

Later still, when people began to walk on two legs and everyone gathered in central forums and link aggregators, responses would collect in long-running threads as commenters battled for karma and other imaginary things.

Now, as the stars begin to dim and humans dip and swerve in flocks of social media ephemera, responses are instantaneous and direct and physical, our nascent haptic helpers tugging gently at our sleeves to let us know that someone, somewhere, has had an opinion at us.

At each stage we gain something and we lose something. I miss what we’ve lost most recently. When comments were emails and other blog posts, they could be ignored. It’s a simple thing to add a filter to, e.g., mark all emails containing the word “bcrypt” as read and move them to a folder one never checks. Emails and web sites don’t come find you; they don’t interrupt conversations of yours to interject their opinions; they don’t make your watch subtly tap you on the wrist. They wait.

Twitter, though, is different. It’s immediate. There are no messages left unread, no inbox, no filters, no delay, no curation. Tweets cause notifications, which are instantly pushed to the devices you carry with you daily. Twitter’s also convex. Everyone’s connected to everyone else, with blocks and protected accounts to hide behind. You can find anyone, join any conversation, spectate any exchange, say anything. All conversations are accessible, to the point that a tweet about someone which doesn’t notify a referent isn’t a tweet—it’s a subtweet.

These are precisely the fishbowl dynamics which made me fall in love with Twitter, but they also have a deeply sinister side.

The Attention Lens

It’s a common enough phenomenon: someone says something on Twitter, perhaps about how cute sleepy dogs are, which suggests a particular reply. Someone sees the place where a joke would go, so they crack it. Everyone laughs. Someone else sees the same empty place where a joke would go, so they also make the joke. Everyone laughs. Another person makes the joke, and another, and another, and another.

(It’s a good joke, for what it’s worth.)

I’ve started thinking of this as an attention lens: small, human amounts of individual attention are refracted through social media to converge on a single person, producing the effect of infinite attention at the focal point. Even in the event that everyone means well, the experience is surreal for the person at the focal point of the lens.

This dynamic, like most, is objectively worse for people who are not stereotypically authoritative, which on the internet means everyone who isn’t a white, straight, cis guy between 18 and 50 years old. The inhibitions to interrupt are lowered, the desire to interject is higher, and as a consequence the mentions are played in that much more. Attention lenses also underlie the asymmetric dynamics of mass harassment on Twitter. Like an amplification DDoS attack, each individual participant need only contribute a handful of messages to flood the target’s mentions. Combine that with a small set of leaders indirectly coordinating the daily hate and you have the blueprints for a fuckboy mention-laser capable of melting steel beams with collimated rays of anime avatars.

Needless to say, this worries me.

The Empty Space

It’s tempting to say that this is a social problem, to add “check someone’s mentions before mentioning them” to the endless scroll of ignored netiquette, to sigh and lament that harassment will always be with us, but that’s fundamentally a cop-out. The problem is that we build social software with no affordances for the limits of individual attention in the name of expedience, of engagement, of the “marketplace of ideas”, of democracy itself.

The idea that we would not want to see something—to not consume something—rankles the Californian ideology. While our giant advertising companies perfect centralized spam-blocking technology, we debate whether it’s ethical for individuals to opt out of displaying ads in their browsers. (Something there is that doesn’t love a wall, it’s true, but only our walls. Theirs, obviously, make good neighbors of us all.)

It’s patently true that Twitter could do more but it’s also patently true that this isn’t one of Twitter’s priorities. They’re an advertising company with a sliding stock price, not a charity, and it doesn’t make business sense to bolt on a bunch of power-user-only features which can only result in fewer sponsored tweets hitting eyeballs. This is the kind of system I’ve let into my life, though. This is who intermediates my interactions with most of my friends.

I don’t have a solution for this. It’s a set of tensions I think humanity will have to resolve it eventually, but it seems unlikely that boom/bust funding of winner-take-all companies will produce systems that respect the limits of human attention.

We’re all deeply in love with the possibilities of fan-out, but what’ll get us in the end is the fan-in.

Updated February 05, 2016

Almost a year ago, Nat Dudley wrote an insightful analysis of how Twitter’s UX leads to, as she puts it, “clusterfucks”:

We’re placing a huge expectation on individuals to strictly adhere to behaviour that is in direct contrast to the behaviour Twitter’s design encourages them to do.

As you’d expect from a UX designer, it’s a much more focused and practical article than this one and includes some suggested changes to help fix things.