<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://codahale.com//atom.xml" rel="self" type="application/atom+xml" /><link href="https://codahale.com//" rel="alternate" type="text/html" /><updated>2025-01-29T20:24:50+00:00</updated><id>https://codahale.com//atom.xml</id><title type="html">codahale.com</title><entry><title type="html">The Joy Of Duplexes</title><link href="https://codahale.com//the-joy-of-duplexes/" rel="alternate" type="text/html" title="The Joy Of Duplexes" /><published>2022-04-03T13:48:00+00:00</published><updated>2022-04-03T13:48:00+00:00</updated><id>https://codahale.com//the-joy-of-duplexes</id><content type="html" xml:base="https://codahale.com//the-joy-of-duplexes/"><![CDATA[<p class="message">This article is about cryptography, which is most commonly used in high-risk scenarios.
It’s an extremely subtle and complex subject and it’s one in which I’m absolutely not an expert.
The constructions, schemes, and protocols I discuss are not standardized, nor have they been analyzed by a professional
or anyone else who is assuming risk for whatever you might use them for.
The vibe of this post is “hey wow, look at these neat ideas” and not “everyone should use duplexes”.
Govern yourself accordingly.</p>

<p>I’ve been playing around with a little side project lately which makes heavy use of a relatively new type of
cryptographic primitive–the duplex–and I figured I should take the time to share what I’ve learned and some of the
pretty profound impacts the duplex can have on designing cryptographic protocols.</p>

<p>In this post, we’ll talk about permutations, sponge functions, and duplexes–what they are, how they’re built, what
makes them secure, and how they’re used.
We’ll use a sponge function to build our own hash function, calculate message authentication codes, generate
pseudo-random data, and make a stream cipher.
We’ll use a duplex to do authenticated encryption, make cryptographic transcripts, create digital signatures, and
build an authenticated public key encryption scheme (a.k.a. signcryption).
We’ll even design our own authenticated session encryption protocol with only a duplex and an elliptic curve.</p>

<p>In order to do all of this, though, we need to understand what a duplex is, which means we first need to understand what
a permutation is and how they’re used to build a sponge.
It’s a long post, so pack a lunch.
Let’s go.</p>

<h2 id="permutations">Permutations</h2>

<p>A permutation, as the name implies, shuffles things around: it’s a function which maps inputs to outputs so that no two
inputs produce the same output.
We pass it a fixed-size array of bits \(x\); it returns another fixed-sized array \(y\).
We only get \(y\) if we pass it \(x\) again.</p>

<p>A good permutation from a cryptographic perspective is one which hides patterns in the input.
In technical terms, it should be hard to distinguish a permutation from a random oracle.
(A random oracle is a theoretical service which maps inputs to random outputs via an infinite secret lookup
table;
it’s impossible by definition to detect any pattern between inputs and outputs.)
Ideally, the only way to tell the difference between a permutation and a random oracle is by knowing exactly what the
permutation is and running the algorithm in reverse.</p>

<p>In that way, a block cipher with a fixed key is a kind of permutation: each unique plaintext block is mapped to a unique
ciphertext block.
(If it didn’t, you’d have some ciphertexts which decrypt to the wrong plaintext!)
A hash algorithm, on the other hand, is not a permutation: some inputs produce the same outputs via collision.
(This gets kind of important later, so keep it in mind.)</p>

<p>Finally, a good permutation should be constant-time: the operations it performs shouldn’t be dependent on the value of
the input block.
This means that an adversary can’t use <a href="https://codahale.com/a-lesson-in-timing-attacks/">timing attacks</a> to learn about
any of the data we’re passing to our permutation function.</p>

<p>Examples of permutation algorithms include <a href="https://keccak.team/keccakp.html">Keccak-<em>p</em></a> (used in
<a href="https://csrc.nist.gov/publications/detail/fips/202/final">SHA-3</a>),
<a href="https://keccak.team/xoodoo.html">Xoodoo</a> (used in <a href="https://keccak.team/xoodyak.html">Xoodyak</a>), and 
<a href="https://gimli.cr.yp.to">Gimli</a> (used in <a href="https://github.com/jedisct1/libhydrogen">libhydrogen</a>).</p>

<p>For now, imagine we have a permutation function \(f: \{0,1\}^b \rightarrow \{0,1\}^b\) which maps blocks of \(b\)
bits to other blocks of \(b\) bits in a way which makes it hard to tell what the input was.</p>

<p>Let’s use it to make a sponge.</p>

<h2 id="sponges">Sponges</h2>

<p>The sponge, in cryptography, is a function which <strong>absorbs</strong> an input string of arbitrary length and then <strong>squeezes</strong>
an output string of arbitrary length.
(Like a sponge.
Get it?)
It was originally defined by Guido Bertoni, Joan Daemen (co-inventor of AES), Michaël Peeters, and Gilles Van Assche
as <a href="https://keccak.team/files/SpongeFunctions.pdf">part of their design process of Keccak</a> for the NIST SHA-3
competition.
(Their design won the competition, making SHA-3 the most widely deployed sponge scheme to date.)</p>

<h3 id="sponge-as-function">Sponge As Function</h3>

<p>From the outside, a sponge is a function which takes an arbitrarily-long input value \(M\) and a non-negative integer
\(\ell\) and returns a \(\ell\)-bit value:</p>

\[\begin{gather*}
\text{Sponge}(M, \ell): \{0,1\}^{\mathbb{Z}^*} \times \mathbb{Z}^* \rightarrow \{0,1\}^\ell
\end{gather*}\]

<p>The implementation of a sponge starts with our permutation function \(f\) and adds a <strong>state variable</strong>
\(S: \{0,1\}^b\) which is split into two sections: an outer section of \(r\) bits and an inner section of \(c=b-r\)
bits.
We call \(b\) the sponge’s <strong>width</strong>, \(r\) the sponge’s <strong>rate</strong>, and \(c\) the sponge’s <strong>capacity</strong>.</p>

<p>Finally, we add a <strong>padding function</strong>, \(P\), which maps a string \(M\) of arbitrary length to a sequence of \(r\)-bit
blocks.
(The usual choice appends a \(1\)-bit, then \((|M| \bmod r)-2\) \(0\)-bits, then a final \(1\)-bit.
Nothing fancy.)</p>

<p>Next, we’ll use these to build the two phases of a sponge function: \(\text{Absorb}\) and \(\text{Squeeze}\).</p>

<h3 id="absorbing-data">Absorbing Data</h3>

<p>We start with the sponge’s state \(S\) initialized to some zero value; it doesn’t really matter what as long as it’s a
constant:</p>

\[\begin{gather*}
S \gets \{0\}^b
\end{gather*}\]

<p>We take the input string \(M\) and turn it into a sequence of \(r\)-sized blocks with the padding function \(P\).
For each block \(M_i\), we update the outer section of \(S\) by XORing the two, then we run the entirety of \(S\)
through the permutation:</p>

\[\begin{gather*}
M_{0..n} \gets P(M) \\
 S_{0..r} \gets S_{0..r} \oplus M_0 \\
S \gets f(S) \\
\dots \\
S_{0..r} \gets S_{0..r} \oplus M_n \\
S \gets f(S) \\
\end{gather*}\]

<p>Notice that we’re only writing to \(S_{0..r}\), the outer section of \(S\), but we pass the whole of \(S\) to our
permutation function, making the final value of \(S\) dependent on both sections.</p>

<h3 id="squeezing-data">Squeezing Data</h3>

<p>Once we’re done absorbing data, we can squeeze an \(\ell\)-bit value \(Z\) out of the state.</p>

<p>If \(\ell≤r\), we return the first \(\ell\) bits of the outer section of \(S\):</p>

\[\begin{gather*}
Z \gets S_{0..\ell} \\
S \gets f(S)
\end{gather*}\]

<p>If \(\ell&gt;r\), we iterate the process, returning bytes from \(S_{0..r}\) and updating \(S\) with our permutation:</p>

\[\begin{gather*}
Z_{0..r} \gets S_{0..r} \\
S \gets f(S) \\
Z_{r..\ell} \gets S_{0..\min{(r, \ell)}} \\
S \gets f(S)
\end{gather*}\]

<p>Again, notice that we’re only reading from outer section of \(S\), but each successive value of \(S\) depends on both
the outer and inner sections of the previous value.</p>

<p>The resulting construction looks like this (from 
<a href="https://keccak.team/files/SpongeDuplex.pdf"><em>Duplexing the sponge</em></a> by Bertoni et al.):</p>

<p><img src="/assets/img/sponge.png" alt="A visualization of the sponge construction" /></p>

<h3 id="picking-parameters">Picking Parameters</h3>

<p>When absorbing or squeezing, we only directly write to and read from the outer section of the sponge’s state,
\(S_{0..r}\).
The inner section, \(S_{r..b}\), is only used as input for our permutation function \(f\), and the only thing we write
to it is the output of \(f\).
The inner section of \(S\) acts a reservoir of bits that an adversary either has to hold constant or never gets to see,
depending on what we’re doing.
This means that the ratio of the sponge’s rate \(r\) to its capacity \(c\) is critical.</p>

<p>If our sponge’s capacity is too small relative to its rate, it becomes feasible for an adversary to use the output of a
sponge function to recreate the sponge’s entire state.
For example, a sponge with \(c=1\) would mean an adversary with an \(r\)-bit block of output would only need to try
\(2\) possible values in order to recover the full contents of \(S\), at which point they could re-create the sponge’s
previous states and produce arbitrary outputs.</p>

<p>If our sponge’s rate is too small relative to its capacity, its performance suffers due to the overhead of the
permutation function.
For example, a sponge with \(r=1\) would have to run the permutation function for each bit of input and output.</p>

<p>The designers of sponge functions pick their \(r\) and \(c\) values carefully to balance speed and security.
This will become clearer as we start building cryptographic operations with our sponge.</p>

<h3 id="message-digests">Message Digests</h3>

<p>The natural use for a sponge function is as a cryptographic hash function.
We pick a digest length, \(n\), and pass our message \(m\) to get back an \(n\)-bit digest:</p>

\[\begin{gather*}
d \gets \text{Sponge}(m, n)
\end{gather*}\]

<p>This is essentially how SHA-3 works.</p>

<p>Two things make this secure: first, the strength of our permutation function \(f\); second, the fact that the inner
section of \(S\) acts as a reservoir of unmodifiable bits.</p>

<p>Consider an adversary trying to find a message \(m\) which produces a fixed digest \(d\) (i.e. trying to break its
pre-image resistance) with a sponge function whose parameters are \(b=384\), \(r=256\), \(c=128\).
To keep it simple, we’ll imagine they’re trying to find a short \(m\) such that \(|m|≤r\) (i.e. the absorb phase 
consists of a single iteration of \(f\), and \(S\) is all zeroes to begin with).</p>

<p>Our adversary is therefore trying to find some state value
\(S'=P(m')||\{0\}^c\) such that \(f(S')\) begins with the digest \(d\).
If our permutation is strong–if it’s hard to distinguish between our permutation and a random oracle–our adversary is
stuck having to try all of the \(2^{128}\) possible values of the inner section of \(S\).</p>

<p>Note that because the input to a sponge function is always padded, it’s not vulnerable to length-extension attacks,
unlike Merkle–Damgård constructions (e.g. MD5, SHA1, SHA-256): given \(H(m)\), it’s not possible to easily calculate
\(H(m||x)\).
That’s because \(H(m)\) is actually \(H(P(m))\), and our padding function \(P\) is non-commutative, so
\(P(m||x) \not= P(m)||P(x)\).</p>

<h3 id="message-authentication-codes">Message Authentication Codes</h3>

<p>The fact that sponges aren’t vulnerable to length-extension attacks means we can securely extend our message digest
construction to calculate message authentication codes (MACs).
We concatenate the secret key \(k\) and message \(m\) into a single input and squeeze an \(n\)-bit authentication tag
\(t\):</p>

\[\begin{gather*}
t \gets \text{Sponge}(k||m, n)
\end{gather*}\]

<p>This is essentially how
<a href="https://www.nist.gov/publications/sha-3-derived-functions-cshake-kmac-tuplehash-and-parallelhash">cSHAKE and KMAC</a> work.</p>

<p>It superficially resembles the kind of naive MAC construction that
<a href="http://netifera.com/research/flickr_api_signature_forgery.pdf">gets people in trouble</a>, e.g. \(\text{MD5}(k||m)\), but
remember that the sponge, unlike most hash functions, isn’t vulnerable to length-extension attacks.
As long as \(k\) is constant size, it’s secure.</p>

<p>(You can run into trouble if \(k\) is variable size.
Proper standards like KMAC pad the key to a constant size as well as add the output length and other domain separation
data to the input.
If you’re trying to create a MAC of multiple pieces of information, you have to develop a padding scheme like
<a href="https://www.nist.gov/publications/sha-3-derived-functions-cshake-kmac-tuplehash-and-parallelhash">TupleHash</a> to
ensure those different pieces of information don’t overlap in the input space or use an existing scheme.
More about that in a bit.)</p>

<p>Let’s look at what makes our sponge-based MAC function secure.
For simplicity’s sake, let’s assume \(|k||m|\) and \(t\) are both less than \(r\) (i.e. both absorb and squeeze phases
involve a single iteration of \(f\)).
Our MAC algorithm looks like this:</p>

\[\begin{gather*}
m_0 \gets P(k||m) \\
S_{0..r} \gets S_{0..r} \oplus m_0 \\
S \gets f(S) \\
t \gets S_{0..n}
\end{gather*}\]

<p>Consider an adversary who is trying to forge a tag \(t'\) for a message \(m'\) without knowing the key \(k\).
Even if they can see an arbitrary number of messages and tags created with \(k\), they’re still stuck trying to recreate
the inner section of \(S\) after the absorb phase.
If that’s big enough, then trying to forge a tag is infeasible.</p>

<h3 id="extendable-output-functions-xofs">Extendable-Output Functions (XOFs)</h3>

<p>So far we’ve only been using short outputs of the sponge, but it’s entirely possible to produce longer outputs.
Let’s say we need a lot of pseudo-random data we can derive from a high-entropy value \(x\).
We can pass \(x\) to our sponge function and squeeze as many bits as we need:</p>

\[\begin{gather*}
z \gets \text{Sponge}(x, n)
\end{gather*}\]

<p>Remember that the squeeze phase, when used to produce multiple blocks of output, looks like this:</p>

\[\begin{gather*}
S \gets f(S) \\
z_0 \gets S_{0..r} \\
S \gets f(S) \\
z_1 \gets S_{0..r} \\
S \gets f(S) \\
z_2 \gets S_{0..r} \\
\dots
\end{gather*}\]

<p>Each additional block of output can be calculated in an incremental fashion, meaning that it’s easy to build an
abstraction on top of this which produces an indefinite series of bits for a given input: an extendable-output function
(XOF).
How secure is it?</p>

<p>Consider an adversary who doesn’t know the seed value \(x\) but can see a single block of output \(z_i\).
What would they need to do to predict the next block \(z_{i+1}\)?
If the sponge’s capacity was zero, then it would be simple: \(z_{i+1}=f(z_i)\).
Likewise, an adversary trying to determine the previous block could calculate \(z_{i-1}=f^{-1}(z_i)\) if the sponge’s
capacity didn’t exist.
But a sponge with a large enough capacity makes it infeasible for an adversary to predict output without the seed value
\(x\).</p>

<p>Now, remember that \(f\) is a permutation and not a compression function–it’s shuffling bits around in \(S\) without
collision.
This is where that matters.
A compression function has some probability that \(f(S) = f(S')\).
As that function is iterated, that probability increases, and the number of possible values for \(S\) decreases,
reducing the security.
Thankfully, we’re using a permutation for \(f\), which means we’re not creating those internal collisions as we produce
output.
Thus, the security of our sponge stays constant relative to the strength of \(f\), allowing us to produce a long stream
of effectively pseudo-random bits.</p>

<h3 id="stream-ciphers">Stream Ciphers</h3>

<p>Since we’ve got an unbounded stream of pseudo-random bits, let’s go ahead and make a stream cipher.
We pass our sponge function a concatenated key \(k\) and nonce \(n\) and XOR our plaintext \(p\) with the output:</p>

\[\begin{gather*}
c \gets p \oplus \text{Sponge}(k||n, |p|)
\end{gather*}\]

<p>Decrypting is the same thing, only with the ciphertext \(c\) as input:</p>

\[\begin{gather*}
p \gets c \oplus \text{Sponge}(k||n, |c|)
\end{gather*}\]

<p>As with the XOF, this is secure if our permutation is strong and the sponge’s capacity is big enough.
An adversary who knows \(n\) but not \(k\) will be left trying to reconstruct the inner section of \(S\), unable to
reconstruct the output, and therefore unable to decrypt \(c\).
And because we can incrementally produce the keystream, we can encrypt streams of data without buffering.</p>

<p>Of course, stream ciphers by themselves are not very secure.
If we accidentally re-use a nonce, an adversary can calculate \(p_0 \oplus p_1\), which violates the confidentiality of
the plaintexts.
Also, an adversary can flip bits in the ciphertext to produce alternative plaintexts that we might not be able to detect
(e.g. <code class="language-plaintext highlighter-rouge">pay me $100</code> could turn into <code class="language-plaintext highlighter-rouge">give Mal $5</code>).
We <em>could</em> solve this problem with the sponge function, but it starts to get gnarly: once we’re done encrypting the
plaintext we’d need to pass the ciphertext to another sponge function to create a MAC, but then we’re talking about a
two-pass algorithm.</p>

<p>To elegantly solve this problem–and many, many others–we’ll need to graduate from the sponge and build a duplex.</p>

<h2 id="duplexes">Duplexes</h2>

<p>A duplex, like a sponge, has a permutation function \(f\), a rate \(r\), a capacity \(c\), a state variable \(S\)
divided into an \(r\)-bit outer section and a \(c\)-bit inner section, and a padding function \(P\).
The big difference is that a duplex is a stateful object, not a function–instead of a single absorb phase and a single
squeeze phase, it maintains its state across an arbitrary sequence of absorb and squeeze phases.</p>

<p>The duplex was invented by Bertoni, Daemen, Peeters, and Van Assche (again) as <a href="https://keccak.team/files/SpongeDuplex.pdf">part of their continued work on
permutation-based cryptography</a>.
Examples of duplexes include Mike Hamburg’s <a href="https://eprint.iacr.org/2017/003.pdf">STROBE</a> and Daemen et al.’s
<a href="https://tosc.iacr.org/index.php/ToSC/article/view/8618/8184">Xoodyak</a>, a finalist in NIST’s Lightweight Cryptography
competition. (Xoodyak’s my personal favorite.)</p>

<h3 id="duplex-as-object">Duplex As Object</h3>

<p>It’s helpful to consider the duplex as an object with an API:</p>

\[\begin{gather*}
\text{Init}() \\
\text{Absorb}(\sigma) \\
\text{Squeeze}(\ell): \mathbb{Z}^* \rightarrow \{0,1\}^\ell
\end{gather*}\]

<p>\(\text{Init}\) creates a new duplex, initializing \(S\) with a zero value:</p>

\[\begin{gather*}
S \gets \{0\}^b
\end{gather*}\]

<p>\(\text{Absorb}(\sigma)\) updates the duplex’s state with an arbitrary string \(\sigma\) in exactly the same way as the
<a href="#absorbing-data">absorb phase of our sponge function</a>.</p>

<p>\(\text{Squeeze}(\ell)\) returns an \(\ell\)-byte string and updates the duplex’s state, iterating and truncating as
necessary just like the <a href="#squeezing-data">squeeze phase of our sponge function</a>.</p>

<p>Our \(\text{Sponge}(M, \ell)\) function, when mapped to a duplex, becomes this:</p>

\[\begin{gather*}
\text{Init}() \\
\text{Absorb}(M) \\
\text{Squeeze}(\ell)
\end{gather*}\]

<p>The duplex construction itself looks like this (from
<a href="https://keccak.team/files/SpongeDuplex.pdf"><em>Duplexing the sponge</em></a> by Bertoni et al.):</p>

<p><img src="/assets/img/duplex.png" alt="A visualization of the duplex construction" /></p>

<p>There are two major differences between a sponge function and a duplex.</p>

<p>First, a duplex’s inputs and outputs are non-commutative.
That is, \(\text{Absorb}(a)\) and \(\text{Absorb}(b)\) is not the same as \(\text{Absorb}(a||b)\), and
\(\text{Squeeze}(a)\) and \(\text{Squeeze}(b)\) is not the same as \(\text{Squeeze}(a+b)\).
With a duplex, those produce entirely different states and thus entirely different outputs.</p>

<p>Second, where the sponge is limited to a single absorb phase and a single squeeze phase, the duplex can alternate
between series of \(\text{Absorb}\) operations and \(\text{Squeeze}\) operations indefinitely.</p>

<p>These two changes enable us to build a frankly staggering array of cryptographic operations.</p>

<h3 id="authenticated-encryption">Authenticated Encryption</h3>

<p>Let’s start with our original motivation for the duplex: authenticated encryption.
The sponge gave us an XOF and thus a stream cipher, but in order to authenticate ciphertexts we’d need to use another
sponge function to create a MAC or something.
We can do better with the duplex.</p>

<p>We initialize a duplex, absorb a key \(k\) and nonce \(n\) and XOR our plaintext \(p\) with the output much like we did
with the sponge, but after that we absorb \(p\) and squeeze an \(m\)-bit authentication tag \(t\):</p>

\[\begin{gather*}
\text{Init}() \\
\text{Absorb}(k) \\
\text{Absorb}(n) \\
c \gets p \oplus \text{Squeeze}(|p|) \\
\text{Absorb}(p) \\
t \gets \text{Squeeze}(m)
\end{gather*}\]

<p>Superficially, this looks a little like the
<a href="https://moxie.org/2011/12/13/the-cryptographic-doom-principle.html">dreaded</a> MAC-then-Encrypt construction because
we’re absorbing \(p\), then squeezing a tag \(t\).
The critical difference, however, is that \(t\) isn’t only derived from \(p\)–it’s derived from the same state which
produced the keystream to make \(c\).
By absorbing \(p\), we ensure that \(t\) seals both the plaintext \(p\) and the ciphertext \(c\).</p>

<p>We can see how this plays out when we decrypt \(c\):</p>

\[\begin{gather*}
\text{Init}() \\
\text{Absorb}(k) \\
\text{Absorb}(n) \\
p' \gets c \oplus \text{Squeeze}(|c|) \\
\text{Absorb}(p') \\
t' \gets \text{Squeeze}(m) \\
t \stackrel{?}{=} t'
\end{gather*}\]

<p>If an adversary modifies \(c\), then \(p' \not= p\) and therefore \(t' \not= t\).</p>

<p>This can be extended to a full AEAD by adding \(\text{Absorb}(d)\) after absorbing the nonce.
In fact, if we’ve got more than one piece of data to authenticate, we can add more \(\text{Absorb}\) calls.
There’s no need to devise our own padding scheme here.
(<a href="https://codahale.com/a-lesson-in-timing-attacks/">You’ll still need to use a constant-time comparison for the authentication tag, though.</a>)</p>

<p>It’s worth noting here that here, unlike with the sponge function, we’re starting to build cryptographic constructions
which integrate the properties of multiple other constructions.
Here we begin with what looks like a stream cipher but then turns into something more like a MAC, producing both
ciphertext and tag with the same underlying primitive.
In contrast, the GCM mode (e.g. AES-128-GCM) combines a block cipher (e.g. AES), a counter mode of operation, GMAC,
a polynomial MAC algorithm, and a number of very specific padding functions to produce the same results.</p>

<h3 id="cryptographic-transcripts">Cryptographic Transcripts</h3>

<p>The fact that the \(\text{Absorb}\) operations aren’t commutative opens up some interesting possibilities.
For example, we can create a transcript of higher-level operations and have cryptographic assurances that the transcript
has not been altered or that another party performing the same actions hasn’t diverged from the same sequence.
That’s a pretty abstract concept, so let’s use the example of ensuring that two users are actually the same.
We initialize a duplex, absorb the pieces of data which identify the user, and finally squeeze a cryptographic
identifier:</p>

\[\begin{gather*}
\text{Init}() \\
\text{Absorb}(\texttt{user.email}) \\
\text{Absorb}(\texttt{user.name}) \\
\text{Absorb}(\texttt{user.phone\_number}) \\
I \gets \text{Squeeze}(n) \\
\end{gather*}\]

<p>This is similar to using a hash function
\(H(E||N||P)\), but because the duplex pads its inputs, it preserves each pieces of data as independent inputs.
We can even add semantic information to ensure domain separation:</p>

\[\begin{gather*}
\text{Absorb}(\texttt{"phone\_number"+user.phone\_number})
\end{gather*}\]

<p>This might be a little heavy-handed for user identifiers, but the ability to create a transcript of operations with
cryptographic assurance of their semantic integrity turns out to be incredibly useful.</p>

<h3 id="digital-signatures">Digital Signatures</h3>

<p>By combining our duplex with a well-defined elliptic curve like <a href="https://ristretto.group">ristretto255</a>, we can turn
a cryptographic transcript into a sophisticated digital signature.
For example, let’s create an <a href="https://datatracker.ietf.org/doc/html/rfc8032">EdDSA</a>-style
<a href="https://en.wikipedia.org/wiki/Schnorr_signature">Schnorr signature</a> of a message \(m\) given the signer’s key pair
\((d, Q)\).</p>

<h4 id="moon-math-primer">Moon Math Primer</h4>

<p>But first, we need to talk about elliptic curves.</p>

<p>Elliptic curves are awesome but don’t make a lot of intuitive sense to folks who are familiar with other cryptographic
things like hash functions and encryption algorithms, so here’s a quick primer on what’s in play.</p>

<p>A <em>scalar</em> is a like a regular integer in that you can do arbitrary arithmetic with them.
Scalars are usually notated using lowercase variables like \(d\) and \(k\).
You can turn an arbitrary string of bits \(\{0,1\}^n\) into a scalar by reducing them to fit into the space of possible
scalar values:</p>

\[\{0,1\}^n \bmod \ell \in \{0 \dots \ell\}\]

<p>A <em>point</em> is a set of coordinates to a point on the elliptic curve and notated using uppercase variables like \(Q\) and
\(G\).
\(G\) is the <em>generator point</em>, which is a point on the curve everyone agreed to use as a constant.
You can add points together and you can subtract one point from another, but you can’t multiply or divide them.</p>

<p>You can multiply a point \(Q\) by a scalar \(d\) (notated \([d]Q\)) in that you can add \(Q\) to itself \(d\) times.
Calculating \([x]Y\) is super cheap; going from \([x]Y\) to either \(x\) or \(Y\) is functionally impossible.</p>

<p>Private keys are scalars which are picked at random from the set of all possible scalars.
A public key \(Q\) for a private key \(d\) is calculated by adding the generator point \(G\) to itself \(d\) times:
\(Q=[d]G\).
Again, it’s easy to go from private key to public key; it’s effectively impossible to go from public key to private key.</p>

<p>A longer, gentler introduction to the subject
<a href="https://fangpenlin.com/posts/2019/10/07/elliptic-curve-cryptography-explained/">can be found here</a>.</p>

<p>Ok, back to our digital signature scheme.</p>

<h4 id="signing-a-message">Signing A Message</h4>

<p>First, we initialize a duplex and use it to absorb the message \(m\) and the signer’s public key \(Q\):</p>

\[\begin{gather*}
\text{Init}() \\
\text{Absorb}(Q) \\
\text{Absorb}(m)
\end{gather*}\]

<p>Next, we generate a random commitment scalar \(k\), then calculate and absorb the commitment point \(I\):</p>

\[\begin{gather*}
k \stackrel{\$}{\gets} \{0,1\}^{512} \bmod \ell \\
I \gets [k]G \\
\text{Absorb}(I)
\end{gather*}\]

<p>Finally, we squeeze a challenge scalar \(r\) and calculate the proof scalar \(s\):</p>

\[\begin{gather*}
r \gets \text{Squeeze}(512) \bmod \ell \\
s \gets dr+k
\end{gather*}\]

<p>The resulting signature is
\(I||s\).</p>

<h4 id="verifying-a-signature">Verifying A Signature</h4>

<p>The signature can be verified like this:</p>

\[\begin{gather*}
\text{Init}() \\
\text{Absorb}(Q) \\
\text{Absorb}(m) \\
\text{Absorb}(I) \\
r' \gets \text{Squeeze}(512) \bmod \ell \\
I' \gets [s]G - [r']Q \\
I \stackrel{?}{=} I'
\end{gather*}\]

<p>Now, EdDSA signatures are entirely possible to create with regular hash algorithms–<a href="https://ed25519.cr.yp.to">Ed25519</a>
uses SHA2-512), but by using a duplex we get a number of benefits.</p>

<p>For one, it allows us to easily establish cryptographic dependence between values.
For Schnorr signatures, we need the challenge scalar \(r\) to be dependent on all the inputs: \(Q\), \(m\), and \(I\).
If we don’t include the signer’s public key \(Q\), then it’s possible to find another public key which produces the
same signature for the same message.
If we don’t include the message \(m\), then what are we even signing?
Likewise, if we don’t include the commitment point \(I\), then it becomes possible to <a href="https://blog.trailofbits.com/2021/02/19/serving-up-zero-knowledge-proofs/">produce an infinite number of
alternate valid signature/public key pairs given a signed message</a>.</p>

<p>For two, we’re not limited to signing a single bitstring.
We can easily sign arbitrarily structured data as long as it can be traversed in a deterministic fashion.
A set of financial transactions, a complex mathematical structure, a conversation between services, whatever.
You can even use duplexes to <a href="https://merlin.cool">securely structure zero-knowledge proofs</a>.</p>

<p>For these more complex protocols, it’s helpful to think of the duplex as establishing a transcript of all the values
being used.
If we absorb all the inputs, the outputs will always be cryptographic dependent on them, giving adversaries far fewer
opportunities to modify them.</p>

<h3 id="signcryption">Signcryption</h3>

<p>Now that we’ve got encryption and digital signatures going, we can combine the two to produce a signcryption scheme,
which provides both confidentiality and authenticity in the public key setting.
Given a sender’s key pair \((d_S, Q_S)\), a receiver’s public key \(Q_R\), and a plaintext \(p\), we start by
initializing a duplex and absorbing everyone’s public keys:</p>

\[\begin{gather*}
\text{Init}() \\
\text{Absorb}(Q_S) \\
\text{Absorb}(Q_R)
\end{gather*}\]

<p>Next we generate an ephemeral key pair \((d_E, Q_E)\) and absorb the public key:</p>

\[\begin{gather*}
d_E \stackrel{\$}{\gets} \{0,1\}^{512} \bmod \ell \\
Q_E \gets [d_E]G \\
\text{Absorb}(Q_E)
\end{gather*}\]

<p>After that we calculate the <a href="https://en.wikipedia.org/wiki/Elliptic-curve_Diffie–Hellman">Elliptic Curve Diffie-Hellman</a>
(ECDH) shared secret between the ephemeral private key \(d_E\) and the receiver’s public key \(Q_E\), \(K\), and absorb
it, effectively using it as a symmetric key:</p>

\[\begin{gather*}
K \gets [d_E]Q_R \\
\text{Absorb}(K)
\end{gather*}\]

<p>Now that we’ve got a duplex keyed with a secret, we encrypt and absorb \(p\):</p>

\[\begin{gather*}
c \gets p \oplus \text{Squeeze}(|p|) \\
\text{Absorb}(p) \\
\end{gather*}\]

<p>At this point we have a duplex which is effectively a transcript of all the critical information we have: the sender’s
public key, the receiver’s public key, the ephemeral public key, the ECDH shared secret, and the plaintext.
If we squeezed an authentication tag here, we’d essentially have created
<a href="https://en.wikipedia.org/wiki/Integrated_Encryption_Scheme#Formal_description_of_ECIES">ECIES</a> using just two
cryptographic primitives.
But we can improve on that, so let’s integrate our Schnorr signature scheme:</p>

\[\begin{gather*}
k \stackrel{\$}{\gets} \{0,1\}^{512} \bmod \ell \\
I \gets [k]G \\
\text{Absorb}(I) \\
r \gets \text{Squeeze}(512) \bmod \ell \\
s \gets {d_S}r+k
\end{gather*}\]

<p>The final ciphertext is
\(C=Q_E||c||I||s\), which we can unsigncrypt (yes, that’s what they call it) like this:</p>

\[\begin{gather*}
\text{Init}() \\
\text{Absorb}(Q_S) \\
\text{Absorb}(Q_R) \\
\text{Absorb}(Q_E) \\
K \gets [d_R]Q_E \\
\text{Absorb}(K) \\
p' \gets c \oplus \text{Squeeze}(|c|) \\
\text{Absorb}(p') \\
\text{Absorb}(I) \\
r' \gets \text{Squeeze}(512) \bmod \ell \\
I' \gets [s]G - [r']Q_S \\
I \stackrel{?}{=} I'
\end{gather*}\]

<p>The receiver doesn’t know the ephemeral private key \(d_E\), so they calculate the ECDH shared secret as \([d_R]Q_E\)
because \([d_E]Q_R=[d_R]Q_E=[d_Ed_R]G\).
Likewise, they don’t know the sender’s private key \(d_S\), so they verify the signature using their public key \(Q_S\).</p>

<p>The really cool thing is that our signcryption construction offers a much stronger degree of security than most because
using a duplex allows us to seamlessly integrate our encryption and digital signature algorithms.</p>

<h4 id="security-through-integration">Security Through Integration</h4>

<p>Because we’re using a duplex, our signcryption scheme is stronger than the generic schemes which combine separate
public key encryption and digital signature algorithms: Encrypt-Then-Sign (\(\mathcal{EtS}\)) and Sign-then-Encrypt
(\(\mathcal{StE}\)).</p>

<p>An adversary attacking an \(\mathcal{EtS}\) scheme can strip the signature from someone else’s encrypted message and
replace it with their own, potentially allowing them to trick the recipient into decrypting the message for them.
That’s possible because the signature is of the ciphertext itself, which the adversary knows.</p>

<p>With our scheme, on the other hand, the digital signature isn’t of the ciphertext alone.
A standard Schnorr signature scheme like Ed25519 derives the challenge scalar \(r\) from a hash of the signer’s
public key and the message being signed (i.e. the ciphertext).
Our challenge scalar, on the other hand, is derived from the duplex’s state, which depends on (among other things) the
ECDH shared secret \(K\) and the plaintext message \(p\).
Unless our adversary already knows \(K\) (the secret key that \(p\) is encrypted with) and \(p\) (the plaintext
they’re trying to trick someone into giving them), they can’t create their own signature (which they’re trying to do
in order to trick someone into giving them \(p\)).
You can see how that might be hard for them.</p>

<p>An adversary attacking an \(\mathcal{StE}\) scheme can decrypt a signed message sent to them and re-encrypt it for
someone else, potentially posing as the original sender.
For example, you send a sternly-worded letter to your manager, informing them that you hate your job and are quitting.
Your boss decrypts the message, keeps the signed plaintext, and several months later encrypts it with your new manager’s
public key and sends it to her.
If the message is sufficiently generic (e.g. <code class="language-plaintext highlighter-rouge">I hate this job and am quitting.</code>), it looks like an authentic message
from you to your new manager and you’re out of a job.</p>

<p>Our duplex-based scheme isn’t vulnerable to this kind of attack, thankfully, because, again, the challenge scalar \(r\)
is dependent on the duplex’s state, which in addition to the plaintext message \(p\) also includes the ECDH shared
secret \(K\) and all the public keys.
An adversary can duplicate the duplex’s state prior to the \(\text{Absorb}(K)\) operation, but no further.</p>

<h4 id="security-through-composition">Security Through Composition</h4>

<p>The fact that we can tightly integrate our encryption and digital signature algorithms without falling prey to the same
attacks as the generic constructions means we end up with a stronger scheme than even very modern standards, like
<a href="https://www.rfc-editor.org/rfc/rfc9180.html">HPKE</a>.
HPKE provides an authenticated scheme in the form of the \(\text{AuthEncap}\) KEM, but it’s vulnerable to very sneaky
type of attack: key compromise impersonation.
<a href="https://eprint.iacr.org/2006/252.pdf">Key compromise impersonation</a> (KCI) occurs when a scheme allows an adversary who
compromises a user’s private key to forge messages from other users which the compromised user will believe are
authentic.</p>

<p>For example, let’s say our adversary compromises the victim’s private key \(d_V\).
The adversary can take an arbitrary public key–let’s say Beyoncé’s key, \(Q_B\)–and use it to calculate the shared
secret Beyoncé <em>would</em> use if she actually sent the victim a message: \([d_V]Q_B=[d_B]Q_V=[d_Bd_V]G\).
If the authenticity of the message depends entirely on only the sender and receiver knowing that shared secret, as with
HPKE or ECIES, the adversary could pose as Beyoncé to their victim, which would be an emotionally crushing kind of
catfishing.</p>

<p>Our signcryption scheme, on the other hand, is immune to these particular shenanigans.
The receiver of a message can verify the signature but cannot create one themselves; therefore, an adversary with the
same power as the receiver (i.e. with knowledge of their private key) cannot forge a signature from someone else.</p>

<h4 id="security-through-simplicity">Security Through Simplicity</h4>

<p>Finally, it’s worth stepping back and appreciating the parsimony of this approach.
It’s totally possible to build signcryption constructions out of traditional primitives like block ciphers, but the
degree of complexity (and the resulting risk of screwing something) is dramatically higher.
For example, a design with similar properties might pass the ECDH shared secret point to HKDF-SHA2-512 and extract an
AES key, CTR nonce, and HMAC-SHA2-512 key, then use Ed25519 (which also requires SHA2-512) to sign (but not transmit)
the MAC of the ciphertext.
Like, I <em>think</em> that’d work?</p>

<p>But you’re talking about an implementation which requires AES, SHA2-512, the twisted Edwards form of Curve25519, and
probably the Montgomery form of Curve25519 (for X25519) as primitives, plus implementations of HKDF, HMAC, CTR mode,
and Ed25519.</p>

<p>Compare that to our construction, which requires a duplex (e.g Xoodyak) and an elliptic curve (e.g. Ristretto255).
(Admittedly, Ristretto255 is based on the Edwards twist of Curve25519, but NIST’s P-256 would work, too.)</p>

<h3 id="session-encryption">Session Encryption</h3>

<p>As a final example of what you can do with a duplex, let’s talk about session encryption.
Imagine we’ve got two people that want to securely talk to each other over a (reasonably) reliable network (e.g. TCP).
They want to know that their conversation is confidential (i.e. no one can eavesdrop) and authentic (i.e. no one can
alter what’s said).
(Obviously these two should use TLS 1.3 and be done with it, but let’s see what we can do ourselves.)</p>

<p>We’ve got a sender (with a static key pair \((d_S, Q_S)\)) and a receiver (with a static key pair \((d_R, Q_R)\)).</p>

<p>The sender begins by generating an ephemeral key pair \((d_S^E, Q_S^E)\) and transmitting their ephemeral public key
\(Q_S^E\) to the receiver.
The receiver generates their own ephemeral key pair \((d_R^E, Q_R^E)\) and transmits their ephemeral public key
\(Q_R^E\) back the sender.
They both initialize duplexes and absorb all of the public keys in the same order:</p>

\[\begin{gather*}
\text{Init}() \\
\text{Absorb}(Q_S) \\
\text{Absorb}(Q_S^E) \\
\text{Absorb}(Q_R) \\
\text{Absorb}(Q_R^E) \\
\end{gather*}\]

<p>Next, both sender and receiver calculate two ECDH shared secrets: \(K_0\), between the receiver’s ephemeral key pair
and the sender’s static key pair, and \(K_1\), between the sender’s ephemeral key pair and the receiver’s static key
pair.</p>

<p>On the sender’s side:</p>

\[\begin{gather*}
K_0 \gets [d_S^E]Q_R \\
K_1 \gets [d_S]Q_R^E \\
\text{Absorb}(K_0) \\
\text{Absorb}(K_1)
\end{gather*}\]

<p>On the receiver’s side:</p>

\[\begin{gather*}
K_0 \gets [d_R]Q_S^E \\
K_1 \gets [d_R^E]Q_S \\
\text{Absorb}(K_0) \\
\text{Absorb}(K_1)
\end{gather*}\]

<p>Next, both sender and receiver clone their duplexes into <em>incoming</em> and <em>outgoing</em> duplexes (after all, a duplex is
just a state variable \(S\)).
Using their <em>outgoing</em> duplexes, they both create Schnorr signatures of the <em>outgoing</em> duplexes’ states.
For example, on the sender’s side:</p>

\[\begin{gather*}
k_S \stackrel{\$}{\gets} \{0,1\}^{512} \bmod \ell \\
I_S \gets [k_S]G \\
\text{Absorb}(I_S) \\
r_S \gets \text{Squeeze}(512) \bmod \ell \\
s_S \gets d_Sr_S+k_S
\end{gather*}\]

<p>The sender transmits their signature
\(I_S||s_S\) to the receiver and the receiver responds with their signature \(I_R||s_R\), at which point both parties
use their <em>incoming</em> duplexes to verify the signatures.
For example, on the receiver’s side:</p>

\[\begin{gather*}
\text{Absorb}(I_S) \\
r_S' \gets \text{Squeeze}(512) \bmod \ell \\
I_S' \gets [s_S]G - [r_S']Q_S \\
I_S \stackrel{?}{=} I_S'
\end{gather*}\]

<p>If the signatures are valid, the two parties have established a confidential and authentic channel of communication to
each other.
When the sender wants to send a message, they can use the <em>outgoing</em> duplex’s current state to encrypt the message and
generate and authentication tag:</p>

\[\begin{gather*}
c \gets p \oplus \text{Squeeze}(|p|) \\
\text{Absorb}(p) \\
t \gets \text{Squeeze}(m)
\end{gather*}\]

<p>Likewise, when they receive a message, they can use their <em>incoming</em> duplex to decrypt and authenticate:</p>

\[\begin{gather*}
p' \gets c \oplus \text{Squeeze}(|c|) \\
\text{Absorb}(p') \\
t' \gets \text{Squeeze}(m) \\
t \stackrel{?}{=} t'
\end{gather*}\]

<p>Instead of a static key, this effectively creates a session encryption system with a rolling key, with the duplex’s
state being cryptographically dependent on every prior operation.
In order to eavesdrop, an adversary would have to reconstruct the inner section of the duplex’s state.
If an adversary alters a message, the receiver’s state would diverge from the sender’s, making any communication
impossible.
Like our signcryption construction, the handshake provides what’s called “insider security”, in that it maintains
confidentiality if the sender is compromised and authenticity if the receiver is compromised.</p>

<p>Again, this is a toy protocol.
Our sender and receiver should use TLS.
But it’s worth noting that we created a toy mutually-authenticated key exchange with streaming, authenticated
encryption in what might be a couple of hundred lines of code in a way which makes it incredibly clear to document and
analyze and all of that using just two cryptographic primitives.</p>

<p>Neat, huh?</p>

<h2 id="tldr">tl;dr</h2>

<p>The duplex is a profoundly useful cryptographic primitive which allows us to build sophisticated cryptosystems in a 
robust and reliable way.
You can use a duplex to implement message digests, MACs, PRFs, stream ciphers, AEADs, and more using a single
cryptographic primitive whose security reduces entirely to the strength of a single permutation function.
Duplexes make it easier to build high-level constructions and protocols, but it’s such a new concept that relatively
little work has been done in terms of standardizing specific constructions or developing security proof techniques for
analyzing duplex-based protocols.
Hopefully everyone who makes it to the end of this overly-long article will be as enthusiastic as I am about the future
of the cryptographic duplex (and permutation-based cryptography, generally).
Here’s a <a href="https://www.youtube.com/watch?v=2R2gb0MKJlo">ten-hour video of some mountains</a> as a reward.</p>

<hr />

<h2 id="updated-april-03-2022">Updated April 03, 2022</h2>

<ul>
  <li>Clarified the definition of permutation. Thanks, <a href="https://lobste.rs/s/9ssv1p/joy_duplexes#c_dcvfes">quasi_qua_quasi</a>!</li>
</ul>

<hr />

<p><em>Thanks to various Fiascans and Jeff Hodges for reviewing this post.
Thanks to Niki Hale for suggesting putting a video at the bottom to reward diligent folks such as yourself.
Any mistakes in this article are mine, not theirs.</em></p>]]></content><author><name></name></author><summary type="html"><![CDATA[In which integration brings simplification.]]></summary></entry><entry><title type="html">Work Is Work</title><link href="https://codahale.com//work-is-work/" rel="alternate" type="text/html" title="Work Is Work" /><published>2020-01-12T15:52:00+00:00</published><updated>2020-01-12T15:52:00+00:00</updated><id>https://codahale.com//work-is-work</id><content type="html" xml:base="https://codahale.com//work-is-work/"><![CDATA[<p class="message">Every time I’ve written or spoken about organizational design, I’ve regretted it.
There’s something about staking out a position on it which manages to prove me wrong a few years
later. But I’ve been having some long thinks about it again, and here’s what I’ve got. Strap the
fuck in.</p>

<p>At some point in time, every organization realizes that it’s slowing down. Features take longer to
ship, folks spend more and more time in meetings, and everyone gets panicky about estimation and
planning. If we’ve internalized essentialist bullshit about “B players hiring C players”, maybe we
get nervous about The Bar and hiring. If we’re <a href="https://www.washingtonpost.com/archive/opinions/1978/07/16/the-rituals-of-baseball/520382d5-2b17-4480-8574-442f380a6c00/">anxious and uncertain</a>, we might get
religion about Agile or Scrum or whatever. If we’re prone to modernist flights of fancy, we might
decide to spend our <a href="https://mcfunley.com/choose-boring-technology">innovation tokens</a> trying to disrupt the 200,000 year old industry of
getting humans to work together. If we’re the type to do our eight and hit the gate, we chalk it up
as the price of success.</p>

<p>These approaches rarely yield results.</p>

<h2 id="our-dogmatic-slumber">Our Dogmatic Slumber</h2>

<p>Most explanations of organizational success or failure are crap. <a href="https://en.wikipedia.org/wiki/Emic_and_etic">Emic</a> accounts–i.e., those
from within the organization–are limited to those concepts and narratives which exist within the
organization. They may structurally serve as self-narratives and foster a sense of in-group identity
and purpose but because they come from the epistemological equivalent of a gravity well their
explanatory power outside of that organization is usually terrible.</p>

<p>If we take the etic perspective–i.e. from outside the organization–we can see that emic
explanations for an organization’s success or failure have readily available counterfactuals in
other organizations. If agile, flat organizations, code reviews, monorepos, open offices, fancy type
systems, etc. were actually the causal factors they’re purported to be, then why do so many
organizations adopt those practices without success? Why are there other successful organizations
which lack those practices? How can we tell the difference between <em>cum hoc, ergo propter hoc</em>
just-so stories and actual causal factors?</p>

<p>More importantly, can we determine the <a href="https://plato.stanford.edu/entries/supervenience/">supervenience</a> of some set of factors on organizational
performance, not just in a particular context but across all possible organizations? That is, are
there necessary, <em>a priori</em> truths of organizational performance?</p>

<p>As it happens, there are.</p>

<h2 id="corporate-americas-next-top-model">Corporate America’s Next Top Model</h2>

<p>If you squint hard enough, <strong>an organization doing work is just an incredibly complex, dynamic,
distributed, parallel process</strong>. We have very good tools for understanding the rough outlines of how
those work, dating back at least to Manabrea’s 1842 comment to Babbage regarding his analytical
machine. Now, I won’t pretend to having developed <a href="https://en.wikipedia.org/wiki/Psychohistory_(fictional)">psychohistory</a> and given how difficult it
is to predict the behavior of even <a href="https://en.wikipedia.org/wiki/Three-body_problem">simple dynamic systems</a>, a fully predictive model of
organizational success certainly seems impossible.</p>

<p>But like Kant attempting to derive the necessary preconditions of subjective experience in
<a href="https://en.wikipedia.org/wiki/Critique_of_Pure_Reason"><em>Critique Of Pure Reason</em></a>, we can sketch out the <em>boundaries</em> of what an organization is
capable of and the dynamics of that as it grows. In the same way that the <em>a priori</em> knowledge that
ten pounds weighs more than five pounds informs how much shit we can expect to fit into a bag,
modeling organizations as parallel processes can inform the way we design them.</p>

<p>What happens inside those boundaries is a matter of execution and effort; what happens outside those
boundaries is impossible.</p>

<h2 id="the-ceiling-is-low">The Ceiling Is Low</h2>

<p><strong>The work capacity of an organization scales, at most, linearly as new members are added.</strong>
Each new member in an organization adds a constant number of possible work hours to the total
possible work hours of the company’s existing employees. <a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl’s Law</a> states that given a
fixed task, a parallel solution utilizing \(N\) processors will run faster than a sequential
solution by <em>at most</em> a factor of \(N\).</p>

<p>As parallel resources are added, the total time spent in the parallelizable portion of the task
amortizes to zero; in contrast, the total time spent in the sequential portion of the task never
drops below a floor value. This is as true for a group of people trying to write software as it is
for a group of CPUs trying to model the behavior of stars in the galaxy. Our intuition tells us that
larger organizations do exhibit superlinear behaviors, but this literally cannot be the case if
hiring is the only variable in the equation. Therefore, our only hope for superlinear productivity
lies in changing the task which is being executed. Thankfully, work capacity is not the same as
productivity.</p>

<p>As an organization hires more employees, work on productivity improvements must be a constant
priority. Internal tooling, training, and services must be developed and fielded to ensure that all
members are able to work on problems of continuously increasing impact. <strong>The ceaseless pursuit of
force multipliers is the only possible route to superlinear productivity improvements as an
organization grows.</strong></p>

<p>Finally, it must be emphasized that this linear bound on work capacity is a ceiling, not a floor.
One cannot do better than linear, but one can certainly do worse. There are many other factors which
act as a drag on work capacity, and organization-wide improvements in productivity are critical in
mitigating them.</p>

<h2 id="the-floor-is-lava">The Floor Is Lava</h2>

<p><strong>Contention costs grow superlinearly as new members are added.</strong>
Parallel solutions to tasks are rarely perfectly concurrent (and indeed, such tasks are rightfully
called “embarrassingly parallel”), and often require some sequential critical sections. The line of
CPUs or people waiting to enter a critical section can be modeled as a queue, which allows us to use
queueing theory to understand how the queue’s cycle time changes as the queue size grows. If we
model the line for a sequential section as a \(G/G/1\) queue, which is to say, without making any
assertions about the arrival process or service time distribution but assuming a single queue server
(i.e. only one CPU or person can hold the lock), we arrive at <a href="https://en.wikipedia.org/wiki/Kingman%27s_formula">Kingman’s Formula</a> for the mean
wait time:</p>

\[\mathbb E(W_q) \approx \left( \frac{\rho}{1-\rho} \right) \left( \frac{c_a^2+c_s^2}{2}\right) \tau\]

<p>Notably, the wait time of a queue increases non-linearly with respect to \(\rho\) (utilization) and
quadratically with respect to \(c_a\) (the coefficient of variation for arrivals) and \(c_s\) (the
coefficient of variation for service times). (This is the quantified form of the intuition that
queues are either empty or overflowing.)</p>

<p>The non-linearity of this should give us pause, as increasing the number of people contending for a
shared resource is the same thing as increasing \(\rho\). If contention on those resources is
unmanaged, organizational growth can result in catastrophic increases in wait time. At some point,
adding new members can cause the organization’s overall productivity to decrease instead of
increase, as the increase in wait time due to contention is greater than the increase in work
capacity. (This is the organizational version of the latency spikes we see as servers become
overloaded.)</p>

<p>These shared resources aren’t necessarily physical things, like bathrooms or printers; they can be
digital, like files in a source code repository or tickets in a bug tracker, or organizational, like
code reviews or work assignments. As with writing highly-concurrent applications, building
high-performing organizations requires a careful and continuous search for shared resources, and
developing explicit strategies for mitigating their impact on performance.</p>

<p>A commonly applied but rarely successful strategy is using external resources–e.g. consultants,
agencies, staff augmentation–as an end-run around contention on internal resources. While the
consultants can indeed move quickly in a low-contention environment, integrating their work product
back into the contended resources often has the effect of ballooning \(c_s\) (the variation of
service times, or how long a critical section is held). This produces a quadratic spike in wait
times which increases utilization which in turn produces a superlinear spike in wait times.
(Queueing theory is a harsh mistress.) Successful strategies for reducing contention include
increasing the number of instances of a shared resource (e.g., adding bathrooms as we add employees)
and developing stateless heuristics for coordinating access to shared resources (e.g., grouping
employees into teams).</p>

<p>As with heavily layered applications, the more distance between those designing the organization and
the work being done, the greater the risk of unmanaged points of contention. Top-down organizational
methods can lead to subdivisions which seem like parallel efforts when listed on a slide but which
are, in actuality, highly interdependent and interlocking. <strong>Staffing highly sequential efforts as
if they were entirely parallel leads to catastrophe.</strong></p>

<h2 id="hell-is-other-people">Hell Is Other People</h2>

<p><strong>Coherence costs grow quadratically as new members are added.</strong>
Working on complex tasks using parallel resources (or with a group of people) requires
communication. A group of \(3\) has \(3\) dyads; a group of \(4\) has \(6\); a group of \(5\) has
\(10\); a group of \(N\) people has \(\frac{N^2-N}{2}\) possible dyads. Point-to-point communication
(i.e., talking to each other) can be modeled as the activation of a subset of those dyads.</p>

<p>While some organizations are chattier than others, this communication is essential for the sharing
of information and the coordination of action. But it’s not free. Communication takes time. If the
relative percentage of people who need to talk to each other to get something done stays constant as
the organization grows (i.e., \(x\%\) of all dyads), <strong>the total time spent communicating will grow
quadratically as the work capacity of the organization grows linearly</strong>.</p>

<p>We can consider group meetings as a batching strategy to reduce the number of entities involved in
point-to-point communications, but the effectiveness of this strategy depends heavily on the
relative overlap of groups and the group structures. The degree to which the groups overlap is
essentially the same factor as the percentage of dyads required for communication. If the group
sizes are bounded, the growth of coherence costs will be reduced by a constant factor but still grow
quadratically. It may be tempting to try to punt on coherence and just ride dirty, but even subtle
forms of incoherence have <a href="https://www.holmesreport.com/latest/article/the-cost-of-poor-communications">massive business costs</a>. The only scalable strategy for containing
coherence costs is to limit the number of people an individual needs to talk to in order to do their
job to a constant factor.</p>

<p>In terms of organizational design, this means limiting both the types and numbers of consulted
constituencies in the organization’s process. Each additional person or group in a <a href="https://en.wikipedia.org/wiki/Responsibility_assignment_matrix">responsibility
assignment matrix</a> geometrically increases the area of that matrix. Each additional
responsibility assignment in that matrix geometrically increases the cost of organizational
coherence.</p>

<p>It’s also worth noting that these pair-wise communications don’t need to be formal, planned, or even
well-known in order to have costs. Neither your employee handbook nor your calendar are accurate
depictions of <a href="https://safetydifferently.com/the-varieties-of-human-work/">how work in the organization is done</a>. Unless your organization is staffed with
zombies, members of the organization will constantly be subverting standard operating procedure in
order to get actual work done. Even ants improvise. An accurate accounting of these hidden costs can
only be developed via an honest, blameless, and continuous end-to-end analysis of the work as it is
happening.</p>

<h2 id="principles-from-beyond-space-and-time">Principles From Beyond Space And Time</h2>

<h3 id="keep-the-work-parallel-the-groups-small-and-the-resources-local">Keep the work parallel, the groups small, and the resources local</h3>

<p>When presented with a set of problems which grow superlinearly intractable as \(N\) increases, our
best bet is to keep \(N\) small. If the organization’s intent is to increase value delivery by
hiring more people, work efforts <strong>must</strong> be as independent as possible. Leaders should develop
practices and processes to ensure that the work efforts which their strategies consider parallel are
actually parallel. Shared resources should be continuously managed for contention, and where
possible, the resources a group needs should be colocated with that group (e.g., if the work
involves a lot of design, staff a designer to that group). <a href="https://en.wikipedia.org/wiki/Combined_arms">Combined arms doctrine</a> isn’t just
for soldiers.</p>

<h3 id="prioritize-the-development-of-force-multipliers">Prioritize the development of force multipliers</h3>

<p>If an organization is largely working on the same types of problems it was in previous years, it’s
cause for concern. Teams dedicated to internal tooling should be staffed and given the explicit
direction of building tools and optimizing processes to help increase their co-workers’
productivity. If the percentage of the organization dedicated to improving how the organization
works begins to fall, ask yourself–have we hit a global or local maximum? Go long on high-leverage
tools, but stay grounded in whether or not they actually help.</p>

<h3 id="if-possible-factor-work-products-into-independent-modules-if-not-grow-slowly-and-optimize">If possible, factor work products into independent modules; if not, grow slowly and optimize</h3>

<p>If your work product–e.g. codebase, documents, etc.–can be factored into independent modules, do
so. The key word there is <em>independent</em>. Slicing your shit up into a hundred microservices will not
help you if everyone needs to change ten of them to get anything done. Some problems are not
particularly amenable to partition; these are also problems which do not benefit much from
additional workers. If the problem is a fixed point, look at ways of optimizing the sequential
portion of the work. Know that throwing bodies at that problem will produce a clusterfuck.</p>

<h3 id="scale-organizational-efforts-across-a-portfolio-of-synergistic-products">Scale organizational efforts across a portfolio of synergistic products</h3>

<p>Most smart businesses start out with a single product. They go long on their product hypothesis, put
their eggs in a single basket, and swing for the fences. If they’re lucky enough to get traction,
they double down on this. Over. And over. And over. At some point, they’ve got several battalions’
worth of people milling around, all trying to figure out who owns the turboencabulator UI and
whether or not the new marzelvanes will be fully antigravic by the big fall marketing push.</p>

<p>To avoid that, organization leaders should keep the development of a product portfolio as an
explicit goal. Feature or product ideas which are complimentary to the organization’s overall
business strategy but don’t naturally coexist with the main product can be developed as separate
products by independent teams. We have evidence that <a href="https://cacm.acm.org/magazines/2020/1/241718-in-search-of-the-shortest-possible-schedule/fulltext">software development schedules can be shorted
by, at most, 25%</a>; it should be easy to pick between a single product in 18 weeks or two
products in 24 weeks.</p>

<p>Successful new products can be incrementally integrated with the existing products where it makes
sense, and tooling, libraries, and frameworks can be developed by force multiplier teams to reduce
both time-to-market of new products and the carrying costs of existing products. Unsuccessful new
products can be gracefully removed from market at dramatically lower cost than features of similar
complexity. After all, removing a feature from a product involves the same contention and coherence
costs as adding a feature. It’s rare for a feature to present carrying cost greater than the cost of
its removal, thus the prevalence of <a href="https://en.wikipedia.org/wiki/Ghost_ship">Flying Dutchman</a> features.</p>

<p>As a concrete example of the virtue of a product portfolio, imagine Amazon Web Services as a single
product, staffed by a hundred thousand doomed souls and fronted by a UI which is just a series of
buttons allowing you to provision and operate virtual machines, databases, data lakes, robotics
applications, augmented reality apps, IoT dinguses, and more. Such a creature would implode under
its own weight.</p>

<p>Instead, Amazon Web Services is a portfolio of synergistic products. EC2 has its own independent set
of features, developed and operated by an independent set of employees. When its needs can be
provided by another AWS product (e.g., storing virtual machine images on S3, or sending metrics to
CloudWatch), a cross-product integration is introduced. This product structure enables the highly
concurrent organizational structure that enables Amazon to field a <a href="https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Services">blistering number</a> of new
products every year while continuing to support and develop existing products. <a href="https://en.wikipedia.org/wiki/Amazon_SimpleDB">Failed
services</a> can be sunsetted or drawn down without disrupting the rest of the organization.</p>

<h3 id="keep-responsibility-assignment-matrices-small-sparse-and-local">Keep responsibility assignment matrices small, sparse, and local</h3>

<p>As an organization matures, ad-hoc roles are often developed into full teams. This specialization is
often critical for building internal economies of scale, but the formalization of new constituencies
should be kept in check. Each column in your responsibility assignment matrix expands the possible
set of required interactions geometrically; each assignment in the matrix is a coordination point
requiring waiting. Where a matrix indicates a high-touch relationship between two groups (e.g., a
group of engineers working on a feature and the lawyers trying to ensure the legal compliance of
that feature), efforts should be made to reduce the cost of that interaction by colocating their
members (e.g., embed a lawyer with the engineers).</p>

<h3 id="prioritize-asynchronous-information-distribution-over-synchronous">Prioritize asynchronous information distribution over synchronous</h3>

<p>A significant source of <a href="https://en.wikipedia.org/wiki/Lean_services#Value_Demand_and_Failure_Demand">failure demand</a> for meetings and status updates is the desire of
organizational leaders to keep abreast of who’s doing what. This situational awareness is indeed
important, but trying to maintain it by calling meetings, messaging people on Slack, and catching
people on the hallways is a significant systemic drag on organizational productivity.</p>

<p>A better model for staying informed of developments as the organization scales is for groups to
publish status updates as part of the regular cadence of their work. Leaders can asynchronously read
these updates and, should the need arise, initiate additional, synchronous conversation to ask
questions, provide feedback, etc.</p>

<p>Synchronous meetings should be reserved for low-latency collaboration on complex issues; likewise,
collaboration should be reserved for synchronous meetings.</p>

<h3 id="what-happens-inside-the-boundaries-is-important">What happens inside the boundaries is important</h3>

<p>That we know some of the boundaries of organizational performance and their dynamics doesn’t excuse
us from using our empathy to build humane organizations. Companies are groups of people being
compensated for having to spend some of their finite lifetimes not being with their partners,
children, pets, or super weird hobbies. They deserve to be members of organizations which honor that
time by ensuring that their work has value and meaning. There is no mathematical model to guide us
to that goal.</p>

<hr />

<p><em>Thanks to various Fiascans for reviewing this post. Any mistakes in this article are mine, not
theirs.</em></p>]]></content><author><name></name></author><summary type="html"><![CDATA[In which returns diminish.]]></summary></entry><entry><title type="html">Towards A Safer Footgun</title><link href="https://codahale.com//towards-a-safer-footgun/" rel="alternate" type="text/html" title="Towards A Safer Footgun" /><published>2017-06-08T00:00:00+00:00</published><updated>2017-06-08T00:00:00+00:00</updated><id>https://codahale.com//towards-a-safer-footgun</id><content type="html" xml:base="https://codahale.com//towards-a-safer-footgun/"><![CDATA[<p>Modern symmetric encryption is built around <a href="https://tools.ietf.org/html/rfc5116">Authenticated Encryption with Associated
Data</a> (AEAD) constructions: combinations of ciphers and message authentication codes which
provide strong guarantees of both <em>confidentiality</em> and <em>integrity</em>. These constructions avoid the
<a href="https://moxie.org/blog/the-cryptographic-doom-principle/">“doom principle”</a> which made so many older cryptosystems vulnerable to online attacks, but many
of the standard AEAD constructions have problems of their own.</p>

<h2 id="whats-an-aead">What’s an AEAD?</h2>

<p>An AEAD is essentially a pair of functions:</p>

\[\begin{array}{rccc}
\operatorname{Encrypt}\colon &amp; (K, N, M, D) &amp; \to &amp; (C, T)\\
\operatorname{Decrypt}\colon &amp; (K, N, C, T, D) &amp; \to &amp; (M) \coprod \varnothing\\
\end{array}\]

<p>Given a key \(K\), a nonce \(N\), a message \(M\), and some associated data \(D\), the
\(\operatorname{Encrypt}\) function returns a ciphertext \(C\) and an authentication tag \(T\). (The
data is not included in the ciphertext.) Passing the same key, nonce, data, ciphertext, and tag to
the \(\operatorname{Decrypt}\) function will return the original message \(M\). If <em>any</em> bit of the
key, nonce, data, ciphertext, or tag are altered, however, \(\operatorname{Decrypt}\) will return
nothing (or an error).</p>

<h2 id="fantastical-constructions">Fantastical constructions…</h2>

<p>The addition of <a href="https://tools.ietf.org/html/rfc5246#section-6.2.3.3">AEAD-based ciphersuites</a> to TLS 1.2 gave the world a set of standard AEAD
constructions, including the ubiquitous <a href="https://tools.ietf.org/html/rfc5288">AES-GCM</a> and the more recent
<a href="https://tools.ietf.org/html/rfc7905">ChaCha20-Poly1305</a>. These have found their way into <a href="https://golang.org/pkg/crypto/cipher/#AEAD">standard libraries</a>,
<a href="https://download.libsodium.org/doc/secret-key_cryptography/ietf_chacha20-poly1305_construction.html">high-level libraries</a>, <a href="https://github.com/kubernetes/kubernetes/blob/f89d2493f100bf7268b8778d7e077e7d5383c50b/staging/src/k8s.io/apiserver/pkg/storage/value/encrypt/aes/aes.go#L69-L71">open-source projects</a>, and <a href="https://d0.awsstatic.com/whitepapers/KMS-Cryptographic-Details.pdf">online services</a>, largely
displacing a small ecosystem of lightly-scrutinized, ad-hoc constructions. This is a positive
development, but it’s worth noting that the proliferation of these constructions outside of the
context for which they were developed–i.e. TLS–has exposed us to additional cryptographic
challenges.</p>

<h2 id="and-where-to-find-them">…and where to find them</h2>

<p>TLS 1.2 uses AEADs to protect records–batches of data being sent or received via a TLS
connection–and the standard IETF AEAD constructions are designed to take advantage of specific
details of that context. For one, TLS connections are protected using randomly generated,
per-session keys. For two, TLS records have unique, sequenced identifiers to prevent an attacker
from dropping or duplicating records without detection.</p>

<p>As a result, all IETF AEAD standards use 96-bit nonces derived from the record identifiers, which
are guaranteed to be unique inside a single connection. Because the keys are also scoped to a single
connection, there is less risk of accidentally re-using a key/nonce pair. That’s an important
consideration, because their security guarantees of confidentiality and integrity are completely
broken if a key/nonce pair is ever re-used.</p>

<h2 id="every-cloud-has-a-granite-lining">Every cloud has a granite lining</h2>

<p>AES-GCM provides confidentiality by using AES in counter mode to encrypt the data. A 128-bit
counter, which is derived from the nonce, is encrypted and the resulting key stream is <code class="language-plaintext highlighter-rouge">xor</code>-ed with
the plaintext to produce the ciphertext. If a key/nonce pair is used for two different messages, an
attacker can recover \(M_1 \oplus M_2\), which will reveal information about both messages.
ChaCha20-Poly1305 uses ChaCha20, a stream cipher, which is vulnerable to the same attack should a 
nonce ever be reused.</p>

<p>AES-GCM and ChaCha20-Poly1305 both provide integrity by hashing the ciphertext with polynomial
message authentication codes (GHASH for the former, Poly1305 for the latter). These types of
algorithms are vulnerable to <a href="https://eprint.iacr.org/2013/144.pdf">forgery attacks</a> should an authentication key ever be
reused. ChaCha20-Poly1305’s construction <a href="https://eprint.iacr.org/2017/239.pdf">is resilient this sort of attack</a> because the
authentication key is derived from the ChaCha20 keystream, but AES-GCM is <a href="https://eprint.iacr.org/2013/157.pdf">very fragile</a>
and fails catastrophically in the event of key/nonce reuse.</p>

<p>Because key/nonce reuse is essentially impossible in the context of a TLS connection, these
constructions trade nonce-misuse resistance for performance. But what about in other contexts?</p>

<h2 id="betting-ones-shirt">Betting one’s shirt</h2>

<p>Without a naturally-occurring nonce like TLS’s record identifiers, developers often <a href="https://github.com/kubernetes/kubernetes/blob/f89d2493f100bf7268b8778d7e077e7d5383c50b/staging/src/k8s.io/apiserver/pkg/storage/value/encrypt/aes/aes.go#L69-L71">generate random
nonces</a> on the assumption that the probability of two operations picking the same 96-bit
value is low enough to safely ignore. For a few operations, that’s certainly correct, but thanks to
the <a href="https://en.wikipedia.org/wiki/Birthday_problem">birthday problem</a>, those probabilities get big quick. At what point does that become
unsafe?</p>

<p>Thankfully, NIST made <a href="http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf">concrete recommendations</a> regarding this:</p>

<blockquote>
  <p>The probability that the authenticated encryption function ever will be invoked with the same IV 
and the same key on two (or more) distinct sets of input data shall be no greater than 
2<sup>-32</sup>.</p>
</blockquote>

<blockquote>
  <p>…unless an implementation only uses 96-bit IVs that are generated by the deterministic 
construction:</p>

  <p>The total number of invocations of the authenticated encryption function shall not exceed
2<sup>32</sup>, including all IV lengths and all instances of the authenticated encryption 
function with the given key.</p>
</blockquote>

<p>If we’re randomly generating nonces, then, we should encrypt no more than 4,294,967,296 messages
with the same key. That’s a lot, right? Well, no.</p>

<p>As an example, let’s assume we’ve got a web application running on a bunch of independent servers
and we’re using AES-GCM or ChaCha20-Poly1305 to secure something in each request with a shared key.
At 100 requests a second, we won’t need to rotate keys for 16 months. At 1,000 requests per second,
we’ll need to rotate keys every 50 days or so. At 10,000 requests per second, we’ll be rotating keys
more than once a week. At 100,000 requests per second, we’ll be rotating keys more than twice a day.</p>

<p>What’s a developer to do?</p>

<h2 id="a-safer-footgun">A safer footgun</h2>

<p>The solution, broadly speaking, is a class of AEAD constructions which are called “nonce-misuse
resistant”, which means they don’t immediately fall apart should a key/nonce pair be re-used. This
doesn’t mean, however, they can tolerate arbitrary amounts of re-use, but rather their security
bounds degrade much more slowly with each re-used nonce.</p>

<p>The most practical nonce-misuse resistant AEAD proposal, I think, is <a href="https://eprint.iacr.org/2017/168">AES-GCM-SIV</a>, an
improvement of <a href="https://eprint.iacr.org/2015/102.pdf">GCM-SIV</a> designed for use in Google’s <a href="https://www.chromium.org/quic">QUIC</a> protocol. QUIC allows
clients to prove they’re actually using a particular IP address by issuing cryptographic
“source-address” tokens.</p>

<blockquote>
  <p>Since a central allocation system for nonces is not operationally viable, random selection of
nonces is the only possibility. AES-GCM’s limit of 2<sup>32</sup> random nonces (per key) suggests 
that, even if the system rotated these secret keys daily, it could not issue more than about 50K
tokens per second. However, in order to process DDoS attacks the system may need to sustain 
issuance of several hundred million per second.</p>
</blockquote>

<p>As a result, AES-GCM-SIV is designed with a few important properties:</p>

<ol>
  <li>For encryption, it uses AES, which is well-studied and widely supported in hardware thanks to
AES-NI.</li>
  <li>For authentication, it uses a MAC called POLYVAL, which is essentially GCM’s GHASH without the
byte swapping. As a result, it can leverage Intel’s <code class="language-plaintext highlighter-rouge">PCLMULQDQ</code> instruction to achieve speeds
only slightly slower than GCM, despite requiring two full passes on each message.</li>
  <li>Its security bounds degrade almost linearly with nonce reuse, improving on GCM-SIV’s quadratic 
degradation, and making it ideal for use in systems like QUIC where GCM’s per-key limits are
infeasible.</li>
  <li>Unlike constructions like <a href="https://cr.yp.to/snuffle/xsalsa-20081128.pdf">XSalsa20</a>, which protect against nonce reuse by increasing 
the nonce size, AES-GCM-SIV still uses 96-bit nonces. For a system like QUIC, which may peak at
10<sup>8</sup> operations per second, this matters. At that rate, XSalsa20 would require 9.6Gb/s 
more bandwidth.</li>
</ol>

<p>AES-GCM-SIV is <a href="https://tools.ietf.org/html/draft-irtf-cfrg-gcmsiv-05">not yet an IETF standard</a>, but it’s in progress. Once
standardized, however, I can see it being the recommended construction for authenticated encryption.</p>

<p>As you may have guessed, I’ve written a <a href="https://github.com/codahale/aes-gcm-siv">Java library</a> which implements the most recent draft
specification for AES-GCM-SIV. It’s based on the AES and GHASH implementations in BouncyCastle, so
it doesn’t leverage the performance and security benefits of AES-NI and <code class="language-plaintext highlighter-rouge">PCLMULQDQ</code> instructions,
making it around twice as slow as AES-GCM. It’s still plenty fast (~40-50µs for a 1KiB message),
however.</p>

<h2 id="tldr">tl;dr</h2>

<p>AEADs have been a critical development in making symmetric encryption safe to use, but most standard
constructions were developed to be used in contexts with deterministic nonces, like TLS’s record
identifiers. In contexts with long-lived keys and where nonces must be randomly generated, the
Birthday Problem makes nonces much more likely to collide in systems performing many operations.
Should a nonce be reused, the security guarantees of most standard AEADs are null and void–an
attacker can recover plaintexts, forge messages, etc.</p>

<p>Cryptographers have been working on “nonce-misuse resistant” AEADs, which have security bounds which
degrade more gracefully when nonces are re-used. Of these, AES-GCM-SIV is the most promising and,
once standardized, will be a better choice than AES-GCM in contexts where nonces must be generated
randomly.</p>

<hr />

<h2 id="updated-june-10-2017">Updated June 10, 2017</h2>

<ul>
  <li>Fixed my assertions about ChaCha20-Poly1305 forgery attacks. Thanks,
<a href="https://twitter.com/dchest/status/863295115477094400">@dchest</a>!</li>
</ul>

<hr />

<p><em>Thanks to <a href="https://twitter.com/tqbf">Thomas Ptacek</a> for reviewing this post. Any mistakes in this
article are mine, not his.</em></p>]]></content><author><name></name></author><summary type="html"><![CDATA[In which encryption's splash damage is somewhat nerfed.]]></summary></entry><entry><title type="html">usl4j And You</title><link href="https://codahale.com//usl4j-and-you/" rel="alternate" type="text/html" title="usl4j And You" /><published>2017-05-31T00:00:00+00:00</published><updated>2017-05-31T00:00:00+00:00</updated><id>https://codahale.com//usl4j-and-you</id><content type="html" xml:base="https://codahale.com//usl4j-and-you/"><![CDATA[<p>My friend Jeff wrote <a href="https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/">a thing</a> a while back which contained a cornucopia of truth, one of my
favorite bits being the following:</p>

<blockquote>
  <p>“It’s slow” is the hardest problem you’ll ever debug. “It’s slow” might mean one or more of the
number of systems involved in performing a user request is slow. It might mean one or more of the
parts of a pipeline of transformations across many machines is slow. “It’s slow” is hard, in part,
because the problem statement doesn’t provide many clues to location of the flaw. Partial 
failures, ones that don’t show up on the graphs you usually look up, are lurking in a dark corner. 
And, until the degradation becomes very obvious, you won’t receive as many resources (time, money,
and tooling) to solve it. Dapper and Zipkin were built for a reason.</p>
</blockquote>

<p>I’ve been thinking about this lately, and I think another factor which makes this such a hard
problem is that even if you have the ability to segment performance telemetry by space (i.e. which
subsystem is slow) it’s not guaranteed that doing so will actually find the problem. As with
performance optimization, if you don’t find any hotspots, the problem is often systemic, not local,
and as such requires a different set of tools to resolve.</p>

<p>In this post, I’ll introduce you to <a href="https://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a>, the <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">Universal Scalability Law</a>, and
<a href="https://github.com/codahale/usl4j">usl4j</a>, a Java library for modeling system performance given a small set of real-world
measurements.</p>

<h2 id="littles-law">Little’s Law</h2>

<p>First, let’s go over <a href="https://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a>. <a href="https://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a> is a simple equation of the behavior of
queues. It formally describes the relationship of queue size (\(N\)), throughput (\(X\)) and
latency (\(R\)):</p>

\[\begin{array}{rcl}
N &amp; = &amp; XR\\
X &amp; = &amp; N/R\\
R &amp; = &amp; N/X\\
\end{array}\]

<p>For example, consider a coffee shop which takes an average of 90 seconds to make an order
(\(R=90\)). A customer places an order, on average, every 60 seconds (\(X=1/60\)). The average
number of in-flight orders, therefore, is \(N=XR=1/60 \times 90 = 1.5\).</p>

<p><a href="https://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a> is incredibly helpful in terms of being able to predict systems’ behaviors by
modeling them as big-ass queues. A simple auto-scaling system, given a threshold latency of \(R\)
seconds, can monitor the number of concurrent requests \(N\) over a set of servers. When the number
of requests per second \(X\) begins to approach \(N/R\), the system can bring new servers online to
increase the capacity and stay under the threshold latency.</p>

<p>Given any two parameters, <a href="https://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a> allows us to derive the third, but what if we want to
predict a system’s behavior given only a single parameter?</p>

<h2 id="the-universal-scalability-law">The Universal Scalability Law</h2>

<p>The <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">Universal Scalability Law</a>, developed by <a href="http://www.perfdynamics.com/Bio/njg.html">Neil J. Gunther</a>, is a model which combines
<a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl’s Law</a> and <a href="https://en.wikipedia.org/wiki/Gustafson%27s_law">Gustafson’s Law</a> to produce a nonlinear model which can be used to
predict a system’s behavior:</p>

\[X(N) = \dfrac{\lambda N}{1+\sigma(N-1)+\kappa N(N-1)}\]

<p>It describes the expected throughput of a system (\(X\)) at a given level of concurrency (\(N\))
as the nonlinear relationship of three parameters:</p>

<ul>
  <li>\(\sigma\), the overhead of contention</li>
  <li>\(\kappa\), the overhead of crosstalk</li>
  <li>\(\lambda\), how fast the system operates in ideal conditions</li>
</ul>

<p>(For a more in-depth explanation, I highly recommend reading <a href="https://www.xaprb.com/">Baron Schwartz’s</a> excellent book,
<a href="https://www.vividcortex.com/resources/universal-scalability-law/">Practical Scalability Analysis with the Universal Scalability Law</a>.)</p>

<p>Given a set of <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> parameters \(\{\sigma,\kappa,\lambda\}\), we can use the <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a>
equation to pin down one parameter of <a href="https://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a>, allowing us to make predictions given the
value of any single parameter:</p>

\[\begin{array}{rcc}
R(N) &amp; = &amp; \dfrac{1 + \sigma(N-1) + \kappa N(N-1)}{\lambda}\\
R(X) &amp; = &amp; \dfrac{-\sqrt{X^2 (\kappa^2 + 2 \kappa (\sigma - 2) + \sigma^2) + 2 \lambda X (\kappa - \sigma) + \lambda^2} + \kappa X + \lambda - \sigma X}{2 \kappa X^2}\\
X(R) &amp; = &amp; \dfrac{\sqrt{\sigma^2 + \kappa^2 + 2 \kappa (2 \lambda R + \sigma - 2) - \kappa + \sigma}}{2 \kappa R}\\
N(R) &amp; = &amp; \dfrac{\kappa - \sigma + \sqrt{\sigma^2 + \kappa^2 + 2 \kappa (2 \lambda R + \sigma - 2)}}{2 \kappa}\\
\end{array}\]

<p>We can even predict the maximum concurrency of a system:</p>

\[N_{max} = \left\lfloor \sqrt{\dfrac{1 - \sigma}{\kappa}} \right\rfloor\]

<p>But where do \(\sigma\), \(\kappa\), and \(\lambda\) come from? In order to determine the
<a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> parameters for a system, we must first gather a set of measurements of the system’s
behavior.</p>

<h2 id="building-a-model">Building a model</h2>

<p>These measurements must be of two of the three parameters of <a href="https://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a>: mean response time
(in seconds), throughput (in requests per second), and concurrency (i.e. the number of concurrent
clients).</p>

<p>Because response time tends to be a property of load (i.e. it rises as throughput or concurrency
rises), the dependent variable in our tests should be mean response time. This leaves either
throughput or concurrency as our independent variable, but thanks to <a href="https://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a> it doesn’t
matter which one we use. Because the <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> is defined in terms of concurrency (\(N\)) and
throughput (\(X\)), it’s more straight-forward to keep these measurements in those terms.</p>

<p>For the purposes of discussion, let’s say we measure throughput as a function of the number of
concurrent clients sending requests as fast as they can. After our load testing is done, we should
have a set of measurements shaped like this:</p>

<table>
  <thead>
    <tr>
      <th>Concurrency</th>
      <th>Throughput</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>955.16</td>
    </tr>
    <tr>
      <td>2</td>
      <td>1878.91</td>
    </tr>
    <tr>
      <td>3</td>
      <td>2688.01</td>
    </tr>
    <tr>
      <td>4</td>
      <td>3548.68</td>
    </tr>
    <tr>
      <td>5</td>
      <td>4315.54</td>
    </tr>
    <tr>
      <td>6</td>
      <td>5130.43</td>
    </tr>
    <tr>
      <td>7</td>
      <td>5931.37</td>
    </tr>
    <tr>
      <td>8</td>
      <td>6531.08</td>
    </tr>
  </tbody>
</table>

<p>Next comes the hard part: we need to use a nonlinear solver to generate optimal coefficients for the
<a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> which fit these measurements.</p>

<h2 id="using-usl4j">Using usl4j</h2>

<p>Luckily for you, I wrote <a href="https://github.com/codahale/usl4j">usl4j</a>, a Java library which uses <a href="http://ddogleg.org/">DDogleg</a>’s
Levenberg-Marquardt least-squares optimizer to build a fully-parameterized <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> model given a
set of measurements:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">com.codahale.usl4j.Measurement</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.codahale.usl4j.Model</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>

<span class="kd">class</span> <span class="nc">Example</span> <span class="o">{</span>
  <span class="kt">void</span> <span class="nf">buildModel</span><span class="o">()</span> <span class="o">{</span>
    <span class="kt">double</span><span class="o">[][]</span> <span class="n">points</span> <span class="o">=</span> <span class="o">{</span> <span class="o">{</span><span class="mi">1</span><span class="o">,</span> <span class="mf">955.16</span><span class="o">},</span> <span class="o">{</span><span class="mi">2</span><span class="o">,</span> <span class="mf">1878.91</span><span class="o">},</span> <span class="o">{</span><span class="mi">3</span><span class="o">,</span> <span class="mf">2688.01</span><span class="o">}</span> <span class="o">};</span> <span class="c1">// etc.</span>
  
    <span class="c1">// Map the points to measurements of concurrency and </span>
    <span class="c1">// throughput, then build a model from them. </span>
    <span class="nc">Model</span> <span class="n">model</span> <span class="o">=</span> <span class="nc">Arrays</span><span class="o">.</span><span class="na">stream</span><span class="o">(</span><span class="n">points</span><span class="o">)</span>
                        <span class="o">.</span><span class="na">map</span><span class="o">(</span><span class="nc">Measurement</span><span class="o">.</span><span class="na">ofConcurrency</span><span class="o">()::</span><span class="n">andThroughput</span><span class="o">)</span>
                        <span class="o">.</span><span class="na">collect</span><span class="o">(</span><span class="nc">Model</span><span class="o">.</span><span class="na">toModel</span><span class="o">());</span>
    
    <span class="c1">// Predict the throughput for various levels of</span>
    <span class="c1">// possible concurrency.</span>
    <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">10</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">200</span><span class="o">;</span> <span class="n">i</span><span class="o">+=</span><span class="mi">10</span><span class="o">)</span> <span class="o">{</span>
      <span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">printf</span><span class="o">(</span><span class="s">"At %d concurrent clients, expect %f req/sec\n"</span><span class="o">,</span> 
        <span class="n">i</span><span class="o">,</span> <span class="n">model</span><span class="o">.</span><span class="na">throughputAtConcurrency</span><span class="o">(</span><span class="n">i</span><span class="o">));</span>
    <span class="o">}</span>
  <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The resulting data looks something like this:</p>

<p><img src="/assets/img/usl.png" alt="Graph of actual throughput vs. USL-predicted values." /></p>

<p>usl4j allows you to calculate all of the parameters of <a href="https://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a>: \(N(X)\), \(N(R)\),
\(X(N)\), \(X(R)\), \(R(N)\), \(R(X)\), as well as \(N_\text{max}\) and \(X_\text{max}\).</p>

<h2 id="continuous-measurement">Continuous Measurement</h2>

<p>Because we can build <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> models using submaximal testing (i.e. not testing to overload), the
measurements we use aren’t necessarily restricted to test bench experiments. We can take real 
measurements from live systems and continuously build models from them, as building a model from a
small set of measurements is very fast (i.e. &lt;100µs).</p>

<p>Building <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> models in realtime can augment dashboards and alerting systems with answers to
critical questions:</p>

<ul>
  <li>How close is the system to the predicted maximum throughput or concurrency?</li>
  <li>How close are we to the point at which our latency will be over SLA?</li>
  <li>How much would adding another server help our latency?</li>
</ul>

<p>Building <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> models over historical data allows you to quantitatively answer key questions
about how the system has changed over time with regard to scalability:</p>

<ul>
  <li>A decreased \(\sigma\) value means decreased contention (e.g. better database lock scheduling or 
a new lock-free data structure).</li>
  <li>A decreased \(\kappa\) value means decreased crosstalk (e.g. removing false cache sharing, or 
reduced fanout in a distributed system).</li>
  <li>An increased \(\lambda\) value means the system has increased in unloaded performance (e.g. a 
new compiler optimization or runtime version).</li>
</ul>

<h2 id="usl-for-all">USL For All</h2>

<p>Because the <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> has a strong physical basis, its parameters can indicate where effort is best
spent: increasing \(\sigma\), \(\kappa\), or \(\lambda\). Unlike existing observability tools,
the <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> can allow you to pinpoint what kind of process is behind “it’s slow”, even if the
slowness isn’t limited to a subset of the system.</p>

<p>The <a href="http://www.perfdynamics.com/Manifesto/USLscalability.html">USL</a> is a powerful, accessible tool for modeling the behavior of software systems, and
it’s my firm hope that making it easily automatable with <a href="https://github.com/codahale/usl4j">usl4j</a> leads to its adoption in
observability platforms.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In which measurements and models are made.]]></summary></entry><entry><title type="html">Risky Business Requires Active Operators</title><link href="https://codahale.com//risky-business-requires-active-operators/" rel="alternate" type="text/html" title="Risky Business Requires Active Operators" /><published>2017-05-19T00:00:00+00:00</published><updated>2017-05-19T00:00:00+00:00</updated><id>https://codahale.com//risky-business-requires-active-operators</id><content type="html" xml:base="https://codahale.com//risky-business-requires-active-operators/"><![CDATA[<p class="message">This article was originally posted on blog.skyliner.io on Feb 23, 2017.</p>

<p>It’s tempting, as developers, to consider automation an absolute good. We are, after all, paid to
automate work. But as the complexity of a system passes some hazy, ill-defined threshold, we need to
be very clear about the <strong>risks of automation</strong> in order to successfully and safely wield its power.</p>

<p><img src="/assets/img/terraform.jpg" alt="Ok Google, tell Siri to say “Alexa: file my taxes”." /></p>

<p>As the co-founder of Skyliner, a continuous delivery launch platform for AWS, I’m obviously
convinced of the virtues of automating huge swaths of traditionally manual administration tasks. I
mean, I helped build a product which spins up a full multi-environment AWS architecture at the press
of a button—so why am I counseling caution?</p>

<h2 id="you-had-one-job">You had one job</h2>

<p>It doesn’t take much for a system to become complex, especially when you take into account its human
operators. Unlike simple systems, which can often be analyzed in terms of failure probabilities, the
safety of complex systems must be considered holistically: that is, human operators included.</p>

<p>Automation, by removing human intervention from a task, removes the possibility of human error.
Unlike a human operator, a Roomba never gets bored of vacuuming your living room, never goes on
vacation, and never takes a sick day. But automation also removes the possibility of human error
correction. Unlike a human operator, a Roomba will diligently paint your floors with the unexpected
mess your new puppy left in the hallway. As a result, in order to build safe automation we need to
consider the role of human operators in keeping complex systems safe.</p>

<p><img src="/assets/img/wayminute.jpg" alt="An Uber self-driving SUV, self-driving like a jerk in downtown SF." /></p>

<h2 id="when-good-days-go-bad">When good days go bad</h2>

<p>It’s worth noting that complex systems aren’t homogenous blobs of complexity. Some components and
their activities are relatively straight-forward and some are more complex. When looking at change
boundaries and automation, we should focus on those with a higher degree of downside risk which
can’t be effectively hedged with automation.</p>

<p>For example, launching an EC2 instance has a small amount of downside risk—if it doesn’t come up
cleanly, you’re paying for a useless instance—which can be very effectively hedged by using
auto-scaling groups. In contrast, deploying a new version of your application to production has a
large amount of downside risk—if the new version contains a <a href="https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/">subtle but horrible
bug</a>, you lose
<a href="https://www.sec.gov/litigation/admin/2013/34-70694.pdf">money</a>—and this risk can’t always be hedged
via blue/green deploys, canary deploys, or other types of automation.</p>

<p>Around these kinds of change boundaries, where the downside side is high and correctness can’t be
automatically verified, we must ensure that our systems keep humans operators active. After all,
they’re the only ones who will be able to detect and
<a href="https://www.youtube.com/watch?v=pzzQ42D9Srw">respond</a> to such incidents to keep the system safe.</p>

<h2 id="on-the-care-and-feeding-of-human-operators">On the care and feeding of human operators</h2>

<p>An active human operator is one who:</p>

<ul>
  <li>is <strong>aware</strong> they are crossing a change boundary</li>
  <li><strong>understands</strong> doing so may cause an incident</li>
  <li>is <strong>prepared</strong> to respond should one occur</li>
</ul>

<p>In order to make human operators aware of change boundaries and their ambient risks, we must design
automation systems which require their active participation in order to perform some actions around
those boundaries. If a system is purely autonomous, any distinctions between safer and riskier
actions inside it are invisible to a human operator. Similarly, actions performed by human operators
should not traverse those change boundaries as an unintended consequence—they should be explicitly
labeled as such and not intermediated by other, unrelated actions. By making change boundaries
visible and explicit, we also help prepare human operators to respond to possible incidents. The
more proximal an action is to its effect, the easier it becomes for us to reason about the immediate
cause of an incident while it’s happening.</p>

<h2 id="a-combination-emergency-brakegas-pedal">A combination emergency brake/gas pedal</h2>

<p>For example, consider the relatively common practice of running pending database migrations as part
of a successful deploy. On the happy path, this is a huge time-saver, and removes the potential for
human error: never again will Gary forget to run his goddamn migrations! Off of the <a href="https://en.wikipedia.org/wiki/Happy_path">happy
path</a>, though, this type of automation can make incidents
more likely and harder to respond to.</p>

<p>To begin with, this combines the crossing of two risky change boundaries in a single action.
Database migrations, especially on large, live databases, are notorious for causing incidents. A
poorly-written migration can take milliseconds on a developer’s laptop, seconds in QA, and hours in
production, all while holding an exclusive lock and thrashing the disk.</p>

<p>Coupling these two actions not only increases the probability of failure, it also obscures their
connection, which slows down incident response: a deploy is centered around the application, not the
database, and a responder will be primed to look at potential bugs in the application before
considering the possibility of a problem with the migration.</p>

<p>Further, it complicates any automated responses to deploy failures in general. Any attempts to
restore the system to safety must take into consideration not only that the last deploy may have
modified the application’s external dependencies, but also that a migration have left the database
in an indeterminate state, requiring manual intervention.</p>

<p><img src="/assets/img/ansible.jpg" alt="git commit -m &quot;Fix problems with puppy messes&quot;" /></p>

<h2 id="putting-the-pilot-in-auto-pilot">Putting the pilot in “auto-pilot”</h2>

<p>In designing Skyliner, we intentionally chose to make deploying changes to production an active
process. By default, Skyliner will build commits and deploy them to QA when you push them, but
promoting that build to production requires an actual human to push a button.</p>

<p>We also chose to avoid making that button press on your keyboard. We’re very familiar with command
line tooling for deploys, and while command line tooling is amazing for automation, it often does a
very poor job of actually engaging human operators in considering the surrounding context of an
action:</p>

<p><img src="/assets/img/s3-outage.jpg" alt="You said four zeros, right? No? Really? Well, you got four zeros." /></p>

<p>Unlike a purely automatic system or a command line tool, Skyliner places users in an environment
which primes both their expectations of what will happen when they push that button and their
situational awareness of events after they push that button:</p>

<p><img src="/assets/img/skyliner-screenshot.png" alt="This is my day job." /></p>

<p>Making changes to complex systems is inherently risky, and only some of that risk can be mitigated
by automation. Managing the remaining, necessary risk is the job of human operators like you and me.
Safe, successful automation must empower us, not relegate us to the sidelines.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In which I peddle artisanally-curated, locally-sourced, farm-to-table change boundaries.]]></summary></entry><entry><title type="html">Virtual Machines Are Fleeting Things</title><link href="https://codahale.com//virtual-machines-are-fleeting-things/" rel="alternate" type="text/html" title="Virtual Machines Are Fleeting Things" /><published>2017-05-18T00:00:00+00:00</published><updated>2017-05-18T00:00:00+00:00</updated><id>https://codahale.com//virtual-machines-are-fleeting-things</id><content type="html" xml:base="https://codahale.com//virtual-machines-are-fleeting-things/"><![CDATA[<p class="message">This article was originally posted on blog.skyliner.io on Feb 23, 2017.</p>

<p><img src="/assets/img/mono-no-aware.jpg" alt="Fine weather, isn't it?" /></p>

<p>Last month, AWS sent us an email regretfully informing us that the hardware running one of our
instances had begun to fail. They gave us two weeks to move our data, if we could, to a new
instance. I smiled, hit the archive button, and confidently went about the rest of my day. We’d long
since terminated that instance, and our software had already moved on. At other companies, an
instance retirement email means a day’s work for someone. My happy indifference to the loss of
instance <code class="language-plaintext highlighter-rouge">i-0b3baeac82b2b8461</code> was a direct result of a key design decision we made when building
Skyliner: Skyliner performs everything — builds and deploys — by launching new, impermanent EC2
instances. The result is a dramatically simpler system, and thus a more reliable and operable
system.</p>

<p>At the lowest level, deploying with fresh, purpose-specific instances means there is zero risk of
accidental state from the last deploy: no conflicting system packages, no mangled symlinks, no
zombie processes, no old configuration files. This radically reduces the path dependence of builds
and deploys, which in turn increases their reliability and operability. When a deploy fails, you
don’t need to consider the entire history of a server—you can be confident that the failure is
scoped to that deploy’s (hopefully small) diff.</p>

<p>The cumulative effect of this elimination of accidental state is a radically reduced divergence from
an ideal state over time. If an engineer manually logs into one of the servers to debug something,
whatever changes they may make—installing a package, modifying a configuration file, taking a heap
dump—automatically disappear after the next deploy, instead of when said engineer diligently
remembers to roll back all of the changes they made when trying to diagnose an outage at 3am. If a
developer forgets to delete their temporary files or debug logging is accidentally enabled for
everything, those files vanish after the next deploy. Like
<a href="https://en.wikipedia.org/wiki/Apoptosis">apoptosis</a>, continuously destroying and creating instances
makes it nearly impossible to accumulate error.</p>

<p>Instance impermanence is also a helpful invariant for steering applications themselves towards more
reliable architectures. The ability to depend on local storage as long-term persistence is a kind of
<a href="https://en.wikipedia.org/wiki/Affordance">false affordance</a>: it implicitly makes the promise to
application developers that <code class="language-plaintext highlighter-rouge">/tmp</code> is not, despite the name, actually temporary. A classic example
of this are web applications which store session data as temporary files (<em>e.g. Rails circa 2006</em>).
When working on and deploying to permanent servers, it’s easy for a developer (<em>e.g. me circa early
2006</em>) to build software on top of that false affordance of permanence (<em>e.g. authentication circa
early 2006</em>) with significant reliability and operability problems (<em>e.g. inode exhaustion from old
session files, logging out all users from cleaning up what I thought were old session files, kernel
locks and corrupted data from trying to store the sessions on an NFS server circa mid-2006</em>) which
would not have happened with an alternate architecture (<em>e.g. storing sessions in memcached circa
late 2006</em>). (It was a different time.)</p>

<p>Unsurprisingly, this invariant is also helpful in guiding infrastructure applications like Skyliner
towards simpler, more reliable architectures. In particular, it collapses provisioning, deployment,
and configuration management into a single, highly reliable pathway. Need more capacity? Launch new
instances. Want to deploy your changes? Launch new instances. Want to change the configuration?
Launch new instances. In systems with separate provisioning, deployment, and configuration
management processes, it’s common for the less frequently used processes to be slow, complex, and
unreliable: deploys are pleasant, installing a new package for an application takes half a day
wrestling with Puppet, and provisioning additional capacity is a full-on devops adaptation of
Voltaire’s <em>Candide</em>. By combining those pathways, we ensure that an improvement for one is an
improvement for all, and its constant operation gives us confidence in our ability to perform more
infrequent tasks like provisioning and configuration.</p>

<p>Finally, a focus on new instances means being able to take advantage of key AWS technologies like
autoscaling groups. Every Skyliner application is deployed in an autoscaling group, which makes
recovering from instance crashes or retirement automatic. Skyliner is itself a Skyliner application,
so when I read that email about poor instance <code class="language-plaintext highlighter-rouge">i-0b3baeac82b2b8461</code>, I could simply smile and get
back to work.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In which the pain of attachment is avoided.]]></summary></entry><entry><title type="html">On The Difficulty Of Conjuring Up A Dryad</title><link href="https://codahale.com//on-the-difficulty-of-conjuring-up-a-dryad/" rel="alternate" type="text/html" title="On The Difficulty Of Conjuring Up A Dryad" /><published>2017-05-17T00:00:00+00:00</published><updated>2017-05-17T00:00:00+00:00</updated><id>https://codahale.com//on-the-difficulty-of-conjuring-up-a-dryad</id><content type="html" xml:base="https://codahale.com//on-the-difficulty-of-conjuring-up-a-dryad/"><![CDATA[<p class="message">This article was originally posted on blog.skyliner.io on Nov 29, 2016.</p>

<p><img src="/assets/img/apollo-and-daphne.jpg" alt="Apollo &amp; Daphne / Veronese" /></p>

<p>When we started building Skyliner, our goal was to make deploys on AWS safe, reliable, and easy. To
accomplish this at scale, we made some key design and implementation decisions. In this post, I’ll
tell you what those decisions were, why they work, and how we built a system which is reliable
enough to even deploy itself.</p>

<hr />

<h2 id="use-a-finite-state-machine">Use A Finite-State Machine</h2>

<p>Our earliest major design decision was to model the Skyliner deploy process as a <a href="https://en.wikipedia.org/wiki/Finite-state_machine">Finite-State
Machine</a> (FSM), with transitions from one state
to another associated with specific conditions and actions. For example, a deploy in the
<code class="language-plaintext highlighter-rouge">rollout-wait</code> state will check the newly-launched instances of the deploy. If the instances are up
and running, the deploy is advanced via <code class="language-plaintext highlighter-rouge">rollout-ok</code> to the <code class="language-plaintext highlighter-rouge">evaluate-wait</code> state. If the instances
have failed to launch, the deploy is advanced via <code class="language-plaintext highlighter-rouge">rollout-failed</code> to the rollback state. If the
instances are still launching, the deploy is kept in the <code class="language-plaintext highlighter-rouge">rollout-wait</code> state via
<code class="language-plaintext highlighter-rouge">rollout-in-progress</code>.</p>

<p><img src="/assets/img/skyliner-deploy.png" alt="A Skyliner deploy." /></p>

<p>Using an FSM allows us not just to exhaustively determine that all possible states of a deploy are
handled, but also to decompose our own code into small, comprehensively-tested, state-specific
functions. It also allows us to extract state management as first-class concern; unlike many deploy
tools which keep this state in memory, we store an append-only history of deploy states. As a
result, our code is reentrant: it can be interrupted at any time and safely resumed later. If one of
our servers crashes another one can seamlessly take its place with no disruption in service.</p>

<h2 id="use-a-reliable-coordinator">Use A Reliable Coordinator</h2>

<p>Unlike some deploy tools which require a single “master” server to coordinate deploys, Skyliner uses
Amazon’s Simple Queue Service (SQS)—a highly-available, scalable, reliable message queue service—as
a distributed clock to advance each deploy’s state.</p>

<p>SQS has a very robust model for dealing with failures: when a consumer polls the server for a new
message, it specifies a visibility timeout — a period during which that message will not be visible
to other workers. If the consumer successfully processes the message, it deletes it from the queue.
If the consumer crashes, the visibility timeout elapses and the message becomes visible to another
consumer. Similarly, when sending a message to a queue, one can specify a delay — a period of time
during which that message will not be visible to any consumer. We use the delay and the visibility
timeouts to create “ticks” for deploys.</p>

<p>When a deploy is started, we send an SQS message with the deploy ID, environment, etc. using a delay
of e.g. 10 seconds. After 10 seconds, it becomes visible to a Skyliner background thread, which
receives it using a visibility timeout of e.g. 10 seconds. The thread looks up the deploy’s current
state and takes any appropriate action to advance it. If the deploy has finished, the thread deletes
the message from the queue. Otherwise, the message is left in the queue to reappear after another 10
seconds has passed.</p>

<p>If the instance handling the deploy crashes, the deploy’s message becomes visible after another 10
seconds, and another instance will receive it and advance the deploy.</p>

<h2 id="use-blue-green-deploys">Use Blue-Green Deploys</h2>

<p>Skyliner uses <a href="http://martinfowler.com/bliki/BlueGreenDeployment.html">blue-green deploys</a>. Instead
of modifying servers in-place and hoping nothing goes wrong, we leverage EC2’s elasticity and launch
an entirely new set of instances running the new version of your software. If the new version passes
healthchecks, we roll forward by terminating the old instances. If the new version doesn’t come up
cleanly, we roll back by terminating the new instances.</p>

<p>As a result, deploys on Skyliner are:</p>

<ol>
  <li>
    <p><strong>Reliable.</strong> Both rolling forward and backward requires only the termination of EC2 instances—no
coordinating system package upgrades and downgrades, no modifying local Git checkouts or switching
symlinks.</p>
  </li>
  <li>
    <p><strong>Safe.</strong> At no point in time is your application’s capacity reduced, so there’s no need to wait
until a scheduled maintenance period to deploy.</p>
  </li>
  <li>
    <p><strong>Fast.</strong> Compared to other safe deploy strategies, Skyliner doesn’t have to reduce the rollout
 rate in order to maintain capacity. In a traditional data center, the set of servers an 
 application can run on is typically fixed, and in order to maintain a minimum capacity, deploys 
 are usually done in a “rolling” fashion, either making changes one server at a time or in small 
 batches. Consequently, rolling deploys take \(\mathcal{O}(\frac{N}{M})\) time for \(N\) total 
 servers and batches of size \(M\). Blue-green deploys, on the other hand, can perform all their 
 operations in parallel, and take \(\Omega(1)\) time—only as long as the worst-case time of any 
 single instance.</p>
  </li>
</ol>

<h2 id="skyliner-on-skyliner">Skyliner On Skyliner</h2>

<p>The end result of these decisions is that several times a day, we can use Skyliner to deploy a new
version of itself. When a build is finished, we use the old version to start the deploy, which
launches EC2 instances with the new version of Skyliner. For a brief moment, both are running
simultaneously until a background thread rolls the deploy forward and begins the termination of the
EC2 instances running the old version. As those shut down, the instances running the new version
oversee the cleanup of the deploy which brought them into existence.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In which deploys are made boring.]]></summary></entry><entry><title type="html">The Happy Genius Of My Household</title><link href="https://codahale.com//the-happy-genius-of-my-household/" rel="alternate" type="text/html" title="The Happy Genius Of My Household" /><published>2017-05-16T00:00:00+00:00</published><updated>2017-05-16T00:00:00+00:00</updated><id>https://codahale.com//the-happy-genius-of-my-household</id><content type="html" xml:base="https://codahale.com//the-happy-genius-of-my-household/"><![CDATA[<p class="message">This article was originally posted on blog.skyliner.io on Aug 29, 2016.</p>

<p>When viewed from one angle, we’re in the middle of an infrastructure software renaissance: a
cornucopia of distributed databases, self-healing meshes, and software-defined anythings, all
radiating with potential to help you get your own applications up and running. From a different
angle however, the average software developer resembles a modern Tantalus, ever-grasping at an
ever-retreating image of operability and reliability.</p>

<p>When we started work on Skyliner, we knew that in building a product that helps people quickly get
their applications up and running in the cloud we would have to make a number of hard architectural
decisions not just for ourselves but also for our customers. One of the biggest of these—and one I
remain very confident about—was the decision to avoid container platforms like Swarm and Kubernetes
entirely. We believe that by focusing on a single cloud provider—in our case, AWS—and tapping into
the economies of scale which make cloud computing so attractive, we can deliver a better application
platform for our customers. Here’s why.</p>

<h2 id="pick-a-cloud-any-cloud">Pick a cloud. Any cloud.</h2>

<p>To start, a bit of devops apostasy: the idea of multi-cloud architectures is, for the vast majority
of companies, a complete and total boondoggle.</p>

<p>The idea of jumping from Amazon to Microsoft to Google all nimbly-pimbly is an attractive one, and
for some very constrained workloads the financial upside of pricing arbitrage might even be worth
it. For the other 99% of the market, however, the reality of the situation is that a truly
multi-cloud architecture must either limit itself to the lowest common denominator of the supported
providers’ functionality or else foist on its users the responsibility for resolving the
inconsistencies in feature sets and behaviors (probably at 4am during an outage).</p>

<p>Historically, most companies have settled on a single provider, but with the rising hype about
containers some companies have decided to invest in systems which limit their requirements to the
multi-cloud lowest common denominator but provide a platform on top of that which is so universal
that the affordances and amenities that customers require could be built on top of it. In essence:
<em>what if the cloud was built on top of us</em>?</p>

<p>The potential upside of these container platforms is well-marketed but the costs are far less widely
acknowledged.</p>

<h2 id="you-only-get-one-hill-to-die-on-so-choose-wisely">You only get one hill to die on, so choose wisely.</h2>

<p>A viable production install of a container platform usually looks something like this. First, you
set up a clustered configuration service (e.g. Consul, Etcd, ZooKeeper). Second, you set up a
clustered scheduler service (e.g. Kubernetes, Swarm, etc.). Third, you set up a set of workers which
actually run your software. At every step of the process, you alone are responsible for all
operational tasks associated with keeping these systems running 24/7.</p>

<p>Once you manage to set all that up, you’ll only have a blank slate upon which you can begin the work
of building your own application platform. Log aggregation? Metrics collection? Databases? Backups?
Storage? Alerting? Load balancing? Key management? Service discovery? Authentication? You’ll need to
find solutions for all of these problems, set them up, integrate them, and then keep them running
for the rest of your business’s natural life. (Not that you need it all for the business, but once
you get locked into a serious container platform, the tendency is to push it as far as you can.)</p>

<p>Even if you cobble together a suitable application platform, you’ll still be on the wrong side of
the economic dynamics which make cloud computing so appealing. To wit: there is a large team of
people at Amazon whose job it is to keep the load balancers working 24/7, and because they operate
at a vast scale, they can offer this service for $16 a month per load balancer. That’s about 12
minutes per month of a software developer’s time. It’s unlikely that a jerry-built ball of Nginx and
HAProxy can beat that value proposition.</p>

<p>Unless you’re in the business of building platforms, it’s wise to consider the set of
responsibilities you already have—the features you need to ship, the bugs you need to fix, the
markets you need to address, the customers you need to help—and ask yourself if just participating
in your cloud provider’s economies of scale wouldn’t make more sense.</p>

<p>At Skyliner, we came to the conclusion that the point of running our software in the cloud was to
make life easier—not more exciting—and we think we’ve hit a fine balance in how to accomplish this.</p>

<h2 id="use-containers-not-too-much-mostly-for-packaging">Use containers. Not too much. Mostly for packaging.</h2>

<p>Skyliner uses containers purely as an application packaging system, allowing you to build a
container image on your laptop, build server, etc. and have a high degree of confidence that it can
be successfully deployed. Skyliner doesn’t use registries, scheduling, service discovery,
virtualized networking, or any other advanced features. Instead, we use AWS services with proven
reliability like S3, Autoscaling, and Elastic Load Balancing — services which have seen almost a
<a href="https://aws.amazon.com/blogs/aws/amazon_ec2_beta/">decade</a> of continuous use and improvement. Each
instance in your application’s autoscaling group downloads the container image from S3, loads it,
and runs it locally. It’s simple, it’s reliable, and it’s managed entirely by AWS.</p>

<p>While there’s a lot of interesting technology on the horizon, the point of Skyliner is to help you
run your applications today. Even if you’re excited by the potential of containers, you already have
software you need to run—your own—and you probably won’t be helped in that by adding additional
operational tasks to your long list of responsibilities.</p>

<h2 id="if-you-lived-here-youd-be-home-by-now">If you lived here, you’d be home by now.</h2>

<p>As part of building Skyliner, we actively sought out economies of scale and technical advantages in
AWS—places where Amazon teams provide far better value than a small company could on their own. For
example:</p>

<ul>
  <li>Elastic Load Balancing can offer superior service by virtue of the fact that Amazon controls the 
networking fabric in ways that Amazon customers cannot;</li>
  <li>Key Management Service is backed by cryptographic hardware but is orders of magnitude cheaper than 
going it alone;</li>
  <li>S3’s storage pricing was industry-changing when it launched a decade ago and it’s only gotten
cheaper;</li>
  <li>a t2.medium instance on EC2 is about half the cost of a c4.8xlarge instance when normalized for 
CPU and memory.</li>
</ul>

<p>With Skyliner, you can leverage the best parts of the existing AWS ecosystem without having to be an
expert. We do the curation for you. It’s our job to do the work of researching each service,
weighing its value proposition, selecting the winners, and connecting it all together. As a
customer, you get some amazing functionality directly out of the box:</p>

<ul>
  <li>A full, tamper-proof audit log of all AWS operations performed on your account via CloudTrail.</li>
  <li>Load balancers with HTTP/2, WebSockets, and a modern, managed TLS stack via ELB’s Application Load 
Balancers.</li>
  <li>Scalable log aggregation with search and alerting via CloudWatch Logs.</li>
  <li>Free SSL certificates which renew automatically via Amazon Certificate Manager.</li>
  <li>Secure configuration parameter storage via Key Management Service.</li>
  <li>Metric aggregation and alerting with CloudWatch.</li>
  <li>Resilience to data center outages via Autoscaling Groups and multi-availability zone Virtual 
Private Clouds.</li>
</ul>

<p>The core component of what we do at Skyliner is turn our expertise with AWS into integrated,
reliable features for you, and because Skyliner runs everything on your own AWS account, you’re not
limited to what we can do. Want to set up a managed database server with RDS? Want to use Kinesis
for stream processing? Redshift for big data analytics? SQS for background jobs? CloudFront for
faster web pages? Go ahead and set it up. Where possible, we add things like subnet and security
groups to make it easy for you to set things up yourself, even if we don’t automate their management
yet.</p>

<p>As software developers, we understand the allure of shiny new technologies, but ultimately we
decided that we prefer the quiet satisfaction of sleeping through a night while on call. After all,
we’re not just building a platform for our customers—we run our applications on Skyliner, too.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In which clouds are considered.]]></summary></entry><entry><title type="html">On Containers</title><link href="https://codahale.com//on-containers/" rel="alternate" type="text/html" title="On Containers" /><published>2016-08-30T00:00:00+00:00</published><updated>2016-08-30T00:00:00+00:00</updated><id>https://codahale.com//on-containers</id><content type="html" xml:base="https://codahale.com//on-containers/"><![CDATA[<p>As you may or may not know, I founded a company
(<a href="https://www.skyliner.io">Skyliner</a>) with some friends at the beginning of the
year, focused on making it easier to set up and operate web applications on
<a href="https://aws.amazon.com">AWS</a>.</p>

<p>I just wrote an article <a href="https://blog.skyliner.io/the-happy-genius-of-my-household-2f76efba535a">about our approach to containers</a>
and the economics underlying that approach.</p>

<p>My co-founder <a href="http://mcfunley.com">Dan McKinley</a>–he of the “choose boring
technology” motto–wrote an excellent article
<a href="https://blog.skyliner.io/no-way-out-but-through-1db41c648697">on how Skyliner deploys your applications</a>
and why we built it like that.</p>

<p>My other co-founder, Marc Hedlund–previously the VP of Engineering at Stripe
and Etsy before that–needs to finish his blog post about why we’re not a
traditionally VC-funded company. <em>*ahem*</em></p>

<p>I have some thoughts about how Skyliner is the philosophical extension of my
work with projects like Dropwizard, but I’ll save them for later.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In which I write a blog post somewhere else.]]></summary></entry><entry><title type="html">Fan-In</title><link href="https://codahale.com//fan-in/" rel="alternate" type="text/html" title="Fan-In" /><published>2016-02-01T00:00:00+00:00</published><updated>2016-02-01T00:00:00+00:00</updated><id>https://codahale.com//fan-in</id><content type="html" xml:base="https://codahale.com//fan-in/"><![CDATA[<p>For a long time, my go-to question when interviewing a candidate was about
building <a href="https://www.twitter.com">Twitter</a>. I was looking to see how they
thought about scalability and so, in classic programming interview cargo cult
style, I would try to lead the candidate down the road of designing a system
which could handle the pub/sub-y part of Twitter’s timeline.</p>

<p>(This was years ago, mind you, back when Twitter’s product design was simple
enough that telling engineers at Twitter how to do their jobs was a common
hobby.)</p>

<h2 id="the-bieber-problem">The Bieber Problem</h2>

<p>My question’s focus was the
<a href="http://highscalability.com/blog/2013/7/8/the-architecture-twitter-uses-to-deal-with-150m-active-users.html">Bieber problem</a>:
given that Mr. Bieber is now followed by over 75 <em>million</em> people, how do you
build a timeline system in which his adoring fans can, with low latency, see his
tweets? Candidates would toss something up on the board, I’d ask questions about
why they went with a particular approach or how they thought a particular bit
would handle increased load, and they’d toss more stuff up on the board. It was
great.</p>

<p>(Like almost all interview questions, it was ultimately just a vehicle for my
own prejudices and superstitions but it passed for clever at the time and no
one, including myself, noticed.)</p>

<h2 id="the-other-bieber-problem">The <em>Other</em> Bieber Problem</h2>

<p>Some years later, I described this question to a friend of mine over drinks at
one of the City bars where people argue about databases and eat tater tots. He
was a third-wave Twitter old blood, depending on how one counts his rings, and
he smiled and nodded and told me about the other Bieber problem.</p>

<blockquote>
  <p>Fan-out’s rough, but that’s mostly parallelizable. The real rough part is that
all of his fans reply.</p>
</blockquote>

<p>I also smiled and nodded and, thinking only of computers, commiserated: “Wow,
that sounds hard.” Like most things, I hadn’t actually thought it through.</p>

<h2 id="the-small-nice-thing">The Small, Nice Thing</h2>

<p>The other day I shared <a href="/a-small-nice-thing/">a small observation</a> I’d made about programming
in Clojure and compared it to a number of other languages I’ve used over the
years. Unlike some of the other things I’ve written, it was not an advocacy
piece. I had found a neat bit of Clojure where everything lined up
perfectly—like the congruence of azimuth and street which slots the sunset
neatly between skyscrapers on certain days—and wanted to share. I had hoped to
return to my long-neglected blog by publishing a quick observation about some
recent work of mine, the way I’d done back when I’d tried to explain
<a href="/what-makes-jersey-interesting-parameter-classes/">what I liked</a> about <a href="/what-makes-jersey-interesting-injection-providers/">JAX-RS</a> to folks in the Ruby community.</p>

<p>(Again, this was years ago, back when Rails was new and shiny enough that
telling rails-core how to do their job was a common hobby.)</p>

<p>And so I wrote and published a post comparing the pair-wise addition of
sequences of numbers in some of the languages with which I’ve worked. I had
hoped to shake my reluctance to write, but what happened instead confirmed the
dread I’d felt about writing at all.</p>

<p>Some folks—friends, mostly—were happy to see me return to writing, which was
nice to hear. A few—again, mostly friends—suggested more idiomatic
solutions, and I updated the post with their suggestions. Others—mostly
Clojure programmers—said they’d noticed the same thing or shared other little
serendipitous bits of code, which made me smile.</p>

<p>Most people, though, wanted to argue. In retrospect, it’s a classic bikeshed
setup—a simple bit of code about which everyone can have an opinion—and
that’s exactly what happened. Folks mentioned me to tell me how it could be done
in languages I haven’t used and don’t care about, to tell me how much better
statically-typed languages are (as if I hadn’t spent years programming in
statically-typed languages), to suggest that I cherry-picked this case in order
to better advocate for Clojure, to shit-talk the languages I’ve used,
&amp;c. &amp;c. &amp;c.</p>

<p>This wasn’t entirely unexpected—lord knows I’ve written some fairly incendiary
shit and lord knows I’ve argued on the internet about computers before—but
what took me by surprise was the degree to which these fairly garden-variety
responses fucked up my day.</p>

<p>In order to explain why, though, I need to back up a bit.</p>

<h2 id="the-arc-of-the-choral-universe">The Arc Of The Choral Universe</h2>

<p>At the dawn of time, when the earth was still new and everyone hosted their own
websites, responses to a piece of writing arrived as either emails or more
pieces of public writing. If your work moved someone to respond—agreement,
correction, outrage, whatever—they would either send you an email or publish
something themselves.</p>

<p>Later, when the animals had been named and everyone was running their blogs on
insecure bits of PHP, responses would arrive as comments on your post or
pingbacks to new posts on other insecure bits of PHP.</p>

<p>Later still, when people began to walk on two legs and everyone gathered in
central forums and link aggregators, responses would collect in long-running
threads as commenters battled for karma and other imaginary things.</p>

<p>Now, as the stars begin to dim and humans dip and swerve in flocks of social
media ephemera, responses are instantaneous and direct and physical, our nascent
haptic helpers tugging gently at our sleeves to let us know that someone,
somewhere, has had an opinion at us.</p>

<p>At each stage we gain something and we lose something. I miss what we’ve lost
most recently. When comments were emails and other blog posts, they could be
ignored. It’s a simple thing to add a filter to, e.g., mark all emails
containing the word “bcrypt” as read and move them to a folder one never
checks. Emails and web sites don’t come find you; they don’t interrupt
conversations of yours to interject their opinions; they don’t make your watch
subtly tap you on the wrist. They wait.</p>

<p>Twitter, though, is different. It’s <em>immediate</em>. There are no messages left
unread, no inbox, no filters, no delay, no curation. Tweets cause notifications,
which are instantly pushed to the devices you carry with you daily. Twitter’s
also <em>convex</em>. Everyone’s connected to everyone else, with blocks and protected
accounts to hide behind. You can find anyone, <a href="http://www.tmz.com/2016/01/28/neil-degrasse-tyson-bob-earth-flat-beef/">join any conversation</a>,
<a href="http://www.buzzfeed.com/michaelblackmon/you-wear-cool-pants#.hd0qaY060">spectate any exchange</a>, <a href="https://broadly.vice.com/en_us/article/we-interviewed-the-youths-who-tweet-fuck-me-daddy-at-the-pope">say anything</a>. All conversations are
accessible, to the point that a tweet about someone which doesn’t notify a
referent isn’t a tweet—it’s a <em>subtweet</em>.</p>

<p>These are precisely the fishbowl dynamics which made me fall in love with
Twitter, but they also have a deeply sinister side.</p>

<h2 id="the-attention-lens">The Attention Lens</h2>

<p>It’s a common enough phenomenon: someone says something on Twitter, perhaps
about <a href="https://twitter.com/mallelis/status/691776454082859008">how cute sleepy dogs are</a>, which suggests a particular
reply. Someone sees the place where a joke would go, so
<a href="https://twitter.com/Marina_Berger/status/691776631384494080">they crack it</a>. Everyone laughs. Someone else sees the same empty place
where a joke would go, so <a href="https://twitter.com/lukedones/status/691776840441171968">they also make the joke</a>. Everyone
laughs. Another person makes <a href="https://twitter.com/_BVM/status/691777198295027713">the joke</a>, and <a href="https://twitter.com/esammer/status/691777252523114497">another</a>, and
<a href="https://twitter.com/jerrykuch/status/691778026317684736">another</a>, and <a href="https://twitter.com/gratuitous_arp/status/692209380826152960">another</a>.</p>

<p>(It’s a good joke, for what it’s worth.)</p>

<p>I’ve started thinking of this as an <em>attention lens</em>: small, human amounts of
individual attention are refracted through social media to converge on a single
person, producing the effect of infinite attention at the focal point. Even in
the event that everyone means well, the experience is <a href="https://www.youtube.com/watch?v=HPeattKV74A">surreal</a> for
the person at the focal point of the lens.</p>

<p>This dynamic, like most, is objectively worse for people who are not
stereotypically authoritative, which on the internet means everyone who isn’t a
white, straight, cis guy between 18 and 50 years old. The inhibitions to
interrupt are lowered, the desire to interject is higher, and as a consequence
the mentions are played in that much more. Attention lenses also underlie the
asymmetric dynamics of mass harassment on Twitter. Like an amplification DDoS
attack, each individual participant need only contribute a handful of messages
to flood the target’s mentions. Combine that with a small set of leaders
<a href="https://en.wikipedia.org/wiki/Thomas_Becket#Assassination">indirectly coordinating</a> the daily hate and you have the blueprints for
a fuckboy mention-laser capable of melting steel beams with collimated rays of
anime avatars.</p>

<p>Needless to say, this worries me.</p>

<h2 id="the-empty-space">The Empty Space</h2>

<p>It’s tempting to say that this is a social problem, to add “check someone’s
mentions before mentioning them” to the endless scroll of ignored netiquette, to
sigh and lament that harassment will always be with us, but that’s fundamentally
a cop-out. The problem is that we build social software with no affordances for
the limits of individual attention in the name of expedience, of engagement, of
the “marketplace of ideas”, of democracy itself.</p>

<p>The idea that we would not want to see something—to not <em>consume</em>
something—rankles the <a href="http://www.alamut.com/subj/ideologies/pessimism/califIdeo_I.html">Californian ideology</a>. While our giant
advertising companies perfect centralized spam-blocking technology, we debate
whether it’s ethical for individuals to <a href="http://www.npr.org/2015/09/21/442308407/apple-ignites-debate-over-ad-blocking-software">opt out of displaying ads</a> in
their browsers. (<em>Something there is that doesn’t love a wall,</em> it’s true, but
only <em>our</em> walls. <em>Theirs</em>, obviously, make good neighbors of us all.)</p>

<p>It’s patently true that <a href="http://danilocampos.com/2014/07/the-least-twitter-could-do/">Twitter could do more</a> but it’s also patently
true that this isn’t one of Twitter’s priorities. They’re an advertising company
with a sliding stock price, not a charity, and it doesn’t make business sense to
bolt on a bunch of power-user-only features which can only result in fewer
sponsored tweets hitting eyeballs. This is the kind of system I’ve let into my
life, though. This is who intermediates my interactions with most of my friends.</p>

<p>I don’t have a solution for this. It’s a set of tensions I think humanity will
have to resolve it eventually, but it seems unlikely that boom/bust funding of
winner-take-all companies will produce systems that respect the limits of human
attention.</p>

<p>We’re all deeply in love with the possibilities of fan-out, but what’ll get us
in the end is the fan-in.</p>

<h2 id="updated-february-05-2016">Updated February 05, 2016</h2>

<p>Almost a year ago, <a href="http://natdudley.com/">Nat Dudley</a> wrote <a href="http://natdudley.com/blog/twitter-ux-and-bullying/">an insightful analysis</a> of
how Twitter’s UX leads to, as she puts it, “clusterfucks”:</p>

<blockquote>
  <p>We’re placing a huge expectation on individuals to strictly adhere to
behaviour that is in direct contrast to the behaviour Twitter’s design
encourages them to do.</p>
</blockquote>

<p>As you’d expect from a UX designer, it’s a much more focused and practical
article than this one and includes some suggested changes to help fix
things.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In which attention is lensed.]]></summary></entry></feed>