bdunagan

How Faster Compute, More Data, and Better Algorithms Scaled AI to ChatGPT

Sat, 31 May 2025 12:00:00 +0000

Unsplash - @viazavier

I heard someone comment that NeuroIPS (a computational neuroscience conference) has grown in interest and narrowed in focus over the last two decades, where there were 500 people in 2003 discussing many different AI fields, 1,500 people in 2013 focused on machine learning, and 15,000 people in 2023 only talking about Transformers.

Why? The Transformer neural network architecture scales results in terms of compute, data, and parameters. There are now 2-trillion parameter LLMs trained on 30-trillion token data sets and running on thousands of GPUs. Billions of people now use LLMs in Google Search’s AI Overview and OpenAI’s ChatGPT.

Let’s walk through the three fundamental forces that have pushed the field forward at such a fast pace: faster compute, more data, and better algorithms.

Faster Compute

LLMs don’t run on CPUs. They run on GPUs. In Building a 10-Million Parameter LLM with 300 Lines of Python and Training It in 10 Minutes, I tried running a 800k-parameter LLM on a CPU and a GPU: the GPU was 118x faster (2 hours vs 1 minute). CPUs are designed for serial execution of any set of instructions; they can handle running an OS, writing a Word document, and resizing a photo. GPUs are designed for parallelized execution of a specific set of instructions: math, specifically for graphics. It’s even in the name. GPU stands for Graphics Processing Unit, as they were originally designed for computer games.

CPUs have a small number of processing cores (8-core or even 64-core), but GPUs have thousands of cores. The Nvidia GeForce RTX 4090 ($1,600 MSRP) has 16,384 cores on a single card. Nvidia wrote a huge software framework called CUDA to abstract away the low-level details of all those cores, so that game developers could write simpler high-level software to leverage them in parallel.

Compute: 100 GFLOPS to 100 TFLOPS in 20 years

Moreover, the amount of compute per card has been growing incredibly fast. Over the last twenty years, the compute power of a single GPU card has grown 1000x from 10^2 GFLOPS to 10^5 GFLOPS. FLOPS stands for floating-point operations per second. GFLOP stands for Giga-FLOPS or 1000x. The latest GPU cards clocks in at 100 TFLOPS.

It’s purely coincidental that neural networks, like computer graphics, also require processing massive amounts of math that can be parallelized, but GPUs are a fantastic fit for neural networks. Fundamentally, GPUs are so much faster than CPUs at machine learning specifically because they are designed for solving that specific numerical problem incredibly efficiently, processing math computation over thousands of cores in parallel.

Tools such as Pytorch abstract away CUDA and tailor the experience to neural network programming. With Pytorch, developers can simply assign models to “the GPU”, without any lower-level understanding of blocks, threads, or cores. In fact, I built a 10-million parameter LLM in the post linked above by training and running inference on a single GPU.

However, a 2-trillion parameter model does not train or run on “the GPU”. Pytorch isn’t magic and reinforces the lesson that LLMs are a leaky abstraction at scale. To train a model that requires thousands of individuals GPUs, the model needs to be split up using parallelism, referred to as 3D parallelism (as there are three independent axes):

Data Parallelism (DP): Data is split into batches. Parameter sharding is a memory optimization on top of this with tools like ZeRO (Zero Redundancy Optimizer) and FSDP (Fully Sharded Data Parallel). This is for training or high-volume inference.
Tensor Parallelism (TP): Tensors are split within a single layer.
Pipeline Parallelism (PP): The pipeline between layers is split.

Meta said that pre-training Llama 4 Behemoth (the 2T-parameter model) required 32,000 GPUs, but let’s focus on the simpler task of running the model. Llama 4 Behemoth might be a 2T-parameter model, but it uses Mixture-of-Experts (see “Better Algorithms”) and so only has 288B active parameters.

In practice, running the model might use an architecture like this using Nvidia 80GB H100s: 1 DP x 2 PP x 8 TP or 16 GPUs. That setup enables two axes of parallelism: splitting the model across layers for pipeline parallelism and then splitting within the individual layers for tensor parallelism. In fact, for a single request, pipeline parallelism is used simply to fit the model into GPU memory; when there are multiple requests, pipeline parallelism enables micro-batching to fully utilize the GPUs as each request progresses through the model layers.

Combining all of these axes of parallelism enables LLMs to train and run quickly and efficiently at scale, but keep in mind that this entire description is all to generate a single token. After that token is generated, it’s appended to the context window, and the entire process runs again. That workflow cannot run in parallel because those steps are dependent. Similar to most at-scale cloud services, the actual infrastructure that undergirds being able to use ChatGPT is incredibly complicated.

At a much higher level, we can look at how increasing amounts of compute has decreased the costs of generating tokens.

Cost: GPT-3: $60, GPT-3.5: $20, Llama 3.2: $0.06

Token costs have plummeted by 100x over the last two years, from $60/million tokens for GPT-3 to $0.06/million tokens for Llama 3.2. Faster compute is making this massive amount of numerical computation far cheaper.

More Data

As compute becomes cheaper, there is an accelerating need for high-quality training data. There are training data sets for vision, speech, and other modalities, but let’s focus on text and see how training data sets have grown over just the last eight years.

Training Words: GPT-1: 1B, GPT-3: 1T, Llama 4: 20T

Since 2018, the training size for large language models has grown 20,000x. GPT-1 was trained on 1B words. Wikipedia has 55 million articles across all languages, comprising 29 billion words, and Llama 4 was trained on almost 1,000x more data, including text, image, and video datasets.

Better Algorithms

Faster compute is great. More data is great. It doesn’t matter if the algorithm–the neural network model–cannot embed the nuance of the data. That’s where Transformers come in. The Transformer architecture drastically increased the number of parameters for neural networks but enabled them to store far more nuance in a parallelizable data flow. More recently, Mixture-of-Experts (MoE) networks further pushed the number of parameters but, similarly, increased the accuracy of the model while keeping the number of “active” parameters at a lower number.

Let’s walk through how quickly deep learning neural network models have grown in terms of parameter count.

Parameters: 1.5B to 2T in 8 years

That’s a 20,000x increase in parameters in eight years. For every parameter in GPT-1, there are 20,000 more in Llama 4, and every one of those parameters can learn a little bit more about the data.

The Bitter Lesson

Every graph above is log scale.

The three dimensions driving the current wave of AI are all increasing exponentially. The pace of change feels so blisteringly fast these days precisely because all three drivers are advancing so quickly. We keep shipping faster compute. We keep expanding the data sets to train AI on. We keep improving the algorithms to train. Progress on any one of these would push AI forward. Our current exponential progress on all three made ChatGPT a reality.

In fact, there are a growing number of examples, from Google’s AlphaZero to DeepSeek’s r1, where compute and search (self-play or reinforcement learning) scale results better than any human-derived heuristic systems that researchers can come up with. Better algorithms, such as Transformers and Mixture-of-Experts, that leverage faster compute and more data continue to produce increasingly accurate results.

Greg Sutton wrote an influential essay in 2019 titled “The Bitter Lesson”, arguing that the best way forward for AI is leveraging compute and search:

“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin…One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.”

Building a 10-Million Parameter LLM with 300 Lines of Python and Training It in 10 Minutes

Sat, 23 Mar 2024 12:00:00 +0000

DALL-E: "2:1 size picture of toy building blocks halfway through building a complicated structure"

OpenAI’s ChatGPT, Google’s Gemini, Meta’s Llama2, Mistral’s Mixtral are all examples of large languages models (LLMs). They are general-purpose machine learning models that can handle a wide variety of tasks. They seem magical, but we can build a small-scale 10-million parameter example of an LLM in around 300 lines of Python and train it in ten minutes on Google Colab for free. As we’ll see, creating and training a model is easy; achieving good performance is difficult.

The steps we’ll go through are:

Creating a data set to train on
Creating a machine learning model
Training the model on the data set
Running inference on the model
Thinking about leaky abstractions

Throughout this post, I’ll be using machine learning terminology. See my post A Glossary for Understanding Large Language Models in AI such as OpenAI’s GPT-4, Meta’s Llama2, and Google’s Gemini for any unfamiliar terms.

Let’s get started. All the code is available at github.com/bdunagan/bdunaganGPT.

Data Set

For a small data set, I compiled a single text document with the contents of every blog post on this blog, bdunagan.com. The blog is written in Jekyll and markdown, so I wrote a Python script (create_data_set.py) to concatenate all the files in the “_posts” folder (195 files) into a single document at bdunagan.com.txt. We’ll use this text file as the training data for our small Transformer model.

The training data has 363,859 characters with a vocabulary of 107 using character-level tokens, as opposed to subword tokens like OpenAI’s tiktoken and Google’s sentencepiece.

Even in this simple example, I had to go through ten iterations of generating the data: manually inspecting, training, and testing it to see what data worked well. For instance, the model was confused by HTML and Liquid tags, so I removed those. I also needed to remove Jekyll’s front matter, which is metadata for a particular blog post. Because it’s a text machine learning model, I didn’t even need to worry about labeling the data because the next predicted token (the label) is the next character in the text document.

Data collection and cleaning is an incredibly important and time-intensive task for any machine learning application. Without it, “garbage in, garbage out”.

There are many, many public data sets to choose from to start training a machine learning model, including HuggingFace’s data sets and TensorFlow’s catalog.

Model

Our model is ~300 lines of Python, available as a Jupyter Notebook on GitHub (bdunaganGPT.ipynb), and Google Colab makes it easy to import.

Click on this link to open the Jupyter Notebook in Google Colab.
Alternatively, you can visit https://colab.research.google.com/, click File > Open Notebook, select “GitHub”, and paste the URL for bdunaganGPT.ipynb.

Open the Jupyter Notebook file from GitHub in Google Colab

To complete the training in 10 minutes, we need a GPU. Change the processor by clicking on the arrow in the top right and selecting “Change runtime type”. Select the free T4 GPU, click “Save”, then click “Connect” next to the arrow.

Click on the arrow and select "Change runtime type"

Select the T4 GPU, click "Save", then click "Connect"

After the notebook loads in Google Colab and the correct processor is connected, press the Play button.

Press the play button to train and run this model for free in Google Colab

Google Colab will run this notebook: training the 10-million parameter model, running inference, and outputting 100 new tokens––in 10 minutes for free using the T4 GPU with the following hyperparameters:

GPTTest(batch_size=32, block_size=128, max_iters=1400, learning_rate=1e-3, n_embd=128, n_head=4, n_layer=52, dropout=0.1, device=device)

The model achieves a loss of 1.680 and produces the following text:

"Store 200 nice detail handled details of Launch and files or an useful base Amazon Vier. I also read"

Some people will find this output astonishing. Others will find it laughably bad. Both are right. The difference is in expectation and understanding of the building blocks of the neural network.

Before training, the model knows nothing; it doesn’t understand or write English. Its only training for knowing what character (token) to write next is based solely on my blog’s contents. After running a training loop one thousand times on one GPU, the model was able to string together characters into English words, capitalize the first word, and add some spaces and even a period. It’s not ChatGPT though, which took months to train on thousands of GPUs with billions of tokens.

This model is based on Andrej Karpathy’s excellent nanoGPT model from his Zero-to-Hero YouTube course, a fantastic and accessible deep dive into deep neural networks. Karpathy is a cofounder of OpenAI and was the Director of AI at Tesla before recently returning to OpenAI. Learn about how to write a model like this in his “Let’s build GPT: from scratch, in code, spelled out” YouTube video.

I updated the Python code in three ways:

Pytorch’s Multihead Attention: I switched from Karpathy’s version of Multihead Attention to Pytorch’s version using an attention mask.
Positional Encoding: I experimented with the original sin() and cos() positional encoding function with no learning parameters from Google’s “Attention Is All You Need” paper, but testing showed that Karpathy’s simpler learned positional encoding reduced loss a bit more.
Toggles: I added three different toggles to experiment with changing the model architecture. Two of them were the above items, and the third was disabling residual connections.

None of my changes improved the model’s original performance, but they were useful for experimentation.

Let’s compare it to GPT-3, using the table from OpenAI’s 2020 research paper titled “Language Models are Few-Shot Learners”:

Hyperparameter table from GPT-3 paper

Hyperparameters are attributes about the model and training, rather than weights within the model that we’re tuning. Each of these models was trained on 300 billion tokens, meaning each model had the same massive set of training data to learn from, but there are a number of hyperparameters listed that define the structure of each model. Let’s walk through each hyperparameter, including the equivalent variable name in our bdunaganGPT version:

n_params (parameters): The number of total weights (parameters) in the model.
n_layers (n_layer): The number of aggregate layers in the model. Each layer is composed of multiple components.
d_model (n_embd): The dimensionality of the embedding vector.
n_heads (n_head): The number of attention heads in each multi-head attention block.
d_head (head_size): The number of inputs on each attention head block, determined by two above hyperparameters: n_embd // n_head.
Batch Size (batch_size): The number of examples in each batch as it’s processed.
Learning Rate (learning_rate): The step size of how much we update the weights on every back propagation.
Context Window (block_size): The number of tokens each model reads in to predict the next token. All GPT-3 models used 2048 tokens.

With these hyperparameters, OpenAI configured GPT-3, just like we configured bdunaganGPT. GPT-3 is the same basic architecture as the model we ran. GPT-3 is just much bigger.

There is a significant caveat that models on the scale of GPT-3 can be optimized in various ways. For instance, OpenAI’s paper on GPT-3 refers to “alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.” But broadly speaking, the architectures are the same.

Training

The loss calculation is independent of our architecture, so we can test the loss with different architectures, processors, and hyperparameters to see which minimizes the training times and loss. For instance, doubling the layers might reduce the loss by a bit but double the training time, which might be an unacceptable tradeoff. I used cross entropy loss as the loss function for the model feedback. It’s the same as negative log likelihood and is derived from KL divergence.

Let’s look at how the loss decreases over time. With random weights, the loss should be around -ln(1/n) where n is the number of possible tokens. In my case, there are 107 possible characters in bdunagan.com.txt, so the baseline loss is around 4.8. The loss quickly decreases though. By saving the loss for each iteration, we use Matlab to plot how the loss changes over time.

plt.plot(steps, losses)

The significant noise in the loss graph is caused by the variance in the batches. As the LLM progressively trains across different batches, the loss goes up and down locally while still decreasing on average, and the plot visually confirms that the model is improving with more training.

The problem is that this model is just one combination of hyperparameters. There could be a different set that produces an even better model. In theory, we could test each combination of hyperparameters to find the optimal one, but in practice, large models take months to train. We cannot exhaustively search the possible space for the minimum loss.

To get a sense of how the loss changes, let’s look at a small set of combinations, all tested on Google Colab’s T4 processor:

Hyperparameters	#1	#2	#3	#4
Parameters	5,451	31,339	212,331	1,621,611
Layers	4	8	16	32
Embedding Vector	8	16	32	64
Heads	4	8	16	32
Batch Size	8	16	32	64
Block Size	32	128	128	256
Learning Rate	1e-2	1e-3	1e-4	1e-5
Dropout	0.0	0.1	0.2	0.3
Iterations	1,000	1,000	1,000	1,000
Training Time(s)	19	36	77	1218
Validation Loss	2.627	2.588	2.770	3.253

More parameters doesn’t immediately translate to lower loss. By decreasing the learning rate for the 1.6m-parameter model, we saw a higher loss than the 30k-parameter model after the same number of training iterations.

Even with one particular hyperparameter combination, more training isn’t always better. As an example of overfitting, see the table below to watch the validation loss grow while the training loss continues to shrink. The set of parameters shows that even the number of iterations is important.

Iterations	Training Loss	Validation Loss
500	2.0993	2.1271
1,500	1.2805	1.6006
3,000	0.9815	1.7457

An 800k parameter model with batch_size=64, block_size=128, max_iters=3000, learning_rate=1e-3, n_embd=64, n_head=8, n_layer=16, dropout=0.0

Unsurprisingly, changes to the Transformer architecture lead to significant changes in the final loss. Let’s see the effect of disabling various parts of a 200k-parameter model over 2,000 iterations with the following hyperparameters: 4 layers, 4 heads, 16 batch size, 32 block size, 1e-3 learning rate, 64 embedding vector, and 0.0 dropout.

Model Notes	Validation Loss
Original architecture	1.916
No normalization	1.926
No positional encoding	2.026
No attention	2.554
No residual connections	3.162

At this small scale, certain architectural pieces are less necessary than others are. Attention is the fundamental insight of the “Attention Is All You Need” paper from Google, and that layer is clearly helpful at capturing information for the model to use.

However, the effect of residual connections is even more powerful. These connections are the + operations that add the input to the output of a layer, and backprogagation over residual connections lets the model avoid vanishing gradients because the final loss flows back to each layer separately, thanks to the derivatives. (Watch Karpathy’s micrograd YouTube video for a detailed mathematical explanation.) By cobbling together these architectural patterns, we can avoid plateauing in the loss function over many training iterations.

All these training passes take time, but the processor type drastically changes how long. Let’s compare how long training takes for different processors that Google Colab has available.

Google Colab - Processor Options

We’ll use the 800k-parameter LLM to accentuate the differences in times with the following hyperparameters: 16 layers, 1,000 iterations, 8 heads, 64 batch size, 128 block size, 1e-3 learning rate, 64 embedding, and 0.0 dropout. Note that Google Colab provides CPU and T4 for free when resources are available, but the A100 and V100 are part of the paid plan.

Processor	Training Time (seconds)
CPU	7568
T4 GPU	117
V100 GPU	69
A100 GPU	64

GPUs are 118x faster than CPUs are.

GPUs greatly accelerate training performance, even in this very small example. The model trained in two hours on the CPU and in one minute on the GPU. The Nvidia A100 GPU enabled the model to train 118x faster than the CPU did, opening up opportunities for scaling that wouldn’t have been practical with CPUs.

Inference

We’ve quantified the loss across a variety of different hyperparameters, but we have no idea what the output looks like for each tier of loss. Again, before training, the model knows nothing about English. During training, all it sees is my blog’s content and tries to replicate it to mimimize the loss. Let’s qualify the loss by see the output text at various points:

4.0 Validation Loss:

==rv8\ Exlu(6~@ '/)1/U⌘N4vbyYW&?jeuP48-ea>"6\\R ;J!>:3~ "H~25 V" djP1Qw%uI<\d”>—y7 dna+ZHa%Cal<(-:?7

3.0 Validation Loss:

T .septheeh.s 9 nt> e yr. ye dmnild a .aueo_r%lme ol e aslen r hu.noco W Un elenQdhautf t neep

2.5 Validation Loss:

alder t Cfo t o s TalySL mlo deo dotersincolrwfit rto me ontade. iom Pk gs "indodeacronero andn. t'

2.0 Validation Loss:

ayigh, Ontiss thippated to so peckup to with spoickin't nox a rele davestrn difUpearanges and frotuI

1.6 Validation Loss:

I going turn into a key-vs, partners, remarker of these; additions tha steps disconsforw was easy li

Look at how the text changes from random letters to almost coherent words and then to almost reasonable sentences. A magical quality of deep neural networks and LLMs in particular is the ability to continually decrease the loss with the right hyperparameters and more data in a reasonable amount of time with GPUs. They continue to absorb more context and leverage that knowledge to generate new tokens that accurately resemble the existing data set.

Fine-Tuning

What we built is not ChatGPT. What we did is referred to as pre-training. The model is trained on a large corpus with a loss function that measures how well the output matches the corpus. Regardless of how much we improved the loss, the output would remain a steady stream of sentences that sound like my blog posts, not answers to questions.

The next step is called fine-tuning, feeding in thousands of examples of conversations along with reinforcement learning with human feedback (RLHF) to nudge the models’ pre-trained weights to appear to respond to questions with answers or write accurate responses to prompts.

Leaky Abstraction

There is a significant difference between reading about machine learning and large language models and actually attempting to create one. At one point during his YouTube series, Karpathy notes that deep neural networks are a “leaky abstraction”: understanding the internals of each aspect is critical to avoid pitfalls. Let’s look at three different instances in our simple model:

Hyperparameter Tuning: Simply increasing the iterations for certain hyperparameter combinations increased our model’s validation loss rather than decreasing it. Overfitting is a huge problem in machine learning in general.
Residual Connections: Before these connections, models suffered from vanishing gradients because the loss function’s output didn’t flow back to each layer separately and instead passed through each layer. But without understanding the math behind how the derivative can bypass layers with an addition, it wouldn’t be clear that residual connections would be a way around that. See the following code blocks as a comparison:

# Residual connection in Transformer Block
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
# No residual connection in Transformer Block
x = self.sa(self.ln1(x))
x = self.ffwd(self.ln2(x))
# Subtle bug in residual connection in Transformer Block
x = self.ln1(x) # BUG: x is overridden and no longer passes through, but this works for small networks
attn_mask = nn.Transformer.generate_square_subsequent_mask(T).to(self.device)
output, _ = self.sa(x, x, x, attn_mask=attn_mask, need_weights=False, is_causal=True)
x = x + output
x = x + self.ffwd(self.ln2(x))

Matrix, Layer, and Model Analysis: Every level of the model requires analysis and optimization to reduce the loss to the minimum, including studying matrix sizes to ensure they are doing what they are expected to do and making sure the network is set up correctly. These systems are incredibly complex at scale, with a host of places that could silently increase loss.

Understanding the innerworkings of neural networks enables us to optimize and tune models in a way that relying on the abstraction would not.

Moreover, a deeper knowledge both demystifies and highlights the magical quality of LLMs. ChatGPT is an LLM that frequently feels like talking to a person. Fundamentally, it is an LLM that is simply predicting the most appropriate next token in a way that minimizes the loss. I still find it amazing that these steps can lead to such an authentic interaction.

A Glossary for Understanding Large Language Models in AI such as OpenAI's GPT-4, Meta's Llama2, and Google's Gemini

Sat, 16 Mar 2024 12:00:00 +0000

DALL-E: "picture of an open book 16:9"

OpenAI’s GPT-4, Meta’s Llama2, and Google’s Gemini are all forms of large language models (LLMs), a subset of deep neural networks, which are a subset of machine learning algorithms. LLMs feel magical, but at their core, these models are token-prediction algorithms. The fundamental building blocks of these machine learning models are quite simple but capable of achieving astounding results at scale.

Let’s walk through a number of words that come up for LLMs.

Glossary

At the highest level, we care about two aspects:

Model: The model is the actual algorithm for processing a given input context and generating an output.
Weights: Weights are numerical constants being multiplied with (weights) or added to (biases) the inputs, like (w*x + b). For simplicity, weights and biases are both referred to as weights. They’re also referred to as parameters. GPT-3 has 7 billion parameters. Talking about the number of weights for a neural network gives an instant scope to the complexity of the network, but the number does not translate into accuracy.

Going a level deeper, we can think about how we train and use the model and weights:

Converting "LLMs understand numbers, not text." into a numerical vector using tiktoken

Tokens: GPT-3 was trained on a massive corpus of text. However, neural networks don’t “read” text. They take a set of numbers as input and output a set of numbers that minimize the loss function. These numbers are called tokens. For large language models (LLMs), text is converted into these tokens that typically represent subword chunks (not individual characters but not words) for the network to read as input and to write as output.
Training: To optimize the weights of the neural network, we take a large corpus of data and iteratively run a forward pass (inference) on small sections of it to generate an output. The output is then compared to our desired output, and the difference is called the loss. The loss is passed backwards through the network using calculus to adjust the weights, so that the next forward pass is more accurate. The goal of training is to minimize the loss of a model’s inference.
Inference: When consumers use ChatGPT, the model is running inference (a forward pass) to read an input to generate an output.

Below models and weights, we can dive into the innerworkings of neural networks. Let’s go through them in quasi-top-down order:

Hyperparameters: These refer to the actual configuration of the neural network, like having 2 layers of 100 neurons or 4 layers of 50 neurons. Optimizing hyperparameters is a second-order optimization on top of optimizing the weights of one iteration of the neural network.

Neuron

Neuron: The neuron, also referred to as a node or a perceptron, is the central building block of the neural network. It’s a simple mathematical function, designed to represent a biological neuron. The neuron takes a set of input values (x_n), multiplies each by a weight (w_n), adds them all together, adds a bias (b), and finally passes that value through a non-linear function, also called an activation function, to produce an output (y).
Loss Function: The loss function is the final part of the neural network, only present during training, and compares the output of the forward pass of the network to the expected output. Again, the goal is of training is to minimize the loss.
Forward Pass: The forward pass of a network is processing an input through the network into an output. For an LLM, the output is a set of probabilities returned by the softmax function to decide the likelihood of the next token within the token space. That probability distribution is then sampled to decide on the predicted token.
Backward Pass: The backward pass of a network is the crucial part of training neural networks. We take the output of the loss function and feed it into the network going backwards using calculus to take local derivatives of each part with respect to the final output.
Gradient Descent: When we calculate the derivative of the local equation with respect to the final output during the backward pass, we use the local slope and multiply it by a small learning rate to move toward the local minima.
Learning Rate: We are using gradient descent to move the weights incrementally toward an output that is the local minima of the equation. The learning rate is the step size of these increments: too small will take the network too long to reach the local minima, too large will cause the network to overshoot the local minima.
Cross Entropy Loss: We need a loss function to help the neural network learn how to improve based on a single positive number that we want to minimize, and popular ones include cross entropy loss and mean square error (MSE) loss. Cross entropy loss takes the softmax of the logits and then the mean of the negative log likelihood, giving us a single number. Negative log likelihood (NLL) is simply the negative log of the value. The cross entropy loss function is calculating the distance (KL divergence) between the predicted probability distribution and the true probability distribution. Simplifying that mathematical formula results in the negative log likelihood.
Softmax: This is a normalization function that translates a set of numbers to a set of probabilities between zero and one, enabling the result to be handled as a standard probability range. However, instead of a basic normalization function that divides each value by the sum of all values, softmax uses e^x / sum(e^x). The exponential component both highlights the maximum value by increasing the distance between it and the other values, unlike basic normalization, and is differentiable, unlike hardmax.
Logits: It’s the unit of measurement for a log scale (logistic unit). We use this term for the output of the penultimate neural network layer before we normalize the output using the softmax function (to get final probabilities for the next token) or before we calculate cross entropy loss (to get a loss number to minimize). Logits are unnormalized log-probabilities because both of those subsequent functions includes a softmax calculation, which exponentiates and normalizes the values. Logits are not the final values because we want the neural network’s output to be in the form of probabilities with the maximum value exaggerated through the softmax function.
Logprobs: After we use softmax on the output to get the probabilities of the next token, we take the log to get logprobs, and this extra calculation helps with numerical stability for small probability values due to the way computers store real numbers. For example, .000000001 and .0000000001 are 10x different, but their precision could be lost; however, log(.0001) is -9 and log(.00001) is -10, and their precision won’t be lost in storage. Surfacing logprobs from an LLM provides more nuance to the LLM’s confidence in its own output and what alternative responses would have been. OpenAI started providing logprobs for its selected output tokens and alternative tokens in 2023.
Batch Size: The model cannot process the entire training data set in one pass, and it would lose valuable information if it processed each example independently. The batch size refers to the numbers of examples that are processed concurrently in a single forward pass. For example, GPT-3 had a batch size of 3.2 million, so every forward pass had 3.2 million examples to process together to form a better understanding.

Attention block from "Attention Is All You Need" paper

Attention: Neural networks do not automatically absorb relationship information. For example, in the sentence “The house is on the market.”, “house” and “market” have a relationship, but it’s not the same relationship as the two words in the sentence “The house is next to the market.” Previous architectures, like recurrent neural networks (RNNs), add dependencies between stages of the model to add better relationship information, but this dependency prevented them from being efficiently parallelizable. In a 2017 research paper from Google titled “Attention Is All You Need”, researchers proposed capturing that contextual information in a new set of weights to train in the network in the form of matrices called “Query”, “Key”, and “Values”. The paper was focused on machine translation, using an encoder block for the source text and a decoder block for the destination text. However, attention has become the central architectural insight for modern LLMs because models can absorb relationship information of surrounding tokens in an efficiently parallelizable way.
Cross-Attention: This version of attention has the keys and values come from the encoder while the queries come from the decoder. Cross-attention is tailored to machine translation, where the model is attempting to both understand the source language in its entirety but also understand and predict the destination language output. For example, a model translating French to English would see “Bonne après-midi.” as input to the encoder block but only “Good” (not “ morning”) as input to the decoder block. Its job is to use the entire French context and a partial English context to predict the next English token. Again, this approach has proven in practice to be far better than any other ML algorithm has been.
Self-Attention: This version of attention has the keys and values generated from the same input as the queries, using a decoder block and no encoder block. Reusing our example, the input would be “Good” as tokens, and based on training data, the model would predict “ afternoon” as the next token.
Multi-Head Attention: One layer of self-attention only lets us represent a single relationship between two tokens, because the softmax function highlights only one possibility. Adding more heads enables the layer to capture more relationships to absorb more information about the tokens. For example, “The hungry dog ate breakfast.” has multiple relationships: “hungry” and “dog”, “dog” and “ate”, “hungry” and “ate”, “ate” and breakfast”, and “hungry” and “breakfast”. We want the LLM to absorb as many relationships as possible to most accurately predict what to say next.
Encoder: There can be two modules for attention: an encoder and a decoder. A model for machine translation uses both. The encoder takes the source language, and the decoder takes the destination language. And keep in mind that encoding is not the same as embedding. Moreover, there are decoder-only models like BERT.
Decoder: The decoder can take a destination language for machine translation, or it can be used outside of machine translation for token prediction, like for GPT.
Tokenization: The input needs to be translated into numerical values (tokens) for the model to interpret mathematically. This process is called embedding. A single token is pre-assigned a number, so the set of tokens in the input becomes a vector of numbers. The token-to-integer lookup is defined statically in advance. For example, Google uses a sub-word embedding algorithm called sentencepiece, and OpenAI uses one called tiktoken (with 50,257 possibilities).
Embedding: Tokens are used as a lookup into an N-dimensional vector space, called an embedding space. Each token’s vector is random initially and adjusted through back propagation during training to bring similar tokens closer to each other and dissimilar tokens further from each other. Keep in mind that embedding is not the same as attention and does not retain token positions.
Positional Encoding: Attention captures relationships between tokens, but it does not include actual position. Encoding the position of each token provides more information for the model to utilize. The 2017 “Attention is All You Need” paper from Google decided to use a sin/cos formula based on the embedding vector dimension, embedding vector index, and token index in order. These encoding values are simply added to the embedding vectors’ values (keeping the dimensionality the same). An alternative approach is to create a second embedding with the dimensions of the context length and the original embedding vector and add that to the original embedding vector. This second option is called learned positional encoding because it lets the neural network optimize the weights and can lead to even better performance.
Vanishing Gradient: A key insight in the attention paper was adding the raw input of a layer to the layer’s output through a residual connection. The backpropagation calculus lets the gradient pass back to the initial layers without being diminished by intermediate layers, improving the learning rate. Other architectures found the gradient vanished as it traveled back through the layers.

The context window and next token prediction over three inference runs

Context Window: As token-prediction algorithms, LLMs take an input to produce exactly one token. The first inference produces the first output token, based on the input. The second inference produces the second token, based on the input and the first output token. And the process continues, sliding the context window along by one token per inference run. When using ChatGPT, the displayed response is actually the model being run over and over again to take the question and on-going response to produce the next token.
Context Length: This is the total number of tokens that the neural network can take as input. In papers, researchers refer to this as the block size. Because it’s the total, LLMs like GPT include the output token length in the count, so that the LLM can keep the beginning of the input in the context window while generating the end of the output. Otherwise, the LLM would “forget” what the beginning of the initial input was while attempting to generate the end of the output.

Data Sets

Labeled Data Set: Training a model requires an input and a desired output, referred to as the label. The model trains its parameters to minimize the difference (loss) between the actual output and the desired output. For the text prediction tasks that an LLM performs, the label is the next token in the data set. This data set is split into three subsets: training data set (60%), validation data set (20%), and test data set (20%). (Percentages vary.) Consumer LLMs like ChatGPT are trained on huge data sets. For instance, GPT-3 was trained on 300 billion tokens, so in our breakdown, that would be 180 billion tokens for the training set, 60 billion tokens for the validation set, and 60 billion tokens for the test set.
Training Data Set: This subset of data is used to actually train the model.
Validation Data Set: This subset of data is used to validate the loss of the model but not training the model, such as running many different configurations of hyperparameters and finding the lowest loss. At a higher level though, the model is being trained on this subset simply by optimizing for the model where those hyperparameters generate the lowest loss.
Test Data Set: This subset of data is used to test the loss of the final model using data that the model has never seen to identify underfitting or overfitting.
Overfitting: By training models on a specific set of data, the parameters can become too optimized on its quirks. For instance, training a model on thousands of pictures of a cow in a grass field might lead to a model that cannot identify a cow standing on a road. It extracted the wrong information from the data and overfit: no grass, no cow.
Underfitting: If the model does not have access to enough data, it can underfit and not be able to complete the task. In the example of the cow in a grass field above, training a model on only ten images of cows would not provide enough information for the model to be able to identify a cow in a new image. Well-tuned models neither overfit nor underfit; they absorb enough information to complete the task but are able to generalize beyond the trained data set to work on new inputs.
Zero-Shot Learning: A model is able to complete a task without ever having seen it before. ChatGPT was so surprising because it did very well at tasks that it had never been trained on, like “write a haiku about why ChatGPT’s service is overloaded” (which OpenAI had on its status page for a while).
Few-Shot Learning: A model is able to complete a task having only seen a couple instances of it before.
PyTorch: This is a popular Python library created by Meta (then Facebook) for writing machine learning. It’s a complement to Pandas and numpy libraries and competes with Google’s Tensorflow.
Pre-training: “Training” encompasses the entire process of preparing a model for usage, but technically, optimizing weights is referred to as pre-training.
Fine-Tuning: After pre-training a model, we use a smaller set of examples, on the order of thousands, to fine-tune the weights of the model to solve more specific tasks, such as answering questions.

Finally, let’s touch on a number of general terms that come up for the current state of machine learning:

Transformer model from "Attention Is All You Need" paper

Transformers: This is a type of neural network architecture that Google researchers proposed in 2017 in “Attention Is All You Need” that utilizes attention as a building block. It’s called a transformer because the model transforms the data in different ways and the model needed a name to differentiate itself from RNNs, which were the state-of-the-art at the time.
GPT: This is an acronym for generative pre-trained transformers that OpenAI came up with, first seen with their GPT-1 model. ChatGPT has been so successful that transformer and GPT are frequently used interchangeably.
LLM: This is an acronym for large language models. GPT is a specific type of LLM.
Foundational Model (FM): This is a type of AI model that has been trained on a large set of data and can be applied to a large set of use cases. Due to the success of ChatGPT and other LLMs, there is a growing set of companies operationalizing foundation model pipelines, referred to as FMOps or LLMOps.
Multi-Layer Perceptron: This is a simple type of neural network that has multiple layers of perceptrons, also referred to as neurons.
Convolutional Neural Networks (CNNs): Commonly used for image analysis, this type of neural network has convolutional layers in between feedforward (unidirectional) layers. These convolutional layers act as filters on the data, sliding across the context and extracting certain features into its weights. Google DeepMind released a popular version called WaveNet for processing audio.
Recurrent Neural Networks (RNNs): Before transformers, this was the best neural network for machine translation. It preserves state inside each layer to use the previous output in the next input, giving the network a memory but also introducing dependencies between iterations and neurons. For comparison, the attention architecture in transformers also adds a form of memory but without the dependencies, so transformers can parallelize more efficiently.
Small Language Models (SLMs): Given the success of ChatGPT, companies are exploring how to build their own small-scale models but with a small set of domain-specific data to focus the model on a set of tasks. General LLMs like GPT-4 and Gemini take months to train and significant resources even for inference. Shrinking the number of parameters makes the model smaller and more efficient without reducing performance at specific tasks, to the point where these models can run on-device instead of in a data center via the cloud.

Foundation

Neural networks have been around for decades, but access to data and compute power at scale has enabled these models to cross a threshold from useful to magical. ChatGPT became an instant sensation because people did not realize a computer could respond to questions like a person. Still, at its core, a large language model such as ChatGPT is a token-prediction algorithm, and understanding how an LLM is built helps demystify and ground its performance. This glossary provides a reasonable foundation for understanding the details.

SudokuAI 1.1: Solve Sudoku Instantly at Night with Dark Mode

Fri, 16 Feb 2024 12:00:00 +0000

SudokuAI 1.1 with Dark Mode support

SudokuAI now supports Dark Mode. This app is an open-source side project that I wrote to learn Swift. Learn more about it:

Available now.

Prompt Engineering for Stories: A Generative AI Children's Book using ChatGPT and Midjourney

Sat, 28 Jan 2023 12:00:00 +0000

Midjourney Prompt: “a panda wearing sunglasses and lying down beach bed in cancun”

ChatGPT and Midjourney are both examples of generative AI, a form of machine learning that lets users create artwork based on specific inputs, such as text or images. Reading about the possiblities of this technology doesn’t have the same impact as personally trying it. It’s amazing.

I first learned about Midjourney from Stratechery’s interview with its founder, David Holz, creator of LeapMagic. I had no idea that Midjourney was Discord’s most popular server or that Midjourney is one of the biggest consumers of GPUs in the cloud. The images are amazing. Type in text, and out comes a professional image.

Midjourney Prompt: “panda reading a book on the beach”

Midjourney, released in July 2022, creates images in 60 seconds, solely through Discord. In the Stratechery interview, Holz talks about how the team was using Discord as an easy way to test the software and then realized how much fun it was, so they decided to make the tool available only through Discord. Midjourney uses a freemium business model, with a couple dozen free images and paid tiers.

OpenAI’s ChatGPT, released in November 2022, is also remarkable. It’s a text chatbot, based on the GPT-3 large language model (LLM), and can respond to an impressive variety of questions. Ask it for an essay comparing communism and capitalism, and the essay is ready in less than a minute. Ask for a Python script to ingest a CSV file. Done. Ask for a sonnet about why ChatGPT is at capacity. It displays that on its status page.

Both services take text as an input, referred to as a prompt, and run it through an advanced machine learning model to output a generative AI result. To explore the two services, I thought it would be fun to create children’s books, writing a prompt for ChatGPT so that the output could be fed directly into Midjourney for story artwork. I did one pass myself and then another with my family, and we printed all of them on Shutterfly as 6x6 Instant Books for $4 each.

Prompt Engineering

Prompts are a very new concept, but there is already a term for iterating on them for better results: prompt engineering. My goal was to create a prompt for ChatGPT that would generate a story, and the story sentences would serve as the prompts for artwork in Midjourney. I went with two family favorites: birds and pandas.

ChatGPT: “Write a seven-sentence children’s story titled “Falcon Finds Home” about a peregrine falcon that wakes up in a box, escapes, and flies home above the forest and stream to find its family and rainbow-colored nest, with each sentence written as a prompt.”

The peregrine falcon woke up inside a small box, feeling confused and alone.

ItThe falcon quickly realized it needed to escape from the box and began looking for a way out.

Using its powerful wings, the falcon broke through the top of the box and flew into the open air.

ItThe falcon soared above the forest, the trees a blur of green beneath it.

The falcon followed the babbling stream, knowing it would lead it home.

As Itthe falcon flew, a rainbow appeared in the sky, guiding the falcon towards its destination.

Finally, the falcon saw its colorful nest built high in a tree, and it knew it was home.

ChatGPT wrote the below story within ten seconds, including the numbering. I made slight alterations to the text to keep the context of the story for Midjourney in every line.

It’s a story and a bit unnverving. I came up with the idea in thirty seconds, and I was able to create all of the assets for a book in around ten minutes, lay it out in Shutterfly in ten minutes, and ship it in a week for $20.

Book cover for "Falcon Finds Home"

Book layout for "Falcon Finds Home"

ChatGPT: “Write a children’s story titled “Great Wall of Pandas” in seven sentences about a family of panda that must journey over the Great Wall of China to make it home to their bamboo forest.”

The Great Wall of Pandas is a story about a family of pandas who live in a bamboo forest in China. One day, the mother panda, Peaches, and the father panda, Mr. Panda, decide to take their baby panda, Bao, on a journey to see the Great Wall of China. As they walk along the Great Wall, Bao the Panda asks his parents many questions about the history of the wall and how it was built to protect the country. Bao the Panda is so fascinated by the Great Wall that he wants to climb to the top, so with the help of his parents, he does just that. From the top of the wall, Bao the Panda sees a beautiful view of the country below. When it’s time to go home, the family of pandas has a great journey back to their bamboo forest. Bao the Panda can’t wait to tell all of his friends about his adventure and is excited for more adventures with his family.

Book cover for "Great Wall of Pandas"

Book layout for "Great Wall of Pandas"

Connecting ChatGPT to Midjourney to create visual stories from prompts creates an AI-driven creativity pipeline. With prompt engineering, one person can fuse words and art into a story in minutes instead of months.

6x6 Instant Books from Shutterfly

Creativity

My family was uninterested in ChatGPT, but Midjourney was magical. Type in words, and out comes an image. Visualizing ideas was far more interesting for the family than getting the computer to generate a story. They wanted to write their own stories and use Midjourney for the artwork.

They wrote two books, one about a fox and a panda and one about birds, and then we laid out the stories with the artwork from Midjourney on Shutterfly. In fact, printing 6x6 books for $4 with Shutterfly was as much of a hit as Midjourney.

ChatGPT and Midjourney aren’t perfect. I generated many images with Midjourney where the panda had ten toes. ChatGPT repeatedly ignored my requests to use the name in each sentence or how to incorporate styles like “cinematic” or “sketch”. Still, these are the state of what’s available now. Those shortcomings will be resolved soon.

SudokuAI: Instantly Solve Any Sudoku Puzzle with Your iPhone

Sat, 31 Dec 2022 12:00:00 +0000

Before I started this side project, I had tried Sudoku a handful of times. Then, my friend sent me this excellent article by Peter Norvig titled Solving Every Sudoku Puzzle, and I wrote about the algorithm in How to Solve Any Sudoku Puzzle in One Second. Productizing that algorithm into an iPhone app seemed like a fun reason to learn Swift.

My original vision for this app was one screen: a camera view that automatically recognized a Sudoku puzzle, highlighted the numbers with Apple’s yellow box (like QR codes get), and solve it on the fly, superimposing the calculated numbers over the image in real time. I thought I could learn Swift, SwiftUI, and iOS’s image recognition. This is not that app.

Brainstorming is a fantastic avenue to thinking about what’s possible, but all projects, especially side projects like this one, are constrained by time and resources. SudokuAI is a far simpler app. It does have image recognition, but the process doesn’t involve floating yellow boxes. It solves puzzles in real time, but the result isn’t superimposed on a real-time image. SudokuAI also covers a couple more use cases than the original version did: creating a new Sudoku puzzle, using an image from Photos, or getting just a hint.

Most importantly, SudokuAI is shipping on the App Store now. It’s the most important aspect because without shipping, people don’t have the opportunity to try the app.

Shipping a product highlights the tension between Product Management and Engineering. Product Management wants to deliver the best experience to the customer, but Engineering is tasked with implementing the experience. Features that sound simple or fun to customers can translate into untenable requests for Engineering, like floating yellow boxes with superimposed numbers for someone who has never written Swift working on a side project. Delivering a new product requires understanding the needs of customers in addition to the resources of the team and finding the best intersection between them.

SudokuAI List

SudokuAI Puzzle

I iterated many times on the design both to simplify the interaction and to enable me to implement it.

Simple is Hard

My original vision for the app was derailed as soon as I looked into Apple’s documentation for image recognition. I thought that those yellow rectangles might be part of an API for VisionKit. They were not, and I wasn’t interested in building them. Moreover, the original version didn’t account for creating new puzzles, importing digital photos of puzzles, or getting a hint.

I pivoted to having a list of puzzles. The user could try a pre-loaded puzzle, create a new one, or import one from the camera or photo roll, and they could watch autoplay, get a hint, or solve it instantly. That list of use cases seemed reasonable and achievable. Still, those use cases could be surfaced with many different user interfaces.

I settled on buttons along the top of the list view for puzzle ingestion: plus, camera, photo. Toolbar items like these are a standard iOS style.

Toolbar

The puzzle interaction workflows were far harder to get right. I started with the music player analogy: back, play/pause, next.

Actions - v1

Actions - v2

Interacting with them wasn’t a great experience though. “Back” should always be a reset for the puzzle. “Next” doesn’t cover both use cases: hint and solve. I abandoned the music analogy and switched to words: “Solve”, “Hint”, and “AI” (for autoplay). Tapping “AI” switched the buttons to “Pause AI” and “Reset AI”.

Actions - v3

That interaction felt awkward. I iterated on the words with “Solve” and “AI”. Tapping “Solve” switched to “Answer” and “Hint” with “Reset”. Tapping “AI” switched to “Reset” and “Pause” with a play rate slider. The slider would allow users to slow down or speed up autoplay, seeing what choices the algorithm made to proceed through the problem space.

Actions - v4

That workflow was far too awkward. This app was supposed to be simple. I returned to the music player analogy with a small tweak: “Reset” (Back), “Play”/“Pause”, “Hint” (Next), “Solve” (End).

Actions - v5 (Final)

These buttons addressed the use cases and were easy to implement, and they translated well to the high-level pitch for the app:

Solve Sudoku Instantly
Tap for a Hint
How to Solve Any Puzzle

I got tired of adding puzzles manually into the app and added a toolbar item for importing a default set of puzzles: Norvig’s example along with the three hardest puzzles I could find.

The final addition to the app was the tutorial. The first question during beta testing was what exactly the app did, so I added the tutorial and a question mark button to bring up the Help section any time.

SudokuAI is not a complicated or flashy app, but it achieves what the goals were.

Technical Difficulties

I’ve used Objective-C, Cocoa, and Interface Builder (NIBs/XIBs) for over a decade, writing 50k lines of code to create the user interface for Retrospect Backup for Mac, but Apple’s moved on to Swift. I had zero knowledge of Swift when I started this app.

The initial problem was learning enough Swift to implement the algorithm, but productizing R&D doesn’t just mean wrapping a UI around some research. Features like “Tap for a Hint” meant deciding how to store and surface the intermediate steps of the algorithm, and the “Play”/“Pause” buttons required the algorithm slow down and occasionally stop at the user’s request. Adding the plumbing for those user-facing features required extending the original algorithm to support new use cases.

Next, when should the app actually solve a puzzle? When the user hits “Solve”? When it’s created? The design choice depends on the engineering capabilities. Luckily, solving any known puzzle takes 1.2 seconds at maximum, about 1400 steps in the algorithm. Based on that, I created an upper bound, declaring “No Solution” if the algorithm exceeds 2000 steps in its search. Because the time was short enough, I opted to solve a puzzle for every change. By always having a solution or failing after a certain number of tries, the app could then display if there was a solution, providing great feedback to users who are editing a puzzle.

The UI is built in SwiftUI. SwiftUI is drastically different working with XIBs in Interface Builder. In fact, Apple EOL’d Interface Builder in 2011 and rolled it into Xcode before replacing it with SwiftUI. SwiftUI is declarative code (as opposed to imperative code), meaning you can write what you want, not exactly how you want it. The following code becomes a list with PuzzleCell as the table view’s cells and “SudokuAI” as the title.

NavigationView {
  List {
    ForEach(items) { item in
      PuzzleCell(item: item)
    }
  }
  .listStyle(PlainListStyle())
  .navigationTitle("SudokuAI")
}

With SwiftUI, I could easily write UI code (with only a few conditional statements) that worked on any iPhone and iPad on both portrait and landscape. Coming from Interface Builder, the experience was both transformative and incredibly frustrating: amazing because very little code can become a functioning UI and annoying because that UI might not be exactly what I wanted. Learning SwiftUI informed the app’s design by seeing what was easy and shippable.

Beyond solving the puzzle, the app needed to ingest puzzles, either by user creation or by import and image recognition. The goal was to do a reasonable job of recognizing a Sudoku puzzle in an image and extracting the numbers from it. VNRecognizeTextRequest in Apple’s VisionKit was a great API to accomplish this with minimal effort, but it took a number of tries to find the correct parameters. Thanks to Apple’s Locating and Displaying Recognized Text sample project for making attribute testing easy for sample puzzle images.

textRecognitionRequest = VNRecognizeTextRequest(completionHandler: recognizeTextHandler)
textRecognitionRequest.recognitionLevel = .accurate
textRecognitionRequest.usesLanguageCorrection = true
textRecognitionRequest.recognitionLanguages = ["en-US"]
textRecognitionRequest.usesCPUOnly = false
textRecognitionRequest.revision = VNRecognizeTextRequestRevision1

For a given image, iOS’s VisionKit API provided a list of character guesses with their location in the image. The app needed to estimate the spacing between the recognized numbers, including where the puzzle started and stopped within the image, so the code first guesses the surrounding box size and then assigns each character to a square, adding blanks for spaces that contain no number. The algorithm works well if the user crops the image to only include the puzzle. To help the user correct any errors, I overlay the recognized puzzle on top of the image, so that the user can tap on each square to correct the number. The app uses the same workflow to create a new puzzle.

Image Overlay on Puzzle

With camera support, photo roll integration, and puzzle creation, SudokuAI gives users a couple different workflows for ingesting puzzles. Solving puzzles instantly on any change allows the app to display a status like “Ready to Solve” or “44 hints left”. Letting users tap to change each square ensures any image recognition errors can be easily fixed. SwiftUI enables me to bundle these features into a simple UI.

Available Now

SudokuAI is available for free from the App Store. It is compatible with iOS 15 and later on iPhone and iPad. The source code is available on GitHub at bdunagan/SudokuAI under MIT license.

Despite its name, the underlying algorithm is simply constraint propagation and search, neither of which is AI (artificial intelligence) or ML (machine learning). SudokuAI just seemed like a catchy name.

How to Solve Any Sudoku Puzzle in One Second

Sat, 31 Dec 2022 11:00:00 +0000

Unsplash - @iamfrancismorgan

My friend pointed out an excellent article by Peter Norvig titled Solving Every Sudoku Puzzle. Norvig is an Education Fellow at Stanford Institute for Human-Centered AI, co-authored “Artificial Intelligence: A Modern Approach”, the most popular AI textbook, and served as a director of research and search quality at Google.

This is Sudoku: fill in every blank square with a number between 1 and 9 such that every row, every column, and every 3x3 group has every digit.

Example Sudoku Puzzle

I was a complete novice to Sudoku when I started this project. I had casually played it before, but my naive mental algorithm was a mix of constraint propagation and search, without any systematic method to it. In his article, Norvig walked through how to use constraint propagation and backtracking search to quickly solve any valid Sudoku puzzle. His Python version was able to do so in one second for the world’s hardest puzzles.

I skimmed Norvig’s general approach (summarized below) and then attempted to fill in the details myself with a Ruby implementation. The process took a couple iterations and pushed me to understand the problem space far better than I would have simply reading the article and Norvig’s Python implementation.

Let’s walk through the different challenges.

Problem Space

Sudoku has 10^21 potential solutions, and the minimum number of supplied values is 17. A researcher actually spent a year proving that no 16-value Sudoku puzzles existed, according to MIT Tech Review. Fittingly, Norvig’s example puzzle (pictured above) has 17 values.

Each square in the 9x9 grid can have a value between 1 and 9. These squares are grouped into “units”: columns, rows, and 3x3 groups. Each unit has nine squares (peers), so each unit has one of every digit. If a unit already has a value assigned, like “4” in the first square (A1 in Sudoku parlance), then no other square can contain that value. By converting the grid that we see into representation of known and potential values (A1->4, A2->12356789...), we can evaluate each unit according to a set of constraints.

Constraint Propagation

There are two constraints that we can use to reduce the space of possibilities:

Elimination: If a square has only one value, eliminate that value from all other peers.
Assignment: If a square is the only peer with a certain value, that value must belong to that square.

For every elimination, we can attempt an assignment, and for every assignment, we can attempt further eliminations, recursively narrowing the potential value space for the puzzle.

In practice, easy and medium difficulty Sudoku puzzles can be solved without search by simply running eliminations and assignments recursively.

Search

Constraint propagation only gets us so far. After we’ve limited the potential values using eliminations and assignments, we need to search the problem space: choose a possible value for a square and see if it’s correct. A backtracking search leverages constraints to prune a given branch and backtrack as soon as the remaining possibilities on that branch are not valid. I used a backtracking depth-first search in my implementation.

Backtracking Depth-First Search (BDFS)

Implementations Details

Norvig wrote his implementation in Python. I wrote mine in Ruby. They’re both great, high-level languages, but they have important differences that only manifested when I tested difficult puzzles.

In particular, Python has lazy evaluation of generator functions. In some(), the expression e is not evaluated until it’s needed, so the first time it succeeds, it returns and does not evaluate the rest.

def some(seq):
    "Return some element of seq that is true."
    for e in seq:
        if e: return e
    return False

Ruby does not have lazy evaluation, so the function needs to check an instance variable instead to break out of the search.

Norvig picked an excellent puzzle as the example because there are only 17 initial numbers, compared to 36 in easy puzzles and 22 in hard ones. Moreover, the puzzle seems designed to trip up naive search algorithms. My early iterations failed on his example because squares with the smallest number of potential values needed the larger number assigned, and my code guessed the lowest one and could never recover after that. It was a great example that ensured I didn’t prematurely stop improving the algorithm.

However, only testing with one puzzle is not a good QA practice, so I built out a test framework with 60 puzzles of varying difficulty to ensure my implementation worked on a variety of puzzles beyond Norvig’s example. To ease debugging, I added two functions for displaying the current state of the puzzle grid in Terminal: one for the actual values and one for the potential values. The visual aid helped debug issues with the search algorithm.

def self.format_grid(grid)
    # Print grid as numbers and dots.
    grid.chars.each_slice(9).to_a.each_with_index do |row, i|
        log (row.slice(0,3).join + "|" + row.slice(3,3).join + "|" + row.slice(6,3).join).split("").join(" ")
        log "------+------+------" if i % 3 == 2 && i < 8
    end
end

Sudoku Grid in Terminal

def display_values(values)
    values.values.each_slice(9).to_a.each_with_index do |row, i|
        log row[0..2].collect {|r| r.center(9)}.join(" ") + "|" + row[3..5].collect {|r| r.center(9)}.join(" ") + "|" + row[6..8].collect {|r| r.center(9)}.join(" ")
        log "-----------------------------+-----------------------------+-----------------------------" if i % 3 == 2 && i < 8
    end
    log ""
end

Sudoku Potential Values in Terminal

After I finished debugging my version, I ported Norvig’s Python implementation into Ruby to see how they compared. Norvig’s algorithm was vastly faster than mine was: from 3x faster (0.004 seconds vs 0.01 seconds) for easy puzzles to 178x faster for the hardest puzzle available (Unsolvable #28): 0.2 seconds vs 34 seconds.

The implementations were close for puzzles that only required constraint propagation, but mine was much slower at search for two reasons:

Overzealous Constraint Propagation: My version wasted a tremendous amount of time performing constraint propagation on every square during every search, rather than focusing on the square being tested and its affected peers in each unit.
Wasted Data Conversions: My version passed each grid to be tested as a string to the next search, forcing each run to recreate the potential values of each square in the grid rather than leaving the internal representation as a set of known/potential values for each square.

I preserved my Ruby approach along with my Ruby port of the Python version in a GitHub repo for comparison along with the sudoku.csv test framework and results: bdunagan/SudokuAI.

SudokuAI

As an intersection between Product Management and Engineering, I thought productizing Norvig’s algorithm into an iPhone app written in Swift would be a fun extension to this side project. Read more about the process in SudokuAI: Instantly Solve Any Sudoku Puzzle with Your iPhone.

Adopting a Continuous Delivery (CD) Jenkins DevOps Pipeline

Sat, 29 Oct 2022 12:00:00 +0000

Unsplash - @chuttersnap

More than a decade ago, Retrospect Engineering set up a daily build system for our Windows and Mac products. Once a day, a cronjob would kick off a task to build the latest version of our codebase and email the results to the team. The daily build process was a fantastic method for adding automation to the build process for having a repeatable build process, ensuring the main codebase compiled in production, and storing builds that we could use to narrow down any issue that arose during regression testing.

Since then, continuous integration (CI) workflows have become a common practice for Engineering teams. Continuous integration (CI) and continuous delivery (CD) are both processes to automate building, testing, and in the case of delivery, packaging and deploying a product to the final platform for customers. This DevOps automation creates a pipeline from codebase to customers, eliminating human bottlenecks and manual steps to increase the efficiency and performance of an engineering team.

Google’s DevOps Research and Assessment (DORA) group studied thousands of teams to understand the practices and metrics for high-performing teams vs low-performing teams, and continuous delivery is a fundamental building block for high-performing teams. By relying on an automation pipeline, engineering teams can ship more features faster and better tested while reducing burnout.

Recently, we decided it was time to upgrade our infrastructure to a continuous delivery workflow to improve our efficiency as an Engineering team. Let’s walk through the original homegrown build system, what our overall goals for automation were, and how we extended the homegrown build system into a continuous delivery pipeline with Jenkins to improve our delivery performance.

Homegrown Build System

Since the mid-2000s, Retrospect has had a daily build system for Windows and Mac and a manual build for Linux.

Homegrown Build System

On Windows, we’ve used Visual Build Pro to generate builds. Visual Build Pro checked out the source code from GitHub Enterprise, compiled the appropriate project, posted the build output to a NAS share, and emailed the Engineering team with the results. When we added automated testing, we updated Visual Build Pro to kick that off as well.

On Mac, we wrote our own build system. The first version was a shell script, run daily as a cron job. That lasted a couple years until we rewrote it in Ruby. The Ruby version has endured quite well. In the intervening years, we’ve added support for building different versions of the Mac application, running regressions with our automated test framework, signing them, notarizing them, and uploading the final artifacts to Amazon S3. The Ruby build script is an end-to-end build process for our Mac product.

On Linux, we have never made changes to the client agent frequently enough to justify automating the build process. We built it manually for every change.

Continuous Delivery Goals

Our homegrown system had elements of continuous integration, but there were many manual steps. Let’s walk through what we needed from a continuous delivery pipeline:

Continuous Delivery: We wanted a single system that supported an end-to-end pipeline, starting when a pull request was merged, building, testing, packaging, deploying to Amazon S3 for download by customers, and notifying the Engineering team of the results.
On-Premise: We use GitHub Enterprise, so we needed a system that could run on-premise and control the local testing infrastructure.
Cross Platform: Retrospect has Windows, Mac, and Linux elements to build as well as cross-platform dependencies. We need to build a Mac artifact and then transfer it to Windows and vice versa. Moreover, we needed to check out our codebase at a single consistent point across platforms.
Build Commit with Status: As a team, we are accustomed to having a build commit to master/main on GitHub that identifies the exact bits that were compiled and shipped, and the build commits keep version numbers and cross-platform artifacts up-to-date within the codebase. We wanted to consolidate the two commits that we currently had into a single commit and mark it with a status from the CD pipeline.

Our goal was to leverage our current build processes, knit them together into a single build pipeline, and extend that into continuous delivery.

Third-Party Build Pipeline

Building a homegrown system has the fantastic advantage of doing exactly what it was coded for and the unfortunate downside of doing nothing else. The internal team pushes the functionality forward. We needed to leverage a third-party solution to achieve our CD goals with minimal commitment.

We looked at Jenkins, GoCD, Bamboo, CircleCI, and GitHub Actions; Jenkins was open source, worked on-premise, and seemed to be popular enough that we could google for help. Jenkins was designed to be a CI system with a plugin architecture, and it had thousands of plugins. We thought it made sense to try it first.

Overall, the setup process took about three weeks of one engineer’s time, spread out over three month. We have nine stages in our pipeline using three nodes (2 Windows, 1 Mac) to perform the following steps through a versioned Groovy script.

Third-Party Build Pipeline

Auto Trigger: We wanted the new system to start a new build when someone merged a pull request to master. As we use GitHub, its Webhooks were the natural choice. We originally set up GitHub Webhooks to only send Jenkins data on pull request changes, but because it didn’t actually include the master commit, Jenkins didn’t trigger the build. We had to send the push event from GitHub for Jenkins Git plugin to notify our pipeline. However, because the pipeline was triggered by pushes, the later build commit auto-triggered a new build, creating a feedback loop. We had to add special code to detect and suppress builds based on the build commit.
Delay: We set a 30-minute delay before the pipeline starts in case an engineer merges several pull requests in quick succession (which did happen to us in practice).
Code Checkout: We have two platforms that we’re building on, and we needed the codebase to be in the same state on both to avoid the race condition of building different commits on different platforms. We reduced this race condition from 45 minutes to 15 seconds by separating out the checkout stages and moving them to the beginning. We did experiment with checking out a specific Git hash, but that approach ran into issues when we tried to push a new build commit back to master.
Build Dependencies: Retrospect for Windows includes a Mac binary and vice versa. The build pipeline needed to build the dependencies on both platforms and then copy them to the other platform via an Amazon S3 artifact staging folder.
Commit Version and Dependencies: The team was accustomed to the version of the product and the above dependencies that they use in development being updated by the build system. In the past, we had the separate build systems commit these, but that approach would re-introduce the race condition from above. If one platform committed to master, the other platform would need to fetch that change or its push would fail, but because it was fetching a change, it could fetch too much. Either build commit push would fail if someone had merged another pull request, but we were trying to avoid silent bugs, rather than the build process failing. We consolidated the two platform commits into one by transferring all of the updated files from Windows onto Mac via the S3 artifact staging folder and then committing all of them.
Build Applications: Each platform node would build Retrospect client agent and engine application as before, using the previous build systems.
Package Applications: Jenkins was quite handy for creating the folder structure and moving files around for packaging. On Mac, we did need to code sign and notarize the binaries. These steps used to be manual, and the team really appreciated automating them.
Deploy Applications: Our deployment process is quite straight-forward: upload them to S3 and mark as public.
Test Applications: On Mac, the original build system continued to run automated testing on the finalized product. On Windows, Jenkins downloaded the finished binaries to a new node where it built the test framework and then ran it against the new binaries. Our tests are quite extensive, as we make backup software, so the tests actually extended the build pipeline from two hours to nine hours.
Email Team: Jenkins sent an email at the end with the status of the build and linked (if successful) to the staged final products.

It’s a long list, but it ticks off all of our requirements. Jenkins provides us with an infrastructure for managing multiple build systems on different platforms as nodes in a single pipeline, and its Groovy script support enables us to knit those build systems together into a DevOps process.

The resulting Groovy script is 500 lines, and that count excludes the 2k-line Ruby file for building Mac and the extensive Visual Build Pro script on Windows. The option to pull the Groovy script from source code (GitHub Enterprise in our case) highlights the DevOps mindset of repeatability and automation.

While the team is really happy that Jenkins is working now, the setup process was not smooth. It took a significant amount of time to understand what Jenkins could provide us, how to connect what we needed into what it could deliver, and then codify that process in Groovy. Beyond that expertise barrier, Jenkins has bugs, like all software. We hit a number of issues where we eventually found open JIRA issues without resolution, so we worked around them.

High-Performing Team

CI/CD workflows are a necessary part of a high-performing Engineering team. While we had a good system for building and testing the software, we were missing the full continuous delivery pipeline. Like all Engineering projects, there was an opportunity cost associated with the infrastructure upgrade, but resolving this technical debt allows the teamn to focus on features rather than manual processes.

Thanks to the engineer who took on this project and overcame the hurdles we found in Jenkins to wrap up this project and to the entire team for helping out when needed! Moving to cross-platform continuous integration and continuous delivery had been a goal for years, so a huge shout-out to the Retrospect Engineering team for ticking it off this summer.

Anomaly Detection with Retrospect Backup

Sun, 06 Mar 2022 12:00:00 +0000

Unsplash - @kaip

The rise of ransomware, and ransomware-as-a-service more recently, poses a huge threat to businesses around the world with a projected $20B payout in 2021, a 100% year-over-year increase for the last four years. Ransomware is now a vast ecosystem with many different forms of attacks. Many attackers have their own versions of ransomware, and these are called variants. Each variant has the same purpose, but it uses a different mechanism or simply a different naming convention. The majority of ransomware variants and all of the top 10 forms for 2021 followed the same attack pattern: infiltrate a computer, encrypt its files, and then rename the files with a different extension. Businesses need to detect ransomware as early as possible to stop the threat and remediate those resources.

Anomaly detection in Retrospect Backup 18.5, launched last month, identifies changes in an environment that warrant the attention of the IT team. Administrators can tailor anomaly detection to their business’s specific systems using customizable filtering and thresholds for each of their backup policies, and those anomalies are aggregated on Retrospect Management Console across the entire business’s Retrospect Backup instances or a partner’s client base with a notification area for responding to those anomalies.

Retrospect Backup detects all of the major ransomware variants using an algorithm that focuses on file metadata anomalies for behavior-based monitoring. According to Coveware, the top variants are always changing, with over 50% changing every quarter. The key to detection is combining technologies such as signature detection in processes with file-based irregularities. Using a multi-pronged defense, with immutable backups, anomaly detection, and other security layers, businesses will know when they’re being attacked and will have the tools to remediate it and move on.

Technical Deep Dive

As a data protection solution, Retrospect Backup has a significant footprint in a business’s computer environment with visibility into endpoints, servers, NAS volumes, and even cloud storage. To detect anomalies, Retrospect Backup provides a per-policy option for filtering and threshold to decide whether or not certain file changes are an anomaly with options for notifications. Let’s walk through each:

Filtering: Configure a filter to identify the files to observe. Retrospect lets administrators tailor this to file types, paths, dates, or specific attributes, and the built-in filter focuses on office documents, photos, and movies.
Threshold: Set the threshold for the alert. If the percentage of files new or changed out of the total number of files matched by the filter is greater or equal to the threshold, Retrospect will create an anomaly event.
Notification: Access notifications on Retrospect Management Console, receive them immediately in an email, and find them in the Execution History and Backup Report. Retrospect surfaces the notification for anomaly detection in the best place for an organization.

Anomaly Detection Diagram

The diagram shows the volume being monitored as a whole, the subset of files that match the “Anomaly Detection” filter, and the files that are new or changed within that subset. Retrospect generates an alert if the percentage exceeds the threshold.

Customers can simply enable “Anomaly Detection” in each policy, select the appropriate filter, and assign a threshold. Retrospect takes care of the rest.

Retrospect Backup for Windows - Anomaly Detection

Retrospect Backup for Mac - Anomaly Detection

When an anomaly is detected, Retrospect can generate an email notification and surface the anomaly in the application and on Retrospect Management Console. It also provides API integration for businesses to tailor the workflow to their processes using Script Hooks, so the organizations can take further actions, like posting to a Slack channel or even stopping the backup until the problem is assessed.

Free Update

Congratulations to the Retrospect Engineering team for releasing such a huge update! Retrospect Backup 18 included immutable backups for ransomware protection, and with this free update, we’re extending that ransomware focus to anomaly detection.

Containerizing our Web Development Toolchain with Docker

Sat, 19 Feb 2022 12:00:00 +0000

Unsplash - @exdigy

Containerization has been around for decades since chroot, but the technology has become ubiquitous as large tech companies have invested in platforms and tools to make containers easier to use and scale with tools like Kubernetes. While I’d been using containers with Heroku for years, I didn’t realize how pervasive they were until I read an ACMQueue article in 2016 about Google’s Borg, Omega, and Kubernetes, especially this section:

“The isolation and dependency minimization provided by containers have proved quite effective at Google, and the container has become the sole runnable entity supported by the Google infrastructure.”

Containers are how Google can scale services, by abstracting away the physical infrastructure and building orchestration tools to manage them, and they fit into a broader context with the rise of DevOps, handing Engineering an excellent building block for creating software delivery pipelines and deploying applications.

Containers are one level of abstraction up from virtual machines. Virtual machines (VMs) allow you to run an operating system in an isolated environment in a server, virtualizing the hardware, but containers allow you to run a single application within an isolated environment, virtualizing the operating system. Compared to virtual machines, containers are smaller, start faster, and consume fewer resources, so you can run more on a single physical hardware system while maintaining isolation between them.

Virtual Machine vs Container

You’ll find many diagrams like this one online, but it’s an oversimplification. Containers are so small and fast because they rely on the host OS’s kernel, but they actually need their own minimal guest OS as well to handle tasks like package management, configuration, and other miscellaneous OS features. The application’s libraries and dependencies as well as the small guest OS are all wrapped up in “Libs”. As an example, the Alpine Linux OS is 5MB, and applications running on it start in less than a second.

Docker made containers accessible and has become very popular since its launch in 2013 with support for Windows, Mac, and Linux. Docker containers are instances of a Docker image, and each image is a stack of immutable layers, based on the commands of your Dockerfile. This layered immutability has multiple benefits, allowing you to only change the image based on new or changed commands and to store layers separately in Docker Hub, the hosted service for distributing public Docker images. Clients can download layers concurrently for performance.

Toolchain Upgrade

Before the transition to Docker, our web app toolchain was a local macOS environment for development, GitHub for source control, and Heroku for deployment. Retrospect Engineering has used Heroku for web app deployment since 2011, starting with Retrospect.com, our Ruby-on-Rails website, and then with Retrospect Management Console, our hosted service for backup analytics. Heroku, now owned by Salesforce, has been a fantastic deployment target. Deploying a new version is as simple as pushing to a Git repository.

Development Toolchain

Instead of installing and managing upgrades for Ruby/Rails/Gems/Postgres/Redis on macOS, we manage them all within a Docker container running on our Mac development computers. We finally decided to migrate from a local macOS environment to Docker because of upgrade issues. Every Ruby/Rails/gem/macOS version update hit installation issues which we only resolved after finding a blog post for our specific combination of versions. Upgrades were painful enough that we postponed them, a sure sign that we needed a deeper toolchain upgrade.

Docker provided that next-generation environment, where upgrading a Ruby version was as simple as changing the base Docker image that we depend on. Every upgrade starts with a clean slate in the image, so there are far fewer conflicts between different versions or different packages. Using containers for development work increased the Engineering team’s efficiency.

One Container

The website is a Ruby-on-Rails environment with only a SQLite dependency for local development, so we use a simple Docker configuration file (Dockerfile). Here is our short file for the website:

FROM ruby:2.7.5-alpine3.14 as retrospectdotcom

# Set up environment: build-base for "developer tools", sqlite for SQLite, postgres for Postgres, git for Git, and tzdata for weird Rails 5 error.
RUN apk update && apk add --no-cache build-base sqlite-dev postgresql-client postgresql-dev tzdata git

# Add Gemfiles for Bundler.
RUN mkdir /retrospectdotcom
COPY ./Gemfile /retrospectdotcom
COPY ./Gemfile.lock /retrospectdotcom
WORKDIR /retrospectdotcom

# Set up gems
RUN bundle install
RUN rm -rf /retrospectdotcom

# Run rails server
CMD ["bundle", "exec", "rails", "server", "-b", "0.0.0.0"]

Let’s walk through each chunk:

Base Image: We use the official Ruby Alpine Linux image as the base image. Common tools like Ruby publish official images to Docker Hub.
APK: Alpine Package Keeper (APK) is an extremely lightweight package manager. One of the reasons we need a base image inside the container is for the package management functionality, so that we can download the necessary libraries for our application’s environment.
Gemfiles: We create the directory for our application, copy in the Gemfiles, run bundle install to install the required libraries, and finally remove both Gemfiles. This image is specifically for local development, so we want to bind mount to the host’s application folder and not depend on any application code within the container.
Gemfile.lock: We actually created a separate Dockerfile for running bundle update because it’s such an infrequent task to generate a new Gemfile.lock, and we wanted it to be an intentional action.
Rails Server: We run the rails server command to start the application.

From Terminal, we use the Dockerfile to build the image. Then, to run an instance of the image, we use a bind mount, reference a Docker environment file, and include -it to allow interactive debugging with the pry-byebug gem.

docker build -t retrospectdotcom .
docker run -it --mount type=bind,source="$(pwd)",target=/retrospectdotcom --env-file Docker.env -p 3000:3000 retrospectdotcom

With that, we have the website running in a Docker container with the following features:

Local Development: It’s accessible at http://localhost:3000.
Live Editing: The bind mount means the container’s app folder references the app folder on the host under source control, so that we can do live editing and commit our changes as if the app was not running in a container.
Live Debugging: The -it parameter enables tty mode for live interaction, which means we can use debugger for breakpointing.

You can also override the default command by passing a command at the end. For instance, to run rails console, use:

docker run -it --mount type=bind,source="$(pwd)",target=/retrospectdotcom --env-file Docker.env -p 3000:3000 retrospectdotcom bundle exec rails console

Instead of typing the whole line every time, we set up a shell alias in ~/.bash_profile:

alias rs='docker run -it --mount type=bind,source="$(pwd)",target=/retrospectdotcom --env-file Docker.env -p 3000:3000 retrospectdotcom'
# The above command becomes
rs bundle exec rails console

Multiple Containers

Retrospect Management Console is a Ruby-on-Rails web application that has two dependencies: Postgres and Redis. We needed all three services running and able to communicate with each other. This configuration required Docker Compose to handle the multiple microservices.

With Docker Compose, we can set up a docker-compose.yml file to describe the containers, dependencies, volumes, and configurations. Docker Compose even supports health checks to periodically verify that a service is operating correctly.

version: '2'
services:
  postgresdb:
    image: 'postgres:11.2-alpine'
    user: postgres
    ports: 
        - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5
    env_file:
      - 'Docker.env'
  redis:
    image: 'redis:4.0-alpine'
    volumes:
      - redisdata:/data
    ports: 
        - "6379:6379"
  console:
    depends_on:
      'postgresdb':
        condition: service_healthy
      'redis':
        condition: service_started
    command: sh -c "rm -f tmp/pids/server.pid && bundle exec rails s -p 3000 -b '0.0.0.0'"
    build: .
    stdin_open: true
    tty: true
    ports:
      - '3000:3000'
    volumes:
      - .:/consoledotcom
    env_file:
      - 'Docker.env'
    entrypoint: "/docker-entry.sh"
  sidekiq:
    build: .
    command: bundle exec sidekiq -C config/sidekiq.yml
    volumes:
      - .:/consoledotcom
    links:
      - postgresdb
      - redis
    env_file:
      - 'Docker.env'
volumes:
  pgdata: 
  redisdata:

For our docker-entry.sh file, we added logic to either run any pending Rails migrations if there is a database or re-create the database if it did not exist:

# Run postgres tasks.
if psql -h "$PGHOST" -U "$PGUSER" "$PGDBNAME" -c '' 2>&1; then
    # Apply any migrations.
    echo "Running 'bundle exec rake db:migrate'"
    bundle exec rake db:migrate
else
    # If no database, run db:setup (db:create + db:schema:load + db:seed)
    # Note this will attempt all of those actions above twice (development + test) 
    # and might include log errors due to https://github.com/rails/rails/issues/27299
    echo "Running 'bundle exec rake db:setup'"
    bundle exec rake db:setup
fi

# To load database dump, run the following from the *host* Terminal (because Docker can't see the host file system):
# pg_restore --verbose --clean --no-acl --no-owner -h localhost -U postgres -d dbname latest.dump

# Run "command" supplied by docker-compose.
exec bundle exec "$@"

As with the website, we configured Docker to use the host system’s web application folder under source control via a bind mount.

To launch the containers, first build the images then bring the setup up.

docker-compose -f docker-compose.yml build
docker-compose -f docker-compose.yml up

Live debugging is one extra step. We needed stdin_open: true and tty: true entries under the console container in docker-compose.yml. Then, we needed to add debugger in the Rails app where we wanted to breakpoint. Finally, we needed to attach to the container after it launched.

docker ps
docker attach container_name

To detach, type the escape sequence Ctrl+P then Ctrl+Q.

Since the Postgres database is now run in a container, we needed to modify our psql to access it. Fortunately, Docker makes it easy:

# https://stackoverflow.com/questions/27673563/how-to-get-into-psql-of-a-running-postgres-container
psql --host localhost --port 5432 --username postgres

Finally, here are a couple more handy Docker commands:

# Clean up Docker
docker system prune
# Check storage usage (because Docker only is allocated a certain amount of space per its preferences):
docker system df
# Run sh in an instance
docker exec -it container_name sh

Headless Automation Testing

Automation testing is significantly more complicated with Docker. You can’t just drive a browser because there is no screen. We needed to update our Rails Rspec and Capybara setup to use a headless environment for the integration tests. After testing different options, we settled on Selenium Grid with a Google Chrome container using Docker Compose, based on Plymouth Software’s excellent blog post.

With this environment, RSpec can run through our 711 tests for the website and generate screenshots on the failures, covering the site and the embedded partner and support portal.

Here is the docker-compose.test.yml file for reference:

services:
  web:
    environment:
      HUB_URL: http://chrome:4444/wd/hub # <-- Add the HUB_URL environment variable
    depends_on:
      - chrome # <-- Link to the chrome container
    ports:
      - "3000:3000"
    build: .
    command: sh -c "rm -f tmp/pids/server.pid && bundle exec rspec"
    stdin_open: true
    tty: true
    env_file: Docker.env
    volumes:
      - type: bind
        source: .
        target: /retrospectdotcom
  chrome:
    image: selenium/standalone-chrome:4.1.0-20211123 # this version should match that of the selenium-webdriver gem (see Gemfile)
    ports:
      - "4444:4444"
      - "7900:7900"
    volumes:
      - /dev/shm:/dev/shm

Here is a snippet from rails_helper.rb that works for us:

require 'byebug'

# Support devise
require 'devise'
RSpec.configure do |config|
  config.include Devise::Test::ControllerHelpers, :type => :controller
  config.include FactoryBot::Syntax::Methods
end

# Add Capybara settings and remote web driver. See docker-compose.test.yml for more details.
# NOTE: screenshots for failing tests appear in #{Rails.root}/tmp/screenshots.
Capybara.server = :webrick
Capybara.register_driver :chrome_headless do |app|
  chrome_capabilities = ::Selenium::WebDriver::Remote::Capabilities.chrome('goog:chromeOptions' => { 'args': %w[no-sandbox headless disable-gpu window-size=1400,1400] })

  if ENV['HUB_URL']
    Capybara::Selenium::Driver.new(app,
                                   browser: :remote,
                                   url: ENV['HUB_URL'],
                                   desired_capabilities: chrome_capabilities)
  else
    Capybara::Selenium::Driver.new(app,
                                   browser: :chrome,
                                   desired_capabilities: chrome_capabilities)
  end
end

RSpec.configure do |config|
  config.before(:each, type: :system) do
    driven_by :chrome_headless

    Capybara.app_host = "http://#{IPSocket.getaddress(Socket.gethostname)}:3000"
    Capybara.server_host = IPSocket.getaddress(Socket.gethostname)
    Capybara.server_port = 3000
  end
end

Getting the setup exactly right was tricky, but this is the code that works for us. We frequently referred to the Rspec site as well.

Dev Setup

The proof that the toolchain upgrade was a success came when we rolled out Docker to the wider web development team. Previously, setting up Ruby, Rails, and all of the gems was a multi-hour process with frequent web searches to fix environment issues. With Docker, the developers downloaded Docker, built the image with one Terminal command, and launched the multi-container application with a second command. It took five minutes.

Building a Technical Documentation DocOps/DevOps Pipeline

Sat, 05 Feb 2022 12:00:00 +0000

Unsplash - @syinq

Retrospect Backup is designed to be easy to use for non-technical customers, but it has a deep set of features for the technical subset. To understand its capabilities, customers need well-written, up-to-date documentation, in the form of user guides and knowledgebase articles in their language, with targeted content accessible as a URL. Retrospect has been around for 33 years and supports six languages, so achieving that goal is a significant ongoing DocOps/DevOps project.

In the past, we used Adobe FrameMaker for our user guides and a Salesforce Force.com Site for our knowledgebase. Using FrameMaker meant only someone with FrameMaker knowledge could update the user guides, typically an outside contractor. It was difficult to see what changed when updating the content, so the team had to trust spot-checking. Using Force.com meant Support could manage the knowledgebase content, but it wasn’t easy and had no search capabilities at the time. Moreover, language support for both simply multiplied the required tasks.

All these barriers led to fewer documentation updates, detracting from our customer experience and increased Support tickets from confused customers.

Documentation Pipeline

To simplify the process of keeping documentation current and adding new information, we updated our documentation DocOps/DevOps pipeline with a set of goals and the following tools:

Markup: Documents need to be readable plain text with markup => AsciiDoc and Markdown
Source Control and WYSIWYG: Documents need to be versioned, diffable, and editable by non-technical staff => Git and GitHub
Languages: Documents need to be easily localized => Gengo
Deployment, Search, and Links: Document generation needs to be automated and deployable with search and deep linking => Website, Prince, and Swiftype

Documentation Pipeline

Markup: AsciiDoc and Markdown

Markup languages like AsciiDoc and Markdown are excellent at embedding sophisticated styles and layouts in plain text. AsciiDoc in particular was designed for generating books from markup, so it works well for both HTML generation and PDF generation. We use AsciiDoc for most parts of our documentation now, both user guides and KB articles, and mix in Markdown documents when it’s easier.

Converting Adobe FrameMaker to AsciiDoc was painful. The Retrospect Backup for Windows User’s Guide was 500 pages long, and the Retrospect Backup for Mac User’s Guide was 300 pages long. Each was localized into six languages: English, German, Spanish, French, Italian, and Japanese. First, we exported from FrameMaker to HTML, but the HTML was very messy. Next, we used many passes of regular expressions in Ruby to clean it up, and finally, we used pandoc to convert from HTML to AsciiDoc.

Compared to the Adobe FrameMaker conversion, we had a far easier time ingesting the SFDC Article pages that populated our Force.com knowledgebase. We used a Ruby gem to read the objects in from the API and write them out to files with bits of Ruby to convert it into Markdown. There were a small subset of KB article that had already been localized, and those were saved as kb_name.language.adoc (i.e. error_1101.ja.adoc).

With all of our documentation now in a plain text format, we could easily add, edit, delete, and localize it.

Language: Gengo

Gengo translates content within a few hours at low price point. We use them for all of our localization needs. Compared to our previous service, it has saved us thousands of dollars and months of waiting. Getting a string translated for a couple dollars in an hour by a professional sounded absurd to us before we discovered Gengo.

Source Control and WYSIWYG: Git and GitHub

We switched to Git and GitHub almost a decade ago: Migrating Retrospect from SVN to GitHub Enterprise. By adding our documentation to Git, we could easily version the content and diff any changes. We now have 402 KB articles in Git, and our user guides total 104k lines. When we recently added a new chapter in the user’s guide, it only took 10 minutes and became a GitHub pull request to be diff’d and reviewed. When Support writes up a new KB article, Engineering can add to the production website in about 10 minutes.

Moreover, GitHub provides an excellent WYSIWYG editor for AsciiDoc and Markdown files, with rich styles and preview available.

GitHub - WYSIWYG Editor

GitHub - WYSIWYG Preview

Using Git and GitHub allows us to integrate our documentation into our existing build and deployment pipeline while providing editor abilities to non-technical staff.

Deployment, Search, Links: Website, Prince, and Swiftype

The documentation is now in plain text and versioned, but we still need to deploy it with search and links. Most companies use Wordpress for their websites, but we use Ruby-on-Rails. The tradeoff is extensibility. Ruby-on-Rails is a less common choice, so fewer web developers have experience with it compared to Wordpress. However, Rails lets us quickly build new functionality: Building Deep, Scalable Customer Relationship Management Tools.

For the knowledgebase, we built a new page with categories and links to all of the articles. To populate it, we wrote a simple Rails initializer to read in all of the articles in the directory and store their frontmatter in memory to be used later by the KB index page.

Website - Knowledgebase

The user’s guides took more effort. We could easily build a section of the website for Documentation. The trick was handling the chapters well. We wanted to enable Support to link directly to specific sections of chapters without loading the entirety of the user’s guide in a single page. We manually build the chapters in the guide to allow custom sorting, but we dynamically build the subchapters by reading and parsing the AsciiDoc files within Rails. With both, we can display the any chapter along with a sidebar of the overall structure of the guide for context.

Website - User's Guide

It took some time to work out the correct AsciiDoc syntax to integrate well with Rails via asciidoctor. For instance, every chapter needed the following frontmatter:

----
:showtitle:
:idprefix:
:idseparator: -

= Chapter Title [[chapter-title]]
——

Another hurdle was incorporating ERB code into AsciiDoc, so that we could use our Rails helper functions, like short references to our S3 assets:

+++<%= image_tag(s3_image("docs/example.png"), :class => "whitepaper_image") %>+++

The user’s guides also needed to be converted into PDFs for offline usage. We use Prince to convert them from a single HTML page into 300+ page PDFs, wrapped into a Rails rake task.

Because the knowledgebase and user’s guides are on the website, all of their content is indexed by Google and others. For search, we use Swiftype to index the content, rank the results when necessary, and integrate a simple search field at the top of the page.

Alternatives

We are an engineering-focused company, and many companies optimize their documentation pipeline for other goals.

Full Service: Services like ZenDesk provide a complete customer service experience, from ticketing to knowledgebase.
Domain-Specific Documentation: There are other projects to leverage, like Slate, to create documentation for a specific purpose, like API documentation.

There are clear downsides to our approach. Retrospect Engineering is the bottleneck for feature development compared to using a third-party service. I touched on these sorts of trade-offs in Building Deep, Scalable Customer Relationship Management Tools. The issue comes down to what is best for a company at a given time.

Internal Documentation

Along with customer-facing documentation, we migrated our internal documentation from MediaWiki to AsciiDoc hosted on GitHub Wikis: 542 pages. GitHub Wikis are version controlled, in plain text, with universal search, and with access via a local repo. The wiki lives in its own repo but under the Wiki feature, so that we had access the better WYSIWYG editor with sidebar support and have the universal search separate the wiki entries correctly.

For the migration, we used Git’s wiki extension to clone MediaWiki to a Git repo then used pandoc to convert the pages from MediaWiki format to AsciiDoc. Most of the pages are old and unused, another reason why we wanted the content in GitHub. It’s easy to see the entirety of the wiki as local text files rather than a long HTML list. The biggest downside was migrating the 300 embedded images and files from MediaWiki. We opted to migrate selectively when we found one missing in a relevant wiki page in GitHub. In the end, we only migrated a couple dozen files.

Lowering Barriers

Retrospect Engineering has consistently found that better tools and workflows lead to significantly improved experiences. We found that with Git and GitHub, and we found that with these documentation pipeline upgrades. The improved DocOps/DevOps pipeline reduced our documentation schedule from weeks to days.

Leveraging Forever-Incremental Backup Technology for Customer-Centric Data Protection

Sun, 02 Jan 2022 12:00:00 +0000

Unsplash - @benjaminlehman

Let’s say you have a computing environment, and you need to protect its contents over time. You need a backup: a versioned snapshot of the systems for point-in-time restores. Any backup solution will be optimized for certain criteria–simplicity, completeness, flexibility, etc–but each one needs to perform backups. Let’s walk through the three different types:

Full Backup: This is a complete point-in-time snapshot of a volume. It takes up the most space.
Differential Backup: This is a backup that only depends on one full backup. If you have two differential backups based on one full backup, each differential backup only depends on that one full backup and thus take up more space.
Incremental Backup: This is a backup that depends on a full backup and a number of incremental backups. This is the smallest backup format you can have because it only takes up as much space as necessary to preserve point-in-time restores for each recovery point.

Retrospect Backup’s core engine uses forever-incremental file-based backup technology to protect individual files and folders as well as full systems. Forever-incremental backup allows you to only back up file and system changes since the previous backup, minimizing backup time and storage costs.

Forever-Incremental Backup

When you restore, Retrospect will combine these incremental backups into a synthetic full backup, giving you a point-in-time restore for a set of files or a full system.

This customer-centric approach allows businesses to leverage the speed of file-based recovery, restoring a file to the Desktop or to the original location. IT administrators can even restore an entire folder structure based on settings like “Do not replace newer files”, allowing IT to recover a large folder structure in place on the affected server or endpoint without determining whether files have been touched more recently than the backup.

Retrospect Backup’s goal is to protect your entire environment the way you need it, giving you as much flexibility as possible.

Workflow Components

Retrospect Backup leverages a number of components to optimize its forever-incremental file-based backup workflow:

System State: Retrospect includes system state, application state, disk layout, and boot information to support disaster recovery scenarios. To capture an accurate backup of a Windows system, Retrospect utilizes Windows Volume Shadow Copy Service (VSS) to snapshot the system, so that with the other information, it can perform a full disaster recovery (DR), also bare-metal recovery (BMR).
Filtering: Businesses need to be able to include or exclude files based on how critical they are to the business. Music and movies can be excluded while preserving the ability to perform a bare-metal recovery.
Destinations: Retrospect supports a wide range of destinations–disk, NAS, tape, and cloud–and backups can be transferred between different storage locations.
Scheduling: Organizations have their own unique business flow, and Retrospect allows them to schedule their data protection strategies based on their business needs.
Scripts: Retrospect supports different types of actions on data, including backup, replication, archive, and transfers with support for block-level incremental backup (BLIB) and file-level deduplication for backup scenarios. Transfers allow customers to make a copy of a backup in another storage destination, with support for different destination types, filtering, and scheduling. Customers can make one backup of a source and then move it multiple places without touching the source again. Customers can even enable automatic backup healing by using a “Verify” script to auto-heal their backups. If any issues are found, Retrospect will back up the affected files in the next backup. It’s a flexible approach that you can apply to each backup set on your own schedule to detect and fix integrity issues automatically.
ProactiveAI: Retrospect’s unique ProactiveAI policy-based scheduling enables businesses to intelligently protect their infrastructure based on policy window. Endpoints that go on and off the network will be protected as soon as they join, and with remote backup, remote workers are protected even outside of the office. See Optimizing Retrospect’s Algorithm for Resource Scheduling for a technical deep dive.

Retrospect's Backup Stack

The combination and deep integration of all of these components form Retrospect’s backup stack and provide businesses with a flexible backup solution that extends to all major operating systems and includes application-level backup for Microsoft Exchange and Microsoft SQL Server. Customers can deploy Retrospect wherever it makes the most sense: on a VM, in the cloud, or on an old desktop.

Image-Based Backup Comparison

An alternative backup method is image-based backup. Image-based backup uses the blocks of data on a volume to read and protect data at a system level.

Many enterprise data protection solutions use this approach. They focus on the entire volume, copying all used blocks on a volume for a full backup and then subsequent changed blocks for incremental backups. This approach allows customers to restore entire systems, but it loses the file-level granularity of file-based backup.

To allow better flexibility, a number of solutions also support file-level restore. This granular recovery lets businesses quickly choose a point-in-time recovery for a system, launch and attach to it, and then retrieve the individual file that they needed. This workflow enables IT administrators to quickly retrieve a lost file.

However, I don’t know of any solutions that integrate file-level restore into the original system along with the ability to leave files that are newer in that folder structure, as Retrospect can.

Customer-Centric Approach

Businesses use Retrospect’s customer-centric approach to data protection to quickly restore individual files, large folders, or entire systems with incremental-forever file-based backup, system state snapshotting, filtering, scheduling, and deep integration with destinations.

Scaling Data Ingestion for a Web Application to 250 Million Requests

Thu, 30 Dec 2021 12:00:00 +0000

Unsplash - @luckybeanz

Retrospect Management Console is Retrospect’s hosted service where businesses can connect their instances of Retrospect Backup to aggregate, analyze, and manage their backup environments. See MVP and Iteration: How We Built a Web Application for Backup Analytics for more details about why we built it.

Retrospect Management Console

Stack

Here is our current stack for Retrospect Management Console:

Web framework: Ruby-on-Rails
Background processing: Sidekiq
Auto-Scaling: Rails Autoscale
Authentication: Devise Ruby gem
Database: Postgres
Deployment: Salesforce Heroku
Source control: GitHub Enterprise

Data Ingestion

Data requests are like ocean waves hitting a service. Ingestion is trivial for small waves, but the larger and more frequent they get, the harder it is to finish ingesting before the next wave. The service might recover for a bit, but then it gets knocked over again with an even larger wave. And of course, you can’t see the waves, so you have no warning.

As Retrospect Management Console has grown in usage, we hit two constraints: request number and request size. There were more requests, and more requests had a lot of data, up to 11MB.

In the beginning, we didn’t know what the problem was beyond Heroku telling us that the service was down. The memory consumption was far too much, and increasing the running instances (dynos) didn’t resolve the problem.

Heroku Metrics

We initially focused on the memory consumption and lowered the maximum allowed amount. That ceiling temporarily resolved the 500 errors, but the service still went down with a big wave and didn’t recover without a couple manual restarts.

After investigating the problem, it was clear that our background processing framework, SuckerPunch, was the bottleneck. It ran on background Ruby threads in the same instance and was built for sending the occasional email in the background, not processing millions of large data requests every day. We had chosen it initially to ship the MVP, but it was time to move on.

We considered switching to Amazon SES or a similar cloud service, but we settled on Sidekiq. Sidekiq was created for scalable background processing in Ruby. It ran background instances and handed off jobs to them. In fact, SuckerPunch was written as a lighter version of Sidekiq, so it fit well into our existing workflow.

Sidekiq Dashboard

We switched to Sidekiq in April 2021. The transition was five lines of code, and it immediately resolved our scaling issues in both directions. Since then, the background instances have processed 250 million requests at 1.1 million requests per day, including 20MB requests. 20MB is a lot of JSON backup data. The service has been able to handle 300 requests per second.

Auto-Scaling

In addition to Sidekiq, we needed to automatically scale our web instances and worker instances up and down based on the amount of traffic coming in. If we underprovisioned, we would not be able to process the amount of data coming in. If we overprovisioned, we would be paying too much for the amount of data we needed to process.

Heroku offers a services for web instances, but to cover both types, we chose Rails Autoscale. Rails Autoscale automatically scales instances based on queue time for requests, and it has a number of options for scaling up and scaling down along with limiting the minimum and maximum number of instances.

Rails Autoscale Settings

It also has a nice usage page to show how much money it’s saving you by not overprovisioning.

250 Million Requests

We’ve been thrilled by how Retrospect Management Console has grown from a product idea to solve customer problems into a shipping product that has handled 250 million requests since the summer. By monitoring the service and investigating each bottleneck in turn, the service has been able to scale up to meet customer needs, using excellent solutions like Sidekiq and Rails Autoscale.

Retrospect, Inc. Turns 10

Mon, 01 Nov 2021 12:00:00 +0000

Unsplash - @adigold1

Retrospect, Inc. incorporated ten years ago today: November 1, 2011. Retrospect Backup was originally owned by Dantz Development until Dantz was acquired by EMC in 2004. In 2010, Sonic Solutions acquired Retrospect Backup from EMC and then itself was acquired by Rovi. The leaders of Retrospect approached the management team at Rovi about the idea of spinning off the product, and Rovi agreed to a deal.

Thus, Retrospect, Inc. was born, focused solely on data protection. See our Wikipedia page for more details. Eight years later, we were acquired by StorCentric. We’re part of a larger organization now, with ways to grow the business and protect more data.

While Retrospect, Inc. has only been around for a decade, Retrospect Backup has been protecting businesses for 32 years now. Retrospect Backup doesn’t collect analytics, but a couple years ago, I did a back-of-the-envelope calculation for how much data we protect: 100PB. In fact, we now have individual customers who have over 1PB protected by Retrospect Backup. For a long list of features we’ve shipped since 2011, see “Retrospect Celebrates 30 Years of Data Protection”.

I’m extremely grateful to be a part of such a fantastic team and to work on a product that has helped so many recover from data loss. Thanks to everyone who is or has been a part of Retrospect!

MVP and Iteration: How We Built a Web Application for Backup Analytics

Sat, 30 Oct 2021 12:00:00 +0000

Unsplash - @halacious

Retrospect is a small company, so when we build new products or features, our goal is to find the intersection between what customers need and what our team can build, deliver, sell, and support.

In 2017, feedback from customers and partners frequently touched on how difficult it was to monitor and manage multiple instances of Retrospect Backup. Retrospect Backup is software that is installed on Windows and Macs, either servers or desktops, so administrators needed to log into each computer and look at the status in the application or configure email reporting for each instance. The process was painful and manual.

The Retrospect team talked about different solutions to this, including third-party monitoring service integration, better email reporting, or a web UI for each instance. Eventually, we agreed that the best long-term solution would be our own hosted service.

With a hosted service, customers and partners could log into a single web application and see an aggregated view of their entire backup infrastructure or, in the case of partners, all of their clients’ backup infrastructures through a single pane of glass. That one-sentence product pitch was the goal for Retrospect Management Console, focused initially on monitoring and analytics and eventually on management.

MVP: Minimum Viable Product

Building a minimum viable product or MVP represents a compromise between Sales, Engineering, and Product Management. Sales is looking at competitors and wanting to ship the equivalent solution as soon as possible. Engineering is trying to architect and deploy a well-designed product. Product Management is representing the customer and how this product will address their pain points. Finding a balance between these competing views comes down to a well-defined product roadmap that satisfies all three perspectives and begins with the MVP.

The value of the MVP is shipping and feedback. Shipping an MVP means the Engineering team has built a foundation for the product with a small set of features along with a packaging and deployment process. Feedback means you have customers–internally and externally–who are interested enough to give you feedback to help you iterate toward a better future product.

In our case, we wanted to ship the MVP for a hosted service that a customer could sign into and see an aggregated view from multiple Retrospect Backup instances. That translated into the following requirements:

Hosted Service Platform: Heroku
Database: Postgres
Application Framework: Ruby-on-Rails
Authentication: Devise
Security: SSL encryption
Data Processing: SuckerPunch
Integration: Retrospect Backup connection and JSON data
Dashboard: Aggregated view of multiple instances

The actual customer use case is the final point because the others represent the foundation of the product. We couldn’t get data into the service without them. The product requirement for the dashboard was a set of useful statistics and charts, and we could build more after shipping and getting feedback.

For comparison, we added the following features in future releases after the MVP:

Scalable Data Processing: Sidekiq
Detailed Instance Monitoring: Backup Report, Sources List, Scripts List, Backup Sets List, Scripts List
Interactivity: Pause/Stop Support for Activities and Script Creation

We started building in May 2018 and shipped the MVP in August 2018. After testing it in-house for a month, we released it to the public as a beta. It required the latest version of Retrospect Backup to connect and send analytics data.

Retrospect Management Console - MVP

Retrospect Backup Integration

MVP Tradeoffs

When you’re building an MVP, you also need to acknowledge the tradeoffs. The problem with shipping an MVP is those early decisions build the foundation for the product, and as teams continue to build features on top of them, they also find it harder to justify the time commitment of changing them.

For us, we wanted to ship as soon as possible, which meant utilizing our existing knowledge of Ruby-on-Rails. We could have built the service as a Node.js application with React or Angular, but we weren’t as familiar with that approach. However, switching to a different application framework now would be a significant project because of the features we would need to replicate.

Moreover, we chose to have the API endpoint for the integration be served by the user-facing application. That design choice simplified the architecture in the beginning. We only needed one web instance running to both ingest data from Retrospect Backup and also serve the dashboard. However, as the service scaled, we needed to address that bottleneck.

MVPs are a fantastic approach to getting a product into customers’ hands and iterating based on their feedback, but your team needs to acknowledge the Engineering tradeoffs that come with that approach.

Feedback and Iteration

With the MVP shipped, we revisited the product roadmap. Sales wanted more management abilities to compete with other services. Product Management pushed for easier workflows for customers to use the service. Support pointed out issues that customers had. Engineering evaluated performance optimizations for scaling the service.

We’ve added a number of features since the MVP shipped, including the following:

Navigation: Better navigation for organizations with managed organizations and multiple Retrospect Backup instances
Detailed Monitoring: More insight into Retrospect Backup instances via Backup Report, Sources list, Scripts list, Backup Sets list, Scripts list
Interactivity: Pause/Stop support for activities, shared scripts, and backup set/script creation
Status: Detailed status for each Retrospect Backup instance’s service connectivity
Scalable Data Processing: Sidekiq and Rails AutoScale

Web applications are particularly suited to MVPs and iteration based on feedback because the company controls the delivery. We only update Retrospect Backup every six months because customers have repeatedly told us they don’t want updates more frequently. However, with a web application, we can update the application far more frequently, and the customer is only aware if there are new features.

For Retrospect Management Console, we’re on deployment #396, so on average, we’ve shipped 2.5 times per week since Fall 2018.

Retrospect Management Console - Current

Deeper Product Integration

Owning a hosted service allowed us to integrate and streamline the user experience for Retrospect Backup. In May 2021, we launched Retrospect Backup 18 with deeper integration with Retrospect Management Console.

When you download a trial of Retrospect Backup, the download link points to Retrospect Management Console. The service automatically creates an account for the embedded email address and license and then creates a personalized version of the application installer with license and Retrospect Management Console UUID included. After installation, Retrospect Backup automatically creates a public/private keypair and uploads it to the service, so that the service can bundle it in each client installer for that Retrospect instance.

Retrospect Backup - Add Clients

Retrospect Management Console - Client Installers

In Retrospect Backup, the customer can copy a single download link for the Retrospect Backup Client agent installer and send it their entire company. When each employee downloads and installs it, their computer will be automatically added to the Retrospect Backup instance and start getting protected, without any administrative setup.

By leveraging Retrospect Management Console, the Engineering team was able to streamline the entire end-to-end process of trial download to automatic protection for an entire organization, requiring only a couple clicks from the administrator.

Building an MVP and then iterating based on feedback enabled the Retrospect Engineering team to ship an integrated backup service that helps companies easily onboard, protect, and manage their backup environment from a single pane of glass.

Jekyll Tip: Absolute URLs for Links and Images in RSS Feed

Fri, 29 Oct 2021 23:00:00 +0000

When I write a post and add an image, the image lives in the “files” folder. When Jekyll builds the static site locally, I can see the image inline. When I push the site to GitHub Pages, I can see the image online in the post with my web browser. But when I view the feed in an RSS reader (Reeder for me), no image.

The issue is the URL for the image is a relative path instead of an absolute path:

# Relative Path
/files/example.jpg
# Absolute Path
https://bdunagan.com/files/example.jpg

The RSS reader doesn’t know what the base path is. But if I wrote it with the absolute URL format locally, I couldn’t see it until I pushed the post live.

Web browsers handle relative URLs, but RSS readers are not consistent about it. You can add “xml:base” to your feed, but there is no guarantee that the RSS reader will observe it.

Below is a quick Jekyll hack to replace relative URLs for links and images with absolute URLs in /feed.xml, using Liquid’s replace method:

{{ post.content | replace: 'src=&quot;/', 'src=&quot;https://bdunagan.com/' }}

Here is the code in the context of my entire feed.xml file:

---
layout: null
---
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>{{ site.title }}</title>
    <description>{{ site.description }}</description>
    <link>{{ site.url }}</link>
    <atom:link href="{{ site.url }}/feed.xml" rel="self" type="application/rss+xml" />
    {% for post in site.posts limit:20 %}
      <item>
        <title>{{ post.title }}</title>
        <description>{{ post.content | xml_escape | replace: '' | replace: '' | replace: '' | replace: '' | replace: 'src=&quot;/', 'src=&quot;https://bdunagan.com/' | replace: 'href=&quot;/', 'href=&quot;https://bdunagan.com/' }}</description>
        <pubDate>{{ post.date | date: "%a, %d %b %Y %H:%M:%S %z" }}</pubDate>
        <link>{{ site.url }}{{ post.url }}</link>
        <guid isPermaLink="true">{{ site.url }}{{ post.url }}</guid>
      </item>
    {% endfor %}
  </channel>
</rss>

By replacing the relative links to absolute links in the RSS feed, the post is viewable in development and in production as well as in an RSS reader, without any modifications.

How I Managed Cash Flow at a Bootstrapped Startup

Sun, 24 Oct 2021 23:00:00 +0000

Unsplash - @bkotynski

Retrospect, Inc. was a bootstrapped startup of sorts. We were spun out in 2011, responsible for a product that was 22 years old, with around thirty people at the company. We had an accounting firm, Accretive (now Countsy), with a fantastic and helpful advisor there, and they maintained our financials in NetSuite. From 2016 until our acquisition in 2019, I was Chief Operating Officer (COO), so I was responsible for the financials and the cash flow.

The cash flow of a small company is composed of two pieces: Accounts Payable (AP) and Accounts Receivable (AR). AP is the list of vendors to whom we own money with our payment deadline. AR is the list of customers that owe us money with their payment deadline. Sales closes opportunities (bookings) and hands off booked orders to Operations; Operations fulfills those orders and bills customers (billings), transitioning those orders into AR and over to Finance; and finally, Finance follows up on AR and collects money from customers for those orders (collections). This back-office process is how product turns into cash, and, for us, it involved many manual steps.

The core component of a bootstrapped startup is cash in the bank. We needed to know our current financial position, predict the company’s cash flow, and minimize past-due AR. Let’s walk through how we did that with a simple spreadsheet.

Predicting Cash Flow

I’m not referring to complicated machine learning models to predict cash flow. We needed basic spreadsheet-driven estimates for our expenses and our cash. We did not have a subscription service at that point, so we did not deal with annual recurring revenue (ARR) or subscription churn rate. Our focus was simply understanding our current business, selling software licenses online and in the channel as well as support contracts and collecting payment from customers.

AP was consistent for us every month. Countsy did a great job maintaining our NetSuite instance, so that I could drill into any department and see their expenses for the three years or the last quarter. For us, the largest expense was payroll. Beyond that, we had sales rep contractors, our European support team, our accounting team, and a long list of credit card charges. Every quarter, we had to account for quarterly bonuses as well as quarterly bills, such as Salesforce. When we grew our Sales staff, I knew how much those additional people were going to cost on a monthly basis. There were occasional out-of-the-blue bills, such as the yearly online store fee, but in general, our burn rate was easy to predict.

AR took more effort. Most of our customers had NET30 payment schedules, meaning that they had 30 days to pay us after we sent the invoice. However, when I started as COO, we had an AR forecast of one week, meaning we knew which customers were going to pay us what amount for the next seven days. Think about those two data points. Our customers had 30 days to pay us, and we only predicted payments a week out. We had better data than a one-week horizon.

To predict cash as far out as possible, I built an “AR Forecast” spreadsheet in Google Sheets. Using a report in NetSuite, I copied every customer’s open invoices with amount, purchase date, and payment terms into the spreadsheet, and I used formulas to note when the invoice was due. For each customer, I learned what their past payment schedule was and manually predicted around when they would pay us. The spreadsheet allowed me to extend our AR forecast from 1 week to 6 weeks as well as accurately track our total AR.

As a bootstrapped startup, tracking AR and predicting payments were critical to understanding our runway and spotting any hurdles.

Minimizing Past-Due AR

In addition to only having a 1-week cash prediction, we had a huge past-due AR balance. According to NetSuite, many customers had unpaid invoices that were over 90 days old. Some open invoices were over a year old.

All customers had their own processes for paying vendors. Large distributors’ Finance departments ran like clockwork, but they were still idiosyncratic. For instance, one consistently sent checks every week for all of the open invoices that had past NET30 that week, but the key phrase was “sent checks”. The checks were marked as sent in their system, but they always managed to arrive two weeks later. Adding up the times, the checks arrived six weeks after we sold the products: NET45, not NET30. Still, large customers were consistent.

Smaller customers varied in terms of how prompt they were at paying. We had a number of small distributors that were great at paying on time. Others needed reminders. One change we made was ensuring there was an employee who was responsible for following up on past-due AR.

A past-due invoice did not necessarily mean that we hadn’t been paid. All customers need to submit their payment with remittance information, which lists what invoices the payment covers. Sometimes, remittance information is incorrect, leading to accounting not closing the correct open invoice. Other times, accounting has the correct information but has not yet updated the open invoice.

Our huge past-due AR balance was a mix of all three. Accounting only processed invoices every other week, so there were frequently unprocessed invoices. A subset of remittance information was incorrect, so while the customer thought the invoice was paid, our system did not. Working with customers resolved a long list of these that had built up over a couple years. Finally, a couple customers were consistently late at paying. Our diligent back-office person worked with them to figure out a payment schedule that we both agreed on.

By minimizing our past-due AR, I knew what our real AR was, so I could predict payments more accurately.

Profit: EBITDA vs Cash

We didn’t need to worry about EBITDA–shorthand for earnings before interest, taxes, depreciation, and amortization. It’s a common accounting measure for a company’s financial health. As a bootstrapped startup, cash was critical, and EBITDA was only useful for our corporate taxes. We followed cash basis accounting (as opposed to accrual basis accounting), and we focused on profit in terms of cash to decide questions like hiring, bonuses, or layoffs.

Understanding Data and Processes

Understanding a small company’s cash flow isn’t hard, but it does require patience and tools. NetSuite is a fantastic back office tool, but it’s not necessarily ground truth. We couldn’t simply click on “AR Report” and accept the sum at the bottom. The data in the report wasn’t accurate in multiple ways. Digging into the actual data, customer by customer and vendor by vendor, helped us see what our actual AR was, allowed us to predict our cash up to six weeks using a simple spreadsheet, and enabled us to manage the cash flow for our bootstrapped startup.

Technical Deep Dive on Object Lock, Ransomware Protection, and Immutable Backups in Retrospect Backup

Sun, 17 Oct 2021 12:00:00 +0000

Unsplash - @myfotocanva

Ransomware is a huge global threat to businesses around the world. The problem for companies is that their storage is always connected with full access for admins. When ransomware gets the administrative credentials, it has full access too. There is no policy to say that no one, not even the administrator, can change this file for a set amount of time.

Cloud Object Lock does just that. Because cloud storage providers like Amazon S3 control the API, they can add features like Object Lock. This lock is a retention policy for a specific version of a file that is locked from changes from every user, including the administrator. You can think of this as a virtual air-gap in the cloud because there is no way, barring to close the account, to delete that file before the retention date is passed.

Retrospect was one of the first data protection solutions to add ransomware protection using immutable backups: “Retrospect Backup 18: Ransomware Protection”. Retrospect utilizes Object Lock technology in major cloud storage providers to set a retention policy for cloud backups to ensure no one, not even the root user, can delete them during the retention window.

Retrospect User Interface

Creating an immutable backup set with Retrospect Backup is easy. There is a single checkbox in the user interface to enable it and a number of days to specify:

Retrospect Backup - Immutable Retention Policy

However, there is a lot of functionality underneath that checkbox to create immutable backups. Let’s dive into the technical details.

Forever-Incremental Backup’s Rolling Window

Retrospect Backup uses ProactiveAI for policy-driven scheduling and forever-incremental backup technology to minimize backup sizes while ensuring a point-in-time restore. The first backup is a full backup and every subsequent backup is called an incremental backup. Those incremental backups depend on previous backups. If a file doesn’t change, it doesn’t get backed up again.

Ordinarily, this workflow is a fantastic combination of minimizing storage while providing a backup that can perfectly recreate a point-in-time snapshot of the volume being protected. But that changes if you’re concerned the previous backups might be deleted. If a file is no longer locked, it can be deleted maliciously. Retrospect Backup needs to create backups where any backup within the rolling window of immutability are fully contained point-in-time snapshots of the volume.

Retrospect Backup - Immutable Retention Policy

Retrospect Backup accounts for the rolling window in two ways:

File Matching: Retrospect adjusts its file matching to take into account retention policy for a given backed up file. A file that is outside of the retention policy is no longer considered to be backed up, and Retrospect will back it up again.
ProactiveAI Scheduling: ProactiveAI determines the next date the script will run and backs up any file that will fall out of the retention policy by that date with forever-incremental backup, predicting into the future to ensure the file is protected at all times.

The consequence of this change is Retrospect will back up any file that is not protected in an immutable backup. Let’s say you back up every week and you set the retention policy for 4 weeks. Retrospect will back up every file every four weeks, regardless of whether it changed, because it needs to keep those files in the ransomware protection’s rolling window.

This process ensures that customers always have immutable backups with complete point-in-time restores. There is never a time when a backup depends on an out-of-policy file while preserving forever-incremental backups.

Cloud Storage Providers

There are two types of approaches from cloud storage providers: per-object policies and per-bucket policies. Per-object policies can be applied granularly to specific versions of an object at the time of creation, and they can vary within a bucket. Per-bucket policies are created for an entire bucket and are applied uniformly to every new version of any object in that bucket.

To compare with Retrospect:

Per-Object Policy: You can create Backup Set A with an immutable retention policy of 2 days and Backup Set B with an immutable retention policy of 6 months in the same bucket, and the bucket does not need to have a bucket-wide policy.
Per-Bucket Policy: You can only set a bucket-wide policy for immutable retention, so every new object is set to that retention period, regardless of what you have set in Retrospect.

Cloud storage providers with per-object policies are Amazon S3, Wasabi, Backblaze B2, MinIO, and Microsoft Azure Blob Storage (Preview - September 2021), while those with per-bucket policies are Google Cloud Storage and Microsoft Azure Blob Storage.

There are also different policy modes:

Compliance Mode: The policy is time-based and enforced for every user, including administrators.
Governance Mode: The policy is a legal hold, does not expire, and can be cancelled by a user with those permissions.

See Amazon S3 documentation for more information.

Retrospect Backup uses Retention Mode for its immutable backups. When you create an immutable backup, there is no permission level that will allow you to delete that version of the backup files. The root account cannot delete them. The only way to delete them is to close the account.

Because there is a way to ultimately delete the files, it’s important to use multi-factor authentication (MFA) for your root account on the cloud storage provider.

Retention Policy Dates

Let’s walk through the user interfaces for retention policy dates in the different cloud storage providers.

Amazon S3

Below is Amazon S3’s Retention Mode UI.

Amazon S3 Object Lock - Retention Policy

You’ll see it specifies the mode, the “Retain Until Date”, and the version of an object that you’re applying this to. Retrospect Backup does this step automatically when creating an immutable backup.

For Microsoft Azure Blob Storage and Google Cloud Storage, you will need to create the retention policy manually because they only support per-bucket policies.

Click on any object and scroll down on “Properties” to “Object Lock retention”.

Amazon S3

Wasabi

Click on any object and “File Details” appears.

Wasabi

Backblaze B2

Click on any object’s blue link and “Details” appears.

Backblaze B2

Cyberduck

Select any object and click “Info” then “Metadata”.

Cyberduck

Microsoft Azure Blob Storage

For a container, select “Access Policy”. Note: Per-object (blob) version locking in preview (September 2021)

Microsoft Azure Blob Storage

Google Cloud Storage

When viewing a list of files, see the “Retention expiration date” column.

Microsoft Azure Blob Storage

Viewing and Deleting Versions

One important nuance is how to view versions of a file. Only Amazon S3 and Cyberduck show versions. See below. Other interfaces choose to display a simplified version of the actual underlying content while preventing you from taking certain actions, like deletion.

Object Lock - Show Versions

One underlying feature is a delete marker. When you delete a object in a versioned bucket on Amazon S3, the file is not deleted. You are adding a delete marker as the next version of that file, and Amazon S3 understands it should not display that in the interface without “Show Versions” enabled.

Object Lock - Delete Markers

Let’s look at the difference between deleting an object (“delete”) and deleting an object version (“permanently delete”):

Object Lock - Delete

In Wasabi or Backblaze, you don’t see versions, even though they are there for buckets with Object Lock enabled. Wasabi won’t let you delete files through their interface, but if an attacker added a delete marker to your file using an API, the file would no longer show up in Wasabi. You would have to use Cyberduck or other API to see that the locked files were indeed still there.

Video Walkthrough

I recorded a detailed video of the use cases and step-by-step walkthroughs on both Windows and Mac platforms as well as this technical deep dive into how Retrospect’s ransomware protection works and how Object Lock is implemented across cloud storage providers.

Ransomware Protection with Retrospect Backup's Immutable Backups

Note that much of this content is also available on the Retrospect website, under Ransomware Protection and Technical Deep Dive on Ransomware Protection, Object Lock, and Immutable Backups Ransomware. It took me a bit to compile, but the broad overview and then technical details have really helped clarify people’s understanding of ransomware, Object Lock, and Immutable backups–both inside the company and for our partners and customers.

How to Protect Cloud Storage with Retrospect Backup's Cloud Data Protection

Sat, 16 Oct 2021 13:00:00 +0000

Unsplash - @ilumire

Cloud storage provides a low-cost, scalable, offsite storage location for a multitude of corporate assets. Companies use cloud storage for all sorts of data, from website assets to corporate storage to unstructured data used for analysis.

Cloud Data

Retrospect has used Amazon S3 for storing various assets for the last decade. Today, we store tens of terabytes of content on it: website assets, sales collateral, customer logs, Engineering builds, and, of course, corporate backups.

Cloud data storage is fantastic for its infinite supply, worldwide accessibility, and abstraction. It allows companies to move up the storage stack and not manage the physical bits of hardware that are required to store vast quantities of data–all for an affordable per-data-unit cost. Amazon S3 costs $0.023 per GB per month, and other providers like Wasabi and Backblaze B2 are even more affordable.

To achieve SLAs up to 11 9s like Amazon S3, cloud storage providers built redundancies into their infrastructure to protect your data from hardware failure. Objects are stored in multiple copies in a single zone and in multiple different zones. Two data centers can go offline, and you’ll still have your data. Awesome.

Unless you delete a file accidentally. Or ransomware comprises your account and encrypts your data. Or a malicious insider deletes every bucket.

Cloud Data Loss

Working at a backup company, I hear stories about data loss a lot. Bad things happen. You can protect yourself against one type of data loss, only to be hit by another. Cloud data is no exception.

Retrospect Backup can now protect your cloud data with Cloud Data Protection. Retrospect treats cloud storage as a first-class storage volume and integrates cloud volumes into the rest of Retrospect’s deep feature set. Cloud Data Protection is included for an unlimited number of cloud volumes in Retrospect Backup’s server-level editions on both Windows and Mac platforms, and Retrospect integrates cloud storage from all the major cloud storage providers: Amazon S3, Microsoft Azure Blob Storage, Google Cloud Storage, Wasabi, Backblaze B2, and MinIO.

I made a video covering common use cases and step-by-step walkthroughs for Retrospect Backup for Windows and Retrospect Backup for Mac:

How to Protect Cloud Assets with Retrospect Backup's Cloud Data Protection

Learn more at retrospect.com/cloud_data_protection.

Deep Integration

Retrospect Backup has a deep feature set, and Cloud Data Protection allows organizations to incorporate cloud data into their standard backup strategies. Let’s walk through a number of facets:

On-Premise or Cloud: You can back up the cloud volume to on-premise storage, back up the cloud data to a different cloud from a cloud server instance, or back up a local volume and restore it to the cloud. A cloud volume has the same level of protection and interoperability as a local volume.
Incremental-Forever Backups: Retrospect uses incremental-forever backup technology to ensure each backup is as compact as possible while still providing point-in-time restore, and cloud volumes are included.
Scheduling: Retrospect supports cloud volumes under scheduled scripts or policy-based scheduling called ProactiveAI, which prioritizes backups based on a backup window and availability across local volumes, NAS shares, servers, endpoints, email accounts, and now cloud volumes.
Filtering: Retrospect includes a granular set of filtering options for companies to include only critical data or exclude extraneous data, based on volume information or file information. For instance, you can easily write a filter to only back up cloud data that’s created in the last 90 days and less than 100MB.
Cloud Backup: Retrospect can also use cloud storage as a destination for backups, with support for over 20 different cloud storage providers.
Replication: In addition to backup, Retrospect can replicate between cloud volumes, local volumes, NAS shares, servers, endpoints, or any combination, with scheduling and filtering available.

Retrospect Backup’s goal is to ensure your data is protected from the wide array of possibilities for data loss. With Cloud Data Protection in Retrospect Backup, you can back up and restore cloud storage with the same policy-driven workflows that you use for on-premise systems and storage.

Congratulations to the Retrospect Engineering team for shipping this excellent feature in such a deeply integrated fashion! With the continued migration to the cloud, data protection that is agnostic to volume type or location helps companies protect their data in the way that works for their business constraints.

Jekyll Tip: Adding Years and Post Counts using Liquid

Fri, 15 Oct 2021 23:00:00 +0000

After 13 years and 177 posts, this blog is a bit long for a single scrolling index, so I added years to delineate the content as well as post count per year.

That involved a bit of programming in Liquid, a template language created by Shopify and used by Jekyll. Liquid does not expose support for hash creation, even though Jekyll can provide hashes like site. Fortunately, it supports group_by_exp expressions, available in v3.4+.

Below is the code I used:

<ul class="posts">
  {% assign posts_per_year = site.posts | group_by_exp: "post", "post.date | date: '%Y'" %}
  {% for post in site.posts %}
    {% assign year = post.date | date: "%Y" %}
    {% for current_year in posts_per_year %}
      {% if last_year != year and current_year.name == year %}
        <li class="year">{{ year }} - {{ current_year.size }}</li>
      {% endif %}
    {% endfor %}
    {% assign last_year = year %}

    <li>
      <span class="datetime muted" data-time="{{ post.date }}">{{ post.date | date: "%B %-d %Y" }}</span>
      <a href="{{ post.url }}">{{ post.title }}</a>
      <p>{{ post.description }}</p>
    </li>
  {% endfor %}
</ul>

I’ve been using Jekyll for eight years now. In 2013, I converted from Wordpress’s database-driven approach to Jekyll’s static HTML-generated approach: “Why I moved from Wordpress to Jekyll”. It continues to work well.

And for those who use Jekyll, you can post Liquid as text in a post by using {% raw %}{% endraw %}.