This is the post of 2020, so *happy new year* to you all !

I’m a huge fan of LLVM since 11 years ago when I started playing with it to JIT data structures such as AVLs, then later to JIT restricted AST trees and to JIT native code from TensorFlow graphs. Since then, LLVM evolved into one of the most important compiler framework ecosystem and is used nowadays by a lot of important open-source projects.

One cool project that I recently became aware of is Gandiva. Gandiva was developed by Dremio and then later donated to Apache Arrow (**kudos to Dremio team for that**). The main idea of Gandiva is that it provides a compiler to generate LLVM IR that can operate on batches of Apache Arrow. Gandiva was written in C++ and comes with a lot of different functions implemented to build an expression tree that can be JIT’ed using LLVM. One nice feature of this design is that it can use LLVM to automatically optimize complex expressions, add native target platform vectorization such as AVX while operating on Arrow batches and execute native code to evaluate the expressions.

The image below gives an overview of Gandiva:

In this post I’ll build a very simple expression parser supporting a limited set of operations that I will use to filter a Pandas DataFrame.

In this section I’ll show how to create a simple expression manually using tree builder from Gandiva.

Before building our parser and expression builder for expressions, let’s manually build a simple expression with Gandiva. First, we will create a simple Pandas DataFrame with numbers from 0.0 to 9.0:

import pandas as pd import pyarrow as pa import pyarrow.gandiva as gandiva # Create a simple Pandas DataFrame df = pd.DataFrame({"x": [1.0 * i for i in range(10)]}) table = pa.Table.from_pandas(df) schema = pa.Schema.from_pandas(df)

We converted the DataFrame to an Arrow Table, it is important to note that in this case it was a zero-copy operation, Arrow isn’t copying data from Pandas and duplicating the DataFrame. Later we get the `schema`

from the table, that contains column types and other metadata.

After that, we want to use Gandiva to build the following expression to filter the data:

`(x > 2.0) and (x < 6.0)`

This expression will be built using nodes from Gandiva:

builder = gandiva.TreeExprBuilder() # Reference the column "x" node_x = builder.make_field(table.schema.field("x")) # Make two literals: 2.0 and 6.0 two = builder.make_literal(2.0, pa.float64()) six = builder.make_literal(6.0, pa.float64()) # Create a function for "x > 2.0" gt_five_node = builder.make_function("greater_than", [node_x, two], pa.bool_()) # Create a function for "x < 6.0" lt_ten_node = builder.make_function("less_than", [node_x, six], pa.bool_()) # Create an "and" node, for "(x > 2.0) and (x < 6.0)" and_node = builder.make_and([gt_five_node, lt_ten_node]) # Make the expression a condition and create a filter condition = builder.make_condition(and_node) filter_ = gandiva.make_filter(table.schema, condition)

This code now looks a little more complex but it is easy to understand. We are basically creating the nodes of a tree that will represent the expression we showed earlier. Here is a graphical representation of what it looks like:

Unfortunately, haven’t found a way to dump the LLVM IR that was generated using the Arrow’s Python bindings, however, we can just use the C++ API to build the same tree and then look at the generated LLVM IR:

auto field_x = field("x", float32()); auto schema = arrow::schema({field_x}); auto node_x = TreeExprBuilder::MakeField(field_x); auto two = TreeExprBuilder::MakeLiteral((float_t)2.0); auto six = TreeExprBuilder::MakeLiteral((float_t)6.0); auto gt_five_node = TreeExprBuilder::MakeFunction("greater_than", {node_x, two}, arrow::boolean()); auto lt_ten_node = TreeExprBuilder::MakeFunction("less_than", {node_x, six}, arrow::boolean()); auto and_node = TreeExprBuilder::MakeAnd({gt_five_node, lt_ten_node}); auto condition = TreeExprBuilder::MakeCondition(and_node); std::shared_ptr<Filter> filter; auto status = Filter::Make(schema, condition, TestConfiguration(), &filter);

The code above is the same as the Python code, but using the C++ Gandiva API. Now that we built the tree in C++, we can get the LLVM Module and dump the IR code for it. The generated IR is full of boilerplate code and the JIT’ed functions from the Gandiva registry, however the important parts are show below:

; Function Attrs: alwaysinline norecurse nounwind readnone ssp uwtable define internal zeroext i1 @less_than_float32_float32(float, float) local_unnamed_addr #0 { %3 = fcmp olt float %0, %1 ret i1 %3 } ; Function Attrs: alwaysinline norecurse nounwind readnone ssp uwtable define internal zeroext i1 @greater_than_float32_float32(float, float) local_unnamed_addr #0 { %3 = fcmp ogt float %0, %1 ret i1 %3 } (...) %x = load float, float* %11 %greater_than_float32_float32 = call i1 @greater_than_float32_float32(float %x, float 2.000000e+00) (...) %x11 = load float, float* %15 %less_than_float32_float32 = call i1 @less_than_float32_float32(float %x11, float 6.000000e+00)

As you can see, on the IR we can see the call to the functions `less_than_float32_float_32`

and `greater_than_float32_float32`

that are the (in this case very simple) Gandiva functions to do float comparisons. Note the specialization of the function by looking at the function name prefix.

What is quite interesting is that LLVM will apply all optimizations in this code and it will generate efficient native code for the target platform while Godiva and LLVM will take care of making sure that memory alignment will be correct for extensions such as AVX to be used for vectorization.

This IR code I showed isn’t actually the one that is executed, but the optimized one. And in the optimized one we can see that LLVM inlined the functions, as shown in a part of the optimized code below:

%x.us = load float, float* %10, align 4 %11 = fcmp ogt float %x.us, 2.000000e+00 %12 = fcmp olt float %x.us, 6.000000e+00 %not.or.cond = and i1 %12, %11

You can see that the expression is now much simpler after optimization as LLVM applied its powerful optimizations and inlined a lot of Gandiva funcions.

Now we want to be able to implement something similar as the Pandas’ `DataFrame.query()`

function using Gandiva. The first problem we will face is that we need to parse a string such as `(x > 2.0) and (x < 6.0)`

, later we will have to build the Gandiva expression tree using the tree builder from Gandiva and then evaluate that expression on arrow data.

Now, instead of implementing a full parsing of the expression string, I’ll use the Python AST module to parse valid Python code and build an Abstract Syntax Tree (AST) of that expression, that I’ll be later using to emit the Gandiva/LLVM nodes.

The heavy work of parsing the string will be delegated to Python AST module and our work will be mostly walking on this tree and emitting the Gandiva nodes based on that syntax tree. The code for visiting the nodes of this Python AST tree and emitting Gandiva nodes is shown below:

class LLVMGandivaVisitor(ast.NodeVisitor): def __init__(self, df_table): self.table = df_table self.builder = gandiva.TreeExprBuilder() self.columns = {f.name: self.builder.make_field(f) for f in self.table.schema} self.compare_ops = { "Gt": "greater_than", "Lt": "less_than", } self.bin_ops = { "BitAnd": self.builder.make_and, "BitOr": self.builder.make_or, } def visit_Module(self, node): return self.visit(node.body[0]) def visit_BinOp(self, node): left = self.visit(node.left) right = self.visit(node.right) op_name = node.op.__class__.__name__ gandiva_bin_op = self.bin_ops[op_name] return gandiva_bin_op([left, right]) def visit_Compare(self, node): op = node.ops[0] op_name = op.__class__.__name__ gandiva_comp_op = self.compare_ops[op_name] comparators = self.visit(node.comparators[0]) left = self.visit(node.left) return self.builder.make_function(gandiva_comp_op, [left, comparators], pa.bool_()) def visit_Num(self, node): return self.builder.make_literal(node.n, pa.float64()) def visit_Expr(self, node): return self.visit(node.value) def visit_Name(self, node): return self.columns[node.id] def generic_visit(self, node): return node def evaluate_filter(self, llvm_mod): condition = self.builder.make_condition(llvm_mod) filter_ = gandiva.make_filter(self.table.schema, condition) result = filter_.evaluate(self.table.to_batches()[0], pa.default_memory_pool()) arr = result.to_array() pd_result = arr.to_numpy() return pd_result @staticmethod def gandiva_query(df, query): df_table = pa.Table.from_pandas(df) llvm_gandiva_visitor = LLVMGandivaVisitor(df_table) mod_f = ast.parse(query) llvm_mod = llvm_gandiva_visitor.visit(mod_f) results = llvm_gandiva_visitor.evaluate_filter(llvm_mod) return results

As you can see, the code is pretty straightforward as I’m not supporting every possible Python expressions but a minor subset of it. What we do in this class is basically a conversion of the Python AST nodes such as Comparators and BinOps (binary operations) to the Gandiva nodes. I’m also changing the semantics of the `&`

and the `|`

operators to represent AND and OR respectively, such as in Pandas `query()`

function.

The next step is to create a simple Pandas extension using the `gandiva_query()`

method that we created:

@pd.api.extensions.register_dataframe_accessor("gandiva") class GandivaAcessor: def __init__(self, pandas_obj): self.pandas_obj = pandas_obj def query(self, query): return LLVMGandivaVisitor.gandiva_query(self.pandas_obj, query)

And that is it, now we can use this extension to do things such as:

df = pd.DataFrame({"a": [1.0 * i for i in range(nsize)]}) results = df.gandiva.query("a > 10.0")

As we have registered a Pandas extension called `gandiva`

that is now a first-class citizen of the Pandas DataFrames.

Let’s create now a 5 million floats DataFrame and use the new `query()`

method to filter it:

df = pd.DataFrame({"a": [1.0 * i for i in range(50000000)]}) df.gandiva.query("a < 4.0") # This will output: # array([0, 1, 2, 3], dtype=uint32)

Note that the returned values are the indexes satisfying the condition we implemented, so it is different than the Pandas `query()`

that returns the data already filtered.

I did some benchmarks and found that Gandiva is usually always faster than Pandas, however I’ll leave proper benchmarks for a next post on Gandiva as this post was to show how you can use it to JIT expressions.

That’s it ! I hope you liked the post as I enjoyed exploring Gandiva. It seems that we will probably have more and more tools coming up with Gandiva acceleration, specially for SQL parsing/projection/JITing. Gandiva is much more than what I just showed, but you can get started now to understand more of its architecture and how to build the expression trees.

– Christian S. Perone

Cite this article as: Christian S. Perone, "Gandiva, using LLVM and Arrow to JIT and evaluate Pandas expressions," in *Terra Incognita*, 19/01/2020, http://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/.

]]>There are, however, other senses that we can use to monitor the training of neural networks, such as **sound**. Sound is one of the perspectives that is currently very poorly explored in the training of neural networks. Human hearing can be very good a distinguishing very small perturbations in characteristics such as rhythm and pitch, even when these perturbations are very short in time or subtle.

For this experiment, I made a very simple example showing a synthesized sound that was made using the gradient norm of each layer and for step of the training for a convolutional neural network training on MNIST using different settings such as different learning rates, optimizers, momentum, etc.

You’ll need to install PyAudio and PyTorch to run the code (in* the end of this post*).

This segment represents a training session with gradients from 4 layers during the first 200 steps of the first epoch and using a batch size of 10. The higher the pitch, the higher the norm for a layer, there is a short silence to indicate different batches. Note the gradient increasing during time.

Same as above, but with higher learning rate.

Same as above, but with high learning rate that makes the network to diverge, pay attention to the high pitch when the norms explode and then divergence.

Same setting but with a high learning rate of 1.0 and a batch size of 256. Note how the gradients explode and then there are NaNs causing the final sound.

This is using Adam in the same setting as the SGD.

For those who are interested, here is the entire source code I used to make the sound clips:

import pyaudio import numpy as np import wave import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torchvision import datasets, transforms class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 20, 5, 1) self.conv2 = nn.Conv2d(20, 50, 5, 1) self.fc1 = nn.Linear(4*4*50, 500) self.fc2 = nn.Linear(500, 10) self.ordered_layers = [self.conv1, self.conv2, self.fc1, self.fc2] def forward(self, x): x = F.relu(self.conv1(x)) x = F.max_pool2d(x, 2, 2) x = F.relu(self.conv2(x)) x = F.max_pool2d(x, 2, 2) x = x.view(-1, 4*4*50) x = F.relu(self.fc1(x)) x = self.fc2(x) return F.log_softmax(x, dim=1) def open_stream(fs): p = pyaudio.PyAudio() stream = p.open(format=pyaudio.paFloat32, channels=1, rate=fs, output=True) return p, stream def generate_tone(fs, freq, duration): npsin = np.sin(2 * np.pi * np.arange(fs*duration) * freq / fs) samples = npsin.astype(np.float32) return 0.1 * samples def train(model, device, train_loader, optimizer, epoch): model.train() fs = 44100 duration = 0.01 f = 200.0 p, stream = open_stream(fs) frames = [] for batch_idx, (data, target) in enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() norms = [] for layer in model.ordered_layers: norm_grad = layer.weight.grad.norm() norms.append(norm_grad) tone = f + ((norm_grad.numpy()) * 100.0) tone = tone.astype(np.float32) samples = generate_tone(fs, tone, duration) frames.append(samples) silence = np.zeros(samples.shape[0] * 2, dtype=np.float32) frames.append(silence) optimizer.step() # Just 200 steps per epoach if batch_idx == 200: break wf = wave.open("sgd_lr_1_0_bs256.wav", 'wb') wf.setnchannels(1) wf.setsampwidth(p.get_sample_size(pyaudio.paFloat32)) wf.setframerate(fs) wf.writeframes(b''.join(frames)) wf.close() stream.stop_stream() stream.close() p.terminate() def run_main(): device = torch.device("cpu") train_loader = torch.utils.data.DataLoader( datasets.MNIST('../data', train=True, download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=256, shuffle=True) model = Net().to(device) optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5) for epoch in range(1, 2): train(model, device, train_loader, optimizer, epoch) if __name__ == "__main__": run_main()

Cite this article as: Christian S. Perone, "Listening to the neural network gradient norms during training," in *Terra Incognita*, 04/08/2019, http://blog.christianperone.com/2019/08/listening-to-the-neural-network-gradient-norms-during-training/.

]]>Using the same code showed earlier, these animations below show the training of an ensemble of 40 models with 2-layer MLP and 20 hidden units in different settings. These visualizations are really nice to understand what are the convergence differences when using or not bootstrap or randomized priors.

This is a training session without bootstrapping data or adding a randomized prior, it’s just a naive ensembling:

This is the ensemble but with the addition of the randomized prior (MLP with the same architecture, with random weights and fixed):

$$Q_{\theta_k}(x) = f_{\theta_k}(x) + p_k(x)$$

The final model \(Q_{\theta_k}(x)\) will be the k model of the ensemble that will fit the function \(f_{\theta_k}(x)\) with an untrained prior \(p_k(x)\):

This is a ensemble with the randomized prior functions and data bootstrap:

This is an ensemble with a fixed prior (Sin) and bootstrapping:

Cite this article as: Christian S. Perone, "Visualizing network ensembles with bootstrap and randomized priors," in *Terra Incognita*, 20/07/2019, http://blog.christianperone.com/2019/07/visualizing-network-ensembles-with-bootstrap-and-randomized-priors/.

]]>

Cite this article as: Christian S. Perone, "Uncertainty Estimation in Deep Learning (PyData Lisbon / July 2019)," in *Terra Incognita*, 18/07/2019, http://blog.christianperone.com/2019/07/uncertainty-estimation-in-deep-learning-pydata-lisbon-july-2019/.

]]>Not a lot of people working with the Python scientific ecosystem are aware of the NEP 18 (*dispatch mechanism for NumPy’s high-level array functions*). Given the importance of this protocol, I decided to write this short introduction to the new dispatcher that will certainly bring a **lot of benefits** for the Python scientific ecosystem.

If you used PyTorch, TensorFlow, Dask, etc, you certainly noticed the similarity of their API contracts with Numpy. And it’s not by accident, Numpy’s API is one of the most fundamental and widely-used APIs for scientific computing. Numpy is so pervasive, that it ceased to be only an API and it is becoming more a protocol or an API specification.

There are countless advantages when an an entire ecosystem adopts an API specification. The wide adoption of the Numpy protocols will certainly help create a standardized API that will provide seamless integration *across implementations, *even for the ones that are quite different than the `ndarray`

.

The main goal of the NEP 18, is to provide a protocol that is materialized on the method named `__array_function__`

, where it will allow developers of frameworks such as PyTorch, TensorFlow, Dask, etc, to provide their own implementations of the Numpy API specification.

The `__array_function__`

protocol is still experimental and will be enabled by default on the next Numpy v.1.17, however, on the v1.16 you can enable it by setting the environment variable called `NUMPY_EXPERIMENTAL_ARRAY_FUNCTION`

to 1.

It’s easy to provide an example of how the protocol can be implemented in PyTorch in order to get a feeling of what it will allow you to do.

First, we define the protocol that we will use to monkey-patch PyTorch (*but you can change the framework itself, of course, that’s the whole point*):

HANDLED_FUNCTIONS = {} def __array_function__(self, func, types, args, kwargs): if func not in HANDLED_FUNCTIONS: return NotImplemented if not all(issubclass(t, torch.Tensor) for t in types): return NotImplemented return HANDLED_FUNCTIONS[func](*args, **kwargs)

After that, we create a simple decorator to set the `HANDLED_FUNCTIONS`

dictionary with PyTorch’s implementation of the Numpy API:

def implements(numpy_function): def decorator(func): HANDLED_FUNCTIONS[numpy_function] = func return func return decorator

Later, we just have to monkey patch PyTorch (*not required if you change PyTorch code itself*) and provide a Numpy implementation of the PyTorch operations:

torch.Tensor.__array_function__ = __array_function__ @implements(np.sum) def npsum(arr, axis=0): return torch.sum(arr, dim=axis)

And that’s it ! You can now do things like this:

>>> torch_tensor = torch.ones(10) >>> type(torch_tensor) torch.Tensor >>> result = np.sum(torch_tensor) >>> result tensor(10.) >>> type(result) torch.Tensor

This is quite amazing, given that you can easily mix APIs together and write a single piece of code that will work in PyTorch, Numpy or other frameworks independently of the multi-dimensional array implementation and diverging to the correct backend implementation.

Dask is already implementing the protocol, so you don’t have to monkey-patch it:

>>> import numpy as np >>> import dask.array as da >>> dask_array = da.random.random((5000, 1000), chunks=(1000, 1000)) >>> type(dask_array) dask.array.core.Array >>> u, s, v = np.linalg.svd(dask_array) # Numpy API ! >>> u

>>> s

>>> v

As you can see, when we used the singular decomposition from Numpy, we were able to define the Dask computational graph using the Numpy API:

>>> v.visualize()

And we can now compute the result using Dask lazy eval:

>>> result = v.compute() >>> result array([[ 0.03178809, 0.03147971, 0.0313492 , ..., 0.03201226, 0.03142089, 0.03180248], [ 0.00272678, -0.0182263 , 0.00839066, ..., 0.01195828, 0.01681703, -0.05086809], [-0.00851417, 0.0411336 , 0.02738713, ..., 0.03447079, 0.04614097, 0.02292196], ...,

And you can do the same with CuPy as well.

The new dispatch protocol will open the door for seamless integration among frameworks on the Python scientific ecosystem. I hope that more and more frameworks will start to adopt it in near future.

*– Christian S. Perone*

Cite this article as: Christian S. Perone, "Numpy dispatcher: when Numpy becomes a protocol for an ecosystem," in *Terra Incognita*, 06/07/2019, http://blog.christianperone.com/2019/07/numpy-dispatcher-when-numpy-becomes-a-protocol-for-an-ecosystem/.

]]>Cite this article as: Christian S. Perone, "Benford law on GPT-2 language model," in *Terra Incognita*, 14/06/2019, http://blog.christianperone.com/2019/06/benford-law-on-gpt-2-language-model/.

]]>I was experimenting with the approach described in “Randomized Prior Functions for Deep Reinforcement Learning” by Ian Osband et al. at NPS 2018, where they devised a very simple and practical method for uncertainty using bootstrap and randomized priors and decided to share the PyTorch code.

I really like bootstrap approaches, and in my opinion, they are usually the easiest methods to implement and provide very good posterior approximation with deep connections to Bayesian approaches, without having to deal with variational inference. They actually show in the paper that in the linear case, the method provides a Bayes posterior.

The main idea of the method is to have bootstrap to provide a non-parametric data perturbation together with randomized priors, which are nothing more than just random initialized networks.

$$Q_{\theta_k}(x) = f_{\theta_k}(x) + p_k(x)$$

The final model \(Q_{\theta_k}(x)\) will be the k model of the ensemble that will fit the function \(f_{\theta_k}(x)\) with an untrained prior \(p_k(x)\).

Let’s go to the code. The first class is a simple MLP with 2 hidden layers and Glorot initialization :

class MLP(nn.Module): def __init__(self): super().__init__() self.l1 = nn.Linear(1, 20) self.l2 = nn.Linear(20, 20) self.l3 = nn.Linear(20, 1) nn.init.xavier_uniform_(self.l1.weight) nn.init.xavier_uniform_(self.l2.weight) nn.init.xavier_uniform_(self.l3.weight) def forward(self, inputs): x = self.l1(inputs) x = nn.functional.selu(x) x = self.l2(x) x = nn.functional.selu(x) x = self.l3(x) return x

Then later we define a class that will take the model and the prior to produce the final model result:

class ModelWithPrior(nn.Module): def __init__(self, base_model : nn.Module, prior_model : nn.Module, prior_scale : float = 1.0): super().__init__() self.base_model = base_model self.prior_model = prior_model self.prior_scale = prior_scale def forward(self, inputs): with torch.no_grad(): prior_out = self.prior_model(inputs) prior_out = prior_out.detach() model_out = self.base_model(inputs) return model_out + (self.prior_scale * prior_out)

And it’s basically that ! As you can see, it’s a very simple method, in the second part we just created a custom forward() to avoid computing/accumulating gradients for the prior network and them summing (after scaling) it with the model prediction.

To train it, you just have to use different bootstraps for each ensemble model, like in the code below:

def train_model(x_train, y_train, base_model, prior_model): model = ModelWithPrior(base_model, prior_model, 1.0) loss_fn = nn.MSELoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.05) for epoch in range(100): model.train() preds = model(x_train) loss = loss_fn(preds, y_train) optimizer.zero_grad() loss.backward() optimizer.step() return model

and using a sampler with replacement (bootstrap) as in:

dataset = TensorDataset(...) bootstrap_sampler = RandomSampler(dataset, True, len(dataset)) train_dataloader = DataLoader(dataset, batch_size=len(dataset), sampler=bootstrap_sampler)

In this case, I used the same small dataset used in the original paper:

After training it with a simple MLP prior as well, the results for the uncertainty are shown below:

If we look at just the priors, we will see the variation of the untrained networks:

We can also visualize the individual model predictions showing their variability due to different initializations as well as the bootstrap noise:

Now, what is also quite interesting, is that we can change the prior to let’s say a fixed sine:

class SinPrior(nn.Module): def forward(self, input): return torch.sin(3 * input)

Then, when we train the same MLP model but this time using the sine prior, we can see how it affects the final prediction and uncertainty bounds:

If we show each individual model, we can see the effect of the prior contribution to each individual model:

I hope you liked, these are quite amazing results for a simple method that at least pass the linear “sanity check”. I’ll explore some pre-trained networks in place of the prior to see the different effects on predictions, it’s a very interesting way to add some simple priors.

Cite this article as: Christian S. Perone, "Randomized prior functions in PyTorch," in *Terra Incognita*, 24/03/2019, http://blog.christianperone.com/2019/03/randomized-prior-functions-in-pytorch/.

]]>Cite this article as: Christian S. Perone, "PyData Montreal slides for the talk: PyTorch under the hood," in *Terra Incognita*, 26/02/2019, http://blog.christianperone.com/2019/02/pydata-montreal-slides-for-the-talk-pytorch-under-the-hood/.

]]>Here is the full table of contents for those interested:

- Gandiva, using LLVM and Arrow to JIT and evaluate Pandas expressions (1/19/2020)
- Listening to the neural network gradient norms during training (8/4/2019)
- Visualizing network ensembles with bootstrap and randomized priors (7/20/2019)
- Uncertainty Estimation in Deep Learning (PyData Lisbon / July 2019) (7/18/2019)
- Numpy dispatcher: when Numpy becomes a protocol for an ecosystem (7/6/2019)
- Benford law on GPT-2 language model (6/14/2019)
- Randomized prior functions in PyTorch (3/24/2019)
- PyData Montreal slides for the talk: PyTorch under the hood (2/26/2019)
- 10 years of blogging ! (1/16/2019)
- A sane introduction to maximum likelihood estimation (MLE) and maximum a posteriori (MAP) (1/2/2019)
- Introducing EuclidesDB – A machine learning feature database (11/27/2018)
- Time Maps: visualizing discrete events from Brazilian presidential election candidates (10/31/2018)
- Análise bayesiana dos microdados do ENEM do Rio Grande do Sul (10/28/2018)
- Benford’s law emerges from deep language model (10/18/2018)
- PyTorch 1.0 tracing JIT and LibTorch C++ API to integrate PyTorch into NodeJS (10/2/2018)
- Concentration inequalities – Part I (8/23/2018)
- NLP word representations and the Wittgenstein philosophy of language (5/23/2018)
- PyTorch – Internal Architecture Tour (3/12/2018)
- Privacy-preserving sentence semantic similarity using InferSent embeddings and secure two-party computation (1/22/2018)
- New prime on the block (1/4/2018)
- The effective receptive field on CNNs (11/12/2017)
- Benford’s law – Index (11/6/2017)
- The same old historicism, now on AI (7/30/2017)
- Introduction to Word Embeddings (2/8/2017)
- Nanopipe: connecting the modern babel (10/17/2016)
- JIT native code generation for TensorFlow computation graphs using Python and LLVM (8/22/2016)
- Convolutional Neural Networks – Architectural Zoo (6/2/2016)
- Voynich Manuscript: word vectors and t-SNE visualization of some patterns (1/16/2016)
- Convolutional hypercolumns in Python (1/11/2016)
- Convolutional Neural Networks Presentation (10/15/2015)
- Deep learning – Convolutional neural networks and feature extraction with Python (8/19/2015)
- Google’s S2, geometry on the sphere, cells and Hilbert curve (8/14/2015)
- Luigi’s Codex Seraphinianus deep learning dreams (8/11/2015)
- Real time Drone object tracking using Python and OpenCV (1/26/2015)
- Arduino and OLED display to monitor Redis for fun and profit (1/14/2015)
- Recebendo dados de baloes meteorologicos da Aeronautica (7/28/2014)
- Simple and effective coin segmentation using Python and OpenCV (6/22/2014)
- Despesas de Custeio e Lei de Benford (6/9/2014)
- Universality, primes and space communication (1/9/2014)
- The beauty of Bitcoin P2P network (12/13/2013)
- Protocoin – a pure Python Bitcoin protocol implementation (11/22/2013)
- Mapa de calor dos dados de acidentes de transito do DataPoa (11/16/2013)
- Book Suggestion: Codex Seraphinianus (11/5/2013)
- Machine Learning :: Cosine Similarity for Vector Space Models (Part III) (9/12/2013)
- Rastreamento em tempo real de avioes em Porto Alegre utilizando Raspberry Pi + Radio UHF (SDR RTL2832U) (2/5/2013)
- Raspberry Pi & Arduino: a laser pointer communication and a LDR voltage sigmoid (8/19/2012)
- Genetic Programming and a LLVM JIT for restricted Python AST expressions (8/15/2012)
- Raspberry Pi no Brasil (8/2/2012)
- Review Board – Review Requests plotting using Python (7/7/2012)
- Accessing HP Cloud OpenStack Nova using Python and Requests (2/2/2012)
- Announce: Stallion v0.2 released ! (12/15/2011)
- Announce: ‘Stallion’ – Python Package Manager (12/4/2011)
- Hacking into Python objects internals (11/23/2011)
- C++11 user-defined literals and some constructions (11/21/2011)
- Machine Learning :: Text feature extraction (tf-idf) – Part II (10/3/2011)
- Machine Learning :: Text feature extraction (tf-idf) – Part I (9/18/2011)
- The Interactive Robotic Painting Machine ! (8/17/2011)
- Pyevolve – new repository at Github (8/10/2011)
- News: Algorithm searches for models that best explain experimental data (8/2/2011)
- C++0x :: Introduction to some amazing features (7/28/2011)
- The bad good news for optimization (7/16/2011)
- Hiatus (5/20/2011)
- PyCon 2011: Genetic Programming in Python (3/14/2011)
- Using pyearthquake to plot Japan USGS earthquake data into the near real-time MODIS satellite imagery (3/12/2011)
- Invite: PyCon US 2011 – Genetic Programming in Python (3/9/2011)
- Health and Genetic Algorithms (2/4/2011)
- Google Analytics Visualization (2/2/2011)
- Darwin on the track (11/19/2010)
- Capturing and comparing writers profiles through Markov chains (9/28/2010)
- PyOhio 2010: Genetic Programming in Python (8/31/2010)
- New SIGEVOlution Volume 5 Issue 1 (5/27/2010)
- The future can be written in RPython now (5/24/2010)
- Pyevolve 0.6rc1 released ! (4/25/2010)
- Riemann Zeta function visualizations with Python (2/20/2010)
- New issue of SIGEVOlution (Volume 4 Issue 3) (2/17/2010)
- News: Using genetic algorithms to optimise current and future health planning (1/28/2010)
- New issue of SIGEVOlution, Volume 4 Issue 2 (1/18/2010)
- Exploring real-time Haiti USGS Earthquake data with near real-time MODIS Aqua+Terra satellite imagery using Python (1/16/2010)
- Pyevolve in action, solving a 100 cities TSP problem (12/17/2009)
- A method for JIT’ing algorithms and data structures with LLVM (11/22/2009)
- Pyevolve on SIGEVOlution (11/9/2009)
- Pyevolve benchmark on different Python flavors (10/19/2009)
- Successful pyevolve multiprocessing speedup for Genetic Programming (10/11/2009)
- Meanwhile, at the Hall of Justice! (10/5/2009)
- Beautiful Django (10/1/2009)
- On the irreversibility of evolution (9/27/2009)
- New SIGEvolution issue – Volume 3 Issue 4! (9/22/2009)
- n-queens problem using Pyevolve (9/19/2009)
- Word is smart, but OpenOffice is wise (9/18/2009)
- Send Subversion commit messages to Twitter (8/12/2009)
- An analysis of Benford’s law applied to Twitter (8/11/2009)
- Announce: python-gstringc 1.0 (8/8/2009)
- Pyevolve: announce of the user group (8/4/2009)
- GLib GString wrapper for Python (8/4/2009)
- Jinja2 in a Java JSP (7/29/2009)
- Approximating Pi number using Genetic Programming (7/22/2009)
- Genetic Programming and Flex layouts (6/29/2009)
- FISL 10 – “Forum Internacional de Software Livre” (6/25/2009)
- Benford’s Law and the Iran’s election (6/17/2009)
- Genetic Programming meets Python (6/8/2009)
- The Darwin’s cake experiment (6/7/2009)
- Prime Numbers and the Benford’s Law (5/9/2009)
- Evolving autopilots could boost space slingshots (5/7/2009)
- Pyevolve: Sourceforge.net Nomination (5/6/2009)
- ‘Evolutionary Algorithms’ Mimic Natural Evolution In Silico And Lead To Innovative Solutions For Complex Problems (5/6/2009)
- Pyevolve: Genetic Algorithms + Interactive Mode (4/27/2009)
- Delicious.com, checking user numbers against Benford’s Law (4/24/2009)
- Benford’s Law meets Python and Apple Stock Prices (4/23/2009)
- More Twitter 3D videos (4/23/2009)
- Digital Archeology Reveals Dinosaur Details using Genetic Algorithms (4/15/2009)
- Evolutionary Computation (4/14/2009)
- Computação Evolutiva (4/14/2009)
- Computer Program Self-Discovers Laws of Physics (4/5/2009)
- Pyevolve: LaTeX documentation (3/31/2009)
- TSP on Nokia N73 using PyS60 and Pyevolve (3/25/2009)
- Good news for Python community (3/25/2009)
- Genetic Algorithms on cellphones ! Pyevolve on Nokia N73 (Symbian + PyS60) (3/21/2009)
- Happy Birthday Genetic Argonaut ! (3/20/2009)
- Travelling Salesman on Sony PSP using Stackless Python + Pyevolve (3/10/2009)
- Google Search Insight for GA and GP (3/9/2009)
- Rosenbrock’s banana function and friends (3/6/2009)
- Considerations regarding Karl Popper’s philosophy of science and Evolutionary Algorithms (3/4/2009)
- Pyevolve on Sony PSP ! (genetic algorithms on PSP) (3/3/2009)
- Pyevolve: the effect of Linear Scaling (3/2/2009)
- Python: 3D real-time debugging and function call structure (2/25/2009)
- Twitter in 3D ! (2/22/2009)
- Genetic Algorithms help us to understand how we see (2/16/2009)
- Good news: IronPython and Jython compatibility ! (2/16/2009)
- Pyevolve profiling dot graph (2/14/2009)
- Considerações sobre a filosofia da ciência de Karl Popper e Algoritmos Evolutivos (2/13/2009)
- The Pyevolve Interactive Mode on Mac OS (2/12/2009)
- Happy Darwin day ! (2/12/2009)
- Python: acessing near real-time MODIS images and fire data from NASA’s Aqua and Terra satellites (2/11/2009)
- Considerações sobre o determinismo e livre-arbítrio para uma futura e distante ciência (2/9/2009)
- Pyevolve: work on the distributed Genetic Algorithm support (2/7/2009)
- A beautiful example of cooperation between theory and experiment, the new Boron form (2/7/2009)
- The new Pyevolve blog site (2/7/2009)

]]>

Those aforementioned issues make it very confusing for newcomers to understand these concepts, and I’m often confronted by people who were unfortunately misled by many tutorials. For that reason, I decided to write a sane introduction to these concepts and elaborate more on their relationships and hidden interactions while trying to explain every step of formulations. I hope to bring something new to help people understand these principles.

The maximum likelihood estimation is a method or principle used to estimate the parameter or parameters of a model given observation or observations. Maximum likelihood estimation is also abbreviated as MLE, and it is also known as the method of maximum likelihood. From this name, you probably already understood that this principle works by maximizing the likelihood, therefore, the key to understand the maximum likelihood estimation is to first understand what is a likelihood and why someone would want to maximize it in order to estimate model parameters.

Let’s start with the definition of the likelihood function for continuous case:

$$\mathcal{L}(\theta | x) = p_{\theta}(x)$$

The left term means “the likelihood of the parameters \(\theta\), given data \(x\)”. Now, what does that mean ? It means that in the continuous case, the likelihood of the model \(p_{\theta}(x)\) with the parametrization \(\theta\) and data \(x\) is the probability density function (pdf) of the model with that particular parametrization.

Although this is the most used likelihood representation, you should pay attention that the notation \(\mathcal{L}(\cdot | \cdot)\) in this case doesn’t mean the same as the conditional notation, so be careful with this overload, because it is always implicitly stated and it is also often a source of confusion. Another representation of the likelihood that is often used is \(\mathcal{L}(x; \theta)\), which is better in the sense that it makes it clear that it’s not a conditional, however, it makes it look like the likelihood is a function of the data and not of the parameters.

The model \(p_{\theta}(x)\) can be any distribution, and to make things concrete, let’s say that we are assuming that the data generating distribution is an univariate Gaussian distribution, which we define below:

$$

\begin{align}

p(x) & \sim \mathcal{N}(\mu, \sigma^2) \\

p(x; \mu, \sigma^2) & \sim \frac{1}{\sqrt{2\pi\sigma^2}} \exp{ \bigg[-\frac{1}{2}\bigg( \frac{x-\mu}{\sigma}\bigg)^2 \bigg] }

\end{align}

$$

If you plot this probability density function with different parametrizations, you’ll get something like the plots below, where the red distribution is the standard Gaussian \(p(x) \sim \mathcal{N}(0, 1.0)\):

As you can see in the probability density function (pdf) plot above, the likelihood of \(x\) at variously given realizations are showed in the y-axis. Another source of confusion here is that usually, people take this as a probability, because they usually see these plots of normals and the likelihood is always below 1, however, the probability density function doesn’t give you probabilities but densities. The constraint on the pdf is that it must integrate to one:

$$\int_{-\infty}^{+\infty} f(x)dx = 1$$

So, it is completely normal to have densities larger than 1 in many points for many different distributions. Take for example the pdf for the Beta distribution below:

As you can see, the pdf shows densities above one in many parametrizations of the distribution, while still integrating into 1 and following the second axiom of probability: the unit measure.

So, returning to our original principle of maximum likelihood estimation, what we want is to maximize the likelihood \(\mathcal{L}(\theta | x)\) for our observed data. What this means in practical terms is that we want to find the parameters \(\theta\) of our model where the likelihood that this model generated our data is maximized, we want to find which **parameters of this model are most plausible** to have generated this observed data, or what are the parameters that make this sample most probable ?

For the case of our univariate Gaussian model, what we want is to find the parameters \(\mu\) and \(\sigma^2\), which for convenient notation we collapse into a single parameter vector:

$$\theta = \begin{bmatrix}\mu \\ \sigma^2\end{bmatrix}$$

Because these are the statistics that completely define our univariate Gaussian model. So let’s formulate the problem of the maximum likelihood estimation:

$$

\begin{align}

\hat{\theta} &= \mathrm{arg}\max_\theta \mathcal{L}(\theta | x) \\

&= \mathrm{arg}\max_\theta p_{\theta}(x)

\end{align}

$$

This says that we want to obtain the maximum likelihood estimate \(\hat{\theta}\) that approximates \(p_{\theta}(x)\) to a underlying “true” distribution \(p_{\theta^*}(x)\) by maximizing the likelihood of the parameters \(\theta\) given data \(x\). You shouldn’t confuse a maximum likelihood estimate \(\hat{\theta}(x)\) which is a realization of the maximum likelihood estimator for the data \(x\), with the maximum likelihood estimator \(\hat{\theta}\), so pay attention to disambiguate it in your head.

However, we need to incorporate multiple observations in this formulation, and by adding multiple observations we end up with a complex joint distribution:

$$\hat{\theta} = \mathrm{arg}\max_\theta p_{\theta}(x_1, x_2, \ldots, x_n)$$

That needs to take into account the interactions between all observations. And here is where we make a strong assumption: we state that the **observations are independent**. Independent random variables mean that the following holds:

$$p_{\theta}(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} p_{\theta}(x_i)$$

Which means that since \(x_1, x_2, \ldots, x_n\) don’t contain information about each other, we can write the joint probability as a product of their marginals.

Another assumption that is made, is that these random variables are **identically distributed**, which means that they came from the same generating distribution, which allows us to model it with the same distribution parametrization.

Given these two assumptions, which are also known as IID (independently and identically distributed), we can formulate our maximum likelihood estimation problem as:

$$\hat{\theta} = \mathrm{arg}\max_\theta \prod_{i=1}^{n} p_{\theta}(x_i)$$

Note that MLE doesn’t require you to make these assumptions, however, many problems will appear if you don’t to it, such as different distributions for each sample or having to deal with joint probabilities.

Given that in many cases these densities that we multiply can be very small, multiplying one by the other in the product that we have above we can end up with very small values. Here is where the logarithm function makes its way to the likelihood. The log function is a strictly monotonically increasing function, that preserves the location of the *extrema* and has a very nice property:

$$\log ab = \log a + \log b $$

Where the logarithm of a product is the sum of the logarithms, which is very convenient for us, so we’ll apply the logarithm to the likelihood to maximize what is called the log-likelihood:

$$

\begin{align}

\hat{\theta} &= \mathrm{arg}\max_\theta \prod_{i=1}^{n} p_{\theta}(x_i) \\

&= \mathrm{arg}\max_\theta \sum_{i=1}^{n} \log p_{\theta}(x_i) \\

\end{align}

$$

As you can see, we went from a product to a summation, which is much more convenient. Another reason for the application of the logarithm is that we often take the derivative and solve it for the parameters, therefore is much easier to work with a summation than a multiplication.

We can also conveniently average the log-likelihood (*given that we’re just including a multiplication by a constant*):

$$

\begin{align}

\hat{\theta} &= \mathrm{arg}\max_\theta \sum_{i=1}^{n} \log p_{\theta}(x_i) \\

&= \mathrm{arg}\max_\theta \frac{1}{n} \sum_{i=1}^{n} \log p_{\theta}(x_i) \\

\end{align}

$$

This is also convenient because it will take out the dependency on the number of observations. We also know, that through the law of large numbers, the following holds as \(n\to\infty\):

$$

\frac{1}{n} \sum_{i=1}^{n} \log \, p_{\theta}(x_i) \approx \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]

$$

As you can see, we’re approximating the expectation with the **empirical expectation** defined by our dataset \(\{x_i\}_{i=1}^{n}\). This is an important point and it is usually implictly assumed.

The weak law of large numbers can be bounded using a Chebyshev bound, and if you are interested in concentration inequalities, I’ve made an article about them here where I discuss the Chebyshev bound.

To finish our formulation, given that we usually minimize objectives, we can formulate the same maximum likelihood estimation as the minimization of the negative of the log-likelihood:

$$

\hat{\theta} = \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]

$$

Which is exactly the same thing with just the negation turn the maximization problem into a minimization problem.

It is well-known that maximizing the likelihood is the same as minimizing the Kullback-Leibler divergence, also known as the KL divergence. Which is very interesting because it links a measure from **information theory** with the maximum likelihood principle.

The KL divergence is defined as:

$$

\begin{equation}

D_{KL}( p || q)=\int p(x)\log\frac{p(x)}{q(x)} \ dx

\end{equation}

$$

There are many intuitions to understand the KL divergence, I personally like the perspective on the likelihood ratios, however, there are plenty of materials about it that you can easily find and it’s out of the scope of this introduction.

The KL divergence is basically the expectation of the log-likelihood ratio under the \(p(x)\) distribution. What we’re doing below is just rephrasing it by using some identities and properties of the expectation:

$$

\begin{align}

D_{KL}[p_{\theta^*}(x) \, \Vert \, p_\theta(x)] &= \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \frac{p_{\theta^*}(x)}{p_\theta(x)} \right] \\

\label{eq:logquotient}

&= \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \,p_{\theta^*}(x) – \log \, p_\theta(x) \right] \\

\label{eq:linearization}

&= \mathbb{E}_{x \sim p_{\theta^*}(x)} \underbrace{\left[\log \, p_{\theta^*}(x) \right]}_{\text{Entropy of } p_{\theta^*}(x)} – \underbrace{\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]}_{\text{Negative of log-likelihood}}

\end{align}

$$

In the formulation above, we’re first using the fact that the logarithm of a quotient is equal to the difference of the logs of the numerator and denominator (equation \(\ref{eq:logquotient}\)). After that we use the linearization of the expectation(equation \(\ref{eq:linearization}\)), which tells us that \(\mathbb{E}\left[X + Y\right] = \mathbb{E}\left[X\right]+\mathbb{E}\left[Y\right]\). In the end, we are left with two terms, the first one in the left is the **entropy** and the one in the right you can recognize as the **negative of the log-likelihood** that we saw earlier.

If we want to minimize the KL divergence for the \(\theta\), we can ignore the first term, since it doesn’t depend of \(\theta\) in any way, and in the end we have exactly the same maximum likelihood formulation that we saw before:

$$

\begin{eqnarray}

\require{cancel}

\theta^* &=& \mathrm{arg}\min_\theta \cancel{\mathbb{E}_{x \sim p_{\theta^*}(x)} \left[\log \, p_{\theta^*}(x) \right]} – \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]\\

&=& \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]

\end{eqnarray}

$$

A very common scenario in Machine Learning is supervised learning, where we have data points \(x_n\) and their labels \(y_n\) building up our dataset \( D = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n) \} \), where we’re interested in estimating the conditional probability of \(\textbf{y}\) given \(\textbf{x}\), or more precisely \( P_{\theta}(Y | X) \).

To extend the maximum likelihood principle to the conditional case, we just have to write it as:

$$

\hat{\theta} = \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(y | x)}\left[\log \, p_{\theta}(y | x) \right]

$$

And then it can be easily generalized to formulate the linear regression:

$$

p_{\theta}(y | x) \sim \mathcal{N}(x^T \theta, \sigma^2) \\

p_{\theta}(y | x) = -n \log \sigma – \frac{n}{2} \log{2\pi} – \sum_{i=1}^{n}{\frac{\| x_i^T \theta – y_i \|}{2\sigma^2}}

$$

In that case, you can see that we end up with a sum of squared errors that will have the same location of the optimum of the mean squared error (MSE). So you can see that minimizing the MSE is equivalent of maximizing the likelihood for a Gaussian model.

The maximum likelihood estimation has very interesting properties but it gives us only **point estimates**, and this means that we cannot reason on the distribution of these estimates. In contrast, Bayesian inference can give us a full distribution over parameters, and therefore will allow us to **reason about the posterior distribution**.

I’ll write more about Bayesian inference and sampling methods such as the ones from the Markov Chain Monte Carlo (MCMC) family, but I’ll leave this for another article, right now I’ll continue showing the relationship of the maximum likelihood estimator with the maximum a posteriori (MAP) estimator.

Although the maximum a posteriori, also known as MAP, also provides us with a point estimate, it is a Bayesian concept that incorporates a prior over the parameters. We’ll also see that the MAP has a strong connection with the regularized MLE estimation.

We know from the Bayes rule that we can get the posterior from the product of the likelihood and the prior, normalized by the evidence:

$$

\begin{align}

p(\theta \vert x) &= \frac{p_{\theta}(x) p(\theta)}{p(x)} \\

\label{eq:proport}

&\propto p_{\theta}(x) p(\theta)

\end{align}

$$

In the equation \(\ref{eq:proport}\), since we’re worried about optimization, we cancel the normalizing evidence \(p(x)\) and stay with a proportional posterior, which is very convenient because the marginalization of \(p(x)\) involves integration and is intractable for many cases.

$$

\begin{align}

\theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} p_{\theta}(x) p(\theta) \\

&= \mathop{\rm arg\,max}\limits_{\theta} \prod_{i=1}^{n} p_{\theta}(x_i) p(\theta) \\

&= \mathop{\rm arg\,max}\limits_{\theta} \sum_{i=1}^{n} \underbrace{\log p_{\theta}(x_i)}_{\text{Log likelihood}} \underbrace{p(\theta)}_{\text{Prior}}

\end{align}

$$

In this formulation above, we just followed the same steps as described earlier for the maximum likelihood estimator, we assume independence and an identical distributional setting, followed later by the logarithm application to switch from a product to a summation. As you can see in the final formulation, this is equivalent as the maximum likelihood estimation multiplied by the prior term.

We can also easily recover the exact maximum likelihood estimator by using a uniform prior \(p(\theta) \sim \textbf{U}(\cdot, \cdot)\). This means that every possible value of \(\theta\) will be equally weighted, meaning that it’s just a constant multiplication:

$$

\begin{align}

\theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} \sum_i \log p_{\theta}(x_i) p(\theta) \\

&= \mathop{\rm arg\,max}\limits_{\theta} \sum_i \log p_{\theta}(x_i) \, \text{constant} \\

&= \underbrace{\mathop{\rm arg\,max}\limits_{\theta} \sum_i \log p_{\theta}(x_i)}_{\text{Equivalent to maximum likelihood estimation (MLE)}} \\

\end{align}

$$

And there you are, the MAP with a uniform prior is equivalent to MLE. It is also easy to show that a Gaussian prior can recover the L2 regularized MLE. Which is quite interesting, given that it can provide insights and a new perspective on the regularization terms that we usually use.

I hope you liked this article ! The next one will be about Bayesian inference with posterior sampling, where we’ll show how we can reason about the posterior distribution and not only on point estimates as seen in MAP and MLE.

*– Christian S. Perone*

Cite this article as: Christian S. Perone, "A sane introduction to maximum likelihood estimation (MLE) and maximum a posteriori (MAP)," in *Terra Incognita*, 02/01/2019, http://blog.christianperone.com/2019/01/mle/.

]]>