]]>

Not a lot of people working with the Python scientific ecosystem are aware of the NEP 18 (*dispatch mechanism for NumPy’s high-level array functions*). Given the importance of this protocol, I decided to write this short introduction to the new dispatcher that will certainly bring a **lot of benefits** for the Python scientific ecosystem.

If you used PyTorch, TensorFlow, Dask, etc, you certainly noticed the similarity of their API contracts with Numpy. And it’s not by accident, Numpy’s API is one of the most fundamental and widely-used APIs for scientific computing. Numpy is so pervasive, that it ceased to be only an API and it is becoming more a protocol or an API specification.

There are countless advantages when an an entire ecosystem adopts an API specification. The wide adoption of the Numpy protocols will certainly help create a standardized API that will provide seamless integration *across implementations, *even for the ones that are quite different than the `ndarray`

.

The main goal of the NEP 18, is to provide a protocol that is materialized on the method named `__array_function__`

, where it will allow developers of frameworks such as PyTorch, TensorFlow, Dask, etc, to provide their own implementations of the Numpy API specification.

The `__array_function__`

protocol is still experimental and will be enabled by default on the next Numpy v.1.17, however, on the v1.16 you can enable it by setting the environment variable called `NUMPY_EXPERIMENTAL_ARRAY_FUNCTION`

to 1.

It’s easy to provide an example of how the protocol can be implemented in PyTorch in order to get a feeling of what it will allow you to do.

First, we define the protocol that we will use to monkey-patch PyTorch (*but you can change the framework itself, of course, that’s the whole point*):

HANDLED_FUNCTIONS = {} def __array_function__(self, func, types, args, kwargs): if func not in HANDLED_FUNCTIONS: return NotImplemented if not all(issubclass(t, torch.Tensor) for t in types): return NotImplemented return HANDLED_FUNCTIONS[func](*args, **kwargs)

After that, we create a simple decorator to set the `HANDLED_FUNCTIONS`

dictionary with PyTorch’s implementation of the Numpy API:

def implements(numpy_function): def decorator(func): HANDLED_FUNCTIONS[numpy_function] = func return func return decorator

Later, we just have to monkey patch PyTorch (*not required if you change PyTorch code itself*) and provide a Numpy implementation of the PyTorch operations:

torch.Tensor.__array_function__ = __array_function__ @implements(np.sum) def npsum(arr, axis=0): return torch.sum(arr, dim=axis)

And that’s it ! You can now do things like this:

>>> torch_tensor = torch.ones(10) >>> type(torch_tensor) torch.Tensor >>> result = np.sum(torch_tensor) >>> result tensor(10.) >>> type(result) torch.Tensor

This is quite amazing, given that you can easily mix APIs together and write a single piece of code that will work in PyTorch, Numpy or other frameworks independently of the multi-dimensional array implementation and diverging to the correct backend implementation.

Dask is already implementing the protocol, so you don’t have to monkey-patch it:

>>> import numpy as np >>> import dask.array as da >>> dask_array = da.random.random((5000, 1000), chunks=(1000, 1000)) >>> type(dask_array) dask.array.core.Array >>> u, s, v = np.linalg.svd(dask_array) # Numpy API ! >>> u

>>> s

>>> v

As you can see, when we used the singular decomposition from Numpy, we were able to define the Dask computational graph using the Numpy API:

>>> v.visualize()

And we can now compute the result using Dask lazy eval:

>>> result = v.compute() >>> result array([[ 0.03178809, 0.03147971, 0.0313492 , ..., 0.03201226, 0.03142089, 0.03180248], [ 0.00272678, -0.0182263 , 0.00839066, ..., 0.01195828, 0.01681703, -0.05086809], [-0.00851417, 0.0411336 , 0.02738713, ..., 0.03447079, 0.04614097, 0.02292196], ...,

And you can do the same with CuPy as well.

The new dispatch protocol will open the door for seamless integration among frameworks on the Python scientific ecosystem. I hope that more and more frameworks will start to adopt it in near future.

*– Christian S. Perone*

I was experimenting with the approach described in “Randomized Prior Functions for Deep Reinforcement Learning” by Ian Osband et al. at NPS 2018, where they devised a very simple and practical method for uncertainty using bootstrap and randomized priors and decided to share the PyTorch code.

I really like bootstrap approaches, and in my opinion, they are usually the easiest methods to implement and provide very good posterior approximation with deep connections to Bayesian approaches, without having to deal with variational inference. They actually show in the paper that in the linear case, the method provides a Bayes posterior.

The main idea of the method is to have bootstrap to provide a non-parametric data perturbation together with randomized priors, which are nothing more than just random initialized networks.

$$Q_{\theta_k}(x) = f_{\theta_k}(x) + p_k(x)$$

The final model \(Q_{\theta_k}(x)\) will be the k model of the ensemble that will fit the function \(f_{\theta_k}(x)\) with an untrained prior \(p_k(x)\).

Let’s go to the code. The first class is a simple MLP with 2 hidden layers and Glorot initialization :

class MLP(nn.Module): def __init__(self): super().__init__() self.l1 = nn.Linear(1, 20) self.l2 = nn.Linear(20, 20) self.l3 = nn.Linear(20, 1) nn.init.xavier_uniform_(self.l1.weight) nn.init.xavier_uniform_(self.l2.weight) nn.init.xavier_uniform_(self.l3.weight) def forward(self, inputs): x = self.l1(inputs) x = nn.functional.selu(x) x = self.l2(x) x = nn.functional.selu(x) x = self.l3(x) return x

Then later we define a class that will take the model and the prior to produce the final model result:

class ModelWithPrior(nn.Module): def __init__(self, base_model : nn.Module, prior_model : nn.Module, prior_scale : float = 1.0): super().__init__() self.base_model = base_model self.prior_model = prior_model self.prior_scale = prior_scale def forward(self, inputs): with torch.no_grad(): prior_out = self.prior_model(inputs) prior_out = prior_out.detach() model_out = self.base_model(inputs) return model_out + (self.prior_scale * prior_out)

And it’s basically that ! As you can see, it’s a very simple method, in the second part we just created a custom forward() to avoid computing/accumulating gradients for the prior network and them summing (after scaling) it with the model prediction.

To train it, you just have to use different bootstraps for each ensemble model, like in the code below:

def train_model(x_train, y_train, base_model, prior_model): model = ModelWithPrior(base_model, prior_model, 1.0) loss_fn = nn.MSELoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.05) for epoch in range(100): model.train() preds = model(x_train) loss = loss_fn(preds, y_train) optimizer.zero_grad() loss.backward() optimizer.step() return model

and using a sampler with replacement (bootstrap) as in:

dataset = TensorDataset(...) bootstrap_sampler = RandomSampler(dataset, True, len(dataset)) train_dataloader = DataLoader(dataset, batch_size=len(dataset), sampler=bootstrap_sampler)

In this case, I used the same small dataset used in the original paper:

After training it with a simple MLP prior as well, the results for the uncertainty are shown below:

If we look at just the priors, we will see the variation of the untrained networks:

We can also visualize the individual model predictions showing their variability due to different initializations as well as the bootstrap noise:

Now, what is also quite interesting, is that we can change the prior to let’s say a fixed sine:

class SinPrior(nn.Module): def forward(self, input): return torch.sin(3 * input)

Then, when we train the same MLP model but this time using the sine prior, we can see how it affects the final prediction and uncertainty bounds:

If we show each individual model, we can see the effect of the prior contribution to each individual model:

I hope you liked, these are quite amazing results for a simple method that at least pass the linear “sanity check”. I’ll explore some pre-trained networks in place of the prior to see the different effects on predictions, it’s a very interesting way to add some simple priors.

]]>Here is the full table of contents for those interested:

- Uncertainty Estimation in Deep Learning (PyData Lisbon / July 2019) (7/18/2019)
- Numpy dispatcher: when Numpy becomes a protocol for an ecosystem (7/6/2019)
- Benford law on GPT-2 language model (6/14/2019)
- Randomized prior functions in PyTorch (3/24/2019)
- PyData Montreal slides for the talk: PyTorch under the hood (2/26/2019)
- 10 years of blogging ! (1/16/2019)
- A sane introduction to maximum likelihood estimation (MLE) and maximum a posteriori (MAP) (1/2/2019)
- Introducing EuclidesDB – A machine learning feature database (11/27/2018)
- Time Maps: visualizing discrete events from Brazilian presidential election candidates (10/31/2018)
- Análise bayesiana dos microdados do ENEM do Rio Grande do Sul (10/28/2018)
- Benford’s law emerges from deep language model (10/18/2018)
- PyTorch 1.0 tracing JIT and LibTorch C++ API to integrate PyTorch into NodeJS (10/2/2018)
- Concentration inequalities – Part I (8/23/2018)
- NLP word representations and the Wittgenstein philosophy of language (5/23/2018)
- PyTorch – Internal Architecture Tour (3/12/2018)
- Privacy-preserving sentence semantic similarity using InferSent embeddings and secure two-party computation (1/22/2018)
- New prime on the block (1/4/2018)
- The effective receptive field on CNNs (11/12/2017)
- Benford’s law – Index (11/6/2017)
- The same old historicism, now on AI (7/30/2017)
- Introduction to Word Embeddings (2/8/2017)
- Nanopipe: connecting the modern babel (10/17/2016)
- JIT native code generation for TensorFlow computation graphs using Python and LLVM (8/22/2016)
- Convolutional Neural Networks – Architectural Zoo (6/2/2016)
- Voynich Manuscript: word vectors and t-SNE visualization of some patterns (1/16/2016)
- Convolutional hypercolumns in Python (1/11/2016)
- Convolutional Neural Networks Presentation (10/15/2015)
- Deep learning – Convolutional neural networks and feature extraction with Python (8/19/2015)
- Google’s S2, geometry on the sphere, cells and Hilbert curve (8/14/2015)
- Luigi’s Codex Seraphinianus deep learning dreams (8/11/2015)
- Real time Drone object tracking using Python and OpenCV (1/26/2015)
- Arduino and OLED display to monitor Redis for fun and profit (1/14/2015)
- Recebendo dados de baloes meteorologicos da Aeronautica (7/28/2014)
- Simple and effective coin segmentation using Python and OpenCV (6/22/2014)
- Despesas de Custeio e Lei de Benford (6/9/2014)
- Universality, primes and space communication (1/9/2014)
- The beauty of Bitcoin P2P network (12/13/2013)
- Protocoin – a pure Python Bitcoin protocol implementation (11/22/2013)
- Mapa de calor dos dados de acidentes de transito do DataPoa (11/16/2013)
- Book Suggestion: Codex Seraphinianus (11/5/2013)
- Machine Learning :: Cosine Similarity for Vector Space Models (Part III) (9/12/2013)
- Rastreamento em tempo real de avioes em Porto Alegre utilizando Raspberry Pi + Radio UHF (SDR RTL2832U) (2/5/2013)
- Raspberry Pi & Arduino: a laser pointer communication and a LDR voltage sigmoid (8/19/2012)
- Genetic Programming and a LLVM JIT for restricted Python AST expressions (8/15/2012)
- Raspberry Pi no Brasil (8/2/2012)
- Review Board – Review Requests plotting using Python (7/7/2012)
- Accessing HP Cloud OpenStack Nova using Python and Requests (2/2/2012)
- Announce: Stallion v0.2 released ! (12/15/2011)
- Announce: ‘Stallion’ – Python Package Manager (12/4/2011)
- Hacking into Python objects internals (11/23/2011)
- C++11 user-defined literals and some constructions (11/21/2011)
- Machine Learning :: Text feature extraction (tf-idf) – Part II (10/3/2011)
- Machine Learning :: Text feature extraction (tf-idf) – Part I (9/18/2011)
- The Interactive Robotic Painting Machine ! (8/17/2011)
- Pyevolve – new repository at Github (8/10/2011)
- News: Algorithm searches for models that best explain experimental data (8/2/2011)
- C++0x :: Introduction to some amazing features (7/28/2011)
- The bad good news for optimization (7/16/2011)
- Hiatus (5/20/2011)
- PyCon 2011: Genetic Programming in Python (3/14/2011)
- Using pyearthquake to plot Japan USGS earthquake data into the near real-time MODIS satellite imagery (3/12/2011)
- Invite: PyCon US 2011 – Genetic Programming in Python (3/9/2011)
- Health and Genetic Algorithms (2/4/2011)
- Google Analytics Visualization (2/2/2011)
- Darwin on the track (11/19/2010)
- Capturing and comparing writers profiles through Markov chains (9/28/2010)
- PyOhio 2010: Genetic Programming in Python (8/31/2010)
- New SIGEVOlution Volume 5 Issue 1 (5/27/2010)
- The future can be written in RPython now (5/24/2010)
- Pyevolve 0.6rc1 released ! (4/25/2010)
- Riemann Zeta function visualizations with Python (2/20/2010)
- New issue of SIGEVOlution (Volume 4 Issue 3) (2/17/2010)
- News: Using genetic algorithms to optimise current and future health planning (1/28/2010)
- New issue of SIGEVOlution, Volume 4 Issue 2 (1/18/2010)
- Exploring real-time Haiti USGS Earthquake data with near real-time MODIS Aqua+Terra satellite imagery using Python (1/16/2010)
- Pyevolve in action, solving a 100 cities TSP problem (12/17/2009)
- A method for JIT’ing algorithms and data structures with LLVM (11/22/2009)
- Pyevolve on SIGEVOlution (11/9/2009)
- Pyevolve benchmark on different Python flavors (10/19/2009)
- Successful pyevolve multiprocessing speedup for Genetic Programming (10/11/2009)
- Meanwhile, at the Hall of Justice! (10/5/2009)
- Beautiful Django (10/1/2009)
- On the irreversibility of evolution (9/27/2009)
- New SIGEvolution issue – Volume 3 Issue 4! (9/22/2009)
- n-queens problem using Pyevolve (9/19/2009)
- Word is smart, but OpenOffice is wise (9/18/2009)
- Send Subversion commit messages to Twitter (8/12/2009)
- An analysis of Benford’s law applied to Twitter (8/11/2009)
- Announce: python-gstringc 1.0 (8/8/2009)
- Pyevolve: announce of the user group (8/4/2009)
- GLib GString wrapper for Python (8/4/2009)
- Jinja2 in a Java JSP (7/29/2009)
- Approximating Pi number using Genetic Programming (7/22/2009)
- Genetic Programming and Flex layouts (6/29/2009)
- FISL 10 – “Forum Internacional de Software Livre” (6/25/2009)
- Benford’s Law and the Iran’s election (6/17/2009)
- Genetic Programming meets Python (6/8/2009)
- The Darwin’s cake experiment (6/7/2009)
- Prime Numbers and the Benford’s Law (5/9/2009)
- Evolving autopilots could boost space slingshots (5/7/2009)
- Pyevolve: Sourceforge.net Nomination (5/6/2009)
- ‘Evolutionary Algorithms’ Mimic Natural Evolution In Silico And Lead To Innovative Solutions For Complex Problems (5/6/2009)
- Pyevolve: Genetic Algorithms + Interactive Mode (4/27/2009)
- Delicious.com, checking user numbers against Benford’s Law (4/24/2009)
- Benford’s Law meets Python and Apple Stock Prices (4/23/2009)
- More Twitter 3D videos (4/23/2009)
- Digital Archeology Reveals Dinosaur Details using Genetic Algorithms (4/15/2009)
- Evolutionary Computation (4/14/2009)
- Computação Evolutiva (4/14/2009)
- Computer Program Self-Discovers Laws of Physics (4/5/2009)
- Pyevolve: LaTeX documentation (3/31/2009)
- TSP on Nokia N73 using PyS60 and Pyevolve (3/25/2009)
- Good news for Python community (3/25/2009)
- Genetic Algorithms on cellphones ! Pyevolve on Nokia N73 (Symbian + PyS60) (3/21/2009)
- Happy Birthday Genetic Argonaut ! (3/20/2009)
- Travelling Salesman on Sony PSP using Stackless Python + Pyevolve (3/10/2009)
- Google Search Insight for GA and GP (3/9/2009)
- Rosenbrock’s banana function and friends (3/6/2009)
- Considerations regarding Karl Popper’s philosophy of science and Evolutionary Algorithms (3/4/2009)
- Pyevolve on Sony PSP ! (genetic algorithms on PSP) (3/3/2009)
- Pyevolve: the effect of Linear Scaling (3/2/2009)
- Python: 3D real-time debugging and function call structure (2/25/2009)
- Twitter in 3D ! (2/22/2009)
- Genetic Algorithms help us to understand how we see (2/16/2009)
- Good news: IronPython and Jython compatibility ! (2/16/2009)
- Pyevolve profiling dot graph (2/14/2009)
- Considerações sobre a filosofia da ciência de Karl Popper e Algoritmos Evolutivos (2/13/2009)
- The Pyevolve Interactive Mode on Mac OS (2/12/2009)
- Happy Darwin day ! (2/12/2009)
- Python: acessing near real-time MODIS images and fire data from NASA’s Aqua and Terra satellites (2/11/2009)
- Considerações sobre o determinismo e livre-arbítrio para uma futura e distante ciência (2/9/2009)
- Pyevolve: work on the distributed Genetic Algorithm support (2/7/2009)
- A beautiful example of cooperation between theory and experiment, the new Boron form (2/7/2009)
- The new Pyevolve blog site (2/7/2009)

]]>

Those aforementioned issues make it very confusing for newcomers to understand these concepts, and I’m often confronted by people who were unfortunately misled by many tutorials. For that reason, I decided to write a sane introduction to these concepts and elaborate more on their relationships and hidden interactions while trying to explain every step of formulations. I hope to bring something new to help people understand these principles.

The maximum likelihood estimation is a method or principle used to estimate the parameter or parameters of a model given observation or observations. Maximum likelihood estimation is also abbreviated as MLE, and it is also known as the method of maximum likelihood. From this name, you probably already understood that this principle works by maximizing the likelihood, therefore, the key to understand the maximum likelihood estimation is to first understand what is a likelihood and why someone would want to maximize it in order to estimate model parameters.

Let’s start with the definition of the likelihood function for continuous case:

$$\mathcal{L}(\theta | x) = p_{\theta}(x)$$

The left term means “the likelihood of the parameters \(\theta\), given data \(x\)”. Now, what does that mean ? It means that in the continuous case, the likelihood of the model \(p_{\theta}(x)\) with the parametrization \(\theta\) and data \(x\) is the probability density function (pdf) of the model with that particular parametrization.

Although this is the most used likelihood representation, you should pay attention that the notation \(\mathcal{L}(\cdot | \cdot)\) in this case doesn’t mean the same as the conditional notation, so be careful with this overload, because it is always implicitly stated and it is also often a source of confusion. Another representation of the likelihood that is often used is \(\mathcal{L}(x; \theta)\), which is better in the sense that it makes it clear that it’s not a conditional, however, it makes it look like the likelihood is a function of the data and not of the parameters.

The model \(p_{\theta}(x)\) can be any distribution, and to make things concrete, let’s say that we are assuming that the data generating distribution is an univariate Gaussian distribution, which we define below:

$$

\begin{align}

p(x) & \sim \mathcal{N}(\mu, \sigma^2) \\

p(x; \mu, \sigma^2) & \sim \frac{1}{\sqrt{2\pi\sigma^2}} \exp{ \bigg[-\frac{1}{2}\bigg( \frac{x-\mu}{\sigma}\bigg)^2 \bigg] }

\end{align}

$$

If you plot this probability density function with different parametrizations, you’ll get something like the plots below, where the red distribution is the standard Gaussian \(p(x) \sim \mathcal{N}(0, 1.0)\):

As you can see in the probability density function (pdf) plot above, the likelihood of \(x\) at variously given realizations are showed in the y-axis. Another source of confusion here is that usually, people take this as a probability, because they usually see these plots of normals and the likelihood is always below 1, however, the probability density function doesn’t give you probabilities but densities. The constraint on the pdf is that it must integrate to one:

$$\int_{-\infty}^{+\infty} f(x)dx = 1$$

So, it is completely normal to have densities larger than 1 in many points for many different distributions. Take for example the pdf for the Beta distribution below:

As you can see, the pdf shows densities above one in many parametrizations of the distribution, while still integrating into 1 and following the second axiom of probability: the unit measure.

So, returning to our original principle of maximum likelihood estimation, what we want is to maximize the likelihood \(\mathcal{L}(\theta | x)\) for our observed data. What this means in practical terms is that we want to find the parameters \(\theta\) of our model where the likelihood that this model generated our data is maximized, we want to find which **parameters of this model are most plausible** to have generated this observed data, or what are the parameters that make this sample most probable ?

For the case of our univariate Gaussian model, what we want is to find the parameters \(\mu\) and \(\sigma^2\), which for convenient notation we collapse into a single parameter vector:

$$\theta = \begin{bmatrix}\mu \\ \sigma^2\end{bmatrix}$$

Because these are the statistics that completely define our univariate Gaussian model. So let’s formulate the problem of the maximum likelihood estimation:

$$

\begin{align}

\hat{\theta} &= \mathrm{arg}\max_\theta \mathcal{L}(\theta | x) \\

&= \mathrm{arg}\max_\theta p_{\theta}(x)

\end{align}

$$

This says that we want to obtain the maximum likelihood estimate \(\hat{\theta}\) that approximates \(p_{\theta}(x)\) to a underlying “true” distribution \(p_{\theta^*}(x)\) by maximizing the likelihood of the parameters \(\theta\) given data \(x\). You shouldn’t confuse a maximum likelihood estimate \(\hat{\theta}(x)\) which is a realization of the maximum likelihood estimator for the data \(x\), with the maximum likelihood estimator \(\hat{\theta}\), so pay attention to disambiguate it in your head.

However, we need to incorporate multiple observations in this formulation, and by adding multiple observations we end up with a complex joint distribution:

$$\hat{\theta} = \mathrm{arg}\max_\theta p_{\theta}(x_1, x_2, \ldots, x_n)$$

That needs to take into account the interactions between all observations. And here is where we make a strong assumption: we state that the **observations are independent**. Independent random variables mean that the following holds:

$$p_{\theta}(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} p_{\theta}(x_i)$$

Which means that since \(x_1, x_2, \ldots, x_n\) don’t contain information about each other, we can write the joint probability as a product of their marginals.

Another assumption that is made, is that these random variables are **identically distributed**, which means that they came from the same generating distribution, which allows us to model it with the same distribution parametrization.

Given these two assumptions, which are also known as IID (independently and identically distributed), we can formulate our maximum likelihood estimation problem as:

$$\hat{\theta} = \mathrm{arg}\max_\theta \prod_{i=1}^{n} p_{\theta}(x_i)$$

Note that MLE doesn’t require you to make these assumptions, however, many problems will appear if you don’t to it, such as different distributions for each sample or having to deal with joint probabilities.

Given that in many cases these densities that we multiply can be very small, multiplying one by the other in the product that we have above we can end up with very small values. Here is where the logarithm function makes its way to the likelihood. The log function is a strictly monotonically increasing function, that preserves the location of the *extrema* and has a very nice property:

$$\log ab = \log a + \log b $$

Where the logarithm of a product is the sum of the logarithms, which is very convenient for us, so we’ll apply the logarithm to the likelihood to maximize what is called the log-likelihood:

$$

\begin{align}

\hat{\theta} &= \mathrm{arg}\max_\theta \prod_{i=1}^{n} p_{\theta}(x_i) \\

&= \mathrm{arg}\max_\theta \sum_{i=1}^{n} \log p_{\theta}(x_i) \\

\end{align}

$$

As you can see, we went from a product to a summation, which is much more convenient. Another reason for the application of the logarithm is that we often take the derivative and solve it for the parameters, therefore is much easier to work with a summation than a multiplication.

We can also conveniently average the log-likelihood (*given that we’re just including a multiplication by a constant*):

$$

\begin{align}

\hat{\theta} &= \mathrm{arg}\max_\theta \sum_{i=1}^{n} \log p_{\theta}(x_i) \\

&= \mathrm{arg}\max_\theta \frac{1}{n} \sum_{i=1}^{n} \log p_{\theta}(x_i) \\

\end{align}

$$

This is also convenient because it will take out the dependency on the number of observations. We also know, that through the law of large numbers, the following holds as \(n\to\infty\):

$$

\frac{1}{n} \sum_{i=1}^{n} \log \, p_{\theta}(x_i) \approx \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]

$$

As you can see, we’re approximating the expectation with the **empirical expectation** defined by our dataset \(\{x_i\}_{i=1}^{n}\). This is an important point and it is usually implictly assumed.

The weak law of large numbers can be bounded using a Chebyshev bound, and if you are interested in concentration inequalities, I’ve made an article about them here where I discuss the Chebyshev bound.

To finish our formulation, given that we usually minimize objectives, we can formulate the same maximum likelihood estimation as the minimization of the negative of the log-likelihood:

$$

\hat{\theta} = \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]

$$

Which is exactly the same thing with just the negation turn the maximization problem into a minimization problem.

It is well-known that maximizing the likelihood is the same as minimizing the Kullback-Leibler divergence, also known as the KL divergence. Which is very interesting because it links a measure from **information theory** with the maximum likelihood principle.

The KL divergence is defined as:

$$

\begin{equation}

D_{KL}( p || q)=\int p(x)\log\frac{p(x)}{q(x)} \ dx

\end{equation}

$$

There are many intuitions to understand the KL divergence, I personally like the perspective on the likelihood ratios, however, there are plenty of materials about it that you can easily find and it’s out of the scope of this introduction.

The KL divergence is basically the expectation of the log-likelihood ratio under the \(p(x)\) distribution. What we’re doing below is just rephrasing it by using some identities and properties of the expectation:

$$

\begin{align}

D_{KL}[p_{\theta^*}(x) \, \Vert \, p_\theta(x)] &= \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \frac{p_{\theta^*}(x)}{p_\theta(x)} \right] \\

\label{eq:logquotient}

&= \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \,p_{\theta^*}(x) – \log \, p_\theta(x) \right] \\

\label{eq:linearization}

&= \mathbb{E}_{x \sim p_{\theta^*}(x)} \underbrace{\left[\log \, p_{\theta^*}(x) \right]}_{\text{Entropy of } p_{\theta^*}(x)} – \underbrace{\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]}_{\text{Negative of log-likelihood}}

\end{align}

$$

In the formulation above, we’re first using the fact that the logarithm of a quotient is equal to the difference of the logs of the numerator and denominator (equation \(\ref{eq:logquotient}\)). After that we use the linearization of the expectation(equation \(\ref{eq:linearization}\)), which tells us that \(\mathbb{E}\left[X + Y\right] = \mathbb{E}\left[X\right]+\mathbb{E}\left[Y\right]\). In the end, we are left with two terms, the first one in the left is the **entropy** and the one in the right you can recognize as the **negative of the log-likelihood** that we saw earlier.

If we want to minimize the KL divergence for the \(\theta\), we can ignore the first term, since it doesn’t depend of \(\theta\) in any way, and in the end we have exactly the same maximum likelihood formulation that we saw before:

$$

\begin{eqnarray}

\require{cancel}

\theta^* &=& \mathrm{arg}\min_\theta \cancel{\mathbb{E}_{x \sim p_{\theta^*}(x)} \left[\log \, p_{\theta^*}(x) \right]} – \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]\\

&=& \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]

\end{eqnarray}

$$

A very common scenario in Machine Learning is supervised learning, where we have data points \(x_n\) and their labels \(y_n\) building up our dataset \( D = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n) \} \), where we’re interested in estimating the conditional probability of \(\textbf{y}\) given \(\textbf{x}\), or more precisely \( P_{\theta}(Y | X) \).

To extend the maximum likelihood principle to the conditional case, we just have to write it as:

$$

\hat{\theta} = \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(y | x)}\left[\log \, p_{\theta}(y | x) \right]

$$

And then it can be easily generalized to formulate the linear regression:

$$

p_{\theta}(y | x) \sim \mathcal{N}(x^T \theta, \sigma^2) \\

p_{\theta}(y | x) = -n \log \sigma – \frac{n}{2} \log{2\pi} – \sum_{i=1}^{n}{\frac{\| x_i^T \theta – y_i \|}{2\sigma^2}}

$$

In that case, you can see that we end up with a sum of squared errors that will have the same location of the optimum of the mean squared error (MSE). So you can see that minimizing the MSE is equivalent of maximizing the likelihood for a Gaussian model.

The maximum likelihood estimation has very interesting properties but it gives us only **point estimates**, and this means that we cannot reason on the distribution of these estimates. In contrast, Bayesian inference can give us a full distribution over parameters, and therefore will allow us to **reason about the posterior distribution**.

I’ll write more about Bayesian inference and sampling methods such as the ones from the Markov Chain Monte Carlo (MCMC) family, but I’ll leave this for another article, right now I’ll continue showing the relationship of the maximum likelihood estimator with the maximum a posteriori (MAP) estimator.

Although the maximum a posteriori, also known as MAP, also provides us with a point estimate, it is a Bayesian concept that incorporates a prior over the parameters. We’ll also see that the MAP has a strong connection with the regularized MLE estimation.

We know from the Bayes rule that we can get the posterior from the product of the likelihood and the prior, normalized by the evidence:

$$

\begin{align}

p(\theta \vert x) &= \frac{p_{\theta}(x) p(\theta)}{p(x)} \\

\label{eq:proport}

&\propto p_{\theta}(x) p(\theta)

\end{align}

$$

In the equation \(\ref{eq:proport}\), since we’re worried about optimization, we cancel the normalizing evidence \(p(x)\) and stay with a proportional posterior, which is very convenient because the marginalization of \(p(x)\) involves integration and is intractable for many cases.

$$

\begin{align}

\theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} p_{\theta}(x) p(\theta) \\

&= \mathop{\rm arg\,max}\limits_{\theta} \prod_{i=1}^{n} p_{\theta}(x_i) p(\theta) \\

&= \mathop{\rm arg\,max}\limits_{\theta} \sum_{i=1}^{n} \underbrace{\log p_{\theta}(x_i)}_{\text{Log likelihood}} \underbrace{p(\theta)}_{\text{Prior}}

\end{align}

$$

In this formulation above, we just followed the same steps as described earlier for the maximum likelihood estimator, we assume independence and an identical distributional setting, followed later by the logarithm application to switch from a product to a summation. As you can see in the final formulation, this is equivalent as the maximum likelihood estimation multiplied by the prior term.

We can also easily recover the exact maximum likelihood estimator by using a uniform prior \(p(\theta) \sim \textbf{U}(\cdot, \cdot)\). This means that every possible value of \(\theta\) will be equally weighted, meaning that it’s just a constant multiplication:

$$

\begin{align}

\theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} \sum_i \log p_{\theta}(x_i) p(\theta) \\

&= \mathop{\rm arg\,max}\limits_{\theta} \sum_i \log p_{\theta}(x_i) \, \text{constant} \\

&= \underbrace{\mathop{\rm arg\,max}\limits_{\theta} \sum_i \log p_{\theta}(x_i)}_{\text{Equivalent to maximum likelihood estimation (MLE)}} \\

\end{align}

$$

And there you are, the MAP with a uniform prior is equivalent to MLE. It is also easy to show that a Gaussian prior can recover the L2 regularized MLE. Which is quite interesting, given that it can provide insights and a new perspective on the regularization terms that we usually use.

I hope you liked this article ! The next one will be about Bayesian inference with posterior sampling, where we’ll show how we can reason about the posterior distribution and not only on point estimates as seen in MAP and MLE.

*– Christian S. Perone*

Past week I released the first public version of EuclidesDB. EuclidesDB is a multi-model machine learning feature database that is tightly coupled with PyTorch and provides a backend for including and querying data on the model feature space.

For more information, see the GitHub repository or the documentation.

Some features of EuclidesDB are listed below:

- Written in C++ for performance;
- Uses protobuf for data serialization;
- Uses gRPC for communication;
- LevelDB integration for database serialization;
- Many indexing methods implemented (Annoy, Faiss, etc);
- Tight PyTorch integration through libtorch;
- Easy integration for new custom fine-tuned models;
- Easy client language binding generation;
- Free and open-source with permissive license;

And here is a diagram of the overall architecture:

]]>This last presidential election in Brazil was heavily marked by *huge amounts* of money being funneled to digital agencies and all kinds of targeting businesses that used Twitter, WhatsApp, and even SMS messages to propagate their content using their targeting strategies. Even before the elections, Cambridge Analytica was recorded mentioning their involvement in Brazil.

What makes Brazil so vulnerable for these micro-targeting companies, in my opinion, is the widespread ingenuity regarding digital platforms. An example of this ingenuity was the wide-spreading of applications that were allegedly developed to monitor politicians and give information about them, to help you decide your vote, bookmark politicians, etc. But in reality, it was more than clear that these applications were just capturing data (such as geolocation, personal opinions, demographics, etc) about their users with the intention to sell it later or use themselves for targeting. I even saw journalists and some very well-known people supporting these applications. Simply put, most of the time, when you don’t pay for a product (or application), you’re the product.

One very interesting work is the experiment done by Wu Youyou in 2014 where he showed that a simple regularized linear model was better or equal in accuracy to identify some personality traits using Facebook likes, this study used more than 80k participants data:

This graph above shows that with 70 likes from your Facebook, the linear model was more accurate than the evaluation of a friend of you and with more than 150 likes it can reach the accuracy of the evaluation of your family. Now you can understand why social data is so important for these companies to identify personality traits and content that you’re most susceptible.

In this year, one of the candidates didn’t participate much on the debates before the second round and used mostly digital platforms to reach voters, so Twitter became a very important medium that all candidates explored in some way. The idea of this post is to use a discrete event visualization technique called **Time Maps** which was extended to Twitter visualizations by Max C. Watson in his work “*Time Maps: A Tool for Visualizing Many Discrete Events Across Multiple Timescales*” (paper available here). It is unfortunate that not a lot of people use these visualizations because they are *very interesting* for the visualization of activity patterns in *multiple time-scales* on a single plot.

The main idea behind time maps is that you can visualize the time after and before the events for the entire discrete time events. This can be easily understood by looking at the visual explanations done by Max C. Watson.

As you can see in the right, the plot is pretty straightforward, it might take some time for you to realize what is the meaning of the *x* and *y*-axes, but once you grasp the concept, you’ll see that they are quite easy to interpret and how many patterns it can reveal on a single plot.

Time maps were an adaptation from the chaotic field where they were initially developed to study the timing of water drops.

One way to easily understand it is to look at these two series below and their respective time maps:

But before plotting the time maps, let’s explore some basic visualizations from the two candidates who got into the second round of the general elections last week.

I’ll focus only on the two candidates who got into the second round of the elections, their names are Jair Bolsonaro (president elected) and Fernando Haddad (not elected). These first plots will show the number of tweets per day during the year of 2018 and with some red marks indicating the first and second rounds of the elections:

In these plots, we can see that Jair Bolsonaro was more active before the general elections and that for both candidates, the aggregated number of tweets per day always peaked before each election round, with Jair Bolsonaro peaks happening a little earlier than Fernando Haddad. I also marked with a black vertical line the day that Jair Bolsonaro was stabbed in the streets of Brazil, you can see a clear drop of activity with a slow recovery after it.

Let’s now see the time of day profile for each candidate to check the hours of the day that the candidates were quieter and more active:

These profiles tell us very interesting information, that the candidates were most active between 3pm and 4pm, but for Jair Bolsonaro, it seems that the 3pm time of the day is really the time when he was most active by a significant margin. What is really interesting is that there is no tweet whatsoever between 6am and 8am for Fernando Haddad.

Let’s look now the distribution differences between 2017 and 2018 for each candidate:

As we can see from these plots, Jair Bolsonaro was as active in 2017 as in 2018, while Fernando Haddad was not so much active in 2017 with a huge bump in a number of tweets in the year of 2018 (election year). Something that is interesting, is that the pattern from Jair Bolsonaro to tweet more at 1pm shifted to 3pm in 2018, while for Haddad it changed also from 1pm to 2pm. It can be hypothesized that before they were less involved and used to tweet after lunch, but during election year this routine changed (*assuming that it’s not their staff who is managing the account for them*), so there is not only more tweets but also a distributional shift in the hour of the day.

These are the time maps for Jair Bolsonaro. The first is the time map colored by the hour of the day and the second time map is a heat map to see the density of points in the time map.

And these are the time maps for Fernando Haddad:

Now, these are very interesting time maps. You can clearly see in the Jair Bolsonaro time map that there are two stripes: vertical on the left and horizontal on the top that shows the first and last tweets of the day respectively. It’s a slow but steady activity of tweeting, with a concentration on the heat map on the 1-day bands. In Fernando Haddad, you can see that the stripes are still visible but much less concentrated. There are also two main blobs in the heat map of Fernando Haddad, one in the bottom left showing fast tweets that are probably from a specific event and then the blob on the top right showing the usual activity.

If you are interested in understanding more about these plots, please take a look on Max Watson blog article where he explains some interesting cases such as the tweets from the White House account.

If you are curious about how Twitter bots appear on time maps, here is an example where I plot the tweets from the CozinhaBot, that keeps posting some random recipes on twitter:

As you can see, the pattern is very regular, in the heat map we can see the huge density spot before the 2 hr ticks, which means that this bot has a very well known and regular pattern, as opposed to the human-produced patterns we saw before. These plots don’t have a small amount of dots because it has fewer tweets, but because they follow a *very regular interval*, this plot contains nearly the same amount of tweets we saw from the previous examples of the presidential candidates. This is very interesting because not only can be used to spot twitter bots but also to identify which tweets were posted out of the bot pattern.

I hope you liked !

*– Christian S. Perone*

Este tutorial apresenta uma análise breve dos microdados do ENEM do Rio Grande do Sul do ano de 2017. O principal objetivo é entender os fatores que impactam na performance dos participantes do ENEM dado fatores como renda familiar e tipo de escola. Neste tutorial são apresentados dois modelos: regressão linear e regressão linear hierárquica.

View the code on Gist.

]]>