Theory and Practice

Weak but Functional Pigeons

2018-04-30T13:52:00.000+01:00

A bird's eye view of [Razborov, Resolution Lower Bounds for the Weak Functional Pigeonhole Principle, 2003].

Resolution refutations are interesting, among other reasons, because they correspond to how SAT solvers work. A (general) resolution refutation $P$ is a sequence $C_1,\ldots,C_L$ of clauses such that (a) each clause is either an axiom or the resolution of two previous clauses, and (b) the last clause is empty. As usual, a clause is a set of literals, which are possibly negated variables. The length $L(P)$ of the refutation $P$ is its number of clauses. For practical SAT solving algorithms, it is true that if the solver says in time $T$ that the axioms are UNSAT then there exists a resolution refutation $P$ of length $L(P) \le O(T)$. Conversely, if there exists a resolution refutation $P$ of length $L(P)$, then you may get lucky and obtain an UNSAT answer from your solver in time $T\le O(L(P))$.

The resolution rule is the following: $$ \frac{A\cup\{l\}\qquad \{\bar{l}\}\cup B}{A \cup B} $$ We erase a literal $l$ and its negation $\bar{l}$, and keep the other literals.

The first lower bound on $L(P)$ was given by Haken, for resolution refutations $P$ that express the pigeonhole principle: for $m+1$ pigeons and $m$ holes, it is impossible that each pigeon goes in at least a hole and each hole gets at most one pigeon. The proof is allegedly complicated. (I should probably read it.) Some simplification of this and other lower bounds came with [Ben-Sasson, Wigderson, Short proofs are narrow, JACM2001], which shows that $$ w(P') \le O(\sqrt{n \lg L(P)}) \tag{BW} $$ where $w(P')$ is the maximum size of a clause in $P'$ (the width) and $n$ is the number of variables. The idea of the proof is as follows — you give me some refutation $P$ and I'll give you back a refutation $P'$ of width $O(\sqrt{n \lg L(P)})$ by iterating the following process:

find the literal $l$ that occurs most often in clauses longer than the axioms
reorganize the proof to get rid of the said occurrences:
1. derive $\bar{l}$
2. by resolving with $\bar{l}$, erase $l$ from axioms
3. replay old proof but with $l$ removed
(these three chunks are concatenated to give a new, thinner refutation)

From $w \le O(\sqrt{n \lg L})$ and $w \ge \Omega(f(n))$, it follows that $L \ge \Omega\bigl(\exp(\alpha \frac{f(n)^2}{n})\bigr)$ for some $\alpha \gt 0$. So, lower bounds on width $w \gt \omega(\sqrt{n})$ imply exponential lower bounds on length $L \ge \Omega\bigl( \exp(\alpha n^\beta)\bigr)$ for some $\alpha,\beta \gt 0$. To finish off the lower bound for pigeons, Ben-Sasson and Wigderson show that indeed $w \ge \Omega(n)$ using some expander stuff that I'm skipping here. On the other hand, if $w \le O(\sqrt{n})$, then the result $w \le O(\sqrt{n \lg L})$ is of no help in deriving lower bounds on $L$.

Now let's get to the subject of this post, the weak and functional pigeonhole principle. Weak means that instead of having $m+1$ pigeons we have $p \gt m$ pigeons, where $m$ is the number of holes. (One may think that if there are lots and lots of pigeons, then you might be able to notice quicker that you can't find homes for all. So, potentially, the weak principle might have shorter proofs.) Functional means that you put each pigeon not in at least one hole but in exactly one hole. (Again, the extra no-pigeon-cloning axioms might make it possible to find shorter refutations.) It turns out that in this case there are indeed refutations of width $O(\sqrt{n})$, so the (BW) lemma doesn't help. That's where Razborov comes in.

The general plan is quite similar, except we replace the width by a pseudowidth:

upper bound the pseudowidth in terms of the length $L$, and
lower bound the pseudowidth

From 1 and 2, we get a lower bound for the length $L$.

So, what's the pseudowidth? Roughly, it is the number of pigeons that are given many options. To say that more precisely, we need some notation. Setting variable $x_{ij}$ means that pigeon $i$ goes in hole $j$. For the functional principle, it turns out that we can use only positive literals: essentially, we systematically replace $\bar{x}_{ij}$ by $(x_{i1}\lor\ldots\lor x_{i(j-1)})\lor(x_{i(j+1)}\lor\ldots\lor x_{im})$. Then, we can count how many options is pigeon $i$ given by clause $C$: $$ d_i(C) \mathrel{:=} |C \cap \{\,x_{ij} : j\in[m]\,\}| $$ We say that a pigeon has many options if this number is over some fixed threshold which depends on the pigeon. Let $\mathbf{d}=(d_1,\ldots,d_p)$ be a vector with such thresholds, and let $w_{\mathbf{d}}(C)$ be the number of pigeons that have many options in clause $C$, according to thresholds $\mathbf{d}$. The pseudowidth of a refutation $P$ is then $w_{\mathbf{d}}(P) \mathrel{:=} \max_{C\in P} w_{\mathbf{d}}(C)$. Razborov shows that one can always find a thresholds $\mathbf{d}$ such that $$ w_{\mathbf{d}}(P') \le O(w_0 + \lg L(P)) $$ More precisely, if we give Razborov a refutation $P$, he can reply with another refutation $P'$ and some thresholds $\mathbf{d}$ such that the above holds. The catch is that he might introduce a few extra axioms. In fact, $P'$ is obtained from $P$ by replacing some clauses with $(w_0,\mathbf{d})$-axioms, so $L(P')=L(P)$. An $(w_0,\mathbf{d})$-axiom mentions exactly $w_0$ pigeons, and each pigeon $i$ that is mentioned is given exactly $d_i$ options. Step 2 of the proof, which lower-bounds the pseudowidth, needs to somehow deal with these extra axioms.

The main idea of the second part of the proof is to map clauses to certain vector spaces, and look at how the dimension of those spaces grows. Let $V(C)$ be the vector space associated to clause $C$. As long as the pseudowidth is below some threshold $O\bigl(\frac{m}{\lg p}\bigr)$, it turns out that new dimensions are seldom added. More precisely, if a resolution step produces $C$ out of $C_0$ and $C_1$, then $V(C) \;\subseteq\; V(C_0) + V(C_1)$. So, $$\dim V(C) \le \sum_{C'}\dim V(C')$$ where $C'$ ranges over the axioms used in deriving $C$. For the original pigeonhole axioms, the dimension is 0. For the extra axioms introduced in the step 1 of the proof, the dimension is some small number. For the empty clause, the dimension is some big number. So, either we reach the big number because we have many extra axioms, which means the refutation is long, or the pseudowidth is $\ge\Omega\bigl(\frac{m}{\lg p}\bigr)$. In the latter case, we have the lower-bound on pseudowidth that we wanted to obtain in step 2.

From $\frac{m}{\lg p} \le w_0 + \lg L$ it looks like $L\ge \exp \Omega(\frac{m}{\lg p})$. The lower bound is in fact only slightly worse: $$ L \ge \exp \Omega \biggl( \frac{m}{(\lg p)^2}\biggr) $$ The reason is the first case, in which many extra axioms are introduced. To see why, you'd have to look in the paper to see what exactly did I meant by ‘small number’ and ‘big number’ in the previous paragraph.

Nitpick. I stop here because the post is already long. Go read the paper: it has some great ideas! (Although, it does have quite a few small/silly mistakes as well. Somewhat unusually for me, I did not find them confusing.)

Edit 20180430: Fixed the dimension claim, by making the sum going over axioms. (Initially, I said that some dimension inequality holds for each resolution step, which is true but too weak for what follows.)

LYM Inequality

2018-04-20T06:16:00.000+01:00

A simple proof.

We start with universe $[n]=\{1,\ldots,n\}$, and consider families of sets ${\cal F}\subseteq 2^{[n]}$. Such a family is called an antichain when its sets are not related by $\subset$; that is, $F_1 \not\subset F_2$ for all $F_1,F_2\in{\cal F}$. Sperner's Theorem says that, for any antichain ${\cal F}$, $$ |{\cal F}| \le \binom{n}{\lfloor n/2 \rfloor} $$ and it is a consequence of the more general LYM inequality, which also holds for antichains: $$ \sum_{F \in {\cal F}} \binom{n}{|F|}^{-1} \le 1 $$

It is easy to see that Sperner follows from LYM because $\binom{n}{\lfloor n/2 \rfloor} \ge \binom{n}{k}$ for all $n$ and $k$. It is perhaps not so easy to see why the LYM inequality holds. Here is a proof by induction on $n$. The base case $n=0$ is easy. Otherwise, we can assume the induction hypothesis $$ \sum_{\substack{F\\ F\in{\cal F}\\x\not\in F}} \binom{n-1}{|F|}^{-1} \le 1 $$ for any $x\in[n]$. Then $$ \begin{aligned} 1 \;&\ge\; \frac{1}{n} \sum_{x \in [n]} \sum_{\substack{F\\ F\in{\cal F}\\x\not\in F}} \binom{n-1}{|F|}^{-1} &&\text{average induction hypothesis over all $x\in [n]$} \\\;&=\; \frac{1}{n} \sum_{x \in [n]} \sum_{\substack{F\in{\cal F}}} [x\not\in F] \binom{n-1}{|F|}^{-1} &&\text{use indicator function $[{\cdot}]$} \\\;&=\; \sum_{\substack{F\in{\cal F}}} \frac{1}{n} \binom{n-1}{|F|}^{-1} \sum_{x \in [n]} [x\not\in F] &&\text{swap sums} \\\;&=\; \sum_{\substack{F\in{\cal F}}} \frac{1}{n} \binom{n-1}{|F|}^{-1} (n-|F|) &&\text{compute the inner sum} \\\;&=\; \sum_{\substack{F\in{\cal F}}} \binom{n}{|F|}^{-1} &&\text{property of binomial coefficients} \end{aligned} $$

We just proved the LYM inequality. Or did we? Where did I use that ${\cal F}$ is an antichain? I didn't, so the proof above must be wrong. Can you find the mistake and fix it? (Hint: The mistake is subtle but silly, the fix is simple.)

[toggle answer]

The problem is in the step that introduces the indicator function $[{\cdot}]$. That can be done only if $\binom{n-1}{|F|}\ne 0$. This happens most of the time, except when $|F|=n$. Such a case must be treated separately. If $|F|=n$ then $|{\cal F}|=1$ because ${\cal F}$ is an antichain. (There — we used the antichain property.) Thus, the inequality holds for this case as well.

Cap Sets

2017-06-10T07:29:00.000+01:00

Why it's difficult to avoid arithmetic progressions.

The following result was established in 2016:

Let $A$ be a subset of $\mathbb{F}_3^n$ containing no three-term arithmetic progression. Then, $|A|=o(2.8^n)$.

This has been described as a big breakthrough. Yet, the proof is simple enough to fit in 2 pages.

The elements of $\mathbb{F}_3^n$ are vectors with $n$ components, each component being an integer modulo 3. Vectors $a_1,a_2,a_3$ are said to form a three-term arithmetic progression when $a_2-a_1=a_3-a_2$. Because we are working modulo $3$, this is equivalent to $a_1+a_2+a_3=0$. Thus, we can rephrase the result as follows.

Let $A$ be a subset of $\mathbb{F}_3^n$ such that $a_1+a_2+a_3\ne0$ for all distinct $a_1,a_2,a_3\in A$. Then, $|A|=o(2.8^n)$.

If we were to not require $a_1,a_2,a_3$ to be distinct, then the hypothesis would essentially be just ‘false’, because any nonempty set contains a trivial three-term arithmetic progression if we are allowed to pick the same element three times. An equivalent formulation is to say ‘not all equal’ instead of ‘distinct’, because $a+a+b \ne 0$ if $a \ne b$.

Let $A$ be a subset of $\mathbb{F}_3^n$ such that $a_1+a_2+a_3\ne0$ for $a_1,a_2,a_3\in A$ not all equal. Then, $|A|=o(2.8^n)$.

The overall idea is to find lower and upper bounds for the dimension of some vector space of polynomials. We consider the set $M_n^d$ of monomials over $n$ variables that have total degree $\le d$ and all powers from $\{0,1,2\}$. The restriction on powers is not serious because $x^3=x$ by Fermat's little theorem. (One may think that $x^2=1$, but that is true only for nonzero $x$.) Let us write $S_n^d$ for the vector space of polynomials written with monomials from $M_n^d$. We have that $\dim S_n^d=|M_n^d|$; let us denote this quantity by $m_d$. Observe that $m_{\infty}=3^n$; indeed, if we allow any total degree, then polynomials can represent any function $\mathbb{F}_3^n\to\mathbb{F}_3$, by combining indicator polynomials $I_a(x)\stackrel{\text{def}}=\prod_{k=1}^n \bigl(1- (x_k-a_k)^2\bigr)$.

We define two subsets of $\mathbb{F}_3^n$: $$\begin{align} X &\stackrel{\text{def}}= \{\,a_1+a_2 \mid\text{$a_1,a_2\in A$ distinct}\,\} \\ Y &\stackrel{\text{def}}= \{\,-a_3 \mid a_3\in A\,\} \end{align}$$ These sets are disjoint, if $A$ satisfies our hypothesis. We will consider the subspace $V$ of the polynomials that vanish outside $Y$. We will derive a lower bound on $\dim V$, by simply counting points outside $Y$; and we will derive an upper bound on $\dim V$, using a lemma which says, roughly, that ‘any polynomial that is zero on all of $X$ is zero on much of $Y$’. The reasoning will work for an arbitrary $d$; when comparing the upper with the lower bound, we shall pick for $d$ a convenient value.

The bounds we aim to prove are the following: $$ m_d - 3^n + |A| \le \dim V \le 2 m_{d/2} $$

Let us first see how we use these bounds, and we will soon return to how to prove them. By the correspondence $x^k \leftrightarrow x^{2-k}$, we see that monomials of total degree $\le d$ are in a one-to-one correspondence with monomials of total degree $\ge 2n-d$. Thus, $m_d$ equals $3^n-m_{2n-d-1}$. It follows that $|A| \le 2 m_{d/2} + m_{2n-d-1}$. If we pick $d$ such that $d/2=2n-d-1$, we get $|A| \le 3 m_{(2n-1)/3} \le 3m_{2n/3}$. Finally, one can show that $m_{2n/3}=o(2.8^n)$.

Why is $m_{2n/3}=o(2.8^n)$? [toggle answer]

We can use a Chernoff bound. Several useful inequalities involving probabilities have the form $$ \Pr(X\in S) \le \mathbb{E}f(X) $$ where $X$ is a random variable, and $f$ is a function such that $f(x)\ge [x \in S]$. The notation $[\phi]$ stands for $1$ when $\phi$ holds, and $0$ otherwise. It is very easy to prove this: $$\begin{align} \Pr(X \in S) & = \sum_{x} [x \in S] \Pr(X=x) \\ &\le \sum_x f(x) \Pr(X=x) = \mathbb{E} f(X) \end{align}$$ We pick $S \stackrel{\text{def}}= \{\,x\mid x\le 0\,\}$ and $f(x) \stackrel{\text{def}}= e^{-\alpha x}$. This tells us that $\Pr(X \le 0) \le \mathbb{E} e^{-\alpha X}$ for any $\alpha\ge 0$. Enough background – let us move back to $m_{2n/3}$.

The number $m_{2n/3}$ says in how many ways we can choose $n$ numbers from the set $\{0,1,2\}$ such that their sum is $\le 2n/3$. In other words, $m_{2n/3} = 3^n \cdot \Pr(Y_1+\cdots +Y_n\le 2n/3)$, where $Y_1,\ldots,Y_n$ are i.i.d. random variables taking values (uniformly) in $\{0,1,2\}$. If we define $X_k \stackrel{\text{def}}= Y_k-2/3$ and we use the inequality from above, we can calculate $$\begin{align} \Pr\biggl(\sum_{i=1}^n Y_i \le \frac{2n}{3}\biggr) & = \Pr\biggl(\sum_{i=1}^n X_i \le 0\biggr) \\&\le \mathbb{E} \biggl(e^{-\alpha\sum_{i=1}^n X_i}\biggr) = \mathbb{E} \biggl( \prod_{i=1}^n e^{-\alpha X_i}\biggr) = \prod_{i=1}^n \mathbb{E} e^{-\alpha X_i} \\&= \biggl( \frac{e^{-\frac{4}{3}\alpha}+e^{-\frac{1}{3}\alpha}+e^{\frac{2}{3}\alpha}}{3} \biggr)^n \end{align}$$ Now we optimize for $\alpha\ge0$, and we get that $m_{2n/3} \lt 2.75510462^n$.

Now let us get back to proving the bounds on $\dim V$.

The lower bound is easy. The definition of $V$ is $\{\,P\in S_n^d \mid\text{$P(a)=0$ for $a\in \mathbb{F}_3^n\setminus Y$}\,\}$. The dimension of $S_n^d$ is $m_d$, and each of the $|\mathbb{F}_3^n\setminus Y|$ constraints reduces the dimension by at most $1$. Done.

Why does adding a constraint $P(a)=0$ reduce the dimension by at most $1$? [toggle answer]

Consider a basis $P_1,\ldots,P_k$. Without loss of generality, we can assume that $P_i(a)\in\{0,1\}$ for all $i$. We partition $P_1,\ldots,P_k$ based on whether $P_i(a)$ is $0$ or $1$: for $Q_1,\ldots,Q_{k_1}$, we have $Q_i(a)=0$ for all $i$; for $R_1,\ldots,R_{k_2}$, we have $R_i(a)=1$ for all $i$; and $k=k_1+k_2$. If $k_2=0$, then the dimension is not reduced at all, as witnessed by the initial basis $P_1,\ldots,P_k$. If $k_2\gt 0$, then the dimension is reduced by $1$, as witnessed by the basis formed by $Q_1,\ldots,Q_{k_1}$ together with $R_2-R_1,R_3-R_1,\ldots,R_{k_2}-R_1$. [This is an instance of the rank-nullity theorem.]

The upper bound isn't quite so easy. My understanding is that the upper bound is the tear that led to the breakthrough. We'll do it in two steps, corresponding to these inequalities: $$ \dim V \le |\Sigma| \le 2 m_{d/2} $$ where $\Sigma$ is a maximal support of a polynomial in $V$.

For the first inequality, we will show the contrapositive: if a polynomial $P \in V$ has support $\Sigma$ with $|\Sigma|\lt\dim V$, then $\Sigma$ is not maximal. We do this by finding another polynomial $Q \in V$ whose support is nonempty and disjoint from $\Sigma$. The support of $P+Q$ will be a strict superset of $\Sigma$.

Why does $Q$ exist? [toggle answer]

We reuse the argument from the previous gray box. We start with space $V$, and each additional constraint $Q(a)=0$ reduces the dimension by at most $1$. Thus, the space $\{\,Q\in V\mid\text{$Q(a)=0$ for $a\in\Sigma$}\,\}$ has positive dimension.

For the second inequality, we show that any $P\in V$ has a support of size $\le 2 m_{d/2}$. I'll start with an example. Take $P({\bf x})=x_1 x_2 x_3+x_1^2$. This is an element of $S_3^3$ because we use $3$ variables and the total degree of each monomial is $\le 3$. The polynomial $P({\bf x}+{\bf y})$ will have monomials that use both $x$-variables and $y$-variables, but still have total degree $\le 3$: $$\begin{align} P({\bf x}+{\bf y}) = \left\{ \begin{aligned} &x_{1} x_{2} x_{3} + x_{2} x_{3} y_{1} + x_{1} x_{3} y_{2} + x_{3} y_{1} y_{2} \\ &+ x_{1} x_{2} y_{3} + x_{2} y_{1} y_{3} + x_{1} y_{2} y_{3} + y_{1} y_{2} y_{3} \\ &+ x_{1}^{2} + 2 \, x_{1} y_{1} + y_{1}^{2} \end{aligned} \right. \end{align}$$

For each monomial, in addition to the total degree (the sum of all powers), we can define an $x$-degree (the sum of all powers on $x$-variables), and a $y$-degree (the sum of all powers on $y$-variables). Since the total degree is $\le3$, it follows that the $x$-degree is $\le3/2$ or the $y$-degree is $\le3/2$ (or both). Let's put on the first line all those monomials with $x$-degree $\le3/2$, and on the second line the rest: $$ P({\bf x}+{\bf y}) = \left\{ \begin{aligned} & x_{3} y_{1} y_{2} + x_{2} y_{1} y_{3} + x_{1} y_{2} y_{3} + y_{1} y_{2} y_{3} + 2 x_{1} y_{1} + y_{1}^{2} \\ & + x_{1} x_{2} x_{3} + x_{2} x_{3} y_{1} + x_{1} x_{3} y_{2} + x_{1} x_{2} y_{3} + x_{1}^{2} \end{aligned} \right. $$

Now we group monomials on the first line by $x$, and we group monomials on the second line by $y$. $$ P({\bf x}+{\bf y}) = \left\{ \begin{aligned} & x_{3} y_{1} y_{2} + x_{2} y_{1} y_{3} + x_{1} (y_{2} y_{3} + 2 y_1) + (y_{1} y_{2} y_{3} + y_{1}^{2}) \\ & + (x_{1} x_{2} x_{3} + x_1^2) + x_{2} x_{3} y_{1} + x_{1} x_{3} y_{2} + x_{1} x_{2} y_{3} \end{aligned} \right. $$

Introducing some new notation, we can write the above as $$ P({\bf x}+{\bf y}) = \left\{ \begin{aligned} & x_{3} F_{x_3}({\bf y}) + x_{2} F_{x_2}({\bf y}) + x_{1} F_{x_1}({\bf y}) + F_1({\bf y}) \\ & + G_1({\bf x}) + y_{1} G_{y_1}({\bf x}) + y_{2} G_{y_2}({\bf x}) + y_{3} G_{y_3}({\bf x}) \end{aligned} \right. $$

Finally, we instantiate this equality for all $({\bf x},{\bf y})\in A^2$. For an example, let's say $A=\{100,020\}$, where $100$ is a compact notation for the point $(1,0,0)\in\mathbb{F}_3^3$. Then, $$\begin{align} \begin{bmatrix} P(100+100) & P(100+020) \\ P(020+100) & P(020+020) \end{bmatrix} = &\begin{bmatrix} 0 \\ 0 \end{bmatrix}_{x_3} \begin{bmatrix} F_{x_3}(100) & F_{x_3}(020) \end{bmatrix}\\ &+ \begin{bmatrix} 0 \\ 2 \end{bmatrix}_{x_2} \begin{bmatrix} F_{x_2}(100) & F_{x_2}(020) \end{bmatrix} + \cdots \end{align}$$

The matrix corresponding to the term $x_3 F_{x_3}({\bf y})$ was factored into a column vector corresponding to $x_3$, and a row vector corresponding to $F_{x_3}({\bf y})$.

Of course, we can do the same manipulations for any polynomial $P\in S_n^d$, obtaining $$\begin{align} P({\bf x}+{\bf y}) = \sum_{m\in M_n^{d/2}} m({\bf x}) F_m({\bf y}) + \sum_{m\in M_n^{d/2}} m({\bf y}) G_m({\bf x}) \end{align} $$

and we can instantiate this equation for all $({\bf x},{\bf y})\in A^2$ to obtain a matrix identity. If $P\in V$, which implies that $P$ vanishes on $X$, then its matrix is diagonal. On the other hand, the matrix of each term $m({\bf x})F_m({\bf y})$ has rank 1, and similarly the matrix of each term $m({\bf y})G_m({\bf x})$. Because rank is subadditive, the matrix of a $P \in V$ has rank $\le 2 m_{d/2}$, which means that it has $\le2m_{d/2}$ nonzero elements on the diagonal. In other words, $P(-a) = P(a + a)\ne 0$ for $\le 2 m_{d/2}$ points $a\in A$.

This concludes the proof.

This post follows [Ellenberg, Gijswijt, On large subsets of $F_q^n$ with no three-term arithmetic progression], which presents a slightly more general result, for $\mathbb{F}_q^n$ rather than $\mathbb{F}_3^n$.

[Stefan Kiefer provided feedback on several drafts of this post. Thanks!]

Open Access

2017-05-06T08:07:00.003+01:00

Preachy post about how I'm not preachy on the subject of open access.

I believe that scientific knowledge should be free to all. For me, this is a core value. Like with other core values, I would love if more people would share my opinion, but I am not the preacher type. Nevertheless, I appreciate the work of those who preacher type they are. Also, I think I have the right to act according to my core values, as long as I'm not stepping on others'.

How can we achieve free access to scientific knowledge? Granted, much of scientific knowledge is free, and the problem is that people aren't looking for it. But, still, a significant part of scientific knowledge remains behind paywalls.

I believe that there is exactly one way to achieve true, lasting, and meaningful change: reach a critical mass of people who hold the core value that scientific knowledge should be free. How do we get there? There is no royal way. Multiple tools must be used. Talk about it. Write about it. Change rules and guidelines of organizations. Set up procedures that make sharing (rather than not sharing) the default.

But, don't lose sight of the big picture. Rules and procedures are not the goal. Changing hearts and minds is the goal. And how people act is a barometer of how close we are to the goal. For example, do people post their articles on arXiv? Those who don't because it's slightly inconvenient or because they think it's beneath them and an admin person should do it, those people are not true believers, even though they might be preachers.

Changing rules and procedures is a tool for achieving change. Looking at how many papers are easy to access online is not how you measure progress. Looking at how individuals act is.

POPL 2017

2017-02-07T09:47:00.001+00:00

Some things I learned by attending POPL talks.

Monadic Second Order Logic on Finite Sequences. After two weeks, I do not quite remember what was the main contribution. I do remember that during this presentation I finally understood symbolic automata. As you know, traditional formalisms, such as finite automata, work over finite alphabets. If we want to use such formalisms to talk about programs, we have a problem. Often, we want the letters to be (references to) objects, but the number of objects is unbounded. Other times, we want letters to be integers, but the number of integers is also unbounded. A solution is to take an existing formalism, such as finite automata, and extend it to work over infinite alphabets. One way to do so is to use nominal automata. Roughly, instead of considering just an alphabet $\Sigma$, we also consider a group $G$ that acts on the alphabet, and has a finite number of orbits; for example, we could take all permutations, $G={\rm Sym}(\Sigma)$. Then, nominal automata let you define a language $L$ as long as it is invariant under $G$'s actions: $l_1\ldots l_n \in L$ iff $\pi(l_1)\ldots \pi(l_n) \in L$, for all $\pi \in G$. Register automata offer a concrete way to think about nominal automata. In register automata, you have one extra action – you can store the current letter in a register, and you also have an extra guard – you can test if the current letter equals the letter in a register.

An alternative to nominal automata is given by symbolic automata. Intuitively, nominal automata allow you to state requirements that are simple (they only involve equality), but can talk about several letters in the input word. Symbolic automata do the converse: You are allowed to use an arbitrary logic to express your tests, but you may only refer to the current letter when doing so; in particular, there is no moral equivalent of the memory. When I say ‘arbitrary logic’, I mean that the definition of symbolic automata is modular – you can choose your logic later and you should do so explicitly.

Now back to the main result. As I said, I didn't quite get it (or, possibly, I forgot in the two intervening weeks). But, it was something of the following form. Büchi, Elgot, and Trakhtenbrot (1957,1958) showed that MSO over finite words defines regular languages, by giving a procedure which builds an NFA from an MSO formula. So, if somebody gives you an MSO formula, you can say whether it is satisfiable by building the NFA, and checking if its language is nonempty. Loris D'Antoni and Margus Veanes define a sort of MSO(T) – MSO modulo what you're allowed to require of the current letter – and then explain how to build a symbolic automaton for that. Thus, one can check satisfiability of MSO(T), assuming that satisfiability(?) in T is decidable. (BTW, here's a nice set of slides by Moshe Vardi, which mention the result from 1957: Logic, Automata, Games, and Algorithms.)

My understanding was also helped by a chat I had with Margus prior the talk. I bombarded him with questions and he patiently answered all. I also had a chat with Loris, but mostly about Automata Tutor. I'm happy to report that my son (almost 8 years old now) started this week to solve problems from that website. He thinks it's fun! (But his approach is somewhat too random for my taste … although I occasionally observe a spark in his eyes, followed by what seems to be the execution of a plan.)

LOIS: Syntax and Semantics and Learning Nominal Automata. LOIS stands for looping over infinite sets. It is an extension of C++ that lets you do what the name says. Suppose that someone gives you a finite graph $G$ and asks if it is connected. It is possible to answer this question with a simple program, which checks if each vertex $x \in V(G)$ can reach all vertices $V(G)$ of $G$; that is, ${\it connected}(G) := \bigwedge_{x \in V(G)} \bigl({\it reach}(x) = V(G)\bigr)$, where $\it reach$ is implemented by BFS or DFS. I'm not going to write the code, because it's hopefully clear what I mean. If we try to run the same code on an infinite graph, though, the program won't terminate. In fact, we'd also have a termination problem when trying to construct such a graph, although we could conceivable ‘solve’ it by some laziness. Yet, with LOIS you can use (almost) the same program you use for finite graphs, and it will terminate. For example, you can construct a countably infinite random graph, which contains every finite graph as a subgraph, and ask whether it is connected. How does it work? It does symbolic manipulations. So, not quite any infinite set works: they must be definable (although I didn't quite get in which logic). In particular, sets must be countable. This restriction is what makes it impossible (unfortunately ☺) to solve undecidable problems such as universality of register automata.

You can play with LOIS's implementation.

Finding out whether a random graph is connected seems like an artificial example. Are there any real applications? It depends what you mean by ‘real’. But, I can say that there was at least one (other) POPL paper can be viewed as an application: Learning Nominal Automata. Angluin's algorithm (1987) learns regular languages by using two queries: (1) ‘Is this word in the (secret) language?’ and (2) ‘Does this automaton represent the (secret) language?’ Nowadays, as I mentioned, we care about automata that work over infinite alphabets. It turns out that one can use Angluin's algorithm unchanged (or at least without major modifications – I'm not sure) to learn nominal automata. The authors do not use LOIS, but another language NLambda. It also has infinite sets but, as the name implies, it's functional.

One of the authors of LOIS is Eryk Kopczyński, whose code on Topcoder I used to read regularly more than ten years ago. Eryk mentioned this puzzle: Assume that P=NP. Input: An NP-complete problem $p$. Output: An algorithm that solves (any instance of) $p$ in polynomial time.

Component-Based Synthesis for Complex APIs. You have access to some existing functions, and you have to implement a function whose signature is given. Often, a straight-line program would do. This paper shows how to automate the task of finding such straight-line programs, by doing a search guided by types. More specifically, the approach is to count. Suppose you want to write a function with the type ${\rm string}\to{\rm int}$, and you have at your disposal two functions: $$\begin{align*} f &: {\rm string} \to {\rm string} * {\rm string} \\ g &: {\rm string} * {\rm string} * {\rm string} \to {\rm int} \\ \end{align*}$$ You can call $f$ twice and then $g$: $$\begin{align*} &\{\,{\rm string}\,\} \\ &f \\ &\{\,{\rm string} * {\rm string}\,\} \\ &f \\ &\{\,{\rm string} * {\rm string} * {\rm string}\,\} \\ &g \\ &\{\,{\rm int}\,\} \end{align*}$$ I chose this notation to get across an observation made by Hongseok Yang: the method has some similarities to separation logic. Of course, just counting types has a problem: In the example above, the second call to $f$ has two strings on which we could call it, and just by looking at types there's no way to choose. Similarly, the last call to $g$ needs to fix some order for the $3$ arguments, and types provide no guidance. In such cases, a human would need to intervene. The hope is that types are sufficiently precise to make such situations rare.

The presentation used a formulation in terms of Petri-nets, and the synthesis problem as a reachability question. This is a bit of a red-herring, because it makes it sound like the complexity of synthesis is horrible. However, the synthesis works in a similar way as bounded model-checking: it puts a bound on the length of the program and tries to find a solution; if it fails, it increases the bound and repeats.

Others. I also enjoyed other presentations, but I don't want to make this post too long. (Or, rather, I don't want to spend too long on this post.) These other presentations include: Thread Modularity at Many Levels: a Pearl in Compositional Verification (or, ‘how to avoid auxiliary variables for the benefit of automation’), Polymorphism, subtyping and type inference in MLsub (in which types get rather flexible), Exact Bayesian Inference by Symbolic Disintegration (sounded a bit like abstract nonsense, but was very entertaining), Coupling proofs are probabilistic product programs (from which I learned a card ‘trick’). Also, I was happy to see friends; for example, Hongseok:

Learning from Interpretations

2015-11-12T16:37:00.001+00:00

The LFI-Problog algorithm, for doing inference on probabilistic logic programs.

This is a brief summary of what I learned from [Gutmann et al., Learning the Parameters of Probabilistic Logic Programs from Interpretations, 2011]. It's also an abridged record of what I presented in the probabilistic programming reading group at Oxford. Well, abridged in content but not in presentation. The presentation here was done in one go. On the first go, I tend to write rather chatty text, which I don't like. But, others say they actually prefer it! Go figure. Anyway …

ProbLog can be seen as a concise way to describe probability distributions over bitvectors. Here is an example:

  p :: foo(X).
  bar() :- foo(X).

In addition to the above, we know, from the type of $X$, that $X \in \{1,2\}$. (In ProbLog it is more complicated, but I'll just assume the types of variables are given and fixed.) The probability distribution this program describes has the type $\bigl(\{{\it foo}(1), {\it foo}(2), {\it bar}() \} \to 2\bigr) \to [0,1]$. Earlier I said ‘bitvector’ because this type is isomorphic to $2^3 \to [0,1]$, once we fix an order on the atoms ${\it foo}(1)$, ${\it foo(2)}$, ${\it bar}()$. More precisely, the distribution is the following:

world	000	001	010	011	100	101	110	111
probability	$qq$	$0$	$0$	$qp$	$0$	$pq$	$0$	$pp$

Here, $p$ is the parameter that occurs in the ProbLog program, and I used $q$ to denote $1-p$. The table is produced as follows. First, identify the input atoms: these are the groundings of those facts labeled with probabilities. In our case, the input atoms are ${\it foo}(1)$ and ${\it foo}(2)$. Second, give a truth assignment to the input atoms and fix the probability. For example, if we set ${\it foo}(1)=1$ and ${\it foo}(2)=0$, then the probability is $p \times (1-p)$: the $p$ for ${\it foo}(1)$ and the $1-p$ for ${\it foo}(2)$. Finally, complete the world: this is done by applying the derivation rules until a fixed-point is reached. This will be a least fixed-point, so we say we use least fixed-point semantics. All the worlds that can't be generated by this process (pick inputs arbitrarily, then do least fixed-point) get probability $0$.

Note that inputs should not appear as heads of any derivation rule. Otherwise, you're in a bit of a pickle if you try to apply the algorithm from above.

OK, now we have a way to describe probability distributions. In fact, parameterized probability distributions because $p$ is a parameter. The next task is to do learning. And by learning I mean MLE (maximum likelihood estimation).

What is MLE? In MLE you observe some event which you assume come from some parameterized distribution, and you want to set the parameters to maximize the probability of the observed event. For example, if we observe ${\it foo}(1)=1$ and ${\it foo}(2)=0$, then we compute the probability of this event to be $p(1-p)$, and we maximize it by taking $p=1/2$. Or, we could observe just ${\it bar}()=1$, which has the probability $1-(1-p)^2=p(2-p)$, maximized by taking $p=1$. In this latter case, we say that ${\it foo}$s are latent (or hidden) variables: we need to think about them to compute the probability, but we don't observe them.

MLE is very simple in principle, but the expression that you're supposed to optimize becomes unwieldy for examples of even moderate size, especially if latent variables are involved. One general optimization strategy is the EM algorithm. This is an iterative algorithm, which I'll illustrate on some examples.

Look at this table which lists for each observation of ${\it foo}(1),{\it foo}(2)$ how we should pick $p$:

observation	best $p$
00	$0$
01	$1/2$
10	$1/2$
11	$1$

The best $p$ is the average of ${\it foo}(0)$ and ${\it foo}(1)$! But what do we do if we don't observe the $\it foo$s? Well, we still have some expectation for their values, given what we observed.

Let's see what this means for the case in which we observed ${\it bar}()=1$. First, we guess $p=1/2$. Under this guess, $\mathop{\rm E}\bigl({\it foo}(1)\mid {\it bar}()=1\bigr)$ is $$\begin{align*} \frac{pq+pp}{qp+pq+pp} = \frac{q+p}{q+q+p} = \frac{1}{1+q} = \frac{2}{3} \end{align*}$$

By symmetry, $\mathop{\rm E}\bigl({\it foo}(2)\mid {\it bar}()=1\bigr)$ is also $2/3$. So, their average is also $2/3$. (Remember: We take averages of expectations.) Thus, we update $p:=2/3$ (and $q=1/3$).

In the next iteration, we do the same, but with the new value of $p$: $$\begin{align*} \frac{pq+pp}{qp+pq+pp} = \frac{1}{1+q} = \frac{3}{4} \end{align*}$$

If we keep doing this, then we end up with $p=1$, which is the same solution we got when we applied MLE directly, by maximizing $1-(1-p)^2$. (You can check that the recurrence from above defines a sequence $q_n=1/n$. So, when $n\to\infty$ we have $q_n \to 0$ and $p_n \to 1$.) For the general case, you can find a proof in Wikipedia that applying one EM step doesn't reduce the likelihood (likelihood = the probability of the observed event). The proof boils down to Gibbs inequality. Since likelihood doesn't decrease, then we may hope it increases.

The EM algorithm can be seen as some sort of gradient ascent, specialized for maximizing likelihoods. In particular, it is a numeric algorithm, not a symbolic one.

Let's see where we are. We have a language for describing parametrized distributions. We have a numeric algorithm for estimating good values for the parameters. Each iteration of the numeric algorithm works by averaging some expectations. The rest is about how to compute the expectations efficiently. For this, I'll switch to a slightly more complicated example — the one from the paper.

  0.1 :: burglary.
  0.2 :: earthquake.
  0.7 :: awake(X).  // X is one of { mary, john }
  alarm :- burglary.
  alarm :- earthquake.
  calls(X) :- awake(X), alarm

Suppose we observe that ${\it alarm}()=1$ and ${\it calls}({\rm John})=0$. First, we ground the program. For efficiency, we also throw away stuff that can't influence what we observed. In this case, we throw away mary.

  0.1 :: burglary.
  0.2 :: earthquake.
  0.7 :: awake(john).
  alarm :- burglary.
  alarm :- earthquake.
  calls(john) :- awake(john), alarm

Second, we build a formula whose models are the least fixed-points of the above program; that is, the worlds that can have nonzero probability. This step is very easy if there are no cyclic dependencies. $$\bigl({\it alarm} \leftrightarrow ({\it burglary} \lor {\it earthquake})\bigr) \land \bigl({\it calls\_john} \leftrightarrow ({\it awake\_john} \land {\it alarm})\bigr) $$

Third, we simplify the formula according to the observation. $$\begin{align*} &\bigl(1 \leftrightarrow ({\it burglary} \lor {\it earthquake})\bigr) \land \bigl(0 \leftrightarrow ({\it awake\_john} \land 1)\bigr) \\ &\quad=({\it burglary} \lor {\it earthquake}) \land \lnot{\it awake\_john} \end{align*}$$

Let $\phi$ be this formula, corresponding to our observation. We want to compute the expectation $\mathop{\rm E}(f\mid \phi=1)$ for $f={\it burglary}$ and for $f={\it earthquake}$. By definition of conditional probabilities (and using $\mathop{\rm E}{X}=\Pr(X=1)$), $$ \mathop{\rm E} (f \mid \phi=1) % = \Pr(f=1\mid \phi=1) % = \frac{\Pr(\phi=1\mid f=1) \Pr(f=1)}{\Pr(\phi=1)} = \frac{\mathop{\rm E}(\phi\land f)}{\mathop{\rm E}\phi} $$

In other words, the task is to evaluate the reliability polynomial of $\phi$ and of $\phi\land f$. Both of these tasks are easy once we have $\phi$ represented as a BDD. More precisely, they are linear in the size of the BDD. I expect it is obvious why this is so. :-) In our case, if the BDD has size $n$, then we would update the parameters in $\sim 4n$ steps:

compute $\mathbin{\rm E}\phi$ in $n$ steps
compute $\mathbin{\rm E}(\phi \land {\it burglary})$ in $n$ steps
compute $\mathbin{\rm E}(\phi \land {\it earthquake})$ in $n$ steps
compute $\mathbin{\rm E}(\phi \land {\it awake\_john})$ in $n$ steps
update $p_{\it burglary}:= \frac{\mathbin{\rm E}(\phi \land {\it burglary})}{\mathbin{\rm E}\phi}$ in $1$ step
update $p_{\it earthquake}:= \frac{\mathbin{\rm E}(\phi \land {\it earthquake})}{\mathbin{\rm E}\phi}$ in $1$ step
update $p_{\it awake}:= \frac{1}{2} \biggl( \frac{\mathbin{\rm E}(\phi \land {\it awake\_john})}{\mathbin{\rm E}\phi} +p_{\it awake} \biggr)$ in $1$ step. (Here we averaged two expectations: $\mathbin{\rm E}({\it awake\_john}\mid \phi=1)$ and $\mathbin{\rm E}({\it awake\_mary}\mid \phi=1)$. The latter is just $\mathbin{\rm E}({\it awake\_mary})=p_{\it awake}$.)

The final trick is the observation that the four conjoined expectations — $\mathbin{\rm E}(\phi\land f)$ for $f\in\{{\it burglary}, {\it earthquake}, {\it awake\_john}, {\it awake\_mary}\}$ — can be done all in time linear in the size of the BDD. More precisely, the time is $O(m+n)$, where $m$ is the number of expectations being computed, and $n$ is the size of the BDD. In the first traversal you construct a slightly improper BDD that has each of ${\it burgalry}$, $\it earthquake$, $\it awake\_john$, $\it awake\_mary$ on every path from root to leaves. (The paper doesn't do this. In fact, in an implementation you wouldn't do it either. But, the equivalent computation that avoids constructing this pseudo-BDD is slightly annoying to describe.) Then, you tag each node with two numbers, $\alpha$ and $\beta$. Both can be understood in terms of downward random walk which at a node labeled by $\ell$ takes the 1-branch with probability $\Pr(\ell=1)$. For example, at the node labelled by $\it burglary$ we take the 1-branch with probability $p_{\it burglary}$. Thinking in terms of this random walk, the $\alpha$ of node $x$ is the probability that starting at $x$ you end up at a $1$-leaf. Clearly, these numbers can be filled by one bottom-up traversal of the BDD. The $\beta$ of node $x$ is the probability that starting at the root we end up visiting $x$. Clearly, these numbers can be filled by one top-down traversal of the BDD.

How are these $\alpha$ and $\beta$ tags to be used? Well, the $\alpha$ on the root is $\mathop{\rm E}\phi$, which we wanted to know. It is the probability that the formula $\phi$ evaluates to $1$. We can decompose this probability into a sum over all paths from the root to a $1$-leaf. When we want to evaluate $\mathbin{\rm E}(\phi\land f)$, we are interested in summing only over those paths that have $f=1$. Since we made sure that all paths test the value of $f$ exactly once, we can just look at each node labeled by $f$: $$ \mathop{\rm E}(\phi\land f) = \sum_{\text{$x$ labeled by $f$}} \alpha(x) \beta(x) $$

Some comments. After reading the paper, I was worried about two things. First, what do you do if there are cycles in the program? Second, are these BDDs small enough in practice? So, I looked up subsequent work, and I saw that both issues are better addressed in [Fierens et al., Inference and Learning in Probabilistic Logic Programs using Weighted Boolean Formulas, 2015] . I only skimmed this paper (it's much longer!), so what I say below about it may be wrong.

Cycles. I was worrying about cycles because they gave me some headache recently, while working on [Grigore, Yang, Abstraction Refinement Guided by a Learnt Probabilistic Model, 2016] . The paper by Fierens et al. gives references to two standard ways to handle cycles. I don't think it gives more detail. Hongseok and I did use one of those methods, but we had to throw in a few more approximations and ideas to get something that works in any reasonable time. The reference I like is [Lee, A model-theoretic counterpart of loop formulas, 2005]. (The ‘novel’ part of this paper is supposed to be what to do with cycles if you also have negations. We didn't use that part. But, the paper also reviews what you do with cycles when you don't have negations, and I think that review is very readable.)

By the way, now that I read this article carefully, I am convinced that it is very closely related to the learning part in the article by Hongseok and me. I should write-up some proper, in-depth comparison.

BDD size. I was worried about size partly because the problem and a solution (automatic theory splitting) are mentioned in the paper I summarize here, and partly because I tend to be worried about size when the word ‘BDD’ is mentioned. Automatic theory splitting simply means that you decompose the formula into a conjunction of parts that don't share variables, and build one BDD for each part, rather than one BDD for the whole thing. I suspect it's not too often that you can actually do this. Also, this amounts to using a restricted form of decision-DNNF. To remind you, a decision-DNNF has two types of nodes: a decision node that behave like those in a BDD, and a conjunction node. The conjunction node requires that its two (decision-DNNF) children don't share variables. In general, conjunction nodes and decision nodes can be interspersed. Automatic theory splitting amounts to disallowing conjunctions below decisions. The paper by Fierens et al. doesn't have this limitation — it uses more results from the area of knowledge compilation to build decision-DNNFs.

Right, I didn't say what an ‘interpretation’ is. It's what I call above ‘observation’.

Finding Counterexamples from Parsing Conflicts

2015-10-05T12:12:00.003+01:00

How the CUP parser generator explains conflicts.

I found the abstract of [Isradisaikul, Myers, Finding Counterexamples from Parsing Conflicts, 2015] quite exciting. After reading the paper, I think it delivers what the title and the abstract promise: better errors for parser generators. If you ever worked with a LR parser generator, then you know that sometimes it reports shift/reduce or reduce/reduce conflicts. The error usually points to a couple of grammar rules and says ‘here is the problem’. I never feel like I understand what exactly the problem is until I have an example of what could go wrong. So, once I see the warning/error from the parser generator, I try to come up with examples. Well, according to this paper, the CUP parser generator gives you examples. How cool is that?

Still, I'm not completely enthusiastic about the paper. I think the algorithm could be better described. For example, I would have preferred to see pseudocode. I realize that it is not easy to add pseudocode, and I realize that the pseudocode could be too large to be useful. But. I think that adding pseudocode would've forced the presentation to be more precise, which is a good thing. I think that striving for small pseudocode would've forced isolating the core idea, which is a good thing. The heuristics that embellish the core idea could remain in prose, as they are now.

Now let's get technical, a bit. Recall that the ambiguity problem is in P for NFAs (nondeterministic finite automata), and undecidable for CFGs (context free grammars).

The ambiguity problem for NFAs asks whether there are two runs that accept the same word. To solve it, we explore pairs $(q_1,q_2)$ of states that can be reached by the same word. For pairs with $q_1=q_2$ we also care whether they were reached through distinct runs. Formally, we define a graph that has a transition $(q_1,q_2,b) \to (q'_1,q'_2,b')$ when the NFA has transitions $t_1 = \bigl(q_1 \stackrel{\ell}{\to} q'_1\bigr)$ and $t_2 = \bigl(q_2 \stackrel{\ell}{\to} q'_2\bigr)$ for some letter $\ell$ and $b'=b \lor (t_1 \ne t_2)$. The third component is a boolean that keeps track of whether the runs have diverged. We then ask if there is a path $(q,q,0) \leadsto (q',q',1)$ with $q$ initial and $q'$ final. We answer this question with BFS (breadth first search). If the NFA has $m$ transitions and $n$ states, then the graph we defined has $\le 2m^2$ edges and $2n^2$ vertices. Thus, the BFS takes $O(m^2+n^2)$ time. This, by the way, you can find in [Even, On Information Lossless Automata of Finite Order, 1965].

(I have posted this previously as a problem on my blog and on SPOJ. I believe my reference solution for SPOJ is slightly different from what I describe above, but I'm too lazy to check.)

The paper by Isradisaikul and Myers essentially uses the same trick, of running two copies of the machine in parallel. Except in their case, the machine is not an NFA but an LALR parser. If you wonder how could the problem be undecidable, the answer is simple: a LALR parser has an infinite number of configurations, because of the stack. Now, I should say that many other papers are based on the same trick of running two copies in parallel. What Isradisaikul and Myers do is that they exploit a bit the structure of a LALR parser to squeeze some efficiency.

One thing they do is that they distinguish nonunifying from unifying counterexamples. In the terminology from above, a nonunifying counterexample is a word that corresponds to a path $(q,q,0) \leadsto ({\cdot},{\cdot},1)$ with $q$ initial; a unifying counterexample is a word that corresponds to a path $(q,q,0) \leadsto (q',q',1)$ with $q$ initial and $q'$ final. The idea is that in some cases you can find where the paths diverge (the conflict) but it's difficult to find the continuation that gives two different parse trees. In fact, they give up after some timeout. (Another way to describe the difference is to say that a unifying counterexample remains a counterexample even for a GLR parser.)

There are other optimizations they propose. But, frankly, I didn't parse those carefully. So, if you want to add better error messages to your favorite parser generator, then you'll have to read the paper yourself, not just this post. :)

Verdi

2015-09-19T18:28:00.001+01:00

A system for implementing and verifying distributed algorithms.

The abstract of the paper [Wilcox et al., Verdi: a Framework for Implementing and Formally Verifying Distributed Systems, PLDI 2015] seemed quite exciting. Alas, I can't say I recommend reading the paper. For my taste, its information density is way too low. I can't say that I recommend reading the code either, but that's only because I didn't read it. All indications point to it being a cool thing to read.

One worthwhile thing I learned from the paper is that there is a new and cool consensus algorithm: Raft. I should probably check it out.

There is one thing that bugged me from the beginning to the end of the article. The article states that,

Verdi's key conceptual contribution is the use of verified system transformers to separate concerns of correctness and fault tolerance.

What does this mean? Basically, you implement your algorithm assuming the network is perfect. Then, if you want to run it on a network that, say, stutters, you invoke one of these transformers and you get an implementation for a stuttering network. (And, of course, the proof of correctness gets transformed as well.) That's all nice. But. But I studied networks in my undergrad degree, and I distinctly recall spending endless boring hours discussing the solution to this kind of issues: add an abstraction layer. So, naturally, I was expecting to see how this code transformation approach (which, by the way, is a functor) compares to abstraction layers. Alas, there is no mention of the latter.

This tension (between the abstraction layers that I expected and the functors I was given) reminded me of another paper on my reading list: [Gu et al., Deep Specifications and Certified Abstraction Layers, POPL 2015] . Its abstract seemed to imply that they explain how one is supposed to write formal specifications that correspond to the informal idea of abstraction layers.

Push/Pull for Transactions

2015-09-12T20:17:00.000+01:00

A unified model for understanding many transactional systems.

There is little doubt about how one should write concurrent programs. Like in all programming, you must care about correctness first, efficiency second. If you care about correctness, you do not want to use compare-and-swap. You don't even want to use locks. What you want to use is atomic blocks. When you mark a piece of code as atomic, you effectively say, ‘I want to reason about the correctness of this code as if nothing runs concurrently’. Now it is up to the language implementor to ensure that the as if part holds. The part of the language implementation that is responsible for this is called a transactional system.

When the transactional system is implemented, there is a similar tension between correctness and efficiency. At one end of the spectrum, one could use a global lock that each atomic block must hold while executing. This is correct: Atomic blocks run as if nothing runs concurrently because nothing runs concurrently. (I am assuming that every statement in the program is in an atomic block: If a statement is not in an atomic block, then it is by default wrapped in a tiny atomic block, made just for it.) At the other end of the spectrum, you let everything run concurrently, and pray that there is no interference. This is really fast. (And incorrect, obviously.) Needless to say, no one wants an implementation at either end of the spectrum. But, crafting something in the middle seems like a dark art.

This dark art may not remain so dark for long. At least one recent paper aims to shed some light on it. The paper is [Koskinen, Parkinson, The Push/Pull Model of Transactions, PLDI 2015].

Here is how it works. We start with some underlying sequential language. It can be pretty much anything: We model its statements as relations on stores. Then we add a few concurrency features: fork, atomic, and operations on the shared state. (The latter are called methods in the paper.) Now we have a language for concurrent programming, and we can give it an operational semantics, which just encodes the intuitive idea that atomic blocks aren't interrupted by other threads. This is the reference semantics.

The second step is to compile a program with atomic blocks into one without. Instead of atomic blocks, we will use a small set of operations, the most important of which are push and pull. For this compilation to be correct, certain conditions need to hold. (These conditions appear in Figure 4b as assumptions. And, by the way, the paper does not describe this step as ‘compilation’. I do.)

And here's the bang.

As long as the compilation is correct, the program with push and pull is guaranteed to do the same as the given one, the one with atomic. This is a big deal because the correctness conditions in Figure 4b are fairly simple, and because they allow you to do rather weird-looking stuff.

Now let's step back. I described above my understanding of the paper and its claims. I think the idea is really cool and useful. But.

The paper did not convince me that the conditions in Figure 4b are sufficient. The authors know this: They repeatedly say that for lack of space they include only a proof sketch, and that their proof (from the technical report) needs to be checked mechanically in the future. (By the way, the idea of the proof is a bisimulation that doesn't observe the insides of atomic blocks.) On the one hand, I rather trust Erik and Matt. (Although my trust was slightly shaken when I saw that t in Figure 1 is unused. Just joking.) On the other hand, this kind of proof seems extremely brittle. The authors are right: a mechanically checked proof would be cool.

Now go see the paper for yourself: [Koskinen, Parkinson, The Push/Pull Model of Transactions, PLDI 2015].

Coverability for Vector Addition Systems

2015-07-29T17:11:00.000+01:00

Vector addition systems are a model of computation. For this model, coverability is one of the easiest decision problems. This post presents an old (1978) upper bound, with accompanying code.

A VAS (vector addition system) is a model of computation that pops up occasionally. In this model, the state is a vector of $d$ nonnegative integers. At each time step, the state changes from $v$ to $v+\delta$. The shift $\delta$ is chosen nondeterministically from some given finite set $\Delta$. If states $V$ are active now, then states $V'$ are active next, where $$ V' \;=\; \{\,v+\delta\mid\text{$v \in V$ and $\delta \in \Delta$ and $v+\delta \ge 0$}\,\} $$ Here's how you compute $V'$ from $V$ in Python:

def check_many(d, vs):
  for v in vs:
    assert d == len(v)
def check_one(d, *vs):
  check_many(d, vs)
def vas_step(d, Delta, vs):
  check_many(d, Delta)
  check_many(d, vs)
  ws = set()
  for v in vs:
    for delta in Delta:
      w = tuple(v[i] + delta[i] for i in range(d))
      if all(wi >= 0 for wi in w):
        ws.add(w)
  return ws

The reachability problem asks whether there is some run going from $u$ to $w$. The coverability problem asks whether there is some run going from $u$ to $\ge w$. Here's how coverability looks in Python:

def LE(u, v):
  assert len(u) == len(v)
  return all(u[i] <= v[i] for i in range(len(u)))
def vas_cover(d, Delta, u, w):
  check_many(d, Delta)
  check_one(d, u, w)
  vs = [u]
  while len(vs) > 0 and not any(LE(w, v) for v in vs):
    vs = vas_step(d, Delta, vs)
  return len(vs) > 0

For example, the code

print(vas_cover(2, [[1,-1], [-2,1]], [3,2], [10,10]))
print(vas_cover(2, [[1,-1], [-1,2]], [3,2], [10,10]))

gives

False
True

The code I gave has a problem, though. This doesn't terminate:

print(vas_cover(2, [[1,0]], [0,0], [0,1]))

Can we fix it? One way to fix it is to stop after some time if we still didn't find a vector $\ge w$. But, after how much time? We can find one answer in [Rackoff, The Covering and Boundness Problems for Vector Addition Systems, 1978]. The bound given by Rackoff depends on $\Delta$ and $w$ but not on $u$. Let $N$ be the biggest absolute value of a number occurring in $\Delta$ or $w$. Then it is sufficient to try $L_d$ steps, where $$\begin{align} L_0 &= 1 \\ L_k &= (N \cdot L_{k-1})^k + L_{k-1} &&\text{for $k \gt 0$} \end{align}$$ The bounds $L_0, L_1, L_2, \ldots$ are of order $N^0, N^1, N^{1 \cdot 2 + 2}, N^{1\cdot 2\cdot 3 + 2\cdot 3+3}, \ldots$; very roughly, $L_d \in O(N^{d\cdot d!})$. Python code again:

def rackoff_bound(d, Delta, w):
  check_many(d, Delta)
  check_one(d, w)
  N = max(di for delta in Delta for di in delta)
  bound = 1
  for k in range(1, d + 1):
    bound = (N * bound) ** k + bound
  print('bound',bound)
  return bound

def rackoff_vas_cover(d, Delta, u, w):
  check_many(d, Delta)
  check_one(d, u, w)
  vs = [u]
  for i in range(rackoff_bound(d, Delta, w)):
    if any(LE(w, v) for v in vs):
      return True
    vs = vas_step(d, Delta, vs)
  return False
print(rackoff_vas_cover(2, [[1,-1], [-2,1]], [3,2], [10,10]))
print(rackoff_vas_cover(2, [[1,-1], [-1,2]], [3,2], [10,10]))
print(rackoff_vas_cover(2, [[1,0]], [0,0], [0,1]))

The output:

bound 6
False
bound 39
True
bound 6
False

We can solve the case that previously didn't terminate, using a bound of 6. But, why is this bound sufficient? Rackoff's proof uses two tricks: induction on coordinates, and a bounding box. Let $I \subseteq \{0,1,\ldots,d-1\}$ be a set of coordinate indices. For a vector $v \in \mathbb{Z}^d$, let $v[I] \in \mathbb{Z}^{|I|}$ be its restriction to the coordinates indicated by $I$. We can lift the same notation to sets of vertices; for example, if $\Delta \subset \mathbb{Z}^d$, then $\Delta[I] \subset \mathbb{Z}^{|I|}$ and $|\Delta[I]| \le |\Delta|$. We will prove by induction the following stronger statement:

Fix the target $w$, and the set of moves $\Delta$. For all $u$ and $I$, if there exists a run from $u[I]$ to $w'[I]$ using moves from $\Delta[I]$ such that $w'[I]\ge w[I]$, then there exists such a run of length $\lt L_{|I|}$.

We will assume there exists a run, and we will show that there exists one of length $\lt L_{|I|}$, by induction on the size of $I$.

The base case $|I|=0$ holds because $w'[\emptyset] \ge w[\emptyset]$ always holds.

For the inductive case, we use a bounding box of size $S$: that is, we split runs into those that use only coordinates $\lt S$, and the others. If all coordinates are $\lt S$, then interesting runs have length $\lt S^{|I|}$. (If $v_1[I]=v_2[I]$ and $v_1[I] \leadsto v_2[I]$ is a subrun, then we simply cut it out.) But, maybe it's not possible to stay within this box. Then, there is some vertex $v$ that is the first one outside the box. We cut our run into two pieces, $u[I] \leadsto v[I]$ and $v[I] \leadsto w[I]$, which we analyze separately. The first piece is of length $\lt S^{|I|}$ for the same reason as before. Let's move to the second piece, $v[I] \leadsto w[I]$. Let $I' \subset I$ be those coordinates of $v$ that are still within the bounding box. Then, by the induction hypothesis, there is a run from $v[I']$ to $w[I']$ using moves from $\Delta[I']$ of length $\lt L_{|I'|}$. What happens with the coordinates $I \setminus I'$? In $v$ they are $\ge S$, so they can become no smaller than $S-N(L_{|I'|}-1)$. We would like $w'[I \setminus I'] \ge w[I \setminus I']$, so we pick $S$ such that $S-N(L_{|I'|}-1) \ge N$; that is, we pick $S \stackrel{{\rm def}}{=} N \cdot L_{|I|-1}$. With this choice, we get exactly the inductive case from the definition of $L_k$.

That concludes the proof.

Remark: Apart from presentation issues, there are two minor differences between what I said above and what you will find in Rackoff's paper. He does the induction on sets of indices of the form $\{0,1,\ldots,k-1\}$ and justifies this by saying at some point ‘without loss of generality’. That's perfectly fine once you understand the proof, but it confused me for awhile, although I can't say why. I decided to just go through all subsets of $\{0,1,\ldots,d-1\}$. The second is that I measure the length of runs by the number of moves, while Rackoff's paper uses the number of states. I may have off-by-one errors, so I wasn't too explicit about this above. :p (But, if I some such errors slipped through, please let me know so I can fix the text and the code.)

What does this give us? A coverability algorithm that works in $O(d \cdot |\Delta|^{N^{d\cdot d!}})$ time. Not terribly fast, but still better than Ackermannian or non-primitive recursive.

I should say that there are some ways to speed up the algorithm, although it probably won't help with the asymptotics. One obvious change is to add a test in the loop of rackoff_vas_cover: if vs is empty, then answer False. A less obvious change is to run the whole thing backward. [update 20150731: Actually, there is a better upper bound for the backward algorithm. [Bozzelli, Ganty, Complexity Analysis for the Backward Coverability Algorithm for VASS, 2011] shows that the runtime of the algorithm below is of the same order of magnitude as $L_d$: doubly-exponential. Compare their Theorem 1 with their Theorem 2. Their Lemma 5 describes the algorithm from below, with minor differences.]

def vas_cover_backward(d, Delta, u, w):
  check_many(d, Delta)
  check_one(d, u, w)
  vs = [tuple(w)]
  while True:
    if any(LE(v, u) for v in vs):
      return True
    us = set()
    for v1 in vs:
      for delta in Delta:
        u1 = tuple(max(0, v1[i] - delta[i]) for i in range(d))
        if all(not LE(v2, u1) for v2 in vs):
          us.add(u1)
    if len(us) == 0:
      return False
    vs = set(v for v in vs if all(not LE(v, u1) for u1 in us))
    vs |= us
  assert False
print(vas_cover_backward(2, [[1,-1], [-2,1]], [3,2], [10,10]))
print(vas_cover_backward(2, [[1,-1], [-1,2]], [3,2], [10,10]))
print(vas_cover_backward(2, [[1,0]], [0,0], [0,1]))

After the $k$th iteration, the set vs contains the minimal vectors that can reach $\ge w$ in $\le k$ steps. I know this algorithm from Sylvain Schmitz and from the Section 2.2.2 of the algo-wqo lecture notes. [update: Sylvain also pointed some mistake I had in a previous version of the code.] The algorithm is presented there in the more general setting of WSTSs (well structured transition system). (A WSTS is essentially a transition system for which $a \le b$ encodes what we'd intuitively describe as ‘$b$ is an abstraction of $a$’ in program analysis.) Since abstraction came into discussion, I should say that midway writing this post I googled for more recent presentations of Rackoff's proof. I found an article that presents the proof in a more general (and abstract!) setting: [Lazic, Schmitz, The Ideal View of Rackoff's Coverability Technique, 2015] . I didn't read it, so I don't know what it says.

Let's move to a slightly higher vantage point. The more complicated the model of computation, the harder its decision problems. The more precise the question being asked by a decision problem, the harder to answer it. VASs are more complicated than finite automata, and not too far behind Turing machines. Coverability is one of the least precise questions one could ask of them, and it is doable in exponential space using the algorithms from above. Reachability is a more precise question, so solving it seems harder. Its story is in this post by Lipton: An EXPSPACE lower bound. So, one way to make life harder is by asking more precise questions. Another way is to make the model of computation more complicated: coverability is Ackermann-hard when you add reset [Schnoebelen, Revisiting Ackermann-hardness for Lossy Counter Machines and Reset Petri-Nets, 2010].

Tree Buffers

2015-04-27T23:17:00.001+01:00

Circular buffers are one of the most fundamental and pervasive data structures. They are an efficient implementation for buffering linear sequences. Tree buffers are a more general data structure.

You heard of circular buffers: Alice reads out a sequence $x_0$, $x_1$, $x_2$, … of facts. Bob's memory has size $h$, and he stores $x_k$ at address $k \bmod h$. So, Bob always remembers the last $h$ facts that Alice read.

Let's see one way to generalize the problem. Instead of reading out a sequence of facts, Alice gives reasons for the stated facts. In other words, Alice describes a tree, not a sequence. For example, Alice could describe the tree

by saying $$\begin{align*} &\text{$0:a$} \\ &\text{$1:b$, follows from $0$}\\ &\text{$2:b$, follows from $0$}\\ &\text{$3:c$, follows from $1$} \end{align*}$$ Each node has an identifier and a label. In this example, the identifiers are $0$, $1$, $2$, and $3$, but they could be anything that doesn't repeat. The labels put on nodes, on the other hand, could repeat. Indeed, the label $b$ repeats in the example above. You may think of identifiers as memory addresses, and of labels as memory content.

Alice could describe the same tree by saying $$\begin{align*} &\text{$0:a$} \\ &\text{$2:b$, follows from $0$}\\ &\text{$1:b$, follows from $0$}\\ &\text{$3:c$, follows from $1$} \end{align*}$$

In the case of the circular buffer, Bob recalls only the $h$ most recent facts. What would the equivalent be for trees? However we define the equivalent, the following properties are desirable:

What constitutes a recent fact should depend on the structure of the tree, not on the order in which Alice decided to describe the tree.
The case of a skinny, chain-like tree should simplify to the case handled by circular buffers.

For example, if the last thing Alice said was ‘$y$ follows from $x$’ (omitting the label of $y$), then we'd expect to have in memory $h$ ancestors of $y$. But, this implies that one timestep ahead, just before Alice said ‘$y$ follows from $x$’, we must have had in memory $h-1$ ancestors of $x$. If $x$ can be any of the nodes that Alice mentioned previously, then we must keep in memory all the nodes that Alice mentioned! We just can't do better unless we ask for a little help from Alice: We'd like her to also tell Bob if she isn't going to add any more children to a node.

A node is implicitly active when Alice adds it to the tree, then the node possibly receives some children, then the node is explicitly made inactive, and then the node receives no more children. In particular, a node is always active immediately after being added to the tree. In this context, a natural generalization of the linear case is the following: Ensure that Bob keeps in memory $h$ recent ancestors for each active node. Another way to say this is that we want to keep in memory those nodes that are at distance $\le h$ from some active node.

Of course, Bob would like to not keep in memory much more than necessary. Also, Bob wants to process each message from Alice in constant time. Can he do this? The task is difficult because one message from Alice may change the height of many nodes — too many to process in constant time. You can check this by playing with the tree below.

(Update: The slides from CAV2015 have a better click-and-see interface.)

Instructions: Click on a green node to add a child. Control-click a green node to deactivate it. The $h$ is fixed to be $3$. The nodes are labeled by their distance to an active node. The green nodes are active, the yellow nodes are inactive but needed, the red nodes are not needed. Example: To simulate a linear buffer, repeatedly do the following: (1) click on the green node, (2) control-click on the same node.

Yet, Bob can do it! For details, see the paper and the code. There, you will also find a generic application for this data structure: tracing nondeterministic automata, not necessarily finite. This application has special cases like

providing error traces during runtime verification, and
providing functionality similar to that of regexp capturing groups

RIP Herman Conjecture (2005-2015)

2015-04-03T18:32:00.000+01:00

In 1990, Herman proposed a sweet self-stabilizing protocol. In 2005, McIver and Morgan conjectured that its running time is $\frac{4}{27} N^2$. Well … they were right.

Do you remember playing tag as a kid? What about hide-and-seek? Of course you do. In these games one kid is special: the one running after others in tag, or the seeker in hide-and-seek. (Although ‘unlucky’ would perhaps be a better word.) The game usually has rules for picking this kid. For example, the seeker is the person first found on the previous round of the game. The problem is that these rules don't apply to the first round. To pick the first seeker (or the first tagger) kids sit around in a circle and perform a well known ritual: They repeatedly recite a chant, for each syllable pointing to the next kid in the circle. Whoever gets the last syllable is out of the circle. The last person to remain is the chosen special person. This algorithm has an obvious problem: You have to choose where you start pointing in the first place, and once you do that the outcome is completely determined. Thus, all appearance of randomness is illusory. It only appears random because the outcome is somewhat hard to figure out quickly. But not impossible: Josephus was able to do it.

It's possible to do better, using a sweet little algorithm of Herman. Kids start standing in a circle, as before. Some of them (maybe all) have chocolate coins. On each round, all kids with chocolate coins throw them. If they get heads, they have to pass the coin to the left. If they get tails, they get to keep the coin. And here comes the sweet part: If a kid has two coins, THEY GET TO EAT THEM! Anyway, since coins only get eaten in pairs, at some point only one coin will be left if you started with an odd number of coins. The kid that holds this last coin is chosen as the seeker (or the tagger).

Obviously, the outcome is random and unpredictable, because it depends on the coin tosses. But, how long does this take? In 1990, Herman showed that the expected time is $O(N^2 \log N)$, where $N$ is the number of kids. (This is independent of the number of chocolate coins you started with. The initial number of chocolate coins could be any odd number $3\le K\le N$.) In 2004, Fribourg et al. improved the upper bound to $2N^2$. In 2005, Nakata improved the upper bound to $0.936 N^2$. Also in 2005, McIver and Morgan showed that the upper bound needs to be $\ge\frac{4}{27}N^2$, and conjectured that it is $\frac{4}{27}N^2$. Subsequently, various papers showed upper bounds of $0.64N^2$, $0.521N^2$, $0.167N^2$, and $0.156N^2$. Now we know that McIver and Morgan were right:

Maria Bruna, Radu Grigore, Stefan Kiefer, Joel Ouaknine, and James Worrell, Proving the Herman-Protocol Conjecture, 2015

Here's how the proof goes.

First, we describe configurations of chocolates by the distances between them. For example, $(1,2,3,4,2,3,1)$ describes the following situation:

The chocolate is brown and the arrow on the right shows from where we start. We call these numbers gaps $(g_0,\ldots,g_6)$, and their sum is $N=16$, the number of kids. After we normalize them so their sum is $1$, we just call them ‘ex’: In this case $(x_0,\ldots,x_6)= \bigl(\frac{1}{16},\frac{2}{16},\frac{3}{16},\frac{4}{16}, \frac{2}{16},\frac{3}{16},\frac{1}{16}\bigr)$. Then we define the function $$ f(x_0,\ldots,x_{K-1}) = \hskip-1em \sum_{\substack{ 0\le i_0\lt i_1\lt i_2\lt K\\ \text{$i_2-i_1$ and $i_1-i_0$ odd}}} \hskip-2em x_{i_0} x_{i_1} x_{i_2} - \hskip-1em \sum_{\substack{ 0\le i_0\lt i_1\lt i_2\lt i_3\lt i_4\lt K\\ \text{$i_4-i_3, \ldots, i_1-i_0$ all odd}}} \hskip-2em 24 x_{i_0} x_{i_1} x_{i_2} x_{i_3} x_{i_4} $$ and show that $4 N^2 f(x_0,\ldots,x_{K-1})$ is an upper bound for the expected stabilization time if the initial configuration is described by $(x_0,\ldots,x_{K-1})$. Finally, we show that in the simplex $$ D \;=\; \{\,(x_0,\ldots,x_{K-1}) \mid \text{$x_0+\cdots+x_{K-1}=1$ and $x_0,\ldots,x_{K-1}\ge0$}\,\} $$ the maximum of $f$ is $1/27$: $$ \max_{{\bf x}\in D} f({\bf x}) = \frac{1}{27} $$

This last part is trickier than it may seem. For details, see the paper.

Why I use MOOCs

2014-12-26T16:08:00.001+00:00

I saw a Google Plus post saying ‘I hope to hear less about MOOCs in 2015’. Well, I hope to hear more. Here's why.

The author of the Google Plus post is a popular blogger, Daniel Lemire. The Google Plus post links to a blog post, in which Lemire explains why MOOC advocates don't understand what they do, and concludes that MOOCs are probably doomed. The two main arguments are:

MOOCs are supposed to be open, but they are in fact closed, and
content itself is of little value.

None of these arguments is remotely relevant to why I use MOOCs. I will first explain why and how I use MOOCs. Then I'll come back to Lemire's arguments.

For my work, I get most of the information from articles and books. I never pay myself for articles because all the prices I've seen are ridiculous. For this reason, I sympathize with the recent movement pro open access. Books, however, often provide sufficient value to justify their cost. So, I buy them.

I don't use MOOCs for work. I use MOOCs either in the morning while waking up to the smell of coffee, or in the evening while cuddling in bed. I use MOOCs to find out things for which I wouldn't otherwise have time. I use MOOCs because they have a game-like quality that keeps me addicted. What quality am I talking about? I'm talking about the quality of homeworks to come with deadlines and points. They are exactly like the small tasks that you find in games. For me, homeworks are the core of MOOCs, not lectures. In fact, I rarely look at the lectures, unless I can't do the homeworks otherwise.

In short, I play MOOCs just like I play games.

Let's get back now to Lemire's arguments.

First, suppose I agree that MOOCs are closed. So what? I only care whether I pay or not; I don't care about idealistic definitions of openness. By Lemire's own account, Facebook is closed as well. Plenty of people still use it. That's because most people are pragmatic about openness.

Second, Lemire says colleges offer something of value, but that something is not content — content is cheaply available online and from books. Instead, according to Lemire, the value of colleges resides in (a) diplomas, (b) physical meetings, and (c) an ‘experience’. (I don't know what point (c) is supposed to mean.) Here, I kind of agree. Colleges indeed do not offer content: for my work, I do indeed easily get content from Google and Amazon. And indeed physical meetings are valuable. More precisely, it is great to interact regularly with colleagues and lecturers.

But, what does this have to do with MOOCs? Oh, it must be that some people see MOOCs as an alternative to colleges. That is indeed not a very balanced alternative: colleges win hands down. But, I see MOOCs as an alternative to Candy Crush. And, here, MOOCs win hands down.

Long live the MOOCs, the best games I ever played!

Dynamic Dispatch

2014-10-31T20:34:00.000+00:00

In which I explain how to solve puzzle questions involving dynamic dispatch.

Goto is considered harmful. For example, it is tedious to specify which statement to execute next with absolute precision. Hence, modern languages like INTERCAL do not have goto. Instead they have the much better comefrom.

10 comefrom 20
20 print "forever"

The statement comefrom is so much better than goto. The exercise of building the flowgraph of a program stops being boring. Instead, reconstructing the flowgraph feels like solving an entertaining puzzle. The statements of the program are the pieces of the puzzle. Initially, you can't tell much by looking around them. You have to spread your net wide until you find two pieces that click. You join them, and then repeat, trying to find another fit.

Popular languages like Java are worse in this respect. For the most part, the flow of control is rather clear, boring. There is one exception, though: dynamic dispatch. I have seen several puzzles who, at their core, were about figuring out what dynamic dispatch does.

Alas, if you understand how dynamic dispatch is implemented, all these puzzles become rather boring. It's not so bad as it sounds, though, because most people don't know or don't want to know how dynamic dispatch is implemented. Even better, there are different ways of implementing it. Adepts of one implementation find it difficult to communicate with adepts of the other.

Let's look at an example.

class A {}
class B extends A {}
class X { void f(A a) { } }
class Y extends X { void f(A a) {} void f(B b) {} }
public class Main {
  public static void main(String[] args) {
    A a; B b; X x; Y y;
    a = b = new B();
    x = y = new Y();
    x.f(a); x.f(b);
    y.f(a); y.f(b);
  }
}

Let's pretend to be a compiler. First, we get rid of classes.

void f_X_A(X this, A a) { }
void f_Y_A(Y this, A a) { }
void f_Y_B(Y this, B b) { }
void Main_main_StringA(String[] args) {
  a = b = new_B();
  x = y = new_Y();
  f_X_A(x,a); f_X_B(x,b);
  f_Y_A(y,a); f_Y_B(y,b);
}

The argument this used to be implicit, but is now explicit. The function names are decorated with the types of the arguments: otherwise we'd confuse them. At the call sites, we plugged in function names based on the static types; that is, we don't track the flow of any execution. Oh … oops … the function f_X_B is called but not defined. The closest one is f_X_A. Perhaps we should call that one? Sure. Why not? (says the compiler to itself)

void f_X_A(X this, A a) { }
void f_Y_A(Y this, A a) { }
void f_Y_B(Y this, B b) { }
void Main_main_StringA(String[] args) {
  a = b = new_B();
  x = y = new_Y();
  f_X_A(x,a); f_X_A(x,b);
  f_Y_A(y,a); f_Y_B(y,b);
}

This doesn't feel right. I was to dispatch something at runtime, but the code from above does no work at runtime. Oh, right, let's see which function overrides which. In this case, f_Y_A overrides f_X_A: the first argument is subtyped ($Y \lt: X$), the others are exactly the same. So, I'll introduce some code that does that dispatch.

void f_X_A(X this, A a) {
  if (this instanceof Y) { f_Y_A(this, a); }
  else { /* old body */ }
}
void f_Y_A(Y this, A a) { }
void f_Y_B(Y this, B b) { }
void Main_main_StringA(String[] args) {
  a = b = new_B();
  x = y = new_Y();
  f_X_A(x,a); f_X_A(x,b);
  f_Y_A(y,a); f_Y_B(y,b);
}

Notes. The implementation from above is what I'd call C++-like. The alternative is the Java-like implementation, which I won't describe here. Even the C++-like one I described is a caricature. The trouble is that once all details are put in, the description becomes pretty horrendous. Still, the details that do appear above are usually enough to figure out what happens in $99\%$ of puzzles.

Finally, what we really care about is not the implementation. We only care about having some simple rules that let us figure out the control flow of any program. The particular set of rules exemplified above is easy to remember as a set of actions that a compiler does. (The Java compiler certainly does not do this.) Pretending that a compiler does this and then that is just a useful mnemonic.

Why Do We Fail?

2014-10-28T08:44:00.000+00:00

What is there to do if a problem refuses to be knocked down? There's a generic problem solving strategy for that.

Suppose you work on a problem and fail to solve it. You could take a break, step back, and try a different technique. If after repeated attempts the problem fails to fall, then it is time for a change in strategy. Instead of seeing patterns in the problem itself, try to see a pattern in your failed attempts. Ask yourself: Why did all these attempts fail? What do they have in common? There are two things that may happen. One is that you find a criterion that lets you quickly rule out techniques that won't work, thus saving time. Another is that you'll find you cannot solve the problem: all techniques are doomed to fail.

Let's see how this advice applies in three situations: writing correct programs, proving $P \ne NP$, and automatically proving properties of programs. The last is what I've been working on for awhile.

Coding. Designing and coding are similar activities. The difference is in scale, level of detail, and precision. When designing, you erect a tall skeleton. To work quickly, you leave out much detail, but you try to be precise. Nevertheless, designs are less precise than code, because code has to meet the minimal standard of being understood by a compiler. When coding, you flesh out the details, one organ at a time. An ‘organ’ is a functional unit, which you should code in one sitting.

There are two tricks that improve the quality of code considerably. These apply to individual coding sessions, not across coding sessions.

Begin by clarifying to yourself the high-level structure of the code. You should be able to explain why the structure is correct and good. You should also know the time and space complexity. Only once this is done, you can move on to writing the first line of code. There are some caveats to this rule of thumb. First, if it takes more than 15 minutes to clarify the structure, then you are probably trying to code more than you can in one sitting. In such a situation, the best attitude is to scale down your ambitions for the day. Second, the level of detail depends on experience: more experienced coders need less detail. As a rough guide, you should know what functions you will write, and what each of them does.
Once you finish writing the code, move away from the keyboard for 5 minutes. Then come back with a new attitude. Pretend the author is a rookie programmer (not you!), so some bugs must be lurking. Get into the meanest mindset you can — your job is to break the code.

For the second trick, the problem you are trying to solve is this: find an input that causes the program to exhibit a bug. If repeated attempts failed to exhibit a bug, then it's time for a change in strategy. Instead of trying to find a new input that may cause the program to fail, look at the ones that you already tried and make sure you understand why they don't uncover a bug. There are two things that may happen. One is that you find a criterion that lets you quickly rule out classes of inputs. The criterion would usually be a proof that the program is correct on a class of inputs. Another thing that may happen, is that you realize that the program always works. That's when you find a full proof.

In other words, the search for bugs is an instance of the general strategy laid out at the beginning.

P versus NP. Some results in theoretical computer science are known as ‘barriers’. These are results that say a certain proof technique will not work. One example is that small monotone circuits cannot solve the Clique problem.

Let me remind you what the Clique problem is. Think of a graph with $n$ vertices. It can be described by a bitstring of length $\binom{n}{2}$. Each bit says whether its corresponding edge is in the graph. For example, for a graph with vertices 1, 2, 3, and 4, the possible edges are 12, 13, 14, 23, 24, 34. So, the bitstring $110110$ represents the graph that contains edges 12, 13, 23, 24. The Clique problem asks for a family of boolean circuits, one for each number of vertices, that takes such a bitstring and outputs whether the graph contains a clique of size $k$. The constant $k$ is fixed in advance. A clique is a subset of vertices that are all pairwise connected. In the previous example, 123 is a clique, because 12, 13, and 23 are all edges of the graph. Thus, if $k$ would be 3, the answer would be ‘yes’; but, if $k$ would be 4, the answer would be ‘no’.

The result I mentioned says that if you try to build circuits using only AND and OR gates, then the circuits will get big. More precisely, there is some $k$ for which the size of the circuit for $n$ vertices is not polynomial in $n$. The argument is based on approximations of the AND and OR gates, and it is asymmetric in how it handles AND and OR. Once negations are allowed, you can easily simulate an AND with an OR and some NOTs. So, the same argument stops working, because it may not handle AND and OR asymmetrically anymore. We do not know what happens if NOT gates are allowed. We suspect that big circuits are still needed, but don't know for sure. If we would be able to prove that the result still holds in the presence of NOT gates, then we would know that $P \ne NP$. The reason is a theorem ‘whose exact authorship is apparently quite difficult to establish’ [1]:

Theorem. Let $\{f_n\}_{n \in \mathbb{N}}$ be a family of boolean functions, where $f_n$ has $n$ inputs. Let $\mathcal{L}$ be the language whose words of length $n$ are the models of $f_n$. Let $M$ be a Turing machine that recognizes the language $\mathcal{L}$. Then $$ T_M(n) S_M(n) = \Omega(L(f_n))$$ where $T_M$ and $S_M$ are the time and space used by the Turing machine, and $L(f_n)$ is the size of the smallest circuit that computes $f_n$.

Proof sketch: Given a Turing machine and a fixed $n$, you can build a circuit that handles inputs of size $n$ just like the machine. We say that Turing machines describe families of circuits; and we say the families of circuits are uniform, because they have a concise description — the machine. If no family of circuits is smaller than a certain limit, then certainly no uniform family of circuits is smaller than that limit.

In this storyline, several people attempt to prove $P \ne NP$ and fail. One of them steps back and asks why do these attempts fail? The result is a nice theorem about boolean circuits and the Clique problem. This theorem says that any attempt that does not seriously consider negation, one way or another, is doomed to fail to prove $P \ne NP$. So, this storyline is an instance of the general strategy laid out at the beginning.

Parametric Static Analysis. Static analyzers are programs that goggle other programs, looking for bugs. A naive way to analyze a program is to run it and see what it does. The trouble with this approach is that it takes forever, especially if you run the program on all possible inputs. Static analyzers do use the naive approach, but with a twist: they approximate the semantics. As an aside, note that something very similar happened in the Clique barrier: It is difficult to track what the circuit does exactly, but easier to track what the circuit does approximatively. Approximation is a fundamental tool in static analysis, and it is studied systematically in the area of abstract interpretation.

A parameterized static analyzer is one that can try various approximations, tweakable by parameters. Given a program and a potential bug, the question is whether some setting of the analyzer's parameters would succeed in ruling out the bug.

Let's apply the general strategy to this problem.

The first step is to try several parameter settings. If all of them fail to rule out the bug, then it is time to ask why. We analyze the failures and we conclude that some other parameter settings won't rule out the bug either. This is exactly what we did in [Zhang et al., On Abstraction Refinement for Program Analyses in Datalog, 2014] .

Which attempts to analyze? The general strategy involves making some attempts at solving the problem. The hope is then that analyzing the failed attempts helps us solve the problem. But, not all attempts are created equal. If an attempt is too simple and fails for trivial reasons, then analyzing it won't give us too much information. Thus, the attempts should seriously jab at the problem.

In the case of writing code, the test cases should be tough. One option is to cover a corner case. Another option is make them involved enough to exercise most of the code. (Look at A Torture Test for $\TeX$ to see what I mean.)

In the case of $P \ne NP$ you should look at serious proof attempts. In the case of parametric static analysis $\ldots$ I'm still working out what it means. Literally. That's what I'm working on now. In On Abstraction Refinement for Program Analyses in Datalog we focused on cheap attempts. But, that's like trying to break code by throwing at it the smallest test cases: a fairly good strategy, actually, but not quite the best.

Conclusion. Remember: If you work on a problem, first try to solve it in earnest, and record your attempts. If after repeated attempts the problem fails to fall, then it is time for a change in strategy. Instead of seeing patterns in the problem itself, try to see a pattern in your failed attempts. Ask yourself: Why did all these attempts fail? What do they have in common? There are two things that may happen. One is that you find a criterion that lets you quickly rule out techniques that won't work, thus saving time. Another is that you'll find you cannot solve the problem: all techniques are doomed to fail.

Notes. For more unsolicited advice on coding from yours truly, see Advice to Beginner Programmers. (But you should know that when I say ‘rule’ I don't mean ‘law’.) For the Clique barrier, see the expository article [Gowers, Razborov's Method of Approximations, 2009]. The theorem and the quote [1] is from [Razborov, Lower Bounds for Monotone Complexity of Boolean Functions, 1986] . For more on parametric static analyses, you could see a rather long previous post (Datalog and MaxSat: an Unexpected Match) or a rather abstract video (On Abstraction Refinement for Program Analyses in Datalog).

Research and Startups

2014-10-13T06:51:00.001+01:00

In which I argue that research is an extreme form of startup.

Folklore says that nine in ten startups fail. The number 0.9 comes out of the hat of a cat. But, there seems to be agreement that very few startups are highly successful. Some plod along like zombies for a long time. Only few become Twitters or acquired by Twitter (or some other social network).

The situation is seen as satisfactory. Startups carry out activities that have high risk and high reward. Any one particular startup is more likely to fail than succeed. But, the sheer number of startups ensures that we have a steady stream of innovation. This stream is undoubtedly useful for society.

The situation — high risk and high reward — is even more extreme for research.

Every now and then I encounter the opinion that startups rock and research sucks; for example, a few months ago, when I was reading Antifragile. The book extols the virtues of entrepreneurship. It even goes as far as asking for more safety nets so that entrepreneurs take more risks. It's OK to fail, because progress is made by trial and error. At the same time, researchers are seen as a useless bunch. They spend their time on theories, instead on what they should be doing: try and err. Sure, every once in a while they build an atomic bomb, but that's an exception!

As if the startup that becomes Facebook is the rule.

(I still recommend reading the book. It is entertaining and thought provoking, even if rude.)

I like to think of big companies, startups, and research in terms of a travel metaphor. You live in a small village. On one side you see a deep forest. Across the road there is a huge mountain. What would you rather do?

You could take the bus to visit the city. The bus trip may involve some unexpected delays, but it's rather predictable nevertheless. That bus is the service that big companies offer. There may be big differences between two companies — how comfortable the chair is, how entertaining the driver is, how often the bus is on time. But, no matter which bus you pick, it's a safe bet that you will get to the city.
You could go on a hiking expedition to the summit of the mountain. You might find this a lot more exciting than a bus trip. But you also need to make sure you have enough stamina and resolve. There may be a few footpaths along the way, but they are far in-between, badly marked, and you'll certainly not get to the summit by just following them. This expedition is what startups do. Their aim is clear — the summit is in sight. It's also fairly sure that someone will get there eventually. But it's not clear at all that you are sufficiently well trained to get there before night falls. Even if you are, you might get unlucky and take a turn of a road that makes the trip more difficult than necessary.
Finally, you could go into the deep forest. Here, there are no footpaths, and you have no idea what you'll see, or whether you'll be able to get back home. For some, this might be even more exciting than a hiking expedition. This trip through the deep forest is what research does. It discovers hidden gems that you couldn't even imagine if you wouldn't bump into them, mostly by accident. That is not to say that finding gems in a forest is done only by pure luck. You need survival skills. You need to have a sense of direction. You need to fend of animals. You need to be able to climb trees. You need to chart a map as you go, and use it to avoid going in circles. And you have better chances if you join a group, but it can be difficult for people to agree on what is the best course of action.

Nitpicking Distance

2014-10-05T20:09:00.000+01:00

I recently wrote down what I thought is a completely standard definition of distance on graphs. I was asked if I'm sure that the definition is well-behaved. Well … let's see.

Let's define a distance on a directed graph: To go from $x$ to $z$, we first go to $y$, and then take one more step. $$ d(x,z) = 1 + \min \{\,d(x,y)\mid y \to z \,\} $$ Question: Is this definition good? Suppose the digraph is $0 \leftrightarrow 1$. According to the equation above, $$\begin{align} &\bbox[2px,border:2px solid red,pink]{d(0,1)} \\ &= 1 + \min \{ d(0,0) \} &&\text{by decomposing $0\leadsto 1$ as $0\leadsto 0\to 1$} \\ &= 1 + d(0,0) \\ &= 1 + 1 + \min \{ d(0,1) \} &&\text{by decomposing $0\leadsto 0$ as $0\leadsto 1\to 0$} \\ &= \bbox[2px,border:2px solid red,pink]{2 + d(0,1)} \end{align}$$ Something is fishy. There are in fact at least two problems with the definition. One problem is illustrated above: It leads to weird consequences when applied for $x=z$. Another problem: What happens when the minimum is taken over the empty set? This second problem leads to a third: What is the type of the function $d$? Let's say $$ d : V \times V \to \mathbb{N} \cup \{\infty\} $$ Now $\infty$ was promoted to a proper distance value. We can agree that $\min\emptyset=\infty$, by convention. Also, we can solve the equation $d(0,1)=2+d(0,1)$: The only solution in $\mathbb{N}\cup\{\infty\}$ is $\infty$. It's nice that we can solve that equation, but we probably don't want $d(0,1)$ to be $\infty$. It's time to fix the definition by adding a special case for $x=z$. $$\begin{align} d(x,z) = \begin{cases} 0 &\text{if $x=z$} \\ 1 + \min\{\,d(x,y)\mid y\to z\,\} &\text{otherwise} \end{cases} \end{align}$$

Question: How can we tell for sure that $d$ is uniquely determined by the equation above? This question really has two parts:

Is there always a solution?
Is there at most one solution?

The first subquestion is suitably addressed by a sledgehammer that goes by the name of the Knaster–Tarski theorem. In the limited form that we need here, it says that an order-preserving function on a complete lattice has a fixed-point. In our case, the lattice is the set $L\;=\;V \times V \to \mathbb{N}\cup\{\infty\}$. The partial order on the lattice is defined pointwise: $$ d \le e \quad\stackrel\Delta=\quad \bigl(\forall x \forall y\; d(x,y) \le e(x,y)\bigr)$$ The order-preserving function is $D:L\to L$, defined by $$ D(d)(x,z) = \begin{cases} 0 &\text{if $x=z$} \\ 1 + \min\{\,d(x,y) \mid y\to z\,\} &\text{otherwise} \end{cases} $$ Let's check that it is order-preserving. We pick $d,e : L$ such that $d\le e$, and we'd like to show that $D(d)\le D(e)$. The case $x=z$ reduces to showing $0\le 0$. The case $x\ne z$ reduces to showing that $$ \min\{\,d(x,y)\mid y\to z\,\} \le \min\{\,e(x,y)\mid y\to z\,\}$$ which follows from the assumption $d\le e$. At this point, we know that the conditions of the Knaster–Tarski theorem are fulfilled, so $D$ has at least one fixed-point. In other words, there is at least one function $d$ that satisfies our recursive equation, no matter on which graph we apply it. (This is true even for infinite graphs.)

Now on to the second subquestion: Is there at most one function in $L$ that satisfies the recursive equation? Let's assume that $D(d)=d$ and $D(e)=e$. By way of contradiction, assume also that $d\ne e$; that is, $d(x,z)\ne e(x,z)$ for some $(x,z)\in V\times V$. Since $d$ and $e$ are fixed-points of $D$, the second assumption is equivalent to $$ D(d)(x,z) \ne D(e)(x,z)$$ If $x=z$, then $D(d)(x,z)=0=D(e)(x,z)$, which is a contradiction. If $x\ne z$, then $$\min\{\,d(x,y)\mid y\to z\,\} \ne \min\{\,e(x,y)\mid y\to z\,\}$$ Without loss of generality, let's assume that $$\min\{\,d(x,y)\mid y\to z\,\} \lt \min\{\,e(x,y)\mid y\to z\,\}$$ We fix $y$ to make the left hand side minimum, and now we have $$d(x,y) \lt e(x,y) \qquad d(x,y) \lt d(x,z)$$ (Exercise: How do we know that $d(x,y)\lt d(x,z)$? Why not $d(x,y)=d(x,z)=\infty$?) To recap, if $d\ne e$, then it must be that $d(x,z)\lt e(x,z)$ (or, symmetrically, $d(x,z)\gt e(x,z)$), and then in turn it must be that $d(x,y)\lt e(x,y)$ and $d(x,y)\lt d(x,z)$ for some $y$. But, this process can't be repeated forever, because $\mathbb{N}$ has no infinite descending chains. Done. Again, this also works for infinite graphs.

You may expect that $d(x,y)=\infty$ means that $x$ and $y$ are disconnected. This is not the case. Let $V=\mathbb{N}\to\{0,1\}$, and say there are arcs $x\leftrightarrow y$ when $x$ and $y$ differ in exactly one place. In this graph you can get from any sequence to any other sequence, but you may need to walk forever to do so. (A very similar example would be to take $V=\mathbb{R}$ and define arcs in terms of decimal representations of real numbers.)

Believe it or not, after all this work, we're not done yet. To be allowed to use the word ‘distance’ with a straight face, we should check some axioms. Here's what Wikipedia says: $$\begin{align} &d : V\times V \to \mathbb{R} \\ &d(x,y)=0 \iff x = y \\ &d(x,y)=d(y,x) \\ &d(x,z)\le d(x,y)+d(y,z) \end{align}$$ I suppose we're in trouble, because $\mathbb{N}\cup\{\infty\}$ is not a subset of $\mathbb{R}$. Also, if the graph is directed then the sort-of distance we defined isn't symmetric. Oh, well.

The other axioms should hold. For finite connected undirected graphs we do have a metric. For finite directed strongly connected graphs we have a quasimetric, although Wikipedia says this terminology is not completely standard. (Is it?) I wouldn't lose sleep about calling it a distance for other graphs, though.

PS: Feel free to nitpick, obviously.

The Effect of Discouragement

2014-09-13T08:40:00.000+01:00

How many books will one write? (Or blog posts, for that matter.)

Terry Tao recently explained why most bad books are written by good authors. The basic idea is that bad authors are are discouraged and give up writing. The model he used describes each author by the probability $p$ to produce good books. But, there's no model for discouragement. Instead, it just sneaks in in an example. Let's try to rectify that. We add a probability $q_0$ of giving up after writing a good book, and a probability $q_1$ of giving up after writing a bad book.

How many books will one write on average, in this model? There's a picture below, for $q_0=1\%$ and various values of $q_1$. The $x$-axis is $p$. So, for example, we see that if all written books are good ($p=1$), then we expect 100 books. But, if all books are bad ($p=0$) and $q_1=2\%$, then we expect 50 books. [Edit: I should perhaps state explicitly what I see in this graph: The number of books you'll write doesn't depend too much on how good you are. Unless you are very good ($p>0.8$), in which case you'll write a lot of books.]

In case you are curious, here's how I drew this. One writes $n\ge1$ books if one writes $n-1$ books without giving up and gives up after the $n$th. Thus, the probability of writing $n$ books is $\alpha^{n-1}\beta$, where $$\begin{align} \alpha &= p(1-q_0) + (1-p)(1-q_1) \\ \beta &= pq_0 + (1-p)q_1 \end{align}$$ If we define $$G(x) = \sum_{n\ge 1} \alpha^{n-1} \beta x^n$$ then the expected value we search for is $G'(1)$. Since, $G$ is a geometric series, it's easy to compute the sum. $$\begin{align} G(x) &= \beta x \sum_{n\ge 0} (\alpha x)^n \\ &= \beta x\frac{1-(\alpha x)^\infty}{1-\alpha x} \\ &= \frac{\beta x}{1-\alpha x} \end{align}$$ Derive: $$\begin{align} G'(x) &=\frac{\beta(1-\alpha x)+\alpha\beta x}{(1-\alpha x)^2} \\ &=\frac{\beta}{(1-\alpha x)^2} \end{align}$$ So, the plot you see above is for the function $$ \frac{\beta}{(1-\alpha)^2} $$

Edit 2: There are two things that are kinda misleading above. Let me try to rectify, if only briefly. One: Tao shows that most bad books could be written by good authors, not that they are. Two: In the model from above with $q_0$ and $q_1$ it is not true that most bad books are written by bad authors.

Beta Function

2014-09-11T07:32:00.000+01:00

How to compute $\int_0^1 dp\, p^k(1-p)^{n-k}$ without symbol manipulation.

The symbols I used to write it might be a giveaway. We'll imagine throwing a coin and obtaining sequences of head and tail like $$HTTTTHTHHH$$ This one has probability $p^5(1-p)^5$. If $p$ is the probability of obtaining H and $k$ is the number of Hs, then the general form is $p^k(1-p)^{n-k}$. The chances of obtaining some sequence with $k$ Hs is $$ \binom{n}{k} p^k(1-p)^{n-k}$$

But, what if we don't know the value of $p$? Instead, we just have some guess about where its value is. (For example, perhaps we just know it's very likely in $[0.4,0.6]$.) In general, we'd model this with a probability density: We say that value $p$ happens with probability $h(p)\,dp$, where the function $h(p)$ is the probability density. Then, to get an expectation for the probability of seeing $k$ Hs, we average over the possible values of $p$: $$ \int_0^1 dp\, h(p) \binom{n}{k} p^k(1-p)^{n-k}$$ The integral is from 0 to 1, because $h(p)$ must be $0$ elsewhere; after all, we know that $p$ is a probability.

If we have no idea what value $p$ has, then we should expect to see all values of $k$ with equal probability. The possible values are $0,1,\ldots,n$, so the probability is $1/(n+1)$. How do we express in terms of $h(p)$ that we have no idea what value $p$ has? That's right, we simply choose a uniform distribution $h(p)=1$, to say that we don't prefer one guess over another. Hence, $$ \int_0^1 dp\, \binom{n}{k} p^k(1-p)^{n-k} = \frac{1}{n+1} $$ and $$ \int_0^1 dp\, p^k(1-p)^{n-k} = \frac{1}{(n+1)\binom{n}{k}} $$

Sure, this reasoning is fast and loose. Nevertheless, this is what I did when I had to compute the integral, and, because I thought it's kinda cute, I'll probably not forget the trick. If you want a more rigouros derivation, Wikipedia has one.

Watching Horn

2014-08-26T06:45:00.002+01:00

How to use watched literals (or, rather, vertices) to do breadth-first search in Horn hyperdigraphs.

Satisfiability solvers use a technique called ‘watched literals’. I'll present it here in the context of Horn clauses, for two reasons: (1) it is a slightly unusual context, and (2) I actually needed to implement this for my work.

Horn clauses can be difficult to visualize. So, I'll use an equivalent formulation in terms of directed hypergraphs. Unlike their undirected relatives, directed hypergraphs are less famous. A directed (hyper)arc is a pair $(X, Y)$ of vertex sets. If you are familiar with SAT, think of $X$ as the set of negative literals, and $Y$ the set of positive literals. Then, a Horn formula corresponds to a hypergraph with $|Y| \le 1$ for all arcs. (And a Horn clause corresponds to a hypergraph arc.) A definite Horn formula corresponds to the case $|Y|=1$.

We want to perform breadth-first search on definite Horn hyperdigraphs.

Let's first clarify what reachability means. For digraphs, if vertex $x$ is reachable and the digraph has an arc $(x,y)$, then vertex $y$ is reachable. For hyperdigraphs, if all vertices in $X$ are reachable and the hyperdigraph has an arc $(X, Y)$, then some vertices in $Y$ are reachable. Saying that some vertex is reachable sounds a bit weird. I chose this definition so that it corresponds on satisfiability of formulas. And, the weirdness doesn't matter as we look at the definite Horn case $|Y|=1$. (But, I should say there are other definitions of reachability on hyperdigraphs.)

How should we implement this?

For digraphs, we use a flag on each vertex to put them in one of three categories: (1) not seen, (2) seen, (3) done. At every step we pick a seen vertex $x$. Then we mark all its successors that were previously not seen as seen. And then we mark $x$ as done. This schema works for every traversal. For breadth-first search, we must pick vertices in a certain order. One way is to use a queue. But, if you are interested in distances, there is a more convenient way. We keep two sets, the current level and the next level, of seen nodes. The $x$ is picked from the current level, its successors added to the next. Also, $x$ is removed from the current level when marked done. (Not strictly necessary: The invariant that these two levels contain only seen nodes can be relaxed.) When the current level becomes empty, it gets swapped with the next level. The process ends when both levels are empty.

If we try to do the same for hypergraphs, we run into an efficiency problem. Previously, given $x$, we had to find all arcs of the form $(x,y)$. That's just the adjacency list representation of digraphs. But now, given $x$, we must find all arcs of the form $(X,y)$ such that $x\in X$ and all vertices of $X$ are seen (or done). This is a considerably more difficult problem, solved by watching literals.

Consider an arc $(X,y)$. If all nodes in $X$ are done, then $y$ must be seen or done. To ensure this is the case, we need a quick way to find arcs whose all sources are done. The trick is to think of a data structure that enforces the negation: For each arc $(X,y)$ there is at least one vertex in $X$ that is not done. When we fail to maintain the invariant, it is because an arc has the required property.

How do we make sure that every arc has an undone vertex? Well, we designate one particular vertex, which is undone, as being watched. Then, we still use the adjacency list representation, as for digraph. For digraphs, the adjacency list maps a vertex $x$ to the list of arcs of the form $(x,y)$. For Horn hyperdigraphs, the adjacency list maps a vertex $x$ to the list of arcs $(X,y)$ such that $x \in X$ is the designated watched vertex for this arc.

Finally, let's see what happens when we pick a seen vertex $x$ in the traversal algorithm. This time we begin by marking $x$ as done. This is like promising to have done the work before actually doing it. So, we'd better do it. We have to go through the adjacency list of $x$, because $x$ cannot be a watched vertex anymore. For each of the arcs, we must pick a different undone vertex to watch. For each arc $(X,y)$ we simply scan all members of $X$ until we find a suitable replacement. If there isn't any, then all sources are done, so we must mark $y$ as seen if it was not seen.

That's it.

By the way, according to Blogger this is my 100000000th post.

Genius! How Not to Be Wrong

2014-08-18T09:57:00.000+01:00

Why math is for everybody.

The book How Not to Be Wrong says that math can enrich your life. It doesn't matter whether you like the math drills from school. Those are for math like running is for football: useful, but not quite the fun part. Unless you turn pro. In that case, running exercises are necessary, rather than merely useful.

I agree with this message wholeheartedly. But I learned something from Kahneman's Thinking Fast and Slow: If you find yourself completely agreeing, then you're probably not paying enough attention. It's easy to do so, because attention is a scarce resource. People instinctively tend to conserve it.

For a while I was worried that I agree too much with Ellenberg. (He is the author of How Not to Be Wrong.) But then I got to the section about genius, and I stopped worrying.

This section is aligned with the main message. It says that it's stupid to give up math because someone else is better than you. That's not the problem. With that I agree. But then, Ellenberg starts saying why you should go on. One reason is that more brains will make math advance faster. That's not the problem either. But then he says things like this:

It can be hard for me to make this case, because I was one of those prodigious kids myself. […] [I] won a neckful of medals in math contests. […] That group of young stars produced many excellent mathematicians. […] But most of the mathematicians I work with now weren't ace mathletes at thirteen; they developed their abilities and talents on a different time-scale. Should they have given up in middle school?

Well, … this comes after a few sections warning of the dangers of ignoring base rates. Having a neckful of medals in math contests is a very rare event almost by definition. Given this, the statement that ‘most mathematicians don't have a neckful of medals’ contains very close to 0 bits of information.

Let's put it differently. Suppose a country has about 150000 students in a certain school year. Out of those 10000 participate in math contests. Out of those 30 do really well, and get some medals at the national level. (Incidentally, I think these figures are in the right ballpark for Romania.) Now, out of these 150000 students, 20 become professional mathematicians. Most professional mathematicians (say, 15) come from those without medals. The rest (5 in this case) come from those with medals. And yet, having a medal gives you chance of 17% of becoming a mathematician, while not having a medal gives you a chance of 0.01% of becoming one. That is, not having a medal decreases your chances more than 1600 times. (Again incidentally, this reasoning mirrors an argument from Ellenberg's book.)

So, yes, most mathematicians were not mathlets. But, no, this is not such great news for your math future as it may sound.

It is easy to misinterpret what I said above, so I'll linger on the point. Suppose you are a student and you don't have a ‘neckful of medals’. Do I think you should not try to become a mathematician? I most certainly do not think that. Do I think being a mathematician is a perfectly valid career choice for you? Yes, I most certainly do think that. But, the reason why I think so is not that most mathematicians weren't mathletes. You see, I can agree with the conclusion but disagree with the argument.

Why, then, should you not care about medals? I'll tell you some of my reasons.

My family encouraged me to take part in the Romanian Math Olympiad. But, I never did well. And I was never worried about it. In fact, I thought that being a mathlete would be rather boring. The reason is the following conversation about the problems of one contest. Me: ‘So how would you solve problem 1?’ Other: ‘Oh, you use Theorem X.’ Me: ‘Ah, I didn't know Theorem X. How do you prove it?’ Other: ‘I don't know. But it doesn't matter. You can just use it.’ Me: ‘What about problem 2?’ Other: ‘For that one you use Theorem Y.’ Me: ‘Hmm. I don't know that one either. Can you tell me why this theorem is true?’ Other: ‘I don't remember now. But it's in the textbook.’ After going through a few more problems, I decided that I could do a lot better in these contests if I knew a bunch of theorems, even if I don't know why they're OK. I thought that knowing theorems without understanding why they hold is no fun. And, more often than not, it seemed like the theorem was more interesting than the problem. It felt like someone knew the theorem and said ‘I'll make a problem out of it by applying some make-up’. Then, the contestants' job was to remove the make-up, an recognize the theorem underneath. This simply doesn't sound like fun.

I should disclose that this memory is so vague that I'm not sure I actually had that conversation. Whether I had it or not, it captures my point of view from high-school.

I no longer hold that view; not entirely. It's true: one can do well in these contests by memorizing theorems. But, one can do really well only by knowing many theorems and techniques. And one simply can't remember so many things unless one also understands them. Another thing I realized in the meantime is that contests cover a tiny corner of math.

So, here's why you need not worry about contests. Your goal should be to have fun. Understanding is the supreme kind of fun. Figuring out on your own something that you didn't know feels awesome. (I sometimes joke that it's better than an orgasm.) This feeling I describe doesn't come often, and certainly not after solving easy problems. The harder the problem, the better the feeling. But, it can be difficult to keep going. Here is were contests come in: they are a great motivational tool. But, only for people with a certain psyche. Contests aren't good in themselves. Contests are good because they make you prepare for them. The work you do while preparing for them is the valuable part. And also the part where you'll ultimately have most fun.

I couldn't see the value of contests because I was not preparing for them. I just showed up. Of course they weren't fun!

The important thing is to keep punching hard problems. Whether you use a contest as an excuse to do it is a secondary concern. (Obviously, the problems need to be within your reach; otherwise, you can't punch them.)

Going back to the book, I concede that my objection amounts to nitpicking. Which makes me worry again that I didn't pay enough attention. I plan to do a second reading with a more critical eye.

Here's a link: Ellenberg, How Not to Be Wrong.

ROSE 1995

2014-08-17T15:46:00.000+01:00

What I remember from the first computer-related conference I attended.

I have rather vague memories about this conference, from almost 20 years ago. But, I managed to find its program in some dark corner of the Internet. From it, I learned several things. The name stands for Romanian Open Systems Event. It was in the beginning of November, so I must have missed school to attend. It was in 1995, so I was just starting Year 10 of school.

I also remember several things that aren't in the program. Two of these memories are most vivid.

First, that's when I learned of Java. There was a presentation by some marketing guy from Sun. (According to the program, it was Darryl Parker.) I had never seen such a good presentation before. He really did convince me that Java is the future of the Internet. Even now I remember that one main selling point of Java is ‘Write Once, Run Everywhere’. And (I'm almost sure) I remember it from that presentation. I learned in the meantime that people are prone to mistake presentation quality for content quality.

Second, at some point Linus Torvalds was working on his laptop. Nobody dared to interrupt, but my father nudged me, and I did. The only thing I knew about him at the time was that he was supposed to be the celebrity guest of the conference. I remember he was extremely nice. When I asked what is he doing he said he's preparing his presentation, which is in one of the next sessions. ‘What do you use?’ ‘Powerpoint. People sometimes think I'm anti-Microsoft, but I'm not. Powerpoint is really good.’ And that's why I learned to use Powerpoint. I don't recall the rest. But I do recall that he patiently waited until I finished quizzing him, despite having an unfinished presentation to make later on. I never met him since.

Sometimes people describe Linus as nasty. I find it very hard to agree, and I think it's because of that meeting. Another result of that meeting was that 5 years later I decided to try Linux. I'm glad I did — it fits my style. (Powerpoint doesn't.)

Coursera Courses

2014-08-07T10:31:00.000+01:00

In this post I review some of the Coursera courses that I completed.

Algorithms: Design and Analysis, Part 2. The course covers greedy, dynamic programming, and NP-completeness. The lectures were very clear, not too fast and not too slow. Caveat: I watched lectures only when I needed. My strategy was to try to solve the problems. If I had trouble, I looked at the slides. If the slides were cryptic, then I looked at the lectures. So, I ended up seeing very few lectures. I even learned an algorithm that I knew I should know but didn't: Karger's. I think it's my favorite example of a randomized algorithm.

Introduction to Genetics and Evolution. The course covers the basic math of genetics. I'll give two examples of things I remember now, more than one year after.

One idea is the Hardy-Weinberg equilibrium. Suppose each individual has one of the three genes $aa$, $aA$, and $AA$. I'll use the fancy word allele: it means one of $a$ and $A$. Reproduction works as follows: Randomly choose an allele from each of the two parents, and put them together. Hardy-Weinberg says that in an infinite population, the percent of each allele is constant. So, if you start with half $a$s and half $A$s, you'll remain with half $a$s and half $A$s.

The other idea is genetic drift. It says what happens when the population is not infinite. What happens is that you'll have random variations. Eventually, these lead to the extinction of one of the alleles.

Programming Languages. This course uses ML, Racket, and Ruby to illustrate basic principles of programming languages. As usual in this kind of courses, you write a few interpreters. The best time to take this course is after you've learned your second language. (I learned BASIC when I was 6; Pascal and C++ came when I was 14. I think I would've benefited most from this course when I was 15. So, if you are 15, then you should try it!)

The Modern World: Global History since 1760. What I remember most vividly about this course is rather strange. It is the soothing voice of the lecturer. Even when I was disagreeing with what he said, I felt compelled to keep listening.

From what I recall, the quizzes were boring: They simply tested a bunch of facts that happened to be mentioned in lectures. The videos, on the other hand, always told a nice story. These stories were clearly addressed to someone living in 2013. For example, the lectures not only said that the West was weak in 1760. They stressed this fact repeatedly, because a world in which the West is weak and insignificant is really difficult to imagine for the typical user of Coursera.

Natural Language Processing. For this course, like for the Algorithms one, I did problems first. I looked at the slides and watched the lectures only when I needed. I ended up looking at most slides and almost no lecture video. Which, I guess, means that the slides are self-sufficient. From what I recall, doing the homeworks involved a substantial effort.

The topics covered are things like probabilistic grammars, tagging problems (e.g., noun, verb, …), log-linear models, learning.

Discrete Optimization. Like for the other computer-related courses, I went directly for the homeworks. In addition, I delayed doing most homeworks for the last week. That turned out to be a huge mistake. I simply couldn't do the homeworks at the rate of one per day, as I planned. In fact, I thought at some point that I won't even get a passing grade. I did, but I also learned that Coursera sometimes has tough courses. So, what were the homeworks like? I recall: knapsack, graph coloring, euclidean traveling salesman. There was also some combination of bin packing and euclidean traveling salesman. More importantly, I recall some tricks that I learned while doing the homeworks. For example, for knapsack I used two tricks to get a good solution. The standard dynamic programming runs in time proportional to the total weight. This is troublesome because weight can be very large. In that case, use a trick from physics: pick your units of measurement. Strictly speaking, the unit of measure doesn't matter. But, when one works in kilometers, one usually doesn't keep track of millimeters. Thus, by ‘picking the unit’ I actually mean ‘ignore the less important bits’. When ignoring bits, you'll want to round up. This way, you never try to fit more than possible in the knapsack. The other trick is to repeat this process using the items that remained out, with better precision.

For travelling salesmen, I wrote some complicated solution that barely got medium points. It did two main things, as I recall. First, it used dynamic programming to do local optimizations. Second, it greedily improved the tour by using a set of moves that I noticed empirically. For an example of move, think of a crossing of edges. These two edges are far apart in the current best tour, so local optimizations won't notice the crossing. But, of course, you can look at all pairs of edges and check for intersections. Once you notice the interesection, it's clear what to do to shorten the tour.

A Brief History of Humankind. This course was like the other history course: trivial (useless?) quizzes, but fascinating lectures. Where the Modern History course would talk about personalities and their decisions, the Humankind History course would talk about the way of life of a typical person. In this sense, the two courses complement each-other nicely.

The most vivid image that this course left in my mind is that of ‘shared imaginary worlds’. Why do money have value? Because we believe that others believe that money have value. This kind of circular reasoning has a certain appeal for a computer scientist.

Linear and Integer Programming. This course was too easy for my taste. They do recommend a resource that looks intriguing: the convex optimization book by Stephen Boyd.

Automata. This course had two kinds of homework: multiple-choice quizzes and programming assignments. Unexpectedly, I enjoyed the quizzes more. Each quiz question tries to test whether you understood some concept. You get treated as a computer program: The quiz presents you with some (smallish) input data, you have to provide the right output. If you repeat the quiz, you get some other input data that you have to process. For one of the first quizzes, I managed to get a question wrong three times in a row. I was trying to do it quickly, and I ended up being very slow. The fourth time I moved slowly and checked my work. I think this phenomenon appears often in real life: People make fun of those that appear slow, but the slow ones often finish first.

Back to the course. Why did I not like the programming assignments? Because I was supposed to modify some Python scripts that were kinda crap.

To answer some questions, I had to look at the slides. These were extremely clear — I never needed to look at the video lectures.

Social and Economic Networks: Models and Analysis. I learned something new from almost every video lecture. The quizzes were rather easy, but only after I learned the vocabulary. It wasn't possible to grasp the definitions from the slides, but the videos were very clear. In other words, if you take this course and you don't already work in social networks, you'll have to watch the videos. Which is a good thing.

Here's an example of the kind of stuff I remember. Social networks are graphs. In various circumstances it makes sense to talk about the utility of a social network. A simple way to define this is as the sum of utilities of each vertex. For an example, think of the co-author graph. The vertices are authors, the edges represent an ongoing collaboration. One particular author benefits from the time given by its co-authors. In turn, this particular author offers his time to their co-authors. In the simplest model, the time is divided equally. Thus, the utility of a vertex $x$ is $1+\sum_{y \in N(x)} 1/|N(y)|$. (The $1$ comes from the time spent by $x$ on their own stuff, and $1/|N(y)|$ appears because $y$ divides its time equally between its neighbors $N(y)$, one of those neighbours being $x$.) Once you have such a simple model, you can ask questions like: Which configurations are pairwise stable (meaning that there is no local incentive to create or delete edges)? Which configurations correspond to global optima? For the answers, you'll have to take the course. (Well, you could find out the answers yourself, of course.)

Statistical Mechanics: Algorithms and Computations. This course is mostly about Markov chains. You can think of Markov chains as a way to sample complicated probability distributions, or as a way to estimate integrals. Come to think of it, the course is more about sampling and integrating than about Markov chains. For example, when you get to the quantum stuff, you see that computing path integrals is better done using Levy paths. These are a direct sampling technique, which works better than the Markov chain approach, in this case.

Also, these techniques can be used for optimization problems. One example is the traveling salesman problem, which makes an appearance in one of the reviews above. I was conceited, and expected that the program I wrote with more than one day of work would do better. After all, I looked at the Python program provided by the instructors and it seemed very similar, except: (1) it did no local optimization using dynamic programming, (2) the vocabulary of moves was a small subset of the one I accumulated. And yet, their implementation worked much better. Wat?? I had to take a look. It turns out there was only one difference: They sometimes allowed a move that made the tour worse. (They used simulated annealing, rather than greedy.) In other words, my solution was getting stuck in local optima. I'm not sure why I didn't think of this when I was doing the discrete optimization course. (It seems obvious with hindsight.) A good rule of thumb: Whenever you use greedy for an optimization problem, at least consider trying simulated annealing.

Starting Soon. Out of these courses, the ones that start soon are the following:

A Brief History of Humankind on August 10.
Automata on September 1.
Social and Economic Networks: Models and Analysis on September 21.

At Most R

2014-07-30T22:05:00.000+01:00

How to encode a cardinality constraint as boolean constraints.

Motivation. All problems in NP can be encoded as a SAT problem, because SAT is NP-complete. Often you need to say that $x_1+\cdots+x_n \le r$. For example, you might want to do this to reduce MAXSAT to SAT. (And we know that MAXSAT is useful for static analysis.) MAXSAT is an optimization problem; SAT is its decision version. The general strategy to reduce optimization to decision is to use binary search. So that's what we'll do.

The input of MAXSAT is a set $\{C_1,\ldots,C_n\}$ of clauses. The question is what is the maximum number of clauses that can be satisfied. To solve this, we ask several SAT questions of the following form: $$\begin{align} (C_1 \lor y_1) \land (C_2 \lor y_2) \land\ldots\land (C_n \lor y_n) \land {\it encode}(y_1+\cdots+y_n \le r) \end{align}$$

If $r=0$, then all the auxiliary variables $y_1,\ldots,y_n$ must be $0$. In this case, the question asks if all clauses can be satisfied. On the other hand, if $r=n$, then the answer is yes, by setting $y_1=\cdots=y_n=1$. In general, the question asks the following: ‘Is it possible to satisfy $n-r$ clauses (simultaneously)?’

How to Do It. I knew for some time that efficient encodings exist, but I didn't look them up. At some point I needed the case $r=1$, and I was able to find a good solution. Now that I read how the general $r$ works I think the encoding is kinda cute. The encoding was found by Carsten Sinz in 2005.

Here's an example for $r=6$:

10100001011010110111
 1100001011010110111
  110001011010110111
   11001011010110111
    1101011010110111
     111011010110111
      11111010110111
       1111110110111
        111111110111

What's happening here? I'm pushing the 1s to the right. Once I see a group of more than $6$ I say BAD!. But how exactly do we push 1s with a boolean circuit? And how many auxiliary variables do we need? Clearly, the gray ones are just the original values. But what's the general rule to compute the others?

Here's the same computation, but more explicit this time.

10100001011010110111
0000000
1000000            drop
 1000000           shift
 1100000           drop
  1100000          shift
   1100000         shift
    1100000        shift
     1100000       shift
     1110000       drop
      1110000      shift
      1111100      drop
       1111100     shift
       1111110     drop
        1111110    shift
        1111111    drop

This time the original variables are in the first line, and all the others are auxiliaries. There are two kinds of steps that alternate. First, there are steps in which the group of $r+1$ auxiliaries are shifted to the right. Second, there are steps where a group of consecutive 1s are dropped from the first row. These 1s are dropped only if they would come adjacent to the 1s that were dropped earlier. I omit some of these steps when they are not interesting. For example, if no 1s are dropped, then I omit the dropping step.

Hongseok Yang described this process as ‘moving a basked underneath the original values, and collecting the 1s’. I think that's a pretty good image to have in mind.

It should now be clear how this is done with a boolean circuit. Dropping is done with AND-gates, between the auxiliary from the left and the original from above. Next, any boolean circuit can be transformed into a conjunctive normal form, using the Tseitin transformation.

Details. If the original variables are $x_1,\ldots,x_n$, then we introduce auxiliaries $y_i^j$ for $1 \le i \le r+1$ and $0 \le j$ and $i+j \le n$. The clauses are $$\begin{align} y_i^{j-1} &\to y_i^j &&\text{for shifting} \\ y_{i-1}^j \land x_{i+j} &\to y_i^j &&\text{for dropping} \\ y_1^j\land y_2^j \land\ldots\land y_{r+1}^j &\to 0 && \text{for detecting bad situations} \end{align}$$

Yes, I am being sloppy with some boundary situations. Also, I'm using slightly more variables than necessary: Can you see how batches of $r$ auxiliaries (rather than batches of $r+1$ auxiliaries) are enough?

Others. There's also an encoding based on dividing the array of booleans in half. You compute for each half if you have at least $k$ 1s, for $1 \le k \le r+1$. You then aggregate the results for halves into the result for the whole array.

Using either of these methods you can encode equality constraints, like $x_1+\cdots+x_n=r$.

Depth First Search in Python

2014-07-24T21:17:00.000+01:00

Everybody thinks they understand depth-first search. The algorithm is indeed simple. But — I think — it is deceptively simple. In this post I play with some code.

Given a graph like

g = { 0 : [1, 2]
    , 1 : [0, 3]
    , 2 : [0, 3]
    , 3 : [1, 2] }

you are most likely to see DFS implemented as some variation of the following:

def dfs_a(g, root):
  seen = set()
  def rec(x):
    seen.add(x)
    for y in g[x]:
      if y not in seen:
        rec(y)
  rec(root)

This is not bad, but it has two issues. One is that it hides lots of cool features of DFS. Another is that Python is crap at recursion.

Let's first expand the code to expose a few properties of the algorithm.

def dfs_b(g):
  seen, done = {}, {}
  time = 0
  def previsit(x):
    nonlocal time
    time += 1
    seen[x] = time
    print('previsit',x,'at time',time)
  def postvisit(x):
    nonlocal time
    time += 1
    done[x] = time
    print('postvisit',x,'at time',time)
  def rec(y):
    for z in g[y]:
      if z in done:
        print('cross arc from',y,'to',z)
        continue
      if z in seen:
        print('back arc from',y,'to',z)
        continue
      previsit(z)
      rec(z)
      postvisit(z)
  for x in g.keys():
    if x in done:
      print('already visited',x)
      continue
    previsit(x)
    rec(x)
    postvisit(x)

A few things to notice:

Sometimes you'll want to iterate over all vertices, instead of using one root.
There is a time when you first get to a vertex, and a time when you leave the vertex. These times are well-bracketed. That is if the times for vertex $x$ are $(a_x,b_x)$ and those for vertex $y$ are $(a_y,b_y)$, then these two intervals either are disjoint, or one includes the other.
The post-order of the vertices is what you get if you grep for postvisit. The pre-order of the vertices is what you get if you grep for previsit. The post-order is useful, for example, in Kosaraju's algorithm to find strongly connected components.
There is a difference between back arcs and cross arcs! To distinguish them, you need two bits per vertex. In this implementation I used the dictionaries seen and done. If you know you don't have cycles, you can get rid of seen. (You might know this, say, if you look at the graph of GIT commits.) If you know you don't have cross arcs, you can get rid of done. (Note: A graph is reducible iff its set of back arcs doesn't depend on the order of the for loops above.)

What about complexity? If the graph has $m$ arcs and $n$ vertices, then the time is $\Theta(m+n)$ and the space is $\Theta(n)$.

Now on to the second problem. Run this code:

n = 0
try:
  def go():
    global n
    n += 1
    go()
  go()
except:
  print('gave up at',n)

On my computer it says gave up at 999. Even puny graphs have 999 vertices!

But we can make the call stack explicit. For this, we need to notice what state does rec have. Very little: the current vertex and the state of the iterator. The transformation into the code below is almost automatic.

def dfs_c(g):
  seen, done = {}, {}
  time = 0
  def previsit(x):
    nonlocal time
    time += 1
    seen[x] = time
    print('previsit',x,'at time',time)
  def postvisit(x):
    nonlocal time
    time += 1
    done[x] = time
    print('postvisit',x,'at time',time)
  for x, ys in g.items():
    if x in done:
      print('already visited',x)
      continue
    previsit(x)
    stack = [(x, ys.__iter__())]
    while stack:
      y, i = stack.pop()
      try:
        z = next(i)
        stack.append((y, i))
        if z in done:
          print('cross arc from',y,'to',z)
          continue
        if z in seen:
          print('back arc from',y,'to',z)
          continue
        previsit(z)
        stack.append((z, g[z].__iter__()))
      except StopIteration:
        postvisit(y)

I must admit: This looks a bit scary. One reason why it does is that it keeps explicit all these distinction that could be of use. For example, the post-visiting is useful for computing SCCs, but sometimes is not. So, it's now time to strip all these, and go back to the first piece of code. The only modification I'll keep is the change from recursion to a loop.

def dfs_d(g, root):
  seen = set([root])
  stack = [(root, g[root].__iter__())]
  while stack:
    try:
      x, i = stack.pop()
      y = next(i)
      stack.append((x, i))
      if y not in seen:
        seen.add(y)
        stack.append((y, g[y].__iter__()))
    except StopIteration:
      pass

This is almost simple. But, hey, if I google for ‘depth-first search python’, then I get this code by Edward Mann.

def dfs(graph, start):
    visited, stack = set(), [start]
    while stack:
        vertex = stack.pop()
        if vertex not in visited:
            visited.add(vertex)
            stack.extend(graph[vertex] - visited)
    return visited

This code is even simpler! Yes, it is, and you should use it: compared to the other top-results from google that I saw, this one is clearly the best. It has only one problem: The memory use is $\Theta(m)$ rather than $\Theta(n)$. This means that for dense graphs you might want to write a bit of extra code to deal with iterators.

What's the point of the other variants? The point is that they are more general. By studying them, you should get a better feeling of what DFS does. For example, you might be able to decode this:

Algorithm for finding strongly connected components:

using DFS on the original graph, sort vertices in reverse post-order
using DFS on the reversed graph, find components; for the outer loop of this DFS use the order built at step 1