Research on Search

New-style Python classes

2012-08-21T16:11:00.000+01:00

With Python 2.2 a new type of classes, named new-style classes, was introduced. New-style classes add some convenient functionality to classic classes. New-style classes are recognized by having class object as base class.

Python caching decorators

2012-08-09T17:18:00.000+01:00

Implementing a dynamic programming algorithm is made much easier in Python by using a caching/memorization decorator.

Python's handling of default parameter values

2012-05-11T13:15:00.000+01:00

Python's handling of default parameter values becomes tricky when one uses a mutable object (e.g., list or dictionary) as a default value.

Default parameter values are evaluated when the function definition is executed. This means that the expression is evaluated once, when the function is defined, and that the same "pre-computed" value is used for each call. If the function modifies the object (e.g. by appending an item to a list), the default value is in effect modified.

This has to be borne in mind when writing recursive Python programs.

Python's splat operator

2012-05-11T10:47:00.001+01:00

Python actually has the splat operator (as in Perl or Ruby) that can unpack the arguments out of a list, tuple, or dictionary: func(*args), or func(**kwdargs).

The Hungarian algorithm in clustering evaluation

2011-09-07T11:21:00.002+01:00

The Hungarian algorithm (aka Kuhn–Munkres algorithm or Munkres assignment algorithm) can solve the assignment problem in polynomial time O(n^3). It can be used to find the optimal mapping from discovered clusters to the ground-truth categories which serves as the basis for some performance measures of clustering.

Fastest membership test in Python

2011-08-28T16:14:00.003+01:00

What is the most efficient method to check whether an item is in a given group or not? In Python, it seems that set (or frozenset) would be slightly faster than dict and much much faster than list.

Submodular functions

2011-08-26T20:02:00.002+01:00

Intuitively, a submodular function over the subsets demonstrates "diminishing returns", which is related to the concept of marginal utility in economics. Its usefulness for machine learning is well explained and illustrated by the Beyond Convexity tutorial. There is a Matlab toolbox for submodular function optimization available that is developed by Andreas Krause.

L1 regularisation Is efficient for selecting relevant features

2011-08-26T15:24:00.004+01:00

Andrew Ng has proven in his ICML-2004 paper that sample complexity grows linearly in the number of irrelevant features when using L2 regularisation (in logistic regression, support vector machine, and back-propagation neural network), but only logarithmically when using L1 regularisation (in logistic regression).

New linear-time algorithm for suffix array construction

2011-07-03T13:44:00.002+01:00

Juha Kärkkäinen, Peter Sanders , and Stefan Burkhardt: Linear Work Suffix Array Construction, Journal of the ACM (JACM), Volume 53 Issue 6, November 2006.
As the authors have said, this algorithm narrows the gap between suffix tree and suffix array, which are widely used and largely interchangeable index structures on strings and sequences. Usually theoreticians prefer the former due to linear-time construction algorithms and more explicit structure while practitioners prefer the latter due to its simplicity and space efficiency. Now there is one more reason for practitioners to stick with suffix array.

Research Impact for REF

2011-07-01T09:01:00.006+01:00

The British government's emphasis on the practical impact of research in REF reminds me of Feynman's following words.

Physics [research] is like sex: sure, it may give some practical results, but that's not why we do it.

DiveRS'11

2011-06-17T23:32:00.002+01:00

Pablo Castells, Jun Wang, Ruben Lara, and Dell Zhang are organising an ACM RecSys-2011 workshop on Novelty and Diversity in Recommender Systems (DiveRS). A special issue of ACM TIST in the scope of the workshop will be announced after the conference. Authors of accepted papers will be invited to submit an extended version.

A couple of metrics

2011-06-17T23:19:00.002+01:00

It is often desirable to measure the dissimilarity or distance between items using a proper metric.

Jaccard coefficient can be converted to a metric by by subtracting the Jaccard coefficient from 1.
Kullback–Leibler divergence can be converted to a metric by taking the square root of its symmetric version Jensen–Shannon divergence.

A Poor Man's Parallel Processing

2010-10-04T00:18:00.003+01:00

A very crude, but often good enough, method to achieve parallel processing (e.g., on multi-core computers) is to partition the large input data file into small chunks, run the program to process each of them in parallel, and then merge the output results file back. Fortunately, this process can be done easily with the wise iterative usage of two Unix utilities: split and cat.

nDCG

2010-09-10T21:28:00.003+01:00

The choice of the gain and discount function for the popular IR performance measure normalised Discounted Cumulative Gain (nDCG) has been discussed and empirically justified in a CIKM-2009 paper through analysis of variance (ANOVA).

LNRE

2010-08-11T15:09:00.002+01:00

Here is a good tutorial with Matlab examples about Statistical Estimation for Large Numbers of Rare Events (LNRE).

VLFeat - a computer vision toolbox

2010-06-18T21:56:00.002+01:00

The VLFeat open source computer vision library that implements popular

feature extraction algorithms (such as SIFT, MSER, and quick shift),
clustering algorithms (such as integer k-means, hierarchical k-means, and agglomerative information bottleneck), and
matching algorithms (such as randomized kd-trees).

It is written in C for efficiency and compatibility, with interfaces in MATLAB for ease of use, and detailed documentation throughout.

Bloom filters and Locality Sensitive Hashing

2010-06-01T09:24:00.002+01:00

Locality Sensitive Hashing (LSH) of l-bits is achieved by carrying out l independent random cuts of the Euclidean space: if two data points are in the same side of all these cuts, they are very likely to be nearest neighbours. In this sense, I think Bloom filters (that also relies on a number of independent hashing functions) can be conceptually considered as the extreme case of LSH: each of its cuts tries to separate one data point from the rest.

An application of Bloom filters

2010-05-31T10:22:00.004+01:00

It is said that Google's BigTable uses Bloom filters to reduce the disk lookups for non-existent rows or columns.

A suffix tree implementation with Unicode support

2010-05-04T00:00:00.005+01:00

It seems that there is currently no suffix tree implementation with Unicode support publicly available online. So I adapted Thomas Mailund's suffix tree implementation in C with a Python binding and put it here. The changes that I made to the code were mainly to make it support Unicode text and be compatible with new version Python. It also includes an example program all_comsubstr.py that illustrates the extraction of common substrings from two Chinese strings (encoded in UTF-8).

Longest Common Substring

2010-05-03T22:53:00.002+01:00

Given two strings, S of length m and T of length n, their longest common substrings can be found in O(m+n) time using a generalised suffix tree, or in O(mn) time through dynamic programming (e.g., the Python code here).

Bayesian inference for the Gaussian

2010-03-28T22:53:00.005+01:00

Given the prior probability
$p(\mu) = \mathcal{N}(\x_0,\sigma_0^2)$
and the likelihood
$p(x_1|\mu) = \mathcal{N}(\mu,\sigma_1^2)$,
the expectation of the posterior probability
$p(\mu|x_1)$
has a very simple and elegant form:
$(\alpha \x_0 + \beta x_1) / (\alpha + \beta)$
where
$\alpha = 1/(\sigma_0^2)$ and $\beta = 1/(\sigma_1^2)$
are the precisions.

Please refer to Bishop's PRML book section 2.3.6.

Comparing Data Analysis Packages

2010-02-03T14:56:00.002+00:00

A succinct comparison of data analysis packages including R, Matlab, SciPy, Excel, SAS, SPSS and Stata, can be found here. I recently tried Stata, but found its language syntax ugly and awkward.

The myth about the Internet

2009-11-11T11:58:00.004+00:00

Walter Willinger et al. recently published a paper in which the scale-free network model of the preferential attachment type for Internet is said to be a myth, as it is based on fundamentally flawed traceout data. Furthermore, they criticize the currently popular data-fitting approach to network science and argue that it should be replaced by the reverse-engineering approach.

Large networks are not modular

2009-08-12T23:24:00.008+01:00

A pretty striking finding in the WWW'08 paper from Leskovec etc. is that in nearly every network dataset they examined, there are tight but almost trivial communities at very small scales (up to around 100 nodes), while at larger scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community-like".

Spectral Graph Partitioning

2009-07-30T23:52:00.005+01:00

There are a number of methods in the family of Spectral Graph Partitioning, including the traditional min-cut and various balanced cut criteria (such as ratio-cut, average-cut, normalized-cut and minmax-cut). Each method uses a different objective function and consequently a different definition of partition (cluster) indicator vector. The following two tutorials on Spectral Clustering both contain a good summary of these methods.
[1] Spectral Clustering, ICML 2004 Tutorial by Chris Ding
[2] A Tutorial on Spectral Clustering by Ulrike von Luxburg