Posted by

Brian Keng

on June 4, 2018

Share to

More Posts...

Bayesian Table Tennis: Building a Hierarchical Table Tennis Ranking Model

by Brian Keng | June 4, 2018

At Rubikloud, we are big fans of iterative development for our software, for our ML models and especially for our table tennis ranking system! Last year I wrote about Building A Table Tennis Ranking Model using the Bradley-Terry model and Google sheets. The model gives a single rating for each player (not unlike the Elo rating system in Chess). One drawback of this model is that it has no measure of “confidence”. For example, a new player could have a single lucky game against the reigning champion and suddenly soar to the top of the rankings, but should we be that confident in their ranking? Enter Bayesian Hierarchical Modelling.

Hierarchical Bayesian Models

Many of the parametric models you know and love (such as linear regression or MLPs), use a maximum likelihood estimate (MLE) or maximum a posterior (MAP) to estimate the parameters \(\theta\) of a model with observations \(x\):

\hat{\theta}_{MLE} &= \underset{\theta \in \Theta}{\arg\max}\{\mathcal{L}(\theta, x)\} \\
\hat{\theta}_{MAP} &= \underset{\theta \in \Theta}{\arg\max}\{\mathcal{L}(\theta, x) + g(\theta)\}\\

where \(\mathcal{L}(\theta, x)\) is the log-likelihood function and regularizer (or prior) \(g(\theta)\) on the parameters.  In the case of our Bradley-Terry model, the parameters \(\hat{\theta}\) are the estimates for our players’ ratings. The nice thing about this formulation is that we can apply your favourite optimization technique (e.g. gradient descent) to find the parameters that maximize Equation 1. The downside is that we only get a point estimate of the parameters, there’s no concept of how “confident” we are in the estimates. Or another way to put it: there’s no measure of uncertainty.

Bayesian hierarchical modelling takes a slightly different philosophical approach. Every parameter of interest in the model is a random variable with its own distribution. This means we can ask our usual questions of a random variable such as: what’s the mean? median? standard deviation?  \(P(a < \theta < b)\)? With a distribution for each parameter, we can now quantify the uncertainty of our estimate in any way we wish. Cool right?

Bayesian modelling always starts off with a statement of Bayes theorem describing the relationship between the parameters of interest and the data:

P(\theta|x) = \frac{P(x|\theta)P(\theta)}{P(x)} = \frac{\text{likelihood}\cdot\text{prior}}{\text{marginal likelihood}}\tag{2}

The likelihood and prior (aka regularizer) from our MAP estimate above are included there along with a constant factor in the denominator. The MAP estimate attempts to find a point-estimate that maximizes the RHS of Equation 2, while a full Bayesian analysis attempts to find the distribution specified by \(\theta\). The biggest downside of doing a fully Bayesian analysis is that it’s hard to compute. In general, there is no closed form solution and we must rely on approximation methods such as variational methods or MCMC methods that generate samples from the distribution.

Once we have a distribution for each parameter, we can generate the credible interval (or region for multi-variate distributions) which involves finding an interval \((a,b)\) such that \(P(a<\theta<b)=1-\alpha\). For example, we might want to find a 50% credible interval for each player’s rating.

A Bayesian Ranking Model

To define our Bayesian hierarchical model, we need to specify the likelihood and prior functions from Equation 2 (the marginal likelihood is a constant so we don’t need to specify it). We’re going to follow the Bradley-Terry model, where we assume that the probability of player \(i\) beating player \(j\) is:

P(i > j) = \frac{p_i}{p_i + p_j} \tag{3}

where \(p_i\) is the rating for player \(i\). With this idea, we can define our likelihood function by modelling with player \(i\) having \(w_{i,j}\) wins against player \(j\) in \(N_{i,j}\) games as a binomial distribution. Additionally, we’re going to take the easy route and assume a flat prior with support \([0,1]\). Putting it all together with observations \(x\) (all the wins and loses):

P(x|p_1,\ldots,p_K) &= \sum_{i=1}^N  \sum_{j=1}^i \binom{N_{i,j}}{w_{i,j}}\big(\frac{p_i}{p_i + p_j}\big)^{w_{i,j}}\big(1-\frac{p_i}{p_i + p_j}\big)^{N_{i,j}-w_{i,j}} \
P(p_1,\ldots,p_K) &= \sum_{i=1}^N \text{const} \ \tag{4}

In this case, we’re assuming that each player’s rating has equal probability of being any possible rating in \([0,1]\). In certain cases, this might cause some issues fitting the model (especially if we’re trying to compute a MAP estimate) but we’ll be using the data regularizing scheme from my original post Building A Table Tennis Ranking Model. The added benefit here is that the MAP estimate for the models are identical.


In recent years, Bayesian statistical packages have come a long way. There are simple interfaces with heavily optimized engines to fit Bayesian models. The most popular Bayesian statistics package is Stan, but I ended up using PyMC3 because I like the native Python interface. For a simple model like this, there’s not much difference in performance so usability is the main concern.

The nice thing about these packages is that you can take Equation 4 and basically translate it directly, here’s a code snippet of my model in PyMC3:

As you can see it has a relatively simple syntax. In the first loop we’re just setting up the priors, and in the second loop we’re going through each pair of observations and adding a term to the likelihood function. From this, we have the MAP estimate and we have a trace of samples drawn from the distribution of each player’s rating. You can find the full implementation here:

The Official Rubikloud Table Tennis Rankings V2

I’m happy to announce that V2 of our table tennis rankings is now in production. To simplify the UI, instead of plotting each individual’s distribution, I simply show each person’s 50% credible interval centered on the MAP estimate:

Table Tennis Rankings Chart

The scores are put on a logarithmic scale and scaled from 1 to 1000. Visually it looks like this:

Table Tennis Whisker Chart

As you can see, some players have a very wide range while others have a very tight range. This shows the “confidence” the model has in the rating. The new changes have been well received with players now having more interest in trying to move up the rankings.

Rubikloud is a great place to work. We tackle interesting problems from recommender systems to our table tennis rankings. If you enjoy table tennis and get excited about building probabilistic models, Rubikloud is always looking for ambitious and curious data scientists: