A Framework for Distributed Inference with Emerging Hardware.
NB: A static version of this interactive article has been published on arXiv and can be cited as:
@article{Ortiz2021visualGBP, title = {A visual introduction to Gaussian Belief Propagation}, author = {Ortiz, Joseph and Evans, Talfan and Davison, Andrew J.}, journal={arXiv preprint arXiv:2107.02308}, year = {2021}, }
See our short video introduction to the article here.
In this article, we present a visual introduction to Gaussian Belief Propagation (GBP), a probabilistic inference algorithm that operates by passing messages between the nodes of arbitrarily structured factor graphs. A special case of loopy belief propagation, GBP updates rely only on local information and will converge independently of the message schedule. Our key argument is that, given recent trends in computing hardware, GBP has the right computational properties to act as a scalable distributed probabilistic inference framework for future machine learning systems.
Bayesian probability theory is the fundamental framework for dealing with uncertain data, and is at the core of practical systems in machine learning and robotics
There are various possible algorithms for probabilistic inference, many of which take advantage of specific problem structure for fast performance.
Efficient inference on models represented by large, dynamic and highly inter-connected graphs however remains computationally challenging and is already a limiting factor in real embodied systems.
Inference on such graphs will only become more important as general purpose perception systems strive towards more heterogeneous and dynamic representations containing abstractions (such as objects or concepts) with continually changing relationships
As we search for the best algorithms for probabilistic inference, we should recall the "Bitter Lesson" of machine learning (ML) research: "general methods that leverage computation are ultimately the most effective, and by a large margin"
Looking to the future, we anticipate a long-term trend towards a "hardware jungle"
To scale arbitrarily and leverage all available computation in this "hardware jungle", our key hypothesis is that we need inference methods which can operate with distributed local processing / storage and message passing communication, without the need for a global view or coordination of the whole model
Fortunately, a well-known inference algorithm exists which has these properties: Loopy Belief Propagation
BP guarantees exact marginal computation with one sequence of forward-backward message passing in tree-structured graphs
One issue that has precluded the use of general Loopy BP is that it lacks convergence guarantees. However, a second and perhaps more relevant issue given modern hardware trends, is that its computational properties have not fitted the dominant processing paradigms of recent decades, CPUs and GPUs. Consequently, other factor graph inference algorithms have been preferred which take advantage of global problem structure to operate much more rapidly and robustly than loopy BP on a CPU.
We believe that now is the right time to re-evaluate the properties of loopy BP given the rise of graph-like computing hardware and sparse models.
Indeed there has been a recent resurgence in interest in the algorithm
In this article, we give an introduction to Gaussian Belief Propagation as a strong general purpose algorithmic and representational framework for large scale distributed inference. Throughout, we present a range of examples across 1D and 2D geometric estimation and image processing in the form of interactive simulations which allow the reader to experiment with and understand the properties of GBP. We hope that these simulations emphasize the key properties of GBP which we summarize below.
In the remainder of the article, we first introduce the GBP algorithm and show that it is intimately linked to solving a linear system of equations $Ax = b$. We then explore 4 key practical details for applying GBP to real problems: extending GBP to handle non-linear relationships and non-Gaussian data distributions, using local message schedules and lastly how hierarchical structure can help convergence.
Inference is the problem of estimating statistical properties of unknown variables $X$ from known or observed quantities $D$ (the data). For example, one might be interested in inferring tomorrow's weather (X) from historic data (D), or the 3D structure of an environment (X) from a video sequence (D).
Bayesian inference proceeds by first defining a probabilistic model $p(X, D)$ that describes the relationships between data and variables, and then using the sum and product rules of probability
Given the posterior, we can compute various properties of $X$, for example:
These two calculations are known as maximum a posteriori (MAP) inference and marginal inference respectively. An important difference is that MAP inference produces a point estimate while marginal inference retains information about uncertainty.
The Hammersley-Clifford theorem tells us that any positive joint distribution $p(X)$ can be represented as a product of factors $f_i$, one per clique, where a clique is a subset of variables $X_i$ in which each variable is connected to all others:
Factorized representations can be very convenient as they expose structure in a model.
Factor graphs are the natural visual representation of this factorization and can be very useful for reasoning about problems
In the factor graphs in this article, circles and squares represent variable and factor nodes respectively, with edges connecting each factor to the variables it depends on.
An example of a simple factor graph is shown in the diagram on the right.
By explicitly representing the factors as nodes in the graph, factor graphs clearly emphasize the conditional independence structure of the problem
- the lack of a factor directly connecting two variables means they are conditionally independent
Factor graphs can also be presented as energy based models
In energy based models, finding the most likely variable configuration is equivalent to minimizing the negative log probability or the sum of factor energies:
Belief propagation (BP) is an algorithm for marginal inference, i.e. it computes the marginal posterior distribution for each variable from the set of factors that make up the joint posterior. BP is intimately linked to factor graphs by the following property: BP can be implemented as iterative message passing on the posterior factor graph. The algorithm operates by iteratively updating a node's locally stored belief by sending and receiving messages from neighbouring nodes. Each iteration consists of 3 phases:
Belief Update: The variable node beliefs are updated by taking a product of the incoming messages from all adjacent factors, each of which represents that factor's belief on the receiving node's variables.
Factor-to-variable message: To send a message to an adjacent variable node, a factor aggregates messages from all other adjacent variable nodes and marginalizes over all the other nodes' variables to produce a message that expresses the factor's belief over the receiving node's variables.
Variable-to-factor message: A variable-to-factor message tells the factor what the belief of the variable would be if the receiving factor node did not exist. This is computed by taking the product of the messages the variable node has received from all other factor nodes.
These 3 operations fully define the algorithm and their equations are presented below.
Belief propagation was originally developed for graphs that are trees
As BP was originally developed for trees, its application to loopy graphs was at first empirical
As the Bethe free energy is non-convex, loopy BP is not guaranteed to converge and even when it does it may converge to the wrong marginals.
Empirically, however BP generally converges to the true marginals although for very loopy graphs it can fail
Most interesting problems have loopy structures and so for the remainder of the article we will use BP to refer to loopy BP. So far, although we have outlined the BP equations, we have not specified the form of the factors, messages or beliefs. From here, we focus on Gaussian belief propagation which is a special form of continuous BP for Gaussian models.
We are interested in Gaussian models in which all factors and therefore the joint posterior are univariate / multivariate Gaussian distributions. Gaussians are a convenient choice for a number of reasons: (1) they accurately represent the distribution for many real world events
A Gaussian factor or in general any Gaussian distribution can be written in the exponential form $p(x) \propto e^{-E(x)}$ with a quadratic energy function.
There are two ways to write the quadratic energy which correspond to the two common parameterizations of multivariate Gaussian distributions: the moments form
The canonical form is often preferred when performing inference, for two main reasons. Firstly, taking a product is simple in the canonical form, so it is easy to form the posterior from the factors. Secondly, the precision matrix is sparse and relates closely to the structure of the factor graph.
The precision matrix describes direct associations or conditional dependence between variables. If entry $(i,j)$ of the precision matrix is zero then equivalently, there is no factor that directly connects $x_i$ and $x_j$ in the graph. You can see this in the default preset graph in the figure below where $\Lambda_{13}=\Lambda_{31}=0$ and $x_1$ and $x_3$ have no factor directly connecting them.
On the other hand, the covariance matrix describes induced correlations between variables and is dense as long as the graph is one single connected component. Unlike the canonical form, the moments form is unable to represent unconstrained distributions, as can be seen by selecting the unanchored preset graph in which there is only relative positional information. We encourage the reader to check out the other preset graphs and edit the graph to explore Gaussian models and the relationship with the canonical form.
The joint distribution corresponding to a factor graph in which all factors are Gaussian can be represented as a single multivariate Gaussian distribution (since the energy terms are additive) in the canonical form:
Marginal inference computes the per-variable marginal posterior distributions.
In the moments form, the marginal distribution of $x_i$ is:
We can therefore summarize inference in Gaussian models as solving the linear system of equations $Ax=b \Leftrightarrow \Lambda \mu = \eta$. MAP inference solves for $\mu$ while marginal inference solves for both $\mu$ and the block diagonal elements of $\Lambda^{-1}$.
Having introduced Gaussian models, we now discuss Gaussian Belief Propagation (GBP) a form of BP applied to Gaussian models. Due to the closure properties of Gaussians, the beliefs and messages are also Gaussians and GBP operates by storing and passing around information vectors and precision matrices. The GBP equations are a specific case of the BP equations presented earlier and we provide details on the GBP equations in Appendix B.
Unlike general loopy BP, GBP is guaranteed to compute the exact marginal means on convergence, although the same is unfortunately not true for the variances which often converge to the true marginal variances, but are sometimes overconfident for very loopy graphs
The interactive figure below aims to build intuition for GBP by exploring the effect of individual messages. For easy visualization and interpretation of the beliefs, we examine 3 spatial estimation problems with increasing "loopiness": a chain, a loop and a grid. Click on a variable node to send messages to its adjacent variables and observe how neighbouring beliefs are updated. You will see that GBP converges to the true marginals regardless of the order in which messages are passed.
We have introduced Gaussian Belief Propagation in its basic form as a probabilistic inference algorithm for Gaussian estimation problems. However, to solve real practical problems with GBP, we often need a number of extra details and tricks which we discuss in this section.
Although requiring all factors to be Gaussian is a convenient assumption, most interesting problems involve non-linear relationships and / or non-Gaussian data distributions, both of which result in non-Gaussian factors. GBP can be extended to handle these problems by linearizing the non-linear relationships and using covariance scaling to handle non-Gaussian data distributions.
A factor is usually created given some observed data $d$ that we model as $d \sim h(X) + \epsilon$, where $h$ simulates the data generation process from the subset of variables $X$
For linear functions $h(X) = \mathtt{J} X + c$, the energy is quadratic in $X$ and we can rearrange the energy so that the factor is in the Gaussian canonical form as:
If $h$ is non-linear
After linearization, the posterior is a Gaussian approximation of the true posterior and inference is performed by successively solving linearized versions of the underlying non-linear problem (as in non-linear least squares optimization).
To see how this linearization works, consider a robot moving in a plane that measures the 2D distance and angle to a landmark also in the plane, where the current estimates for the position of the robot and landmark are $r_0$ and $l_0$ respectively, and the observed measurement $d = h(r_0, l_0)$. In the interactive figure below, we show both the true non-linear factor and the Gaussian approximated factor with $r$ held constant at $r_0$.
The accuracy of the approximate Gaussian factor depends on the linearity of the function $h$ at the linearization point.
As $h$ reasonably is smooth, the linear approximation is good close to the linearization point $l_0$, while further away, the approximation can degrade.
In practice, during optimization we can avoid this region of poor approximation by relinearizing frequently.
As GBP is local, a just-in-time approach to linearization
A second cause of non-Gaussian factors is non-Gaussian data distributions.
We usually model observed data as coming from a Gaussian distribution: $d \sim h(X) + \epsilon$, by choosing $\epsilon$ to be Gaussian noise.
Although this is generally sensible
Robust data distributions (or M-estimators), such as the Huber energy
Robust energy functions can make GBP much more generally useful - for example, they are crucial for bundle adjustment
The interactive figure below gives intuition for the effect of robust factors for the task of 1D line fitting.
The variables we are estimating are the $y$ values at fixed intervals along the $x$ axis and the blue circles and lines show the mean and standard deviation of the beliefs.
The red squares are measurements that produce data factors in the graph and there are also smoothness factors between all adjacent variable nodes encouraging the $y$ values to be close
So far, we have assumed that all variable and factor nodes broadcast messages at each iteration in a synchronous fashion, where all nodes absorb and broadcast messages in parallel. In fact, this is far from a requirement and as GBP is entirely local, messages can be sent arbitrarily and asynchronously.
It turns out that the message schedule can have a dramatic effect on the rate of convergence. For example, swapping synchronous updates for random message passing tends to improve convergence, while a fixed "round-robin" schedule can do even better
In the figure below, we explore message scheduling using the 1D line fitting task once again. The underlying factor graph is a chain (no loops) and so will converge after one sweep of messages from left to right and back again. You can send messages through the graph using the preset schedules (synchronous, random or sweep) or create your own schedule by clicking on a variable node to send messages outwards.
Playing around with different schedules for surface estimation highlights two important properties of GBP.
First, GBP can converge with an arbitrary message passing schedule.
As a consequence, GBP can readily operate in systems with no global clock and varying local compute budgets such as on neuromorphic hardware or between a group of distributed devices
The second property is that GBP can achieve approximate local convergence without global convergence.
Due to the factorized structure of GBP
In the figure below we explore attention-driven message passing for image denoising
Propagating information from one node to another with GBP takes the same number of iterations as the number of hops between the nodes. For nearby nodes in a local region, information can be communicated in a small number of iterations and consensus can be reached quickly, while for distant nodes, a global consensus can take many more iterations to be established. This is an inherent property of local algorithms and can be summarized as low frequency errors decay more slowly than the high frequency errors.
Regular grid structured graphs appear a lot in computer vision (e.g. image segmentation) and in discretized boundary value problems (e.g. solving for the temperature profile along a rod).
Accelerating convergence in such grid graphs has been well-studied in the field of Multigrid methods
Mulitgrid methods can only be applied to graphs with a grid-like structure where it is possible to build equivalent coarsened representations.
In general, most problems are more unstructured and it is not clear how to build a coarsened or abstracted representation of the original problem.
In the general case, we see two possible ways to build hierarchy into a model. A network could be trained to directly predict specific abstractions that form long range connections when included in the graph. Second, the graph could contain additional constraints that define a generic hierarchical structure (much like a neural network) and then the abstractions themselves are also inferred
Solving real non-linear problems with GBP is done by iteratively solving linearized Gaussian versions of the true non-linear problem.
This general pattern of successively solving linearized problems underpins many different non-linear inference methods.
There are efficient libraries
GBP is just one of many possible algorithms that can be used to solve the linearized Gaussian model. To place GBP amongst other methods, we present an overview of a number of related methods for MAP and marginal inference for Gaussian models in the table below. As a reminder, inference in Gaussian models is equivalent to solving the linear system $\Lambda \mu = \eta$, for $\mu$ in MAP inference and for $\mu$ and the diagonal elements of $\Lambda^{-1}$ in marginal inference. You can hover over the circles in the figure to explore how GBP relates to other methods.
With so many different inference methods, choosing which method to use can be a challenge in itself. Judging the speed of each method is complex and depends on both the sparsity structure of $\Lambda$ and on the implementation on the available hardware. Our key argument in this article is that we want a general method that is local, probabilistic and iterative which led us towards Gaussian Belief Propagation.
Other notable candidate methods that are local, probabilistic and iterative are Expectation Propagation (EP)
We envisage that ML systems of the future will be large scale, heterogeneous and distributed and as such will require flexible and scalable probabilistic inference algorithms. In this article, we argued that Gaussian Belief Propagation is a strong candidate algorithm as it is local, probabilistic, iterative and asynchronous. Additionally, we showed 1) how GBP is much more general with a prescription for handling non-linear factors and robust energy functions, 2) how GBP can operate in an attention-driven fashion and 3) how hierarchical structure can help convergence. We hope that this visual introduction will encourage more researchers and practitioners to look into GBP as an alternative to existing inference algorithms.
We see many exciting directions for future research around GBP and provide a GBP LibraryNotebook as a starting point for the interested reader.
Some directions we are most excited about are improving theoretical guarantees, using learned factors
We encourage the interested reader to explore our supplementary GBP playground. The playground consists of 2 interative diagrams based around 2D pose graphs. In the first you can construct your own pose graph, set the initialization and then choose how messages are passed through the graph. The second is a simulation of a robot exploring a 2D environment with landmarks.
We are grateful to many researchers with whom we have discussed some of the ideas in this paper, especially from the Dyson Robotics Lab and Robot Vision Group at Imperial College London. We would particularly like to thank Xiaofan Mu, Raluca Scona, Riku Murai, Edgar Sucar, Seth Nabarro, Tristan Laidlow, Nanfeng Liu, Shuaifeng Zhi, Kentara Wada and Stefan Leutenegger.
This standard derivation of the belief propagation equations follows Yedida et al
We begin by writing the posterior as a product of the factors:
We first consider the form of the free energy for a tree. For tree graphs, the distribution $q(X)$ can be written in the form:
Belief propagation can be derived via minimization of the Bethe free energy subject to two constraints.
The first is a marginalization constraint: $b_i(x_i) = \sum_{X_a \setminus i} b_{a}(X_a)$, and the second is a normalization constraint: $\sum_i b_i(x_i) = 1$.
With these constraints, we can form the Lagrangian and then set the derivates with respect to the parameters to zero:
We now choose the lagrange multiplier to be $\exp(\lambda_{ai}(x_i)) = m_{x_i \rightarrow f_a}(x_i) = \prod_{c \in N(i) \setminus a} m_{f_c \rightarrow x_i}(x_i)$. This is the familiar variable-to-factor message equation and substituting this into the above equations yields the belief propagation fixed point equations (the first of which the reader will recognize as the belief update rule).
Using the marginalization condition , we can derive an equation for the messages in terms of other messages and produce the factor-to-variable message equation:
This result tells us that the fixed-points of loopy belief propagation are local stationary points of the Bethe free energy and because the Bethe energy is bounded from below, BP always has a fixed point.
BP variants have been developed using more accurate or convex approximations of the free energy
Here we present the Gaussian belief propagation equations, where messages and beliefs are parameterized with the Gaussian canonical form. The GBP equations are a specific case of the BP equations presented in the main paper.
To compute the belief at a variable node, you take the product of incoming messages from all adjacent factors:
To send a message at a variable node to an adjacent factor, the variable node takes the product of the messages from all other adjacent factors:
To send a message from a factor node to a variable node, the factor aggregates messages from all other adjacent variable nodes before marginalising out all other adjacent variable nodes:
To see how this computation is implemented for Gaussian models with the canonical form, consider a factor $f$ connected to 3 variable nodes $[x_1, x_2, x_3]$ where we are computing the message to variable $x_1$. The factor $f([x_1, x_2, x_3]) = \mathcal{N}^{-1}([x_1, x_2, x_3]; \eta_f, \Lambda_f)$ is a Gaussian over the variables and can be divided up as follows:
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise.
For attribution in academic contexts, please cite this work as
J. Ortiz, T. Evans, A. Davison. A visual introduction to Gaussian Belief Propagation, 2021.
BibTeX citation
@article{Ortiz2021visualGBP, title = {A visual introduction to Gaussian Belief Propagation}, author = {Ortiz, Joseph and Evans, Talfan and Davison, Andrew J.}, journal={arXiv preprint arXiv:2107.02308}, year = {2021}, }