A visual introduction to Gaussian Belief Propagation

A Framework for Distributed Inference with Emerging Hardware.

NB: A static version of this interactive article has been published on arXiv and can be cited as:

@article{Ortiz2021visualGBP,
  title = {A visual introduction to Gaussian Belief Propagation},
  author = {Ortiz, Joseph and Evans, Talfan and Davison, Andrew J.},
  journal={arXiv preprint arXiv:2107.02308},
  year = {2021},
}

See our short video introduction to the article here.


In this article, we present a visual introduction to Gaussian Belief Propagation (GBP), a probabilistic inference algorithm that operates by passing messages between the nodes of arbitrarily structured factor graphs. A special case of loopy belief propagation, GBP updates rely only on local information and will converge independently of the message schedule. Our key argument is that, given recent trends in computing hardware, GBP has the right computational properties to act as a scalable distributed probabilistic inference framework for future machine learning systems.

Introduction

Bayesian Probabilistic Inference

Bayesian probability theory is the fundamental framework for dealing with uncertain data, and is at the core of practical systems in machine learning and robotics . A probabilistic model relates unknown variables of interest to observable, known or assumed quantities and most generally takes the form of a graph whose connections encode those relationships. Inference is the process of forming the posterior distribution to determine properties of the unknown variables, given the observations, such as their most probable values or their full marginal distributions.

There are various possible algorithms for probabilistic inference, many of which take advantage of specific problem structure for fast performance. Efficient inference on models represented by large, dynamic and highly inter-connected graphs however remains computationally challenging and is already a limiting factor in real embodied systems. Inference on such graphs will only become more important as general purpose perception systems strive towards more heterogeneous and dynamic representations containing abstractions (such as objects or concepts) with continually changing relationships In AI, there has been a recent trend towards sparse representations and graphs which are well-suited for representing sparse high dimensional data. Part of this trend has been driven by the fact that graphs are a natural representation in certain domains, and evidence of this is the rise of graph neural networks and graph-based probabilistic inference frameworks . Another trend is driven by the idea that massive sparsity is required to scale AI compute and evidence of this is the scaling of neural networks with gated computation and the development of novel AI processors specifically for high performance on sparse representations . We believe that inference on large scale and dynamic probabilistic graphs will be an important part of this trend towards sparse computing to scale AI compute. .

The Bitter Lesson

As we search for the best algorithms for probabilistic inference, we should recall the "Bitter Lesson" of machine learning (ML) research: "general methods that leverage computation are ultimately the most effective, and by a large margin" . A reminder of this is the success of CNN-driven deep learning, which is well suited to GPUs, the most powerful widely-available processors in recent years.

Looking to the future, we anticipate a long-term trend towards a "hardware jungle" of parallel, heterogeneous, distributed and asynchronous computing systems which communicate in a peer-to-peer manner. This will be at several levels: across networks of multiple smart devices operating in the same environment; across the many sensors, actuators and processors within individual embodied devices; and even within single processor chips themselves For single processor chips, a trend towards multicore designs with distributed on-core memory and ad-hoc communication is rapidly emerging . This design has the key aim of increasing performance while reducing power usage by minimizing the "bits x millimetres" of data movement during computation ..

To scale arbitrarily and leverage all available computation in this "hardware jungle", our key hypothesis is that we need inference methods which can operate with distributed local processing / storage and message passing communication, without the need for a global view or coordination of the whole model The opposite proposition would be a global centralized inference method, for example via a monolithic cloud-based processor. This centralized approach would not leverage most of the available compute resources and even if feasible may not be desirable, for communication bandwidth or privacy reasons..

Gaussian Belief Propagation

Fortunately, a well-known inference algorithm exists which has these properties: Loopy Belief Propagation . Belief Progagation (BP) was invented in the 1980s , and is an algorithm for calculating the marginals of a joint distribution via local message passing between nodes in a factor graph. Factor graphs are a type of bipartite graphical model that connects variables via factors which represent independent relationships .

BP guarantees exact marginal computation with one sequence of forward-backward message passing in tree-structured graphs , but empirically produces good results when applied to "loopy" graphs with cycles . Unlike Graph neural networks which learn edge and node updates that are applied over a fixed number of message passing steps, loopy BP applies probabilistic message passing updates with iterative convergent behaviour. Although Loopy BP has been successful in a few applications such as error-correcting codes , it has not as of yet been applied to broader machine learning problems.

One issue that has precluded the use of general Loopy BP is that it lacks convergence guarantees. However, a second and perhaps more relevant issue given modern hardware trends, is that its computational properties have not fitted the dominant processing paradigms of recent decades, CPUs and GPUs. Consequently, other factor graph inference algorithms have been preferred which take advantage of global problem structure to operate much more rapidly and robustly than loopy BP on a CPU.

We believe that now is the right time to re-evaluate the properties of loopy BP given the rise of graph-like computing hardware and sparse models. Indeed there has been a recent resurgence in interest in the algorithm ; in particular, one notable work demonstrated that BP on novel graph processor hardware can achieve a 24x speed improvement over standard methods for bundle adjustment. We focus on the special case of Gaussian Belief Propagation (GBP) , in which all factors are Gaussian functions of their dependent variables, and we also infer Gaussian-distributed marginals. GBP has both more extensive mathematical guarantees and stronger empirical performance than general loopy BP. It is also simple to implement, with all messages between nodes taking the form of usually low-dimensional Gaussians, which can be represented by small vectors and matrices.

In this article, we give an introduction to Gaussian Belief Propagation as a strong general purpose algorithmic and representational framework for large scale distributed inference. Throughout, we present a range of examples across 1D and 2D geometric estimation and image processing in the form of interactive simulations which allow the reader to experiment with and understand the properties of GBP. We hope that these simulations emphasize the key properties of GBP which we summarize below.

In the remainder of the article, we first introduce the GBP algorithm and show that it is intimately linked to solving a linear system of equations $Ax = b$. We then explore 4 key practical details for applying GBP to real problems: extending GBP to handle non-linear relationships and non-Gaussian data distributions, using local message schedules and lastly how hierarchical structure can help convergence.

Technical Introduction

Probabilistic Inference

Inference is the problem of estimating statistical properties of unknown variables $X$ from known or observed quantities $D$ (the data). For example, one might be interested in inferring tomorrow's weather (X) from historic data (D), or the 3D structure of an environment (X) from a video sequence (D).

Bayesian inference proceeds by first defining a probabilistic model $p(X, D)$ that describes the relationships between data and variables, and then using the sum and product rules of probability The sum rule is $p(X) = \sum_Y p(X, Y)$ and the product rule is $p(X, Y) = p(Y \rvert X) p(X)$. to form the posterior distribution $p(X \rvert D) = \frac{p(X, D)}{p(D)}$. The posterior summarizes our belief about $X$ after seeing $D$ and can be used for decision making or other downstream tasks.

Given the posterior, we can compute various properties of $X$, for example:

  1. The most likely configuration of the variables $X_{\text{MAP}} = \text{arg max}_X p(X \rvert D)$, or
  2. The marginal posteriors $p(x_i \rvert D) = \sum_{X \setminus x_i} p(X \rvert D)$, which summarize our belief about each individual variable given $D$.

These two calculations are known as maximum a posteriori (MAP) inference and marginal inference respectively. An important difference is that MAP inference produces a point estimate while marginal inference retains information about uncertainty.

Factor Graphs

The Hammersley-Clifford theorem tells us that any positive joint distribution $p(X)$ can be represented as a product of factors $f_i$, one per clique, where a clique is a subset of variables $X_i$ in which each variable is connected to all others:

p(X) = \prod_i f_i(X_i) ~.

Factorized representations can be very convenient as they expose structure in a model. Factor graphs are the natural visual representation of this factorization and can be very useful for reasoning about problems .

In the factor graphs in this article, circles and squares represent variable and factor nodes respectively, with edges connecting each factor to the variables it depends on. An example of a simple factor graph is shown in the diagram on the right. By explicitly representing the factors as nodes in the graph, factor graphs clearly emphasize the conditional independence structure of the problem - the lack of a factor directly connecting two variables means they are conditionally independent Mathematically, two variables $x_i$ and $x_j$ are conditionally independent given all other variables $X_{-ij}$ if: p(x_i, x_j | X_{-ij}) = p(x_i | X_{-ij}) p(x_j | X_{-ij}) ~. An equivalent way condition is: p(x_i | x_j, X_{-ij}) = p(x_i | X_{-ij}) ~. Intuitively, if $X_{-ij}$ causes both $x_i$ and $x_j$, then if we know $X_{-ij}$ we don't need to know about $x_i$ to predict $x_j$ or about $x_j$ to predict $x_i$. Conditional independence is often written in shorthand as: $x_i \bot x_j | X_{-ij}$. .

Factor graphs can also be presented as energy based models where each factor $f_i$ defines an energy $E_i \geq 0$ associated with a subset of the variables $X_i$ This formalism is closely related to the Boltzmann distribution in statistical physics which gives the probability of a state $i$ as a function of the energy of the state and the temperature of the system: p_i = \frac{e^{-E_i / k T}}{ \sum_j e^{-E_j / k T}} ~, where $k$ is the Boltzmann constant, $T$ is the temperature of the system and $j$ sums over all available states. : f_i(X_i) \propto e^{ - E_i(X_i)} ~.

In energy based models, finding the most likely variable configuration is equivalent to minimizing the negative log probability or the sum of factor energies: X_{\text{MAP}} = \text{arg min}_X - \log p(X) = \text{arg min}_X \sum_i E_i(X_i) ~.

The Belief Propagation Algorithm

Belief propagation (BP) is an algorithm for marginal inference, i.e. it computes the marginal posterior distribution for each variable from the set of factors that make up the joint posterior. BP is intimately linked to factor graphs by the following property: BP can be implemented as iterative message passing on the posterior factor graph. The algorithm operates by iteratively updating a node's locally stored belief by sending and receiving messages from neighbouring nodes. Each iteration consists of 3 phases:

Belief Update Although the belief update is listed here as a phase of BP, the variable beliefs are ephemeral properties of the graph that can be computed at any time from the most recent factor-to-variable messages in the graph. BP can proceed simply by alternative variable-to-factor and factor-to-variable message passing, however in practice all implementations compute the belief at each iteration as a third phase for two reasons: 1) it is useful to track convergence, 2) the beliefs can be sent instead of variable-to-factor messages, as the true variable-to-factor message can be recovered at the factor node by dividing the belief by the latest factor-to-variable message. : The variable node beliefs are updated by taking a product of the incoming messages from all adjacent factors, each of which represents that factor's belief on the receiving node's variables.
Factor-to-variable message: To send a message to an adjacent variable node, a factor aggregates messages from all other adjacent variable nodes and marginalizes over all the other nodes' variables to produce a message that expresses the factor's belief over the receiving node's variables.
Variable-to-factor message: A variable-to-factor message tells the factor what the belief of the variable would be if the receiving factor node did not exist. This is computed by taking the product of the messages the variable node has received from all other factor nodes.
These 3 operations fully define the algorithm and their equations are presented below The equations below assume discrete random variables for the simpler notational of using summation rather than integration in the factor-to-variable message passing equation, and for consistency with Bishop's textbook . .

Green curves represent variable belief distributions. Left: BP on a tree. Right: BP with synchronous updates applied to a graph with a loop.

Belief propagation was originally developed for graphs that are trees For tree graphs, any two nodes are connected by exactly one path. and the updates were designed such that the beliefs converge to the exact marginals after one sweep of messages from a root node to the leaf nodes and back . For models with arbitrary conditional independence structure, including cycles or "loops", loopy BP iteratively applies the same message passing rules to all nodes. The simplest variant of loopy BP sends messages from all nodes at every iteration in a synchronous fashion. The videos on the right illustrate how BP is applied to trees and graphs with loops.

As BP was originally developed for trees, its application to loopy graphs was at first empirical . Theoretical grounds for applying the same update rules to loopy graphs were later developed that explain loopy BP as an approximate variational inference method in which inference is cast as an optimization problem. Instead of directly minimizing the factor energies (as is done in MAP inference), loopy BP minimizes the KL divergence between the posterior and a variational distribution which we use as a proxy for the marginals after optimization. Loopy BP can be derived via constrained minimization of an approximation of the KL divergence known as the Bethe free energy (see Appendix A for the derivation).

As the Bethe free energy is non-convex, loopy BP is not guaranteed to converge and even when it does it may converge to the wrong marginals. Empirically, however BP generally converges to the true marginals although for very loopy graphs it can fail .

Most interesting problems have loopy structures and so for the remainder of the article we will use BP to refer to loopy BP. So far, although we have outlined the BP equations, we have not specified the form of the factors, messages or beliefs. From here, we focus on Gaussian belief propagation which is a special form of continuous BP for Gaussian models.

Gaussian Models

We are interested in Gaussian models in which all factors and therefore the joint posterior are univariate / multivariate Gaussian distributions. Gaussians are a convenient choice for a number of reasons: (1) they accurately represent the distribution for many real world events , (2) they have a simple analytic form, (3) complex operations can be expressed with simple formulae and (4) they are closed under marginalization, conditioning and taking products (up to normalization).

A Gaussian factor or in general any Gaussian distribution can be written in the exponential form $p(x) \propto e^{-E(x)}$ with a quadratic energy function. There are two ways to write the quadratic energy which correspond to the two common parameterizations of multivariate Gaussian distributions: the moments formIt's called the moments form as it is parameterized by the first moment and second central moments of the distribution. and the canonical form. The key properties of each of these parameterizations are summarized in the table below.

The canonical form is often preferred when performing inference, for two main reasons. Firstly, taking a product is simple in the canonical form, so it is easy to form the posterior from the factors. Secondly, the precision matrix is sparse and relates closely to the structure of the factor graph.

The precision matrix describes direct associations or conditional dependence between variables. If entry $(i,j)$ of the precision matrix is zero then equivalently, there is no factor that directly connects $x_i$ and $x_j$ in the graph. You can see this in the default preset graph in the figure below where $\Lambda_{13}=\Lambda_{31}=0$ and $x_1$ and $x_3$ have no factor directly connecting them.

On the other hand, the covariance matrix describes induced correlations between variables and is dense as long as the graph is one single connected component. Unlike the canonical form, the moments form is unable to represent unconstrained distributions, as can be seen by selecting the unanchored preset graph in which there is only relative positional information. We encourage the reader to check out the other preset graphs and edit the graph to explore Gaussian models and the relationship with the canonical form.

From Gaussian Inference to Linear Algebra

The joint distribution corresponding to a factor graph in which all factors are Gaussian can be represented as a single multivariate Gaussian distribution (since the energy terms are additive) in the canonical form: P(X) \propto \exp( - \frac{1}{2} X^\top \Lambda X + \eta^\top X) ~. MAP inference corresponds to computing the parameters $X_{\text{MAP}}$ that maximize the above joint distribution. As usual, we can compute the gradient of the log-probability (the total energy): \nabla_X E = \nabla_X \log P(X) = - \Lambda X + \eta ~, and solve for $\nabla_X E = 0$. From here, we see that MAP inference in a Gaussian system reduces simply to solving $X_{\text{MAP}} = \Lambda^{-1} \eta = \mu$, which as expected is equal to the mean.

Marginal inference computes the per-variable marginal posterior distributions. In the moments form, the marginal distribution of $x_i$ is: p(x_i) = \int p(X) dX_{-i} \propto \exp\big( -\frac{1}{2}(x_i - \mu_i)^\top \Sigma_{ii}^{-1} (x_i - \mu_i) \big) ~, where the mean parameter $\mu_i$ is the $i^{th}$ element of the joint mean vector and the covariance $\Sigma_{ii}$ is entry $(i,i)$ of the joint covariance matrix. The vector of marginal means for all variables is therefore the joint mean vector $ \mu = \Lambda^{-1} \eta$ = $X_{\text{MAP}}$ and the marginal variances the diagonal entries of the joint covariance matrix $\Sigma = \Lambda^{-1}$.

We can therefore summarize inference in Gaussian models as solving the linear system of equations $Ax=b \Leftrightarrow \Lambda \mu = \eta$. MAP inference solves for $\mu$ while marginal inference solves for both $\mu$ and the block diagonal elements of $\Lambda^{-1}$.

Gaussian Belief Propagation

Having introduced Gaussian models, we now discuss Gaussian Belief Propagation (GBP) a form of BP applied to Gaussian models. Due to the closure properties of Gaussians, the beliefs and messages are also Gaussians and GBP operates by storing and passing around information vectors and precision matrices. The GBP equations are a specific case of the BP equations presented earlier and we provide details on the GBP equations in Appendix B.

Unlike general loopy BP, GBP is guaranteed to compute the exact marginal means on convergence, although the same is unfortunately not true for the variances which often converge to the true marginal variances, but are sometimes overconfident for very loopy graphs . Although GBP does not in general have convergence guarantees, there some convergence conditions as well as methods to improve chances of convergence Message damping is commonly used to speed up and improve chances of convergence in very loopy graphs. Message damping both empirically and theoretically improves convergence without affecting the fixed points of GBP. The idea behind message damping is to use momentum to reduce chances of oscillation by replacing the message at time $t$ with a combination of the message at time $t$ and time $t-1$: \tilde{m}_{t} = m_{t}^\beta \tilde{m}_{t-1}^{(1 - \beta)} ~, which is a weighted sum in log-space: \log \tilde{m}_{t} = \beta \, \log m_{t} + (1 - \beta) \, \log \tilde{m}_{t-1} ~. Standard BP is recovered when the damping parameter $\beta = 1$ and $\beta = 0$ corresponds to not updating the message and sending the message from the previous iteration. Message damping can be applied to both the variable-to-factor messages and factor-to-variable messages, however we find that applying it just to factor-to-variable messages is sufficient. For GBP, message damping corresponds to damping the information vector and precision matrix as a weighted sum: \tilde{\eta}_{t} = \beta \, \eta_{t} + (1 - \beta) \, \tilde{\eta}_{t-1} \;\;\; \text{and} \;\;\; \tilde{\Lambda}_{t} = \beta \, \Lambda_{t} + (1 - \beta) \, \tilde{\Lambda}_{t-1} ~. (see chapter 22 in ).

The interactive figure below aims to build intuition for GBP by exploring the effect of individual messages. For easy visualization and interpretation of the beliefs, we examine 3 spatial estimation problems with increasing "loopiness": a chain, a loop and a grid. Click on a variable node to send messages to its adjacent variables and observe how neighbouring beliefs are updated. You will see that GBP converges to the true marginals regardless of the order in which messages are passed.

Beyond the standard algorithm

We have introduced Gaussian Belief Propagation in its basic form as a probabilistic inference algorithm for Gaussian estimation problems. However, to solve real practical problems with GBP, we often need a number of extra details and tricks which we discuss in this section.

Non-Gaussian factors

Although requiring all factors to be Gaussian is a convenient assumption, most interesting problems involve non-linear relationships and / or non-Gaussian data distributions, both of which result in non-Gaussian factors. GBP can be extended to handle these problems by linearizing the non-linear relationships and using covariance scaling to handle non-Gaussian data distributions.

Non-linear Relationships

A factor is usually created given some observed data $d$ that we model as $d \sim h(X) + \epsilon$, where $h$ simulates the data generation process from the subset of variables $X$ We are overloading $X$ here by using it to denote a subset of the variables for ease of notation. and $\epsilon \sim \mathcal{N}(0, \Sigma_n)$ is Gaussian noise. Rearranging, we see that the residual is Gaussian distributed $r = d - h(X) \sim \mathcal{N}(0, \Sigma_n)$, allowing us to form the factor with energy: E(X) = \frac{1}{2}(h(X) - d)^\top \Sigma_n^{-1} (h(X) - d) ~.

For linear functions $h(X) = \mathtt{J} X + c$, the energy is quadratic in $X$ and we can rearrange the energy so that the factor is in the Gaussian canonical form as: E(X) = \frac{1}{2} X^\top \Lambda X - \eta^\top X \;\;\;\; \text{, where} \; \eta = \mathtt{J}^\top \Sigma_n^{-1} (d - c) \; \text{and} \; \Lambda = \mathtt{J}^\top \Sigma_n^{-1} \mathtt{J} ~.

If $h$ is non-linear The function $h$ could be any non-linear function, for example a trained neural network or a Gaussian process ., the energy is no longer quadratic in $X$ meaning the factor is not Gaussian-distributed. To restore the Gaussian form, it is standard to use a first-order Taylor expansion about the current estimate $X_{0}$ to approximate the factor as a Gaussian: h(X) \approx h(X_{0}) + \mathtt{J} (X - X_{0}) ~. Here $\mathtt{J}$ is now the Jacobian matrix and the factor can be written in the same form as above but with $c = h(X_{0}) - \mathtt{J} X_{0}$ .

After linearization, the posterior is a Gaussian approximation of the true posterior and inference is performed by successively solving linearized versions of the underlying non-linear problem (as in non-linear least squares optimization).

To see how this linearization works, consider a robot moving in a plane that measures the 2D distance and angle to a landmark also in the plane, where the current estimates for the position of the robot and landmark are $r_0$ and $l_0$ respectively, and the observed measurement $d = h(r_0, l_0)$. In the interactive figure below, we show both the true non-linear factor and the Gaussian approximated factor with $r$ held constant at $r_0$.

The accuracy of the approximate Gaussian factor depends on the linearity of the function $h$ at the linearization point. As $h$ reasonably is smooth, the linear approximation is good close to the linearization point $l_0$, while further away, the approximation can degrade. In practice, during optimization we can avoid this region of poor approximation by relinearizing frequently. As GBP is local, a just-in-time approach to linearization can be used in which factors are relinearized individually when the current estimate of the adjacent variables strays significantly from the linearization point.

Non-Gaussian data distributions

A second cause of non-Gaussian factors is non-Gaussian data distributions. We usually model observed data as coming from a Gaussian distribution: $d \sim h(X) + \epsilon$, by choosing $\epsilon$ to be Gaussian noise. Although this is generally sensible , true data distributions often have stronger tails or are more tightly peaked. In these cases, to retain the Gaussian form for GBP, we use an approximate Gaussian data distribution via covariance scaling . The covariance of the approximate Gaussian is chosen such that the quadratic energy matches the true non-Gaussian energy at that residual, as shown in the figure on the right.

Robust data distributions (or M-estimators), such as the Huber energy , are a common class of non-Gaussian data distributions which have greater probability mass in the tails to reduce sensitivity to outliers. The Huber energy is a continuous function that is quadratic close to the mean and transitions to a linear function for large residuals to reduce the energy cost of outliers The distribution induced by the Huber energy is a Gaussian distribution close to the mean and a Laplace distribution in the tails. The probability density function for the Laplace distribution is: p(x ; \mu, \beta) = \frac{1}{2b} \exp\big(\frac{-|x - \mu|}{b}\big) ~. : E_{\text{huber}}(r) = \begin{cases} \frac{1}{2} r^\top \Sigma_n^{-1} r, & \text{if}\ \rvert r\rvert < t \\ A \;+ \; B \rvert r \rvert, & \text{otherwise} ~. \end{cases} The approximate quadratic energy for the Huber distribution can be found by solving $\frac{1}{2} r^\top \Sigma_{\text{sc}}^{-1} r = E_{\text{huber}}(r)$ to give the diagonal scaled covariance: \Sigma_{\text{sc}} = \begin{cases} \Sigma_n , & \text{if}\ \rvert r\rvert < t \\ \frac{2 E_{\text{huber}}(r)}{ r^\top \Sigma_n^{-1} r} \Sigma_n, & \text{otherwise} ~. \end{cases}

Robust energy functions can make GBP much more generally useful - for example, they are crucial for bundle adjustment and for sharp image denoising (as we will see in a later figure). More generally, our interpretation is that robust factors can play a similar role to non-linearities in neural networks, activating or deactivating messages in the graph.

The interactive figure below gives intuition for the effect of robust factors for the task of 1D line fitting. The variables we are estimating are the $y$ values at fixed intervals along the $x$ axis and the blue circles and lines show the mean and standard deviation of the beliefs. The red squares are measurements that produce data factors in the graph and there are also smoothness factors between all adjacent variable nodes encouraging the $y$ values to be close There are two types of factors in this example: smoothness factors and data factors. Smoothness factors encourage neighbouring nodes to take similar values and the measurement function is simply the difference between the values: h(y_i, y_j) = y_i - y_j ~. For a measurement at $(x_m, y_m)$ which is spatially in between the nodes at $x_i$ and $x_j$, the measurement function for the data factor is: h(y_i, y_j) = (1 - \lambda) y_i + \lambda y_j \:\:\:\:\text{where}\:\:\: \lambda = \frac{x_m - x_i}{x_j - x_i} ~. . You can add your own data factors by clicking on the canvas and a diagram of the factor graph is in the bottom right of the figure. Press play to run synchronous GBP and observe that a Huber energy can disregard outliers and retain step discontinuities in the data unlike the standard squared loss.

Local updates and Scheduling

So far, we have assumed that all variable and factor nodes broadcast messages at each iteration in a synchronous fashion, where all nodes absorb and broadcast messages in parallel. In fact, this is far from a requirement and as GBP is entirely local, messages can be sent arbitrarily and asynchronously.

It turns out that the message schedule can have a dramatic effect on the rate of convergence. For example, swapping synchronous updates for random message passing tends to improve convergence, while a fixed "round-robin" schedule can do even better . Better yet, if each message requires some unit of computation (and therefore energy), it's possible to prioritize sending messages that we think will contribute most to the overall convergence of the system (there is evidence that the brain may apply a similar economical principle ). This is the idea behind residual belief propagation (RBP) and similar variants , which form a message queue according to the norm of the difference from the previous message.

In the figure below, we explore message scheduling using the 1D line fitting task once again. The underlying factor graph is a chain (no loops) and so will converge after one sweep of messages from left to right and back again. You can send messages through the graph using the preset schedules (synchronous, random or sweep) or create your own schedule by clicking on a variable node to send messages outwards.

Playing around with different schedules for surface estimation highlights two important properties of GBP. First, GBP can converge with an arbitrary message passing schedule. As a consequence, GBP can readily operate in systems with no global clock and varying local compute budgets such as on neuromorphic hardware or between a group of distributed devices .

The second property is that GBP can achieve approximate local convergence without global convergence. Due to the factorized structure of GBP , global inference is achieved by jointly solving many interdependent local subproblems. There are many instances in which we might only be interested in local solutions - in these cases, GBP can operate in a just-in-time or attention-driven fashion, focusing processing on parts of the graph to solve local subproblems as the task demands. Local message passing can yield accurate relative local solutions which estimate the marginals up to global corrections that come from more distant parts of the graph One simple example is mapping two connected rooms. An accurate local map of one room can be constructed by focusing processing on the part of the factor graph in that room. For some applications this may be sufficient while for others it may be important to build a map with an accurate absolute position which may require longer range message passing between the parts of the graph corresponding to each separate room. . This attention-driven scheduling can be very economical with compute and energy, only sending the most task-critical messages.

In the figure below we explore attention-driven message passing for image denoising As there are no long-range connections in the image denoising graph, local message passing can produce the true local marginals as the effect of more distant parts of the graph is negligible.. Image denoising is the 2D equivalent of the surface estimation problem from the previous figure. The only difference is that previously although variable nodes were at discrete locations the data factors were at any location, while now the data factors are at the same discrete locations as the variable nodes with one per node. We also revisit the use of robust energy functions with GBP via covariance scaling which is crucial for sharp denoising.

Multiscale Learning

Propagating information from one node to another with GBP takes the same number of iterations as the number of hops between the nodes. For nearby nodes in a local region, information can be communicated in a small number of iterations and consensus can be reached quickly, while for distant nodes, a global consensus can take many more iterations to be established. This is an inherent property of local algorithms and can be summarized as low frequency errors decay more slowly than the high frequency errors.

Regular grid structured graphs appear a lot in computer vision (e.g. image segmentation) and in discretized boundary value problems (e.g. solving for the temperature profile along a rod). Accelerating convergence in such grid graphs has been well-studied in the field of Multigrid methods . One simple approach is to coarsen the grid which transforms low frequency errors into higher frequency errors that decay faster. After convergence in the coarsened grid, the solution is used to initialize inference in the original grid which now has smaller low frequency errors. This is the idea behind coarse-to-fine optimization which is used in many grid-based problems where it is simple to build a coarser graph. In one notable work , the authors demonstrate much faster inference for stereo, optical flow and image restoration with multiscale BP.

Mulitgrid methods can only be applied to graphs with a grid-like structure where it is possible to build equivalent coarsened representations. In general, most problems are more unstructured and it is not clear how to build a coarsened or abstracted representation of the original problem. In the general case, we see two possible ways to build hierarchy into a model. A network could be trained to directly predict specific abstractions that form long range connections when included in the graph. Second, the graph could contain additional constraints that define a generic hierarchical structure (much like a neural network) and then the abstractions themselves are also inferred .

Solving real non-linear problems with GBP is done by iteratively solving linearized Gaussian versions of the true non-linear problem. This general pattern of successively solving linearized problems underpins many different non-linear inference methods. There are efficient libraries for non-linear inference which use trust region methods like Gauss-Newton or line search to guide the repeated linear steps Trust region methods approximate the energy using a model within a trust region. For example, the Gauss-Newton method uses a quadratic model meaning the factors are approximated as Gaussians as in GBP. Line search methods choose a descent direction and then step size at each iteration. In trust region methods, the most expensive step is solving the linear system, while for line search methods choosing the direction is the most expensive part..

GBP is just one of many possible algorithms that can be used to solve the linearized Gaussian model. To place GBP amongst other methods, we present an overview of a number of related methods for MAP and marginal inference for Gaussian models in the table below. As a reminder, inference in Gaussian models is equivalent to solving the linear system $\Lambda \mu = \eta$, for $\mu$ in MAP inference and for $\mu$ and the diagonal elements of $\Lambda^{-1}$ in marginal inference. You can hover over the circles in the figure to explore how GBP relates to other methods.

With so many different inference methods, choosing which method to use can be a challenge in itself. Judging the speed of each method is complex and depends on both the sparsity structure of $\Lambda$ and on the implementation on the available hardware. Our key argument in this article is that we want a general method that is local, probabilistic and iterative which led us towards Gaussian Belief Propagation.

Other notable candidate methods that are local, probabilistic and iterative are Expectation Propagation (EP) and Barfoot's algorithm . EP is generally not node-wise parallel and simplifies to GBP in the special case when it is node-wise parallel, while Barfoot's algorithm involves extra communication edges and is yet to be applied to real problems. For these reasons GBP stands out as the extreme case that maximizes parallelism and minimizes communication - two principles that are at the core of scalable and low-power computation.

Conclusions

We envisage that ML systems of the future will be large scale, heterogeneous and distributed and as such will require flexible and scalable probabilistic inference algorithms. In this article, we argued that Gaussian Belief Propagation is a strong candidate algorithm as it is local, probabilistic, iterative and asynchronous. Additionally, we showed 1) how GBP is much more general with a prescription for handling non-linear factors and robust energy functions, 2) how GBP can operate in an attention-driven fashion and 3) how hierarchical structure can help convergence. We hope that this visual introduction will encourage more researchers and practitioners to look into GBP as an alternative to existing inference algorithms.

We see many exciting directions for future research around GBP and provide a GBP LibraryNotebook as a starting point for the interested reader. Some directions we are most excited about are improving theoretical guarantees, using learned factors , introducing discrete variables, combining GBP with GNNs , incrementally abstracting factor graphs, investigating numerical precision for messages, using GBP for distributed learning in overparameterized networks and lastly unifying iterative inference with test-time self-supervised learning.

Supplementary Playground

We encourage the interested reader to explore our supplementary GBP playground. The playground consists of 2 interative diagrams based around 2D pose graphs. In the first you can construct your own pose graph, set the initialization and then choose how messages are passed through the graph. The second is a simulation of a robot exploring a 2D environment with landmarks.

Acknowledgments

We are grateful to many researchers with whom we have discussed some of the ideas in this paper, especially from the Dyson Robotics Lab and Robot Vision Group at Imperial College London. We would particularly like to thank Xiaofan Mu, Raluca Scona, Riku Murai, Edgar Sucar, Seth Nabarro, Tristan Laidlow, Nanfeng Liu, Shuaifeng Zhi, Kentara Wada and Stefan Leutenegger.

Appendix A:
Variational BP Derivation

This standard derivation of the belief propagation equations follows Yedida et al in showing that the BP update rules follow from constrained minimization of an approximate free energy known as the Bethe free energy.

We begin by writing the posterior as a product of the factors: p(X) = \frac{1}{Z} \prod_a f_a(X_a) = \frac{1}{Z} \prod_a e^{-E(X_a)} ~. The goal of variational inference is to find a variational distribution $q(X)$ that approximates the posterior well by minimizing the Kullback-Leibler divergence between the variational distribution and the posterior. The KL divergence is a non-negative asymmetric similarity metric that has a minimum of 0 when $p = q$. \begin{aligned} KL(q \lvert \rvert p) &= \sum_{X} q(X) \log \frac{q(X)}{p(X)} \\ &= \sum_{X} q(X) \log q(X) - \sum_{X} q(X) \log p(X) \\ &= -H_q(X) - \mathop{\mathbb{E}}_q [\log p(X)] \\ &= -H_q(X) - \sum_{a} \mathop{\mathbb{E}}_q [\log f_a(X_a)] + \log(Z) \\ &= F(p, q) + \log(Z) \end{aligned} Above, we defined the free energy: F(p, q) = -H_q(X) - \sum_{a} \mathop{\mathbb{E}}_q [\log f_a(X_a)] ~, where the first term is the negative of the entropy and the second term is known as the average energy because $-\log f_a(X_a) = E(X_a)$. The free energy has a minimum value of $-\log(Z)$ and by minimizing this free energy we can also minimizing the KL divergence.

We first consider the form of the free energy for a tree. For tree graphs, the distribution $q(X)$ can be written in the form: q(X) = \prod_{i} b_i(x_i)^{1 - d_i} \prod_{a} b_{a}(X_a) ~, where the first product is over variables and the second is over the factors. $b_i(x_i)$ is the marginal distribution over variable $x_i$, $b_a(X_a)$ is the joint marginal distribution over variables $X_a$ that connect to factor $f_a$, and $d_i$ is the degree of variable node $i$ (the number of nodes neighbouring node $i$). Plugging this into the expression for the entropy, we get: H_{tree}(X) = -\sum_{i} (1 - d_i) \sum_{x_i} b_i(x_i) \log b_i(x_i) - \sum_{a} \sum_{X_a} b_{a}(X_a) \log b_{a}(X_a) ~. Similarly, the average energy can be written as: - \sum_{i} \mathop{\mathbb{E}}_q [\log f_i(X_i)] = - \sum_{a} b_a(X_a) \log f_a(X_a) Putting this together gives the free energy for tree graphs: F_{tree} = - \sum_{i} (d_i-1) \sum_{x_i} b_i(x_i) \log b_i(x_i) + \sum_{a} \sum_{X_a} b_{a}(X_a) \log \frac{b_{a}(X_a)}{ f_{a}(X_a)} ~. For general factor graphs with loops, the $F \neq F_{tree}$. The Bethe approximation is to use the free energy for a tree to approximate the free energy for arbitrary loopy graphs. The resulting approximate free energy is known as the Bethe free energy.

Belief propagation can be derived via minimization of the Bethe free energy subject to two constraints. The first is a marginalization constraint: $b_i(x_i) = \sum_{X_a \setminus i} b_{a}(X_a)$, and the second is a normalization constraint: $\sum_i b_i(x_i) = 1$. With these constraints, we can form the Lagrangian and then set the derivates with respect to the parameters to zero: L = F_{Bethe} + \sum_i \gamma_i \bigg\{ 1 - \sum_{x_i} b_i(x_i) \bigg\} + \sum_a \sum_{i \in N(a)} \sum_{x_i} \lambda_{ai}(x_i) \bigg\{ b_i(x_i) - \sum_{X_a \setminus i} b_{a}(X_a) \bigg\} \frac{\partial L}{\partial b_{i}(x_i)} = 0 \;\;\;\;\;\; \Rightarrow b_{i}(x_i) \propto \prod_{a \in N(i)} \exp \big(\lambda_{ai}(x_i) \big) \frac{\partial L}{\partial b_{a}(X_a)} = 0 \;\;\;\; \Rightarrow b_{a}(X_a) \propto f_{a}(X_a) \prod_{i \in N(a)}\exp \big( \lambda_{ai}(x_i) \big)

We now choose the lagrange multiplier to be $\exp(\lambda_{ai}(x_i)) = m_{x_i \rightarrow f_a}(x_i) = \prod_{c \in N(i) \setminus a} m_{f_c \rightarrow x_i}(x_i)$. This is the familiar variable-to-factor message equation and substituting this into the above equations yields the belief propagation fixed point equations (the first of which the reader will recognize as the belief update rule). b_i(x_i) \propto \prod_{a \in N(i)} m_{x_i \rightarrow f_a} (x_i) \propto \prod_{a \in N(i)} m_{f_a \rightarrow x_i}(x_i) b_a(X_a) \propto f_a(X_a) \prod_{i \in N(a)} m_{x_i \rightarrow f_a}(x_i) = f_a(X_a) \prod_{i \in N(a)} \prod_{c \in N(i) \setminus a} m_{f_c \rightarrow x_i} (x_i)

Using the marginalization condition , we can derive an equation for the messages in terms of other messages and produce the factor-to-variable message equation: \begin{aligned} m_{f_a \rightarrow x_i}(x_i) &= \sum_{X_a \setminus x_i} f_a(X_a) \prod_{j \in N(a) \setminus i} \; \prod_{b \in N(j) \setminus a} m_{f_b \rightarrow x_j}(x_j) \\ &= \sum_{X_a \setminus x_i} f_a(X_a) \prod_{j \in N(a) \setminus i} m_{x_j \rightarrow f_a}(x_j) \end{aligned}

This result tells us that the fixed-points of loopy belief propagation are local stationary points of the Bethe free energy and because the Bethe energy is bounded from below, BP always has a fixed point.

BP variants have been developed using more accurate or convex approximations of the free energy , however a detailed discussion of the theory behind BP is beyond the scope of this article and we refer the reader to for a in depth review.

Appendix B:
GBP Equations

Here we present the Gaussian belief propagation equations, where messages and beliefs are parameterized with the Gaussian canonical form. The GBP equations are a specific case of the BP equations presented in the main paper.

Belief Update

To compute the belief at a variable node, you take the product of incoming messages from all adjacent factors: b_{i}(x_i) = \prod_{s \in N(i)} m_{f_s \rightarrow x_i} ~, where $N(i)$ denotes the neighbours of nodes $i$. A Gaussian message has the canonical form: m = \mathcal{N}^{-1}(x; \eta, \Lambda) \propto \exp(-\frac{1}{2} x^\top \Lambda x + \eta^\top x) ~, and so taking products of these messages is equivalent to summing the respective information vectors and precision matrices. The belief parameters $\eta_{b_i}$ and $\Lambda_{b_i}$ are therefore: \eta_{b_i} = \sum_{s \in N(i)} \eta_{f_s \rightarrow x_i} \:\:\:\:\text{and}\:\:\:\: \Lambda_{b_i} = \sum_{s \in N(i)} \Lambda_{f_s \rightarrow x_i} ~.

Variable to factor message passing

To send a message at a variable node to an adjacent factor, the variable node takes the product of the messages from all other adjacent factors: m_{x_i \rightarrow f_j} = \prod_{s \in N(i) \setminus j} m_{f_s \rightarrow x_i} ~. Similar to the belief update step, the outgoing message parameters can be computed simply by summing the message parameters from all other adjacent factors: \eta_{x_i \rightarrow f_j} = \sum_{s \in N(i) \setminus j} \eta_{f_s \rightarrow x_i} \:\:\:\:\text{and}\:\:\:\: \Lambda_{x_i \rightarrow f_j} = \sum_{s \in N(i) \setminus j} \Lambda_{f_s \rightarrow x_i} ~.

Factor to variable message passing

To send a message from a factor node to a variable node, the factor aggregates messages from all other adjacent variable nodes before marginalising out all other adjacent variable nodes: m_{f_j \rightarrow x_i} = \sum_{X_j \setminus x_i} f_j(X_j) \prod_{k \in N(j) \setminus i} m_{x_k \rightarrow f_j} ~.

To see how this computation is implemented for Gaussian models with the canonical form, consider a factor $f$ connected to 3 variable nodes $[x_1, x_2, x_3]$ where we are computing the message to variable $x_1$. The factor $f([x_1, x_2, x_3]) = \mathcal{N}^{-1}([x_1, x_2, x_3]; \eta_f, \Lambda_f)$ is a Gaussian over the variables and can be divided up as follows: \eta_{f} = \begin{bmatrix} \eta_{f\_1} \\ \eta_{f\_2} \\ \eta_{f\_3} \end{bmatrix} \:\:\:\:\text{and}\:\:\:\: \Lambda_{f} = \begin{bmatrix} \Lambda_{f\_11} & \Lambda_{f\_12} & \Lambda_{f\_13} \\ \Lambda_{f\_21} & \Lambda_{f\_22} & \Lambda_{f\_23} \\ \Lambda_{f\_31} & \Lambda_{f\_32} & \Lambda_{f\_33} \end{bmatrix} ~. The first part of the computation for the message to $x_1$, is to take the product of the factor distribution and messages coming from the other adjacent variables nodes ($x_2$ and $x_3$). This yields a Gaussian with the following parameters: \eta^\prime_{f} = \begin{bmatrix} \eta_{f\_1} \\ \eta_{f\_2} + \eta_{x_2 \rightarrow f} \\ \eta_{f\_3} + \eta_{x_3 \rightarrow f} \end{bmatrix} \:\:\:\:\text{and}\:\:\:\: \Lambda^\prime_{f} = \begin{bmatrix} \Lambda_{f\_11} & \Lambda_{f\_12} & \Lambda_{f\_13} \\ \Lambda_{f\_21} & \Lambda_{f\_22} + \Lambda_{x_2 \rightarrow f} & \Lambda_{f\_23} \\ \Lambda_{f\_31} & \Lambda_{f\_32} & \Lambda_{f\_33} + \Lambda_{x_3 \rightarrow f} \end{bmatrix} ~. To complete message passing from this factor, we must marginalise out all variables apart from the variable $x_1$ which is the recipient of the message. The formula for marginalising a Gaussian in the canonical form is given in Eustice et al . For the joint Gaussian distribution over variables $a$ and $b$ parameterized by: \eta = \begin{bmatrix} \eta_{a} \\ \eta_{b} \end{bmatrix} \:\:\:\:\text{and}\:\:\:\: \Lambda = \begin{bmatrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb} \end{bmatrix} ~, the marginal distribution over $a$ after marginalising out $b$, has parameters: \eta_{Ma} = \eta_{a} - \Lambda_{ab} \Lambda_{bb}^{-1} \eta_{b} \:\:\:\:\text{and}\:\:\:\: \Lambda_{Ma} = \Lambda_{aa} - \Lambda_{ab} \Lambda_{bb}^{-1} \Lambda_{ba} ~. To apply these formula to the partitioned joint Gaussian parameterized by $\eta^\prime_{f}$ and $\Lambda^\prime_{f}$, we first reorder the vector and matrix to bring the output variable to the top (in our example the recipient variable $x_1$ is already at the top, so we do not need to reorder). Then we identify the subblocks $a = x_1$ and $b = [x_2, x_3]$ and apply the above marginalization equations to form the parameters of the outgoing message.

Updates and Corrections

If you see mistakes or want to suggest changes, please create an issue on GitHub.

Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise.

Citation

For attribution in academic contexts, please cite this work as

J. Ortiz, T. Evans, A. Davison. A visual introduction to Gaussian Belief Propagation, 2021.

BibTeX citation

@article{Ortiz2021visualGBP,
  title = {A visual introduction to Gaussian Belief Propagation},
  author = {Ortiz, Joseph and Evans, Talfan and Davison, Andrew J.},
  journal={arXiv preprint arXiv:2107.02308},
  year = {2021},
}