Toward a unified theory of efficient, predictive, and ... - Olivier Marre

Sensory Communication, ed Rosenblith WA (MIT Press, Cambridge, MA), pp ... Baum EB, Moody J, Wilczek F (1988) Internal representations for associative ...
3MB taille 4 téléchargements 309 vues
Toward a unified theory of efficient, predictive, and sparse coding Matthew Chalka,b,1 , Olivier Marreb , and Gaˇsper Tkaˇcika a

´ Universite´ de Department of Physical Sciences, Institute of Science and Technology Austria, 3400 Klosterneuburg, Austria; and b Sorbonne Universites, Pierre et Marie Curie Paris 06, INSERM, CNRS, Institut de la Vision, 75012 Paris, France

neural coding | prediction | information theory | sparse coding | efficient coding

S

ensory neural circuits perform a myriad of computations, which allow us to make sense of and interact with our environment. For example, neurons in the primary visual cortex encode information about local edges in an image, while neurons in higher-level areas encode more complex features, such as textures or faces. A central aim of sensory neuroscience is to develop a mathematical theory to explain the purpose and nature of such computations and ultimately, predict neural responses to stimuli from first principles. The influential “efficient coding” theory posits that sensory circuits encode maximal information about their inputs given internal constraints, such as metabolic costs and/or noise (1–4); similar ideas have recently been applied in genetic and signaling networks (5, 6). While conceptually simple, this theory has been extremely successful in predicting a host of different neural response properties from first principles. Despite these successes, however, there is often confusion in the literature, due to a lack of consensus on (i) what sensory information is relevant (and thus, should be encoded) and (ii) the internal constraints (determining what information can be encoded). One area of potential confusion is between different ideas of why and how neural networks may need to make predictions. For example, given low noise, efficient coding predicts that neurons should remove statistical dependencies in their inputs so as to achieve nonredundant, statistically independent responses (3, 4, 7–9). This can be implemented within a recurrent network where neurons encode a prediction error equal to the difference between their received inputs and an interwww.pnas.org/cgi/doi/10.1073/pnas.1711114115

nally generated expectation, hence performing “predictive coding” (10–13). However, Bialek and coworkers (14, 15) recently proposed an alternative theory, in which neurons are hypothesized to preferentially encode sensory information that can be used to predict the future, while discarding other nonpredictive information (14–17). While both theories assume that neural networks make predictions, they are not equivalent: one describes how neurons should compress incoming signals, and the other describes how neurons should selectively encode only predictive signals. Signal compression requires encoding surprising stimuli not predicted by past inputs; these are not generally the same as predictive stimuli, which are informative about the future (16). Another type of code that has been studied extensively is “sparse coding”: a population code in which a relatively small number of neurons are active at any one time (18). While there are various reasons why a sparse code may be advantageous (19– 21), previous work has shown that sparse coding emerges naturally as a consequence of efficient coding of natural sensory signals with a sparse latent structure (i.e., generated by combining many sensory features, few of which are present at any one time) (22). Sparse coding has been successful in predicting many aspects of sensory neural responses (23, 24), notably the orientation and motion selectivity of neurons in the primary visual cortex (25–29). Nonetheless, it is unclear how sparse coding is affected by other coding objectives, such as efficiently predicting the future from past inputs. An attempt to categorize the diverse types of efficient coding is presented in SI Appendix, Efficient Coding Models. To consistently organize and compare these different ideas, we present a unifying framework based on the information bottleneck (IB) (30). In our work, a small set of optimization parameters Significance Sensory neural circuits are thought to efficiently encode incoming signals. Several mathematical theories of neural coding formalize this notion, but it is unclear how these theories relate to each other and whether they are even fully consistent. Here we develop a unified framework that encompasses and extends previous proposals. We highlight key tradeoffs faced by sensory neurons; we show that trading off future prediction against efficiently encoding past inputs generates qualitatively different predictions for neural responses to natural visual stimulation. Our approach is a promising first step toward theoretically explaining the observed diversity of neural responses. Author contributions: M.C., O.M., and G.T. designed research, performed research, and wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. Published under the PNAS license. 1

To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1711114115/-/DCSupplemental.

PNAS Early Edition | 1 of 6

BIOPHYSICS AND COMPUTATIONAL BIOLOGY

A central goal in theoretical neuroscience is to predict the response properties of sensory neurons from first principles. To this end, “efficient coding” posits that sensory neurons encode maximal information about their inputs given internal constraints. There exist, however, many variants of efficient coding (e.g., redundancy reduction, different formulations of predictive coding, robust coding, sparse coding, etc.), differing in their regimes of applicability, in the relevance of signals to be encoded, and in the choice of constraints. It is unclear how these types of efficient coding relate or what is expected when different coding objectives are combined. Here we present a unified framework that encompasses previously proposed efficient coding models and extends to unique regimes. We show that optimizing neural responses to encode predictive information can lead them to either correlate or decorrelate their inputs, depending on the stimulus statistics; in contrast, at low noise, efficiently encoding the past always predicts decorrelation. Later, we investigate coding of naturalistic movies and show that qualitatively different types of visual motion tuning and levels of response sparsity are predicted, depending on whether the objective is to recover the past or predict the future. Our approach promises a way to explain the observed diversity of sensory neural responses, as due to multiple functional goals and constraints fulfilled by different cell types and/or circuits.

NEUROSCIENCE

Edited by Charles F. Stevens, The Salk Institute for Biological Studies, La Jolla, CA, and approved November 20, 2017 (received for review June 22, 2017)

determines the goals and constraints faced by sensory neurons. Previous theories correspond to specific values of these parameters. We investigate the conditions under which different coding objectives have conflicting or synergistic effects on neural responses and explore qualitatively unique coding regimes. Efficient Coding with Varying Objectives/Constraints We consider a temporal stimulus, x−∞:t ≡ (. . ., xt−1 , xt ), which elicits neural responses, r−∞:t ≡ (. . ., rt−1 , rt ). We seek a neural code described by the probability distribution p(rt |x−∞:t ), such that neural responses within a temporal window of length τ encode maximal information about the stimulus at lag ∆ given fixed information about past inputs (Fig. 1A). This problem can be formalized using the IB framework (30–32) by seeking a code, p(rt |x−∞:t ), that maximizes the objective function: Lp (rt |x−∞:t ) = I (Rt−τ :t ; Xt+∆ ) − γI (Rt ; X−∞:t ),

[1]

where the first term (to be maximized) is the mutual information between the responses between t − τ and τ and the stimulus at time t + ∆, while the second term (to be constrained) is the mutual information between the response at time t and past inputs (which we call the coding capacity, C ). A constant, γ, controls the tradeoff between coding fidelity and compression. This objective function can be expanded as Lp (rt |x−∞:t ) = hlog p (xt+∆ |rt−τ :t ) − log p (xt+∆ ) − γ log p (rt |xt−∞:t ) + γ log p (rt )ip(r ,x ).

[2]

Previously, we showed that, in cases where it is not possible to compute this objective function directly, one can use approxima˜ ≤ tions of p(xt+∆ |rt−τ :t ) and p(rt ) to obtain a lower bound, L L, that can be maximized tractably (31) (SI Appendix,General Framework). From Eqs. 1 and 2, we see that the optimal coding strategy depends on three factors: the decoding lag, ∆; the code length, τ ; and the coding capacity, C (determined by γ). Previous theo-

A

B

C

Fig. 1. Schematic of modeling framework. (A) A stimulus (stim.) (Upper) elicits a response in a population of neurons (Lower). We look for codes where the responses within a time window of length τ maximize information encoded about the stimulus at lag ∆, subject to a constraint on the information about past inputs, C. (B) For a given stimulus, the optimal code depends on three parameters: τ , ∆, and C. Previous work on efficient temporal coding generally looked at τ > 0 and ∆ < 0 (blue sphere). Recent work posited that neurons encode maximal information about the future (∆ > 0) but only treated instantaneous codes τ ∼ 0 (red plane). Our theory is valid in all regimes, but we focus in particular on ∆ > 0 and τ > 0 (black sphere). (C) We further explore how optimal codes change when there is a sparse latent structure in the stimulus (natural image patch; Right) vs. when there is none (filtered noise; Left).

2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1711114115

ries of neural coding correspond to specific regions within the 3D parameter space spanned by ∆, τ , and C (Fig. 1B). For example, temporal redundancy reduction (3, 33) occurs (i) at low internal noise (i.e., high C ), (ii) where the objective is to encode the recent past (∆ < 0), and (iii) where information about the stimulus can be read out by integrating neural responses over time (τ  0). Increasing the internal noise (i.e., decreasing C ) results in a temporally redundant “robust” code (34–37) (blue sphere in Fig. 1B). Recent work positing that neurons efficiently encode information about the future (∆ > 0) looked exclusively at nearinstantaneous codes, where τ ∼ 0 (red plane in Fig. 1B) (15, 38–40). Here, we investigate the relation between these previous works and focus on the (previously unexplored) case of neural codes that are both predictive (∆ > 0) and temporal (τ > 0) and have varying signal to noise (variable C ) (black sphere in Fig. 1B). To specialize our theory to the biologically relevant case, we later investigate efficient coding of natural stimuli. A hallmark of natural stimuli is their sparse latent structure (18, 22, 25, 26): stimulus fragments can be constructed from a set of primitive features (e.g., image contours), each of which occurs rarely (Fig. 1C). Previous work showed that, in consequence, redundancy between neural responses is minimized by maximizing their sparsity (SI Appendix, Efficient Coding Models) (22). Here, we investigated what happens when the objective is not to minimize redundancy but rather, to efficiently predict future stimuli given finite coding capacity. Results Dependence of Neural Code on Coding Objectives. Our initial goal

was to understand the influence of different coding objectives in the simplest scenario, where a single neuron linearly encodes a 1-d input. In this model, the neural response at time t is rt = P τw k = =0 wk xt−k + ηt , where w = (w0 , . . ., wτw ) are the linear coding weights and ηt is a Gaussian noise with unit variance.∗ With 1-d stimuli that have Gaussian statistics, the IB objective function takes a very simple form: * !2 + τ X

1 1 L = − log xt+∆ − uk rt−k − γ log rt2 , [3] 2 2 k =0

where u = (u0 , . . ., uτ ) are the optimal linear readout weights used to reconstruct the stimulus at time t +∆ from the responses between t − τ and t. Thus, the optimal code is the one that minimizes the mean-squared reconstruction error at lag ∆, constrained by the variance of the neural response (relative to the noise variance).† Initially, we investigated “instantaneous” codes, where τ = 0, so that the stimulus at time t + ∆ is estimated from the instantaneous neural response at time t (Fig. 2A). We considered three different stimulus types, which are shown in Fig. 2B. With a “Markov” stimulus (Fig. 2B, Top and SI Appendix, Methods for Simulations in the Main Text), with a future trajectory that depended solely on the current state, xt , the neurons only needed to encode xt to predict the stimulus at a future time, xt+∆ . Thus, when τ = 0, we observed the trivial solution where rt ∝ xt , irrespective of the decoding lag, ∆ (Fig. 2 C and D and SI Appendix, Fig. S2A). With a “two-timescale” stimulus constructed from two Markov processes that vary over different timescales (Fig. 2B, Middle), the optimal solution was a low-pass filter to selectively encode

∗ †

τw is the encoding filter length, not to be confused with τ , the decoding filter length.

We omitted the constant stimulus entropy term, log p(xt+∆ ) , from Eq. 3 and the

 noise entropy term, log p rt |xt−∞:t [since with no loss of generality, we assume a fixed amplitude additive noise (32)].

Chalk et al.

A

C

D

E

F

Fig. 2. Dependence of optimal code on decoding lag, ∆; code length, τ ; and coding capacity, C. (A) We investigated two types of code: instantaneous codes, where τ = 0 (C and D), and temporal codes, where τ > 0 (E and F). (B) Training stimuli (stim.) used in our simulations. Markov stimulus: future only depends on the present state. Two-timescale stimulus: sum of two Markov processes that vary over different timescales (slow stimulus component is shown in red). Inertial stimulus: future depends on present position and velocity. (C) Neural responses to probe stimulus (dashed lines) after optimization (opt.) with varying ∆ and τ = 0. Responses are normalized by the final steady-state value. (D) Correlation (corr.) index after optimization with varying ∆ and C.DThis E index measures the correlation between responses at adjacent time steps normalized by the stimulus correlation at adjacent time steps (i.e., hrt rt+1 i / rt2 D E

divided by hxt xt+1 i / xt2 ). Values greater/less than one indicate that neurons temporally correlate (red)/decorrelate (blue) their input. Filled circles show

NEUROSCIENCE

B

the predictive, slowly varying part of the stimulus. The strength of the low-pass filter increased monotonically with ∆ (Fig. 2 C and D and SI Appendix, Fig. S2A). Finally, with an “inertial” stimulus, with a future trajectory that depended on both the previous state, xt , and velocity, xt − xt−1 (Fig. 2B, Bottom), the optimal solution was a high-pass filter so as to encode information about velocity. The strength of this highpass filter also increased monotonically with ∆ (Fig. 2 C and D and SI Appendix, Fig. S2A, Bottom). With an instantaneous code, varying the coding capacity, C , only rescales responses (relative to the noise amplitude) so as to alter their signal-to-noise ratio. However, the response shape is left unchanged (regardless of the stimulus statistics) (Fig. 2D). In contrast, with temporally extended codes, where τ > 0 (so the stimulus at time t + ∆ is estimated from the integrated responses between time t − τ and t) (Fig. 2A), the optimal neural code varies with the coding capacity, C . As with previous efficient coding models, at high C (i.e., high signal-to-noise ratio), neurons always decorrelated their input, regardless of both the stimulus statistics and the decoding lag, ∆ (to achieve nonredundant responses) (SI Appendix, Efficient Coding Models), while decreasing C always led to more correlated responses (to achieve a robust code) (SI Appendix, Efficient Coding Models) (36). However, unlike previous efficient coding models at low to intermediate values of C (i.e., intermediate to low signal-to-noise ratio), the optimal code was qualitatively altered by varying the decoding lag, ∆. With the Markov stimulus, increasing ∆ had no effect; with the two-timescale stimulus, it led to low-pass filtering, and with the inertial stimulus, it led to stronger high-pass filtering. Taken together, “phase diagrams” for optimal, temporally extended codes show how regimes of decorrelation/whitening (high-pass filtering) and of smoothing (low-pass filtering) are preferred depending on the coding capacity, C , and decoding lag, ∆. We verified that a qualitatively similar transition from low- to high-pass filtering is also observed with higher dimenChalk et al.

sional stimuli and/or more neurons. Importantly, we show that these phase diagrams depend in an essential way on the stimulus statistics already in the linear Gaussian case. We next examined what happens for non-Gaussian, high-dimensional stimuli. Efficient Coding of Naturalistic Stimuli. Natural stimuli exhibit a strongly non-Gaussian statistical structure, which is essential for human perception (22, 41). A large body of work has investigated how neurons could efficiently represent such stimuli by encoding their nonredundant or independent components (4). Under fairly general conditions (e.g., that stimuli have a sparse latent structure), this is equivalent to finding a sparse code: a form of neural population code, in which only small fractions of neurons are active at any one time (22). For natural images, this leads to neurons that are selective for spatially localized image contours, with receptive fields (RFs) that are qualitatively similar to the RFs of V1 simple cells (25, 26). For natural movies, this leads to neurons selective for a particular motion direction, again similar to observations in area V1 (27). However, an independent (sparse) temporal code has only been shown to be optimal (i) when the goal is to maximize information about past inputs (i.e., ∆ < 0) and (ii) at low noise (i.e., at high capacity; C  0). We were interested, therefore, in what happens when these two criteria are violated: for example, when neural responses are optimized to encode predictive information (i.e., for ∆ ≥ 0). To explore these questions, we modified the objective function of Eq. 3 to deal with multidimensional stimuli and nonGaussian statistics of natural images (SI Appendix, General Framework). Specifically, we generalized the second term of Eq. 3 to allow optimization of the neural code with respect to higher-order (i.e., beyond covariance) response statistics. This was done by approximating the response distribution p(r ) by a Student t distribution, with shape parameter, ν, learned directly from data (SI Appendix, Eq. S5) (31). Crucially, our modification permits—but does not enforce by hand—sparse PNAS Early Edition | 3 of 6

BIOPHYSICS AND COMPUTATIONAL BIOLOGY

the parameter values used in C. (E and F) Same as C and D but with code optimized for τ  0. Plots in E correspond to responses to probe stimulus (dashed lines) at varying coding capacity and fixed decoding lag (i.e., ∆ = 3; indicated by dashed lines in F).

neural responses (42). For nonsparse, Gaussian stimuli, the IB algorithm returns ν → ∞, so that the Student t distribution is equivalent to a Gaussian distribution, and we obtain the results of the previous section; for natural image sequences, it replicates previous sparse coding results in the limit ∆ < 0 and C  0 (SI Appendix, Fig. S5), without introducing any tunable parameters. We investigated how the optimal neural code for naturalistic stimuli varied with the decoding lag, ∆, while keeping coding capacity, C , and code length, τ , constant. Stimuli were constructed from 10 × 10-pixel patches drifting stochastically across static natural images (Fig. 3A, SI Appendix, Methods for Simulations in the Main Text, and SI Appendix, Fig. S3). Gaussian white noise was added to these inputs (but not the decoded variable, Xt+∆ ) (SI Appendix, Methods for Simulations in the Main Text). Neural encoding weights were optimized with two different decoding lags: for ∆ = −6, the goal was to encode past stimuli, while for ∆ = 1, the goal was to predict the near future. Fig. 3B confirms that the codes indeed are optimal for recovering either the past (∆ = −6) or future (∆ = 1) as desired. After optimization at both values of ∆, individual neurons were selective to local oriented edge features (Fig. 3 C and D) (25). Varying ∆ qualitatively altered the temporal features encoded by each neuron, while having little effect on their spatial selectivity. Consistent with previous results on sparse temporal coding (27), with ∆ = −6, single cells were responsive to stimuli moving in a preferred direction as evidenced by spatially displaced encoding filters at different times (Fig. 3C and SI Appendix, Fig. S6 A–C) and a high “directionality index” (Fig. 3E). In contrast, with ∆ = 1, cells responded equally to stimuli moving in either direction perpendicular to their encoded stimulus orientation. This was evidenced by spatiotemporally separable RFs (SI Appendix, Fig. S6 D–F) and directionality indexes near zero. This qualitative difference between the two types of

A

C

B

D

E

Fig. 3. Efficient coding of naturalistic stimuli. (A) Movies were constructed from a 10 × 10-pixel patch (red square), which drifted stochastically across static natural images. (B) Information encoded [i.e., reconstruction (recon.) quality] by neural responses about the stimulus at varying lag (i.e., reconstruction lag) after optimization with ∆ = −6 (blue) and ∆ = 1 (red). (C) Spatiotemporal encoding filters for four example neurons after optimization with ∆ = −6. (D) Same as C for ∆ = 1. (E) Directionality index of neural responses after optimization with ∆ = −6 and ∆ = 1. The directionality index measures the percentage change in response to a grating stimulus moving in a neuron’s preferred direction vs. the same stimulus moving in the opposite direction.

4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1711114115

code for naturalistic movies was highly surprising, and we sought to understand its origins. Tradeoff Between Sparsity and Predictive Power. To gain an intu-

itive understanding of how the optimal code varies with decoding lag, ∆, we constructed artificial stimuli from overlapping Gaussian bumps, which drifted stochastically along a single spatial dimension (Fig. 4A and SI Appendix, Methods for Simulations in the Main Text). While simple, this stimulus captured two key aspects of the naturalistic movies. First, Gaussian bumps drifted smoothly in space, resembling stochastic global motion over the image patches; second, the stimulus had a sparse latent structure. We optimized the neural code with ∆ ranging from −2 to 2, holding the coding capacity, C , and code length, τ , constant. Fig. 4B confirms that highest performance was achieved when the reconstruction performance was evaluated at the same lag for which each model was trained. This simpler setup recapitulated the surprising result that we obtained with naturalistic stimuli: namely, when ∆ < 0, neurons were selective to a single preferred motion direction, while when ∆ ≥ 0, neurons responded equally to stimuli moving from either direction to their RF (Fig. 4 C and D). Predicting the future state of the stimulus requires estimating its current motion direction and speed. How is it possible then that optimizing the code for predictions (∆ > 0) results in neurons being unselective to motion direction? This paradox is resolved by realizing that it is the information encoded by the entire neural population that counts, not the information encoded by individual neurons. Indeed, when we looked at the information encoded by the neural population, we did find what we had originally expected: when optimized with ∆ > 0, the neural population as a whole encoded significantly more information about the stimulus velocity than its position (relative to when ∆ < 0), despite the fact that individual neurons were unselective to motion direction (Fig. 4 E and F). The change in coding strategy that is observed as one goes from encoding the past (∆ < 0) to the future (∆ > 0) is in part due to a tradeoff between maintaining a sparse code and cells responding quickly to stimuli within their RF. Intuitively, to maintain highly selective (and thus, sparse) responses, neurons first have to wait to process and recognize the “complete” stimulus feature; unavoidably, however, this entails a processing delay, which leads to poor predictions. This can be seen in Fig. 4 G and H, which shows how both the response sparsity and delay to stimuli within a cell’s RF decrease with ∆. In SI Appendix, Supplementary Simulations, we describe in detail why this tradeoff between efficiency and prediction leads to directionselective filters when ∆ < 0 but not when ∆ > 0 (SI Appendix, Fig. S7). Beyond the effects on the optimal code of various factors explored in detail in this paper, our framework further generalizes previous efficient and sparse coding results to factors listed in SI Appendix, Table S1 and discussed in SI Appendix, Supplementary Simulations. For example, decreasing the capacity, C (while holding ∆ constant at −2), resulted in neurons being unselective to stimulus motion (SI Appendix, Fig. S8A), with a similar result observed for increased input noise (SI Appendix, Fig. S8B). Thus, far from being generic, traditional sparse temporal coding, in which neurons responded to local motion, was only observed in a specific regime (i.e., ∆ < 0, C  0, and low input noise). Discussion Efficient coding has long been considered a central principle for understanding early sensory representations (1, 3), with wellunderstood implications and generalizations (23, 37). It has been successful in predicting many aspects of neural responses in early sensory areas directly from the low-order statistics of natural Chalk et al.

C

F

D

G

H

Fig. 4. Efficient coding of a “Gaussian-bump” stimulus. (A) Stimuli (stim.) consisted of Gaussian bumps that drifted stochastically along a single spatial dimension (dim.) (with circular boundary conditions). (B) Information encoded by neural responses about the stimulus at varying lag, ∆test , after optimization with varying ∆train . Black dots indicate the maximum for each column. (C) Response of example neuron to a test stimulus (Upper) and after optimization with ∆ = −2 (blue), ∆ = 0 (green), and ∆ = 2 (red; Lower). (D) Spatiotemporal encoding filters for an example neuron after optimization with different ∆. (E) Circular correlation between the reconstructed speed of a moving Gaussian blob and its true speed vs. the circular correlation between the reconstructed position and its true position obtained from neural responses optimized with ∆ = ±2 (red and blue curves). Curves were obtained by varying γ in Eq. 3 to find codes with different coding capacities. (F) Linear reconstruction of the stimulus trajectory obtained from neural responses optimized with ∆ = ±2 (red and blue curves). The full stimulus is shown in grayscale. While coding capacity was chosen to equalize the mean reconstruction error for both models (vertical dashed line in E), the reconstructed trajectory was much smoother after optimization with ∆ = 2 than with ∆ = −2. (G) Response sparsity (defined as the negentropy of neural responses) vs. ∆ (dots indicate individual neurons; the line indicates population average). (H) Delay between stimulus presented at a neuron’s preferred location and each neuron’s maximum response vs. ∆.

stimuli (7, 22, 32, 43, 44) and has even been extended to higherorder statistics and central processing (45, 46). However, a criticism of the standard theory is that it treats all sensory information as equal, despite empirical evidence that neural systems prioritize behaviorally relevant (and not just statistically likely) stimuli (47). To overcome this limitation, Bialek and coworkers (14, 15) proposed a modification to the standard efficient coding theory, positing that neural systems are set up to efficiently encode information about the future given fixed information about the past. This is motivated by the fact that stimuli are only useful for performing actions when they are predictive about the future. The implications of such a coding objective have remained relatively unexplored. Existing work only considered the highly restrictive scenario where neurons maximize information encoded in their instantaneous responses (15, 38, 40). In this case (and subject to some additional assumptions, such as Gaussian stimulus statistics and instantaneous encoding filters), predictive coding is formally equivalent to slow feature analysis (39). This is the exact opposite of standard efficient coding models, which (at low noise/high capacity) predict that neurons should temporally decorrelate their inputs (3, 33). We developed a framework to clarify the relation between different versions of the efficient coding theory (14, 30, 31). We investigated what happens when the neural code is optimized to efficiently predict the future (i.e., ∆ > 0 and τ > 0) (Fig. 1B). In this case, the optimal code depends critically on the coding capacity (i.e., signal-to-noise ratio), which describes how much information the neurons can encode about their input. At high capacity (i.e., low noise), neurons always temporally decorrelate their input. At finite capacity (i.e., mid to high noise), however, the optimal neural code varies qualitatively depending on whether the goal is to efficiently predict the future or reconstruct the past. When we investigated efficient coding of naturalistic stimuli, we found solutions that are qualitatively different from known sparse coding results, in which individual neurons are tuned to Chalk et al.

directional motion of local edge features (27). In contrast, we found that neurons optimized to encode the future are selective for motion speed but not direction (Fig. 3 and SI Appendix, Fig. S6). Surprisingly, however, the neural population as a whole encodes motion even more accurately in this case (Fig. 4E). We show that these changes are due to an implicit tradeoff between maintaining a sparse code and responding quickly to stimuli within each cell’s RF (Fig. 4 G and H). It is notable that, in our simulations, strikingly different conclusions are reached by analyzing single-neuron responses vs. the population responses. Specifically, looking only at single-neuron responses would lead one to conclude that, when optimized for predictions, neurons did not encode motion direction; looking at the neural population responses reveals that the opposite is true. This illustrates the importance of population-level analyses of neural data and how, in many cases, single-neuron responses can give a false impression of which information is represented by the population. A major challenge in sensory neuroscience is to derive the observed cell-type diversity in sensory areas from a normative theory. For example, in visual area V1, one observes a range of different cell types, some of which have spatiotemporally separable RFs and others do not (48, 49). The question arises, therefore, whether the difference between cell types emerges because different subnetworks fulfill qualitatively different functional goals. One hypothesis, suggested by our work, is that cells with separable RFs have evolved to efficiently encode the future, while cells with nonseparable RFs evolved to efficiently encode the past. More generally, the same hypothesis could explain the existence of multiple cell types in the mammalian retina, with each cell type implementing an optimal code for a particular choice of optimization parameters (e.g., coding capacity or prediction lag). Testing such hypotheses rigorously against quantitative data would require us to generalize our work to nonlinear encoding and decoding models (SI Appendix, Table S1). Here, we focused on a linear decoder to lay a solid theoretical foundation and PNAS Early Edition | 5 of 6

NEUROSCIENCE

E

B

BIOPHYSICS AND COMPUTATIONAL BIOLOGY

A

permit direct comparison with previous sparse and robust coding models, which also assumed a linear decoder (25–27, 35, 36). In addition, a linear decoder forces our algorithm to find a neural code for which information can be easily extracted by downstream neurons performing biologically plausible operations. While the linearity assumptions simplify our analysis, the framework can easily accommodate nonlinear encoding and decoding. For example, we previously used a “kernel” encoding model, where neural responses are described by a nonparametric and nonlinear function of the input (31). Others have similarly used a deep convolutional neural network as an encoder (50). As mentioned earlier, predictive coding has been used to describe several different approaches. Clarifying the relationship between inequivalent definitions of predictive coding and linking them mathematically to coding efficiency provided one of the ini-

tial motivations for our work. In past work, alternative coding theories are often expressed using very different mathematical frameworks, impeding comparison between them and sometimes leading to confusion. In contrast, by using a single mathematical framework to compare different theories—efficient, sparse, and predictive coding—we were able see exactly how they relate to each other, the circumstances under which they make opposing or similar predictions, and what happens when they are combined.

1. Attneave F (1954) Some informational aspects of visual perception. Psychol Rev 61:183–193. 2. Linsker R (1988) Self-organization in a perceptual network. IEEE Computer 21: 105–117. 3. Barlow HB (1961) Possible principles underlying the transformation of sensory messages. Sensory Communication, ed Rosenblith WA (MIT Press, Cambridge, MA), pp 217–234. 4. Simoncelli EP, Olshausen BA (2001), Natural image statistics and neural representation. Ann Rev Neurosci 24:1193–1216. 5. Tkacik G, Bialek W (2016) Information processing in living systems. Ann Rev Condens Matter Phys 7:89–117. 6. Bialek W (2012) Biophysics: Searching for Principles (Princeton Univ Press, Princeton), pp 353–468. 7. Atick JJ, Redlich AN (1992), What does the retina know about natural scenes? Neural Comput 4:196–210. 8. Field DJ (1987) Relations between the statistics of natural images and the response properties of cortical cells. J Opt Soc Am A 4:2379–2394. 9. Kersten D (1987) Predictability and redundancy of natural images. J Opt Soc Am A 4:2395–2400. 10. Rao RP, Ballard DH (1999) Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nat Neurosci 2:79–87. 11. Boerlin M, Deneve S (2011) Predictive coding of dynamical variables in balanced spiking networks. PLoS Comp Biol 7 e1001080. 12. Srinivasan MV, Laughlin SB, Dubs A (1982) Predictive coding: A fresh view of inhibiition in the retina. Proc R Soc B 216:427–459. 13. Druckmann S, Hu T, Chklovskii DB (2017) A mechanistic model of early sensory processing based on subtracting sparse representations. Adv Neural Inf Process Syst 25:1979–1987. 14. Bialek W, De Ruyter Van Steveninck R, Tishby N (2006) Efficient representation as a design principle for neural coding and computation. Proceedings of the IEEE International Symposium on Information Theory, pp 659–663. Available at ieeexplore. ieee.org/abstract/document/4036045/. Accessed December 7, 2017. 15. Palmer SE, Marre O, Berry MJ II, Bialek W (2015) Predictive information in a sensory population. Proc Natl Acad Sci USA 112:6908–6913. 16. Salisbury J, Palmer S (2016) Optimal prediction in the retina and natural motion statistics. J Stat Phys 162:1309–1323. 17. Heeger DJ (2017) Theory of cortical function. Proc Natl Acad Sci USA 114:1773–1782. 18. Olshausen BA, Field DJ (2004) Sparse coding of sensory inputs. Curr Op Neurobiol 14:481–487. 19. Barlow HB (1972) Single units and sensation: A neuron doctrine for perceptual psychology? Perception 1:371–394. 20. Baum EB, Moody J, Wilczek F (1988) Internal representations for associative memory. Biol Cybern 59:217–228. 21. Field DJ (1994) What is the goal of sensory coding? Neural Comput 6:559–601. ¨ 22. Hyvarinen A, Hurri J, Hoyer PO (2009) Natural Image Statistics (Springer, Berlin). 23. Smith EC, Lewicki MS (2006) Efficient coding of natural sounds. Nature 439:978–982. 24. Theunissen FE (2003) From synchrony to sparseness. Trends Neurosi 26:61–64. 25. Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381:607–609. 26. Bell AJ, Sejnowski TJ (1997), The “independent components” of natural scenes are edge filters. Vis Res 37:3327–3338. 27. van Hateren JH, van der Schaaf A (1998) Independent component filters of natural images compared with simple cells in primary visual cortex. Proc Biol Sci 265:359–366.

28. van Hateren JH, Ruderman DL (1998) Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proc R Soc Lond B Biol Sci 265:2315–2320. 29. Olshausen BA (2002) Sparse codes and spikes. Probabilistic Models of the Brain: Perception and Neural Function, eds Rao RPN, Olshausen BA, Lewicki MS (MIT Press, Cambridge, MA), pp 257–272. 30. Tishby N,Pereira FC, Bialek W (1999) The information bottleneck method. arXiv:physics/0004057. 31. Chalk M, Marre O, Tkaˇcik G (2016) Relevant sparse codes with variational information bottleneck. Adv Neural Inf Process Syst 29:1957–1965. 32. Chechik G, Globerson A, Tishby N, Weiss Y (2005) Information bottleneck for Gaussian variables. J Machine Learn Res 6:165–188. 33. Dan Y, Atick JJ, Reid RC (1996) Efficient coding of natural scenes in the lateral geniculate nucleus: Experimental test of a computational theory. J Neurosci 16: 3351–3362. 34. Karklin Y, Simoncelli EP (2011), Efficient coding of natural images with a population of noisy linear-nonlinear neurons. Adv Neural Inf Process Syst 24:999– 1007. 35. Doi E, Lewicki MS (2005) Sparse coding of natural images using an overcomplete set of limited capacity units. Adv Neural Inf Process Syst 17:377–384. 36. Doi E, Lewicki MS (2014) A simple model of optimal coding for sensory systems. PLoS Comput Bio 10:e1003761. 37. Tkaˇcik G, Prentice JS, Balasubramanian V, Schneidman E (2010) Optimal population coding by noisy spiking neurons. Proc Nat’l Acad Sci USA 107:14419–14424. 38. Creutzig F, Sprekeler H (2008), Predictive coding and the slowness principle: An information-theoretic approach. Neural Comput 20:1026–1041. 39. Berkes P, Wiskott L (2005) Slow feature analysis yields a rich repertoire of complex cell properties. J Vis 5:579–602. 40. Buesing L, Maass W (2010) A spiking neuron as information bottleneck. Neural Comput 22:1961–1992. 41. Oppenheim AV, Lim JS (1981) The importance of phase in signals. Proc IEEE 69: 529–541. 42. Olshausen BA, Millman KJ (2000) Learning sparse codes with a mixture-of-gaussians prior. Advances in Neural Information Processing Systems, 12, Ed Solla SA, Leen TK, Muller KR (MIT Press, Cambridge, MA), pp 841–847. 43. Doi E, et al. (2012), Efficient coding of spatial information in the primate retina. J Neurosci 32:16256–16264. 44. Balasubramanian V, Sterling P (2009) Receptive fields and functional architecture in the retina. J Physiol 587:2753–2767. 45. Tkaˇcik G, Prentice JS, Victor JD, Balasubramanian V (2010), Local statistics in natural scenes predict the saliency of synthetic textures. Proc Natl Acad Sci USA 107:18149– 18154. 46. Hermundstad AM, et al. (November 14, 2014), Variance predicts salience in central sensory processing. eLife, 10.7554/eLife.03722. 47. Machens CK, Gollisch T, Kolesnikova O, Herz AVM (2005) Testing the efficiency of sensory coding with optimal stimulus ensembles. Neuron 47:447–456. 48. DeAngelis GC, Ohzawa I, Freeman R (1995) Receptive-field dynamics in the central visual pathways. Trends Neurosci 18:451–458. 49. Priebe NJ, Lisberger SG, Movshon JA (2006) Tuning for spatiotemporal frequency and speed in directionally selective neurons of macaque striate cortex. J Neurosci 26: 2941–2950. 50. Alemi A, Fischer I, Dillon JV, Murphy K (2016) Deep variational information bottleneck. arXiv:1612.00410.

6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1711114115

ACKNOWLEDGMENTS. This work was supported by Agence Nationale de Recherche (ANR) Trajectory, the French State program Investissements d’Avenir managed by the ANR [LIFESENSES: ANR-10-LABX-65], a European Commission Grant (FP7-604102), NIH Grant U01NS090501, and AVIESANUNADEV Grant (to O.M.), and Austrian Science Fund Grant FWF P25651 (to G.T.).

Chalk et al.

Towards a unified theory of efficient, predictive and sparse coding: Supplementary information Matthew Chalk, Olivier Marre, Gasper Tkacik

1

Efficient coding models

The efficient coding hypothesis posits that sensory systems have evolved to transmit maximal information about incoming sensory signals, given internal resource constraints (such as internal noise, and/or metabolic cost) [1, 2, 3, 4, 5, 6, 7]. It has been successful in predicting a host of different neural response properties from first principles. Nonetheless, there is often confusion in the literature, due to the fact that different authors have made very different assumptions about: (i) what sensory information is relevant (and thus, should be encoded); (ii) the internal constraints (determining what information can be encoded). In the following we provide a brief (non-exhaustive) overview of the various types of efficient coding model that have been proposed (illustrated in SI Fig 1), so as to clarify the relation between them.

1.1

Redundancy reduction

Efficient coding models have usually assumed that the goal of sensory processing is to encode maximal information about all incoming signals, given internal constraints. In the low-noise limit, this implies that neurons should remove redundancies in their inputs, to achieve statistically independent responses [5, 6, 7, 8, 9]. Considering, for the moment, only second-order statistics, redundancy reduction implies that neurons should whiten incoming signals, to achieve decorrelated responses. Previous work showed that this can explain many aspects of low-level visual neuron responses, such as the centre-surround receptive fields (RFs) of neurons in the retina [3, 10], and the temporal filtering properties of neurons in the LGN [11]. More generally, to achieve independent responses, neurons must also remove high-order (i.e. beyond covariance) statistical redundancies in their inputs. One way to do this is via ‘sparse coding’, where only a small proportion of neurons are active at any one time [12]. Indeed, given certain assumptions about the statistical structure of sensory signals (i.e. that they are generated by linearly combing a set of independent sparsely distributed features), maximising the sparsity of neural responses is equivalent to maximising their independence [13, 14, 15]. In a seminal paper, Olshausen & Field showed that learning a sparse code of natural images results in local orientated filters, that closely resemble the RFs of V1 simple cells [16]. Since then, sparse coding has been used to model several other aspects of low-level visual neuron responses [17], in addition to coding by auditory [18] and olfactory [19] neurons. It has been proposed that statistically independent responses could be achieved if sensory neurons encode a ‘prediction error’ equal to the difference between their input, and an internal prediction generated by the network [10, 20]. This could be implemented in a hierarchical network, with feedforward signals transmitting an error signal, while feed-back signals transmitting a prediction [20]. Alternatively, recent works have shown how predictive coding could be implemented within a single densely connected recurrent spiking network [21, 22].

1

Efficient coding variable noise

low noise

redundancy 
 reduction [5-9] implementation via top-down/recurrent feedback

predictive
 coding [10, 20-22]

robust coding


prioritise relevant
 information

[23-28]

sparse stim. stats

sparse stim. stats

robust sparse coding [24-26]

sparse coding/ ICA [12-19]

e.g. encode predictive information [33-39]

non-temporal code (i.e. responses not integrated over time)

slow feature analysis [37-39]

Figure 1: Schematic of various types of efficient coding model, with corresponding references.

1.2

Robust coding

The mutual information between responses, R, and stimulus X, can be expressed as: I(R; X) = H(R) − H(R|X), where H(R) is the response entropy, and H(R|X) is the noise entropy. At lownoise, where the second, noise entropy term is negligible, information is maximised by maximising the response entropy, H(R) (via redundancy reduction). At higher noise, it becomes important to minimise the noise entropy, H(R|X), leading to qualitatively different predicted neural responses [4]. In a seminal paper, Atick & Redlich showed that varying the signal-to-noise ratio leads to a qualitative change in the predicted neural code, with neurons whitening their inputs at low noise (to minimise redundancy) and smoothing their inputs at high noise (to average the noise) [23]. Interestingly, their model was able to explain how the shape of retinal ganglion cell (RGC) RFs vary with visual contrast. More recently, several authors proposed ‘robust coding’ models detailing how ‘sensory noise’ (added to the sensory input), and ‘neural noise’ (added to the neural responses) alter the predicted neural responses [24, 25, 26, 27]. These models were able to account for various further aspects of RGC responses, including how the shape and overlap of RGC RFs varies with visual eccentricity. Further, they showed how sparse coding varies with the amplitude of neural and sensory noise. For example, Karklin & Simoncelli [26] showed that at low noise enforcing sparsity leads to local orientated spatial filters (as in [16]), while at high noise it leads to circularly symmetric spatial filters. Tkacik et al. studied how the recurrent connectivity of an efficient spiking network should vary with the signal-to-noise ratio [28]. Interestingly, they found that at low signal-to-noise, information maximisation predicts an attractor like structure of the neural code. While highly redundant, this type of code allows the network to mitigate the effects of noise. Given time-varying stimulus statistics time, the speed that neurons can adapt to efficiently encode their inputs is limited by the need to collect new statistics. Interestingly this was observed for motionsensitive neurons (H1) neurons in the fly visual system, which not only adapt to efficiently encode new input statistics [29], but whose speed of adaptation approaches the physical limits imposed by statistical sampling and noise [30].

1.3

Coding relevant information

An alternative hypothesis is that, rather than encoding all sensory signals, neural circuits preferentially encode behaviourally relevant signals. Indeed, Machens et al. found that grasshopper auditory neurons are optimised to efficiently encode behaviourally relevant vocalisation signals, rather than the sound signals most commonly found in their environment [31]. Further in higher-level sensory areas, many neurons are specialised for encoding features that are behaviourally relevant (e.g. faces) [32], rather than features that are statistically likely (e.g. clouds). A difficulty here is that, except in special cases, it is hard to know which sensory information is relevant to an organism. To overcome this, Bialek & colleagues proposed that a minimal criterion for a

2

stimulus to be behaviourally relevant, is that it can be used to predict what will happen in the future. This led them to hypothesise that sensory neural circuits are set up to encode maximal information about stimuli that are predictive about the future, given a constraint on the information encoded about previous inputs [33, 34, 35]. While intriguing, there is currently little theoretical work exploring the neural implications of this idea. Further, previous work only considered a highly restrictive scenario where neurons are assumed to encode information redundantly, via their instantaneous responses [35, 36, 37]. Interestingly, Creutzig et al. [37] showed that, in this case (and given some further assumptions, such as linear gaussian stimulus statistics) efficiently encoding the future is equivalent to ‘slow feature analysis’ (SFA), a method for extracting slowly varying components from quickly varying input signals, used previously to account for the response properties of complex cells in area V1 [38, 39].

2

General framework

We consider a stimulus represented by the time series, {. . . , Yt−1 , Yt }, which is corrupted by additive gaussian white noise to produce a input, {. . . , Xt−1 , Xt }, received by a population of sensory neurons. We ask what is the optimal neural code, p(Rt |X−∞:t ), such that responses in a time window from t − τ to t encode maximal information about the stimulus between time t + ∆1 and t + ∆2 , constrained on the total information encoded about previous inputs, up to time t. This can be achieved by maximising the following ‘information bottleneck’ (IB) objective function [40]:  (1) Lp(Rt |X−∞:t ) = I Rt−τ :t ; Yt+(∆1 :∆2 ) − γI (Rt−τ :t ; X−∞:t ) The first term denotes the mutual information between Rt−τ :t and Yt+(∆1 :∆2 ) , to be maximised, and the second term denotes the mutual information between Rt−τ :t and X−∞:t , to be constrained. A constant, γ, determines the strength of this constraint, and thus, the tradeoff between coding fidelity and compression. The above objective function is valid for modeling predictive coding, when ∆1 > 0 & ∆2 > 0. However, we wanted a framework that would: (i) give non-trivial solutions for all ∆1 and ∆2 ; (ii) allow comparison with previous efficient coding models. That the first criterion is not satisfied by the above objective function can be seen by setting [∆1 , ∆2 ] = [−∞, 0] and Y = X, in which case the two terms of equation 1 are proportional, and the maximisation is unconstrained. To overcome this, we considered an alternative objective function:  (2) Lp(rt |x−∞:t ) = I Rt−τ :t ; Yt+(∆1 :∆2 ) − γτ I (Rt ; X−∞:t ) where we replaced Rt−τ :t in the second, constraint term with the instantaneous response, Rt . If the responses at each time point are conditionally independent, this expression gives a lower bound on ˜ is the previous IB objective function. Further, when [∆1 , ∆2 ] = [−∞, 0] and X = Y , maximising L equivalent to minimising the temporal redundancy of neural responses (and exactly the same, when γ = 1). Thus the objective function is equally applicable for modeling efficient coding of past inputs (∆1 & ∆2 < 0) and predictive coding of future inputs (∆1 & ∆2 > 0). Finally, note that, while in general the decoding window could be of arbitrary length, to limit the number of free parameters in our analysis, we considered the case where ∆1 = ∆2 , so that the decoding window is limited to a single time-bin of lag ∆. Setting X = Y (i.e. zero external noise), gives the objective function shown in equation 1 in the main text. After performing these simplifications, the objective function can be expanded as follows: Lp(rt |x−∞:t ) = hlog p (yt+∆ |rt−τ :t ) + γτ log p (rt ) − γτ log p (rt |xt−∞:t )ip(r,x,y)

(3)

where for notational simplicity, we have omitted the constant stimulus entropy term. Unfortunately, in many cases, this objective function cannot be computed tractably. Instead, we can compute an

3

˜ < L, that can be evaluated tractably. To do this, we replace the distribuapproximate lower bound L tions p(yt+∆ |rt−τ :t ) and p(rt ) with approximate distributions, q(yt+∆ |rt−τ :t ) ∈ QY |R and q(rt ) ∈ QR (where QR|X and QR denote parametric families of distributions, for which the expectations can be ˜ via alternate updates on computed tractably) [41]. We then maximise the resulting lower bound L, p(rt |x−∞:t ), q(rt ) and q(yt+∆ |rt−τ :t ).

2.1

Model description

We considered a linear encoding model, with neural responses sampled from a multivariate gaussian, Pτw N (rt |µt , Σ). The mean response, µt was obtained by linearly filtering the stimulus, µt = k=1 Wk xt−k+1 , where Wk is an Nr × Nx matrix, denoting the spatial encoding filter at lag k. Σ is an Nr × Nr symmetric noise covariance matrix. Nr and Nx denote the number of neurons and stimulus dimensions, respectively. As described above, to formulate a tractable lower bound for the IB objective function, we had to approximate the decoding distribution, p(yt+∆ |rt−∆:t ), and the response distribution, p(rt ). We approximated the decoding distribution with a linear gaussian model, q(yt+∆ |rt−∆:t ) = N (y yt+∆ , Λ), with mean, yˆt+∆ , obtained by linearly filtering the responses according to, yˆt+∆ = Pτ t+∆ |ˆ U r k t−k+1 , where Uk is an Nx × Nr matrix, denoting the decoding filter at lag k. Λ is an Nx × Nx k=1 symmetric error covariance matrix. Previous work has shown that efficient coding of natural stimuli can be achieved via a ‘sparse’ code, where individual neurons are selective for rarely occuring (i.e. sparse) stimulus features. To allow for sparse coding solutions in our framework, we approximated the response distribution p(rt )  QNr using a student-t distribution, q(rt ) = i=1 Student ri,t |0, ωi2 , νi , where ri,t denotes the response of the ith neuron at time t, and ωi2 , and νi are the scale and shape parameters of the student-t distribution, respectively. For the initial simulations with gaussian stimulus statistics, shown in fig. 2, we considered the limit where νi → ∞ (i.e. where q(rt ) is gaussian). In later simulations, shape parameters for each neuron, νi , were learned from data.

2.2

Optimisation algorithm

Parameters of the encoding distribution (W & Σ), decoding distribution (U & Λ) and response distribution (ω & ν) were learned using a variational IB algorithm, as described in [41]. First we initialise the parameters of the encoding distribution W , and Σ. Next we perform recursive updates of the decoding distribution (U & Λ) and response distribution (ω & ν) parameters, followed by updates of the encoding distribution (W & Σ). This sequence is repeated until the parameters converge. As a full derivation is given in [41], here we restrict ourselves to describing how each of the model parameters are updated on each iteration of the algorithm. Decoding distribution. On each iteration, the parameters of the decoding distribution, q (yt+∆ |rt−τ :t ) = N (yt+∆t |U rt−τ :t , Λ) are updated according to: −1 U ← Cyt+∆ r Crr

where U is anNy × * rt  .. trix, Crr =  . rt−τ hyt+∆ (rt , . . . rt−τ )i.

Λ ← Cyt+∆ yt+∆ − Cyt+∆ r U T

(4)

τNr matrix, U = (U0 , · · · , Uτ −1 ). Crr is an [τ Nr × τ Nr ] covariance ma+   (rt , . . . rt−τ ) and Cyt+∆ r is an [τ Ny × τ Nr ] covariance matrix, Cyt+∆ r =

Response distribution. As stated the marginal response distribution by  QNr earlier, we approximated a student-t distribution, q(rt ) = i=1 Student ri,t |0, ωi2 , νi , with shape and scale parameters for each

4

neuron νi , and ωi , respectively. Substituting q(rt ) into the second term of equation 3, gives: !+ * Nr 2 X ri,t 1 νi + 1 − log ωi2 + f (νi ), hlog q (rt )i = − log 1 + 2 ν 2 ω 2 i i i=1

(5)

  −log Γ ν2 . Parameters, νi , and where the summation is taken over Nr neurons, and f (νi ) = log Γ ν+1 2 ωi , are updated on each iteration, to maximise gaussian

2 stimuli,

2hlog q (rt )i. Note that, for non-sparse, + const., the IB algorithm returns νi → ∞ and ωi2 = ri,t , in which case hlog q (ri,t )i = − 12 log ri,t as in equation 3 of the main text. Unfortunately, the expectation shown above cannot be evaluated in closed form. Instead, we use a variational approximation to construct a lower bound that can be tractably maximised. Following this procedure, as detailed in [41], the scale parameter is updated on each iteration according to,

2 ωi2 ← ξti rti , (6)

2 where rti , denotes the mean-squared response of the ith neuron at time t, and ξti is an additional variational parameter updated on each trial according to: ξti = ν +νri +1 2 2 . (Note that in the limit i h ti i/ωi

2 where νi → ∞, so that the response distribution is near gaussian, ξti = 1, and ωi2 = rti .) The shape parameter, νi , is found numerically on each iteration by solving: ν  ν  i i − log = 1 + ψ(ai ) − log ai + hlog ξti − ξti i , (7) ψ 2 2  where ψ(·) is the digamma function, and ai = 21 νiold + 1 . Encoding distribution. The encoding distribution is described by: N (rt |W xt−τw :t , Σ). On each trial, the noise covariance is updated according to: Σ−1 ←

p(rt |xt−τw :t )

τ −1 1 X T −1 Uk Λ Uk + Ω−1 hΞt i , γ

=

(8)

k=0

where Uk denotes the decoding filter at lag k, and Ω and Ξn are with diagonal elements Ωii = ωi2 , and (Ξt )ii = ξti , respectively. The update for W is given by: w ← (Hf + γHp )−1 b  W1T  .. where w is an [Nx τw N r × 1] vector defined by w = vec  . WτTw

Nr × Nr diagonal covariance matrices

(9)   . 

  To express Hp and Hf , we start by defining the time series, (· · · , zt−1 , zt ), where zt =  

xt xt−1 .. .

   . 

xt−τw +1   Hp is then defined as an [Nx N r × Nx Nr ] square matrix,  1 ωi2 i



(Hp )11 0 .. .

0 (Hp )22 .. .



··· ···   where (Hp )ii = .. .

ξi,t zt ztT . 

 (Hf )11 (Hf )12 · · ·   Hf is an [Nx Nr × Nx Nr ] square matrix, defined by  (Hf )21 (Hf )22 · · ·  where (Hf )ij = .. .. .. . . .

Pτ Pτ T −1 T th z z , and u is the i column of the [N × Nr ] matrix, Uk . u Λ u mi t−k+1 t−m+1 ki y k=1 m=1 ki 5

b

filter amp.

opt. to stim 1 0 2 4 6 8 10

0

filter amp.

a

opt. to stim 1

0.01 0.03 0.10 0.25 0.75 1.00

0

5 time-steps

filter amp.

filter amp.

5 time-steps

opt. to stim 2 0

opt. to stim 2 0

opt. to stim 3

filter amp.

opt. to stim 3

filter amp.

C

0

0

filter lag

filter lag

Figure 2: Dependence of encoding filters on decoding lag, ∆, code length, τ , and coding capacity, C. (a) Encoding filters after optimised with varying ∆, and τ = 0. Encoding filters are normalised to have the same value at lag 0. (b) Same as a, but with filters optimised with τ  0. Plots correspond to filters at varying coding capacity, C, & fixed decoding lag (∆ = 3). 

.

 b1 −1 Pτ

  T Finally, b is an [Nx Nr × 1] vector, defined by b =  b2 , where bi = k=1 zt−k+1 yt+∆ Λ uki .. .

3 3.1

Methods for simulations in the main text Neural coding of 1-d gaussian time series

For the initial simulations, shown in figure 2 in the main text, we considered three different 1-d time series. Stimulus 1 (‘markov stim.’) was generated from an AR1 process, that evolved in time according to the recurrence relation: xt = axt−1 + bηt , where ηt is drawn from a standard normal distribution, and a = 0.89 and b = 0.48. Stimulus p 2 (‘two timescales’) was constructed from two AR1 series, slow summed according to, xt = ρxslow + 1 − ρ2 xfast and xfast were both generated t t , with ρ = 0.47. x from an AR1 process, with parameters a = 0.97, b = 0.23, and a = 0.67, b = 0.73, respectively. Stimulus 3 (‘inertial’) was generated from an AR2 process. The stimulus at time t was given by xt = axt−1 + bxt−2 + cηt , where a = 1.65, b = −0.68 & c = 0.13. In all cases, parameters were chosen such that the stimulus had unit variance (and zero mean). The autocovariance of each stimulus, used to optimise neural responses, were computed analytically for each set of stimulus statistics. We added zero noise to the inputs (i.e. X = Y ). Neural responses were obtained by linearly filtering the stimulus, as described in the main text (with encoding filter of length 60). We optimised the encoding filters by maximising the IB objective function, separately for each stimulus. For panels 2c-d we used decoding filters of length τ = 1; for panels 2e-f we used decoding filters of length τ = 60. In each case we performed the optimisation with decoding lags ranging from ∆ = 1 to ∆ = 10, and a range of different bottleneck parameters, γ (which determined the channel capacity, C). SI fig. 2 plots the optimal encoding filters for each stimulus, with varying τ , C and ∆.

6

0.8 0.6 0.4 0.2 0 −100

−50

0

50

100

lag

c horizontal vertical

10

0

−10 0

25

50

75

100

time

vertical position

b 1

position (pix.)

corr. coeff.

a

3 pix.

horizontal position

Figure 3: Motion trajectory of drifting natural image patches. (a) Autocorrelation of motion speeds along the x-axis. (b) Example trajectory in the x and y direction. (c) Full motion trajectory, in 2d.

3.2

Neural coding of naturalistic movie stimuli

For figure 3, we considered neural coding of naturalistic movie stimuli, consisting of stochastically drifting static images. Images were taken from the van Hateren natural image database (www.kyb.tuebingen.mpg.de/?id=227). Each image was normalised so that the pixels had zero mean and unit variance. Each trial began with a 10×10 patch at a random position, {xcord (0), ycord (0)}, of the image. Movies were constructed by sliding the patch across the image. The position along the x-axis varied according to an AR2 process, described by: xcord (t) = xcord (t − 1) + vx (t − 1), where vx (t) = avx (t − 1) − bη(t), and a = 0.95 and b = 0.16 (see SI fig. 3), and η is a zero mean gaussian process. The position of the patch along the y-axis evolved according to the same dynamics. Trials where the patch reached the border of the image were excluded from the training data. The input to each neuron was created by adding gaussian white noise (with standard deviation of 0.1) to the stimulus. We trained spatio-temporal encoding filters, W , with temporal length 3. We used a student-t approximation for the response distribution (see SI fig. 5 for comparison with results obtained using a gaussian approximation of the response distribution). Filters were initialised with uncorrelated white noise, of magnitude 10−2 . We first learned filters using with ∆ = −6. We then learned filters with increasing values of ∆. The encoding weights obtained for each ∆ was used as the initial conditions for ∆ + 1. We also adjusted the bottleneck parameter, γ, so that the channel capacity remained constant across all ∆ (C ≈ 32.5bits). To compute the ‘directionality index’ for each neuron, shown in fig. 3e, we first presented model neurons with drifiting sinusoidal grating stimuli, of varying phase, direction, and speed. We thus obtained the preferred phase/direction/speed for each model neuron that elicited strongest maximal response. The ‘directionality index’ was then computed for each neuron by comparing the neuron’s maximum response to its preferred stimulus, and its response to a similar stimulus moving in the opposite direction.

3.3

Drifiting blob stimulli

For figure 4, we considered neural responses to ‘drifting blob’ stimuli, as shown in figure 4a. For this stimulus, there were 20 stimulus dimensions (i.e. ‘pixels’), arranged along a single spatial axis. Note that to simplify our analysis we considered circular boundary conditions, so that each stimulus dimension corresponded to an angular coordinate, θ, arranged in equally spaced intervals between −π, and π − 2π/20. Blob-like stimulus features were described by a (wrapped) gaussian, with standard deviation σblob = 0.45, time-varying position, θblob (t), and amplitude, A(t). On each trial, the stimulus was constructed by adding two blob-like features, with varying amplitude and position. The position of each blob varied according to an AR2 process, according to: θblob (t) = θblob (t − 1) + v(t − 1), where v(t) = av(t − 1) − bη(t), and a = 0.90 and b = 0.14. The ampltidue of each blob varied according to an AR1

7

Current Biology Vol 18 No 5 376

Current Biology Vol 18 No 5 376

1

a

2

b

I2 /Itot

0.5





1

30

2

00

2

4

Itot

6

time Figure 1. Illustration of the Time Course of the LFP Phase and of the Spikes, and of the Difference between the Spike Count and Phase-of-Firing Code These data were recorded from electrode 2 in monkey A98 in response to a movie. (A) LFP traces from five presentations of a 12-s-long movie extract. Traces were displaced on the vertical axis so that they could be made distinguishable. Figure 1. Illustration of the Time Course of the LFP Phase and of the Spikes, and of the Difference between the Spike Count and Phase-of-Firing Code (B) Time courses of the 1–4Hz (delta band) band-passed LFP to five presentations of the same 12-s-long movie extract as in (A). Traces were again displaced These data were recorded from electrode 2 in monkey A98color in response toat a each movie. on the vertical axis. The of the line time denotes to which of the four phase quadrants the instantaneous LFP phase belongs to (the color code (A) LFP traces from five presentations of a 12-s-long movie extract. Traces for phase quadrants is shown in [G] andwere [H]). displaced on the vertical axis so that they could be made distinguishable. (B) Time courses of the 1–4Hz (delta(C) band) to five presentations of the 12-s-long movie extract as extract. in (A). Traces again displaced Timeband-passed course of theLFP phases of the 1–4Hz (delta) LFPsame over 30 repetitions of the movie Phasewere values were color coded into quadrants as illustrated in Figure 1. Illustration of the Time Course LFP Phase andcolor of the and of the Difference between the Spike Count quadrants and Phase-of-Firing Code LFP phase belongs to (the color code on of thethe vertical axis. The of Spikes, the line at each time denotes to which of the four phase the instantaneous (G). The bottom five trials in (C)–(E) correspond to the five trials in (A)–(B). for phase quadrants in [G] and [H]). These data were recorded from electrode 2 in monkey A98isinshown response to movie. (D)aRaster plot of spike times (indicated by dots) resulting from 30 repeated presentations of the selected 12 s movie extract. coursemovie of theextract. phases Traces of the 1–4Hz (delta) LFPon over repetitions of the movie extract. values were color coded into quadrants as illustrated in (A) LFP traces from five presentations (C) of aTime 12-s-long displaced the30vertical axis so they bePhase made distinguishable. (E)were Raster plot of the same spike times asthat in (D) butcould with the dots representing the spikes color coded according to the 1–4Hz LFP phase quadrant at which (G). band-passed The bottom five in (C)–(E) correspond the five trials in (A)–(B). (B) Time courses of the 1–4Hz (delta band) LFPtrials to five presentations of emitted. theto same 12-s-long movie extract asillustrate in (A). Traces were again displaced they were These colored spike times the phase-of-firing code, whereas the colorless spike times in (D) illustrate the spike-count code. (D)atRaster plotdenotes of spiketo times (indicated byphase dots) resulting 30 repeated presentations of the selected 12 scode movie extract. on the vertical axis. The color of the line each time which of (F) theSpike four quadrantsfrom the instantaneous phase belongs to (thesliding color rate, averaged over all 30 trials andLFP computed in 4-ms-long time bins, during the 12 s movie extract. The green star and the blue circle in(E) [H]). Raster plot of the same spike times as in (D) but with the dots representing the spikes color coded according to the 1–4Hz LFP phase quadrant at which for phase quadrants is shown in [G] and dicate movie points that elicit similar spike rate responses but different and reliable phase values. These two movie points can be much better discriminated they(delta) were emitted. colored of spike times illustrate the phase-of-firing code, whereas the colorless spike times (D) illustrate the spike-count code. (C) Time course of the phases of the 1–4Hz LFP overThese 30 repetitions the movie extract. Phase values were color coded into quadrants as illustrated in in than from each other by consideration of the phase at which spikes were emitted rather just the counting of spikes. (F) Spike rate, averaged over all 30 trials and computed in 4-ms-long sliding time bins, during the 12 s movie extract. green star and the blue circle in(G). The bottom five trials in (C)–(E) correspond to the five trials in (A)–(B).(G) The sinusoidal convention used for phase, plotted with the color code chosenThe to label phase quadrants. With this sinusoidal convention, the phase dicate movie points that similar spike rate responses but different reliable phase values. These two movie points can be much better discriminated (D) Raster plot of spike times (indicated by dots) resulting fromelicit 30 repeated presentations of correspond the selected 12and s movie extract. values p/2 and 3p/2 respectively to the peak and trough of the oscillation. from each other by consideration of the phase at which spikes were emitted rather than just the counting of spikes. (E) Raster plot of the same spike times as in (D) but with the dots representing theprobability spikes color coded according to the 1–4Hz LFP phase at (plotted which with the same color code as in [G]) is normalized as probability per unit (H) The distribution of the LFP phases at spike times.quadrant The curve (G)times The sinusoidal convention used for phase, plotted the color code chosen label phase quadrants.code. With this sinusoidal convention, the phase they were emitted. These colored spike illustrate the phase-of-firing code, thewith colorless spike times in (D) to illustrate the spike-count angle (itswhereas integral across all angles equals one). values p/2 and 3p/2 correspond respectively to the peak the oscillation. (F) Spike rate, averaged over all 30 trials and computed in 4-ms-long sliding time bins, during theand 12 trough s movieofextract. The green star and the blue circle in(H) The thereliable LFP phases spike times. withbe the same colordiscriminated code as in [G]) is normalized as probability per unit dicate movie points that elicit similar spike rateprobability responsesdistribution but differentofand phaseatvalues. TheseThe twocurve movie(plotted points can much better angle (its at integral angles equals one). from each other by consideration of the phase whichacross spikes all were emitted rather than just the counting of spikes.

Figure 4: Schematic, where the goal is to reconstruct a stimulus in an extended temporal window (shaded blue). (b) We plotted the relative information encoded by two neurons about a ‘two-timescale stimulus (fig. 2b) at varying channel capacity, C. The x-axis is the total information (or channel capacity) encoded by both neurons; the y–axis is the fraction of information encoded by the less active neuron. In blue we plot the results for an instantaneous code (τ = 0); the red plot corresponds to a temporally extended code (τ = 30). In both cases, at low channel capacity only one neuron is active; increasing the channel capacity above a certain threshold leads to the second neuron being active. The trials and modulated by the movie. During movie presentation, namely the 1–4 Hz frequency range (delta band). The singlethe power of the LFP spectrum was highest at low frequencies trial 1–4 Hz band-passed LFP traces during movie presentarequired threshold increases with the coding τrange .(Figure 1B)band). show that Hz LFPs too were reliably mod(