Bayesian Multioutput Feedforward Neural Network ... - CiteSeerX

considered which is consistently estimated by a sample-reuse computation. ... theoretic criteria, is performed first on a simulated case study, and then on a ...
286KB taille 5 téléchargements 375 vues
1

Bayesian Multioutput Feedforward Neural Network Comparison: A Conjugate Prior Approach Vivien Rossi and Jean-Pierre Vila, Member, IEEE,

Abstract— A Bayesian method for the comparison and selection of multi-output feedforward neural network topology, based on the predictive capability, is proposed 1 . As a measure of the prediction fitness potential, an expected utility criterion is considered which is consistently estimated by a sample-reuse computation. As opposed to classic point-prediction-based crossvalidation methods, this expected utility is defined from the logarithmic score of the neural model predictive probability density. It is shown how the advocated choice of a conjugate probability distribution as prior for the parameters of a competing network, allows a consistent approximation of the network posterior predictive density. A comparison of the performances of the proposed method with the performances of usual selection procedures based on classic cross-validation and informationtheoretic criteria, is performed first on a simulated case study, and then on a well-known food analysis dataset. Index Terms— Feedforward neural network, Bayesian model selection, conjugate prior distribution, empirical Bayes methods, expected utility criterion.

I. I NTRODUCTION The issue of selecting a right network topology is one of the most debated in feedforward multilayer neural network modeling. A bias/variance trade-off has to be satisfied [16],[7], to get close to some optimal model complexity (number of layers and neurons) protecting as most as possible from both contradictory effects of overfitting and underfitting. However, the dynamics of the bias and variance errors can in general be only estimated through estimation of a huge amount of varied neural network models and comparison on a test data set, which is hardly feasible in practice. Less time-consuming on-line and off-line strategies have then be proposed, which belong to several classes, heuristically or more statistically oriented. Constructive algorithms [19] come within the first category, while more complex constructive-destructive methods often come within the second [30]. More general methods are the so-called regularization techniques [29], based on implicit structure optimization [28]. They consider a fixed topology and they constrain the network parameters in some way, for example by adding penalty (or weight decay) terms to the cost function, in order to avoid saturation of the units. Regularization techniques using penalty term addition, can be considered as statistically Bayesian since this penalty can be associated with a prior probability on the weight and bias parameters [9]. The well-known MacKay’s Bayesian The authors are with UMR Analyse des Syst`emes et Biom´etrie, INRAENSAM, 2 Place P. Viala, 34060 Montpellier, France 1 This paper is a full extension of the contributed paper Multioutput Feedforward Neural Network Selection: A Bayesian approach, given by the authors to the IEEE-INNS 2003 International Joint Conference on Neural Networks (p.495-500 in the proceedings)

framework for backpropagation [22], originally designed for single output networks, comes within these approaches. For a given neural model this Bayesian formalism leads to the socalled evidence, which estimates how likely the model is given the available dataset, and thus can be used in model selection. In the past recent years the MacKay’s paradigm has been used in several applied fields, such as mathematical finance [25], [17], [40], for example. However these approaches present several critical points such as the imprecise relationship between the generalization performance of the network and the advocated Bayesian selection criteria [22], [36], the treatment of the hyperparameters [9], [39], [23], the usually assumed independence of the weight Gaussian priors, and more crucially, scarcely controllable approximations of the posteriors. The evidence method for example relies on a critical Gaussian approximation of the posterior parameter distribution (in which the hyperparameters, the weight decay terms, are fixed to the values maximizing the evidence). It has been observed that this approximation breaks down when the ratio of the dataset size to the number of network parameters is too small [22], [36]. Markov chain Monte Carlo methods have been proposed to replace this Gaussian approximation [27], requiring now skilled simulation expertise and greater computing. More general and statistically oriented methods are the information-theoretic model comparison criteria [10] as Akaike’s AIC, BIC, Mallow’s Cp , RIC and NIC [26], which also combine some measure of fit with a penalty term to account for model complexity. However it has been observed that according to the situation considered, the performance of these criteria is rather sensitive to the type of penalty [35] especially in the case of neural networks [2]. Other statistical tools such as asymptotic inferentials tests, e.g. likelihood ratio, Wald’s or Rao’s Lagrange multiplier tests [33], can also be used to compare feedforward neural models but they are explicitly restricted to the comparison of nested neural models. Finally, one of the still most attractive comparison procedure, even if computer-intensive, is cross-validation (CV), because of its genericity and limited probability assumption requirements (e.g. exchangeability assumption). However, it has been shown that CV can be inconsistent (unless appropriate data division is done, [34]), as are the methods asymptotically equivalent to it (e.g. AIC and Cp ). Moreover, CV is often too conservative and tends to select unnecessary large models. To counteract these defects, a CV-like Bayesian nonlinear model comparison procedure, inspired by a classic utility criterion [6], has been developed and adjusted to the issue of the comparison of single-output feedforwad neural networks

2

[37]. An extension of this Bayesian approach to multiresponse regression models has been designed recently [31].

II. M ULTI - OUTPUT FEEDFORWARD NEURAL MODELING

The present paper proposes an adaptation of this approach to the comparison of multioutput multilayer neural networks but with a specific recourse to the Bayesian conjugate prior theory and to the so-called empirical Bayes approach. This particular Bayesian framework offers the advantage of allowing to introduce data-respectful priors with the possibility of a complete analytical treatment of the posterior and predictive densities. Moreover, the method allows comparison of networks differing by their respective topologies as well as by their input variable sets. For a given data set, this predictive network performance comparison approach can be used to select among neural nets of varied complexity the net achieving the best compromise between complexity and generalization. As a matter of fact, it could seem that after several valuable works such as [27], [18] for example, the predictive accuracy of a Bayesian network is not sensitive to the number of hidden units (of course given enough units not to underfit) and that there is no need to try optimizing their number and organization, provided sensible priors and adequate posterior approximations are used for the network parameters. However from a practical point of view and from that of the general non Bayesian neural net user, there is still a need of simple efficient standard neural net comparison and selection tools, sufficiently generic to relieve as much as possible of any specialized and controversial issues as hierarchisation of net parameters, relevant choices of parameter and hyperparameter priors, efficient design of posterior approximations and relieving also of the specialized simulating estimation apparatus they involved (e.g. MCMC techniques).

A multi-ouput feedforward neural network model M is a multi-response nonlinear regression model which under the assumption of Gaussian additive errors, can be written as

In this spirit, this paper puts the emphasis on a general analytical approach, leading to a well known closed form for a consistent approximation of the neural net predictive density and confining all numerical aspects only to the final evaluation of the proposed utility criterion. The paper is organized as follows. In Section II the statistical framework of the neural model selection problem is set up. In Section III the building elements of the expected-utilitybased criterion are considered. Convergent approximations of the parameter posterior and posterior predictive densities, allowing the sample-reuse calculation of the expected utility, are developed in Section IV. Section V shows how this predictive density estimation procedure can easily be adapted to take full account of the structural multi-modality of the likelihood function of a feedforward neural model. In Section VI this Bayesian procedure is applied to a simulated predictive neural network selection problem and then to a well known bench-mark test in spectroscopy. The performances of the procedure are compared with that of AIC, BIC and classic CV procedures. Appendix I briefly recalls elements of Bayesian theory used in the construction of the expected utility criterion. Appendix II provides the proofs of the convergence of the approximations of the required parameter posterior and posterior predictive densities used in the criterion.

FRAMEWORK

Model M :

yi = f (xi , θ) + εi

(1)

where the nonlinear mapping f results from the network topology with x as input vector and θ as parameter vector (the set of all weights and biases of the network), [29]. For i ∈ {1, . . . , n} : yi ∈ IRd , xi ∈ IRl , θ ∈ Θ a compact subset of IRq , εi ∼ Nd (0, Σ) with Σ ∈ S ⊂ IRd×d where S is the set of all positive definite symmetric matrices of dimension d × d. We shall have to consider more often than the variancecovariance matrix Σ, the precision matrix Λ = Σ−1 . Let us denote • Zn = (xi , yi ), i = 1, . . . , n, the available data set, made of n i.i.d random data points (x, y). • y1:n = (y 1 , . . . , yn) and x1:n = (x1 , . . . , xn ). ∂f (xi ,θ) ˙ j ∈ {1, . . . , q}. We shall suppose • fxi ,θ = ∂θj that these derivatives exist for all the neural networks considered. Other notations : • Nq (·|µ, Σ) : the q-dimensional Gaussian probability density with expectation µ and covariance matrix Σ. • Wid (·|α, β) : the d-dimensional Wishart density with parameters α and β. • Std (·|µ, Ψ, α) : the d-dimensional Student density with parameters µ, Ψ and α. • To alleviate notations, integration with respect to θ and Λ over their whole membership R set Θ × S, Rwill be denoted throughout the paper by instead of Θ×S . Given Zn and a set M of J feedforward neural models {M j , j = 1, . . . , J}, with E(y|M j , x) = f j (x, θj ), the issue of interest is to select the best neural model, M ∗ , in some predictive sense. III. T HE EXPECTED - UTILITY- BASED CRITERION To do this selection we follow the maximum−expected− utility approach [6] for which the optimal model choice is M ∗ such that u ¯(M ∗ |Zn ) = sup u ¯(M j |Zn )

(2)

M j ∈M

where u ¯(M j |Zn ) =

Z

u(M j , y, x|Zn )p ((x, y)|Zn ) dydx

(3)

in which u(M j , y, x|Zn ) is a given utility function and p ((x, y)|Zn ) is a probability density representing actual beliefs about (x, y) having observed Zn . But p ((x, y)|Zn ) in (3) is generally not available. We then search for a consistent estimate of u ¯(M j |Zn ) for each j M ∈ M. Following Bernardo and Smith [6] we consider the

3

n partitions of Zn : Zn = [Zn−1 (i), (xi , yi )] for 1 ≤ i ≤ n, where Zn−1 (i) denotes the data set Zn after withdrawal of the data point (xi , yi ). If we select k of these data points at random (without replacement), we have by the strong law of large numbers under regular assumptions, as n, k grow to infinity [6], [31] P 1 k k i=1 u(M j , yi , xi |Zn−1 (i)) R a.s. − u(M j , y, x|Zn )p ((x, y)|Zn ) dydx −→ 0

k 1X u(M j , yi , xi |Zn−1 (i)) k i=1

(4)

Furthermore as we are interested in comparing models from a predictive distribution point of view, as suggested in [6] we take as utility function the logarithmic score u(M j , y, x|Zn ) = log p(y|M j , x, Zn )

(5)

In (5) p(y|M j , x, Zn ) is the posterior predictive density under model M j of a response y at x, given the past observations Zn and an appropriate prior density for the neural network model M j parameters. Let us note that with this choice for the utility function, (4) is similar to the predictive sample reuse criterion of [14] which considers the product of conditional predictive densities. We then decide to take as M ∗ the model M j ∈ M maximizing Uj =

k  1X log p yi |M j , xi , Zn−1 (i) k i=1

(6)

This procedure selects on a sample-reuse basis, the model under which the data set Zn achieves the highest level of some internal consistency: the best model is that which on the whole, most favors the likelihood of each observation with respect to the others. The next section will be devoted to the calculation of a convergent approximation pˆ of the parameter posterior predictive density p, for each neural network model M j , leading to the practical criterion k X  bj = 1 U log pˆ yi |M j , xi , Zn−1 (i) k i=1

a.s.

IV. P OSTERIOR PREDICTIVE DENSITIES :

(9)

=

R

p(Zn |θ, Λ)p(θ, Λ|T ) (10) p(Zn |θ, Λ)p(θ, Λ|T )dθdΛ

For a given (θ, Λ) prior, the computation of the posterior (10) is generally untractable. One possible approach to consistenly estimate (9) is to use a technique of Bayesian learning for neural network. These techniques are based on (θ, Λ) sampling from an MCMC p(θ, Λ|T , Zn ) posterior density estimation (see for example [18], [12], [20]). However, such MCMC integrations frequently suffer from instability [15] which can impair the relevance of the final utility criterion estimation. In addition, another major and preliminary difficulty of this Bayesian training approach is of course the (θ, Λ) prior choice itself, which in spite of several attractive approaches [21], [27], remains a critical issue lacking from a general response easy to handle especially for non Bayesians. These difficulties led us to consider an analytical treatment of the parameter posterior and predictive posterior densities estimations, from a well known class of parameter priors. A. (θ, Λ) prior density Let us note that under the assumptions of model (1) the probability density of (y1:n |x1:n , θ, Λ) belongs to the exponential family: p(y1:n |x1:n , θ, Λ) n n 1X o |Λ|n/2 2 exp − ky − f (x , θ)k i i Λ 2 i=1 (2π)nd/2 n n 1 h X  i yi yi0 Λ = c × g(θ, Λ) × exp − tr 2 i=1 n o X + f (xi , θ)0 Λyi (11)

=

i=1

(7)

such that, given k bj n→∞ U −→ Uj

p(y|x, θ, Λ)p(θ, Λ|T , Zn )dθdΛ

In (9) p(y|x, θ, Λ) is given by model (1) and p(θ, Λ|T , Zn ) is a (θ, Λ) posterior probability density with T a vector of hyperparameters. This (θ, Λ) posterior density is obtained by Bayes’ theorem from a given (θ, Λ) prior density p(θ, Λ|T ):

p(θ, Λ|T , Zn )

The expected utility of model M j ∈ M can then be consistently approximated by Uj =

Z p(y|x, T , Zn ) =

(8) A CONSISTENT

APPROXIMATION

In order to compute (6) we need a posterior predictive density for the response at a given x, under model M j , conditional to the training set Zn , for each M j ∈ M. For a given network M as in (1), such a posterior is defined by

n 1 with c = (2π)−nd/2 and g(θ, Λ) = |Λ|n/2 exp − o Pn 1 2 i=1 kf (xi , θ)kΛ . 2 This suggests to take as (θ, Λ) prior density the conjugate density with respect to the likelihood p(y1:n |x1:n , θ, Λ), thus ensuring tractability of the related posterior. Actually, the fundamental advantage of a conjugate prior density is to provide very easily the related posterior density since, because of a closure property, both densities belong to the same family of probability distribution [3]. From (11) and by definition of conjugate families for regular exponential families of probability distributions, we have easily

4

• •

p(θ, Λ|T ) = K[T ]−1 [g(θ, Λ)]τ0 exp +

n X

f (xi , θ)0 ΛT2i

n

i 1 h − tr T1 Λ 2



o

i=1 n

τ0 X kf (xi , θ)k2Λ = K[T ] |Λ| exp − 2 i=1 n io X 1 h (12) + f (xi , θ)0 ΛT2i − tr T1 Λ 2 i=1 −1

τ0 n/2

n

T2i

d

with T1 a d × d symmetric matrix, a IR vector for i = 1, . . . , n, K[T ]−1 a normalizing constant and T = (τ0 , T1 , T21 , . . . , T2n ), a set of hyperparameters. Interpretation of p(θ, Λ|T ) : see final remark of § IV.C. B. (θ, Λ) posterior density Under model M the parameter posterior density associated to the prior p(θ, Λ|T ) is then given by (see Appendix I): p(θ, Λ|Zn , T ) = p(θ, Λ|T + t(y1:n )) (13) Pn with T +t(y1:n ) = (τ0 +1, T1 + i=1 yi yi0 , T21 +y1 , . . . , T2n + yn ) From (12) and (13) we have p(θ, Λ|Zn , T ) = K[T + t(y1:n ]−1 |Λ|(τ0 +1)n/2 n n τ +1X 0 kf (xi , θ)k2Λ × exp − 2 i=1 +

n X

f (xi , θ)0 Λ(T2i + yi )

i=1 n io X 1 h yi yi0 ) Λ − tr (T1 + 2 i=1

p(y1:n |x1:n , T ) is then asymptotically maximized by τ0 = 1, T1 =

n X

yi yi0 , T2i = yi , i = 1, . . . , n

(17)

i=1

n b −1 , a setting under which θ0 and β(T ) are equal to θˆn and Λ 2 n b n are the maximum likelihood estimates of θ where θˆn and Λ and Λ and given by [33]: θbn b −1 Λ n

= argmin det

n  hX

θ

 0 i yi − f (xi , θ) yi − f (xi , θ)

i=1

bn = Σ n  0 1 X yi − f (xi , θˆn ) yi − f (xi , θˆn ) = n i=1

(18)

An intuitive idea of this optimal setting can be reached from ˆ n ). (15) by seeing that p(y1:n |x1:n , T ) ≤ p(y1:n |x1:n , θˆn , Λ The maximization of p(y1:n |x1:n , T ) will be favored, as n grows to infinity, by choosing a setting for T such that the prior density p(θ, Λ|T ) loads more and more in priority a ˆ n ). A simple look at (12) and (11) neighborhood of (θˆn , Λ shows that this will be achieved by the setting (17). Let us note that this setting is related to the so-called empirical Bayes approach [24]. From now on, we shall only consider the setting (17) for the hyperparameters and thus T will not appear any more in the expression of the prior and posterior densities of the parameters. We then have from (14)

(14)

At this point, we have to decide how to treat the hyperparameters T : we could try to integrate them out but under the problematic choice of a second level prior and other possible drawbacks [23]. In the present case, a more tractable and natural approach is to optimize them by maximizing the prior density of the observations themselves: Z p(y1:n |x1:n , T ) =

a = nτ0 + 2 Pn  0 β(T ) = 12 i=1 τ0 f (xi , θ0 )f (xi , θ0 )0 − f (xi , θ0 )T2i −  T2i f (xi , θ0 )0 + 21 T1 hP   n i θ0 = argminθ det T /τ − f (x , θ) T2i /τ0 − 0 i 2 i=1 0 i f (xi , θ)

p(y1:n |x1:n , θ, Λ)p(θ, Λ|T )dθdΛ (15)

It can be shown that p(y1:n |x1:n , T ) = Πni=1 p(yi |xi , T )   a n→∞ n  Πi=1 Nd yi |f (xi , θ0 ), β(T )−1 (16) 2 where the ith factor in the right-hand side is the value at yi of the d-dimensional normal density with mean f (xi , θ0 ) and a inverse covariance matrix β(T )−1 , with 2

n

p(θ, Λ|Zn ) = Kn |Λ| exp

n



n X

kyi − f (xi , θ)k2Λ

o

(19)

i=1

where Kn = K −1 [T + t(y1:n )] is the normalizing constant. However with a parameter posterior as (19) the computation of the posterior predictive density (9) will be intractable for a general neural model f . Let us consider then a convergent approximation of p(θ, Λ|Zn ) allowing the computation of a convergent approximation of p(y|x, Zn ) under model M . C. A L1 -convergent approximation of the parameter posterior density Let H be the following set of assumptions for model M : H1 xi ∈ X a compact subset of IRl , i = 1, . . . , n. H2 The model function f (x, θ) is of class C 1 both in x and θ (this assumption is satisfied by usual networks with differentiable transfert function in their units). Let pˆ(θ, Λ|Zn ) =     d+1 Nq θ|θˆn , Vθ Wid Λ|n + , VΛ (20) 2

5

with

Let

Vθ VΛ

−1  P n ˆ n f˙ ˆ = 2 i=1 f˙x0 ,θˆ Λ xi ,θn i n Pn ˆn ))(yi − f (xi , θˆn ))0 = (y − f (x , θ i i i=1

Z pˆ(y|x, Zn ) =

Let us recall now that under general conditions there exist limit values θ∗ and Λ∗ to which the maximum likelihood estimates ˆ n , converge almost surely with n under model M , θˆn and Λ [38], [1]. These values are the true parameter values when model M is the correct one. When model M is incorrect (which is always the case for neural network modelling of actual data), θ∗ and Λ∗ are the parameter values minimizing the Kullback-Leibler information criterion between the true (x, y) data distribution and the (x, y) distribution induced by model M . Moreover the parameter posterior distribution concentrates around these limit values θ∗ , Λ∗ (see [4], [5] and especially [1] for details). The following lemma extends this concentration property to the distribution of density pˆ(θ, Λ|Zn ). Lemma 1: Suppose assumptions H are satisfied. Let A be a measurable set of Θ×S which contains an open neighborhood of the limit parameter values (θ∗ , Λ∗ ). Then lim Pˆ (A) = 1

n→∞

a.s.

This lemma ensures the consistency of pˆ(θ, Λ|Zn ), i.e. its asymptotic concentration at (θ∗ , Λ∗ ). Theorem 1: Under assumptions H Z lim |ˆ p(θ, Λ|Zn ) − p(θ, Λ|Zn )|dθdΛ = 0

n→∞

a.s. (21)

Remark: In the same way it could have been shown that Z n→∞ p(θ, Λ) − p(θ, Λ) dθdΛ −→ 0 a.s. (22) ˆ with pˆ(θ, Λ) =    1 1  Nq θ|θˆn , 2Vθ Wid Λ| (n + d + 1), VΛ (23) 2 2 (23) shows that unsurprisingly the conjugate prior p(θ, Λ|T ) with the setting (17) takes the form of a “data-respectful” distribution for n sufficiently large. Most remarkable is that the form of this prior approximation and that of the posterior (20) also respect the usual Bayesian choices for this kinds of parameters, confirming thus the interest of this conjugate approach. D. A L1 -convergent approximation of the posterior predictive density By definition Z p(y|x, Zn ) =

p(y|x, θ, Λ)p(θ, Λ|Zn )dθdΛ

(24)

(25)

with p˜(θ, Λ|Zn ) a L1 -convergent approximation of the parameter (θ, Λ) posterior density. and pˆ(y|x, θ, Λ) =

n 1 o |Λ|1/2 ˆn )k2 exp − ky − f (x, θ Λ 2 (2π)d/2

Theorem 2: Under assumptions H Z lim pˆ(y|x, Zn ) − p(y|x, Zn ) dy = 0

n→∞

a.s.

(26)

(27)

Now take p˜(θ, Λ|Zn ) as equal to pˆ(θ, Λ|Zn ) as given by (20), and let n+1ˆ pˆn (y|x, Zn ) = Std (y|f (x, θˆn ), Λn , 2n + 2) (28) n Corollary 1: Under assumption H Z lim pˆn (y|x, Zn ) − p(y|x, Zn ) dy = 0

n→∞

where Pˆ is the probability measure associated with the density pˆ(θ, Λ|Zn ).

pˆ(y|x, θ, Λ)˜ p(θ, Λ|Zn )dθdΛ

a.s.

(29)

Proof: bringing p˜(θ, Λ|Zn ) ≡ pˆ(θ, Λ|Zn ) into (25) with (26), leads easily to (28). A tractable convergent approximation pˆ(y|x, Zn ) of the posterior predictive density p(y|x, Zn ) under model M is now available, which can be applied to each model M j ∈ M. Acbj of the expected cording to (7), a consistent approximation U j utility of model M can now be computed, for j = 1, . . . , J. V. M ANAGING THE NEURAL MODEL LIKELIHOOD MULTI - MODALITY The posterior predictive density approximation proposed in the previous section to compute the expected utility approxib of a given neural model M , assumes that θˆn in (18) mation U is the argument of the quadratic cost function hP  of the minimum  0 i n det yi −f (xi , θ) or equivalently the i=1 yi −f (xi , θ) argument of the maximum of the related likelihood. It has been shown that for a general likelihood function the uniqueness of this optimum is ultimately satisfied under regularity conditions as the data set size n increases [13]. But for a multilayer perceptron model there are always several families of equivalent local optima. These families are connected with two types of symmetry transformation corresponding to parameter-sign changes and neuron interchanges [11]. These transformations lead to equivalent network input-output mappings. More precisely, for a H-hidden-layer network with mh neurons on mh layer h, the overall symmetry factor is SF = Πh=H h=1 mh !2 [37]. This shows that each local mode of the likelihood function (or local minima of the sum of squares surface) belongs to a class of SF equivalent optima. The total number TNC of such classes can hardly be analytically determined in general. But a reasonable exploration of the network parameter space can reveal the NC most attractive of such classes. The missing remaining classes, of lower attractiveness and lower

6

contribution to the topology of the likelihood surface, will not have much consequence for n sufficiently large. Let θˆc,s be the location of the sth local likelihood optimum within the cth class, with 1 ≤ c ≤ NC and 1 ≤ s ≤ SF. Let pˆ(θ, Λ|θˆc,s ) and pˆ(θ, Λ|Zn , θˆc,s ) be the parameter prior and parameter posterior approximations computed respectively from (23) and (20), for θˆn = θˆc,s . Under the assumption that the overlap between all the prior densities pˆ(θ, Λ|θˆc,s ) is negligible and that the NC × SF local optima have all the same probability of being reached by the parameter estimation procedure, it can be shown that a reliable approximation of the neural network posterior predictive density is instead of (28) given by

pˆ(y|x, Zn ) =

NC .X 1 Kc c=1

! NC X

Kc pˆ(y|x, Zn , θˆc )

(30)

c=1

where • pˆ(y|x, Zn , θˆc ) is given by (28) in which θˆc can be taken as any of the SF equivalent local optimal arguments θˆc,s , 1 ≤ s ≤ SF. Z • Kc

=

p(y1:n |x1:n , θ, Λ)ˆ p(θ, Λ|θˆc )dθdΛ q

=

(2π) 2 π

d(d−1) 4

Qd

j=1 |v|α

Γ(α +

α = n + d+1 and v = 2  0 b −1 f (xi , θˆc ) yi − f (xi , θˆc ) = nΛ n . with

1−j 2 )

Pn

i=1

(31) 

yi −

Let us note that under the same assumption the minimum squared error loss prediction of the neural network at x given Zn , is given by the mean of the posterior predictive density (30) ! NC NC .X X Kc Kc f (x, θˆc ) (32) yˆ|x,Z = 1 n

c=1

c=1

VI. C ASE STUDIES The U-criterion as given by (7) has been compared with usual model selection criteria able to deal with correlated multioutput responses, on a simulated and on an actual neural network selection problem. In each of the following case study, N is the size of the available dataset, from which n data points are sampled at random to compute the U-criterion (7) with k = n (which is of course the best choice for k but also the most costly) and the CV, AIC and BIC criteria. The MSEP (mean squared error of prediction) on the remaining N − n data points is also considered but as a reference criteria. For a given neural model M : Pn The CV criterion is defined as CV = i=1 kyi − f (xi , θˆn−1 [i])k2Q−1 where θˆn−1 [i] is the maximum likelihood estimate of θ on Zn−1 [i] and Q is the empirical variancecovariance matrix of the {yi }i=1,...,n . For the AIC and BIC criteria usual forms are considered [10]: ˆ n ) + 2K and BIC = −2logL(θˆn , Λ ˆ n) + AIC = −2logL(θˆn , Λ

Klogn, where K is the total number of the neural model ˆ n ) is the neural model maximum parameters and L(θˆn , Λ likelihood.  PN The MSEP is defined as MSEP = i=n+1 yi − 0   f (xi , θˆn ) Q−1 yi − f (xi , θˆn ) , after the network has been trained on the first data subset of size n. A. A simulated case study Let us consider the following nine feedforward fully connected neural structures with three inputs x1 , x2 , x3 and two outputs y 1 , y 2 : ◦ NNi: one hidden layer of i neurons (6i+2 parameters), i = 1, · · · , 6. ◦ N6N3: two hidden layers of 6 and 3 neurons respectively (53 parameters). ◦ N6N8: two hidden layers of 6 and 8 neurons respectively (98 parameters). ◦ N7N10: two hidden layers of 7 and 10 neurons respectively (130 parameters). N = 1000 data points were independently and identically simulated from network NN5 for a given set of parameter values, with x1 ∼ U[−10, 10], x2 ∼ N (3, 52 ), x3 ∼ U[−1, 7] and an additive Gaussian noise ε on the two outputs, ε ∼  1.75 0.8 N (0, Σ) with Σ = . 0.8 2.5 The first n = 500 data points were used to compute the scores reached by the nine networks according to the U, CV, AIC and BIC criteria respectively. The remaining 500 data points were used to compute the MSEP of each network on this test data subset. All the results are shown in Table I (winning scores are in bold. Note that the U-criterion has to be maximized and the other three ones and the MSEP have to be minimized). The U and CV-criteria select the right network, NN5, as does the MSEP on the test data. However, one can note that the score reached by NN5 contrasts with those of the other eight networks more sharply according to the U-criterion than according to the CV-criterion and even than according to the MSEP. On the other hand, AIC and BIC behave very badly, by simply ranking the networks according to their growing complexity. Table II concisely sums up these behaviours through the pairwise Wilcoxon rank correlation coefficients of the criteria and the MSEP on the test data set. TABLE I S CORINGS OF THE NINE NEURAL MODELS ACCORDING TO THE FOUR CRITERIA AND THE MSEP - C ASE STUDY A Networks NN1 NN2 NN3 NN4 NN5 NN6 N6N3 N6N8 N7N10

U -8.5176 -7.9224 -5.0229 -2.2254 0.6812 -0.4401 -2.4138 -2.2699 -2.6198

CV 39.164 23.715 5.9486 1.2371 1.0973 1.1437 1.1932 1.3934 1.3334

AIC 435.9818 446.8827 452.7242 459.5458 464.6480 475.6331 501.6511 601.6449 664.0071

BIC 462.3683 493.0592 518.6906 545.3020 570.1941 600.9692 628.2354 924.8800 1092.8

MSEP 45.122 37.946 10.707 6.3251 2.0014 2.8412 6.5689 6.2608 7.7527

7

TABLE II W ILCOXON RANK CORRELATIONS OF THE CRITERIA SCORINGS C ASE STUDY A

U CV AIC-BIC

U 1

CV 0.9167 1

AIC-BIC -0.4834 -0.5334 1

MSEP 1 0.9167 -0.4834

to CV). The pairwise Wilcoxon rank correlation coefficients displayed by Table IV express strikingly these respective performances and confirm the quite satisfying behaviour of the U-criterion with respect to the MSEP. TABLE III S CORINGS OF THE SEVEN NEURAL MODELS ACCORDING TO THE FOUR CRITERIA AND THE

B. The spectroscopic Tecator data The previous Bayesian approach (U-criterion) was applied to the selection of a 2-output feedforward neural network for the Tecator meat data [8], [36]. The data recorded by a Tecator spectrometer (the Infratec Food and Feed Analyser) are available in the Statlib, by courtesy of the Tecator Company and H. H. Thodberg (http://lib.stat.cmu.edu/datasets/tecator). In [37], the single-output version of the proposed approach was applied to the selection of a multilayer perceptron for the prediction of the fat content of a meat sample on the basis of its near infrared absorbance spectrum as available in the Tecator data set. The results were compared to that of the MacKay’s Bayesian evidence method [22] used by Thodberg [36]. The goal is now to select a network which best predicts both the fat and protein meat contents. 1) The data: Following Thodberg recommendations the first n = 172 samples of the Tecator data set are used for computing the four selection criteria for each competing model. The 43 next ones are used to compute the MSEP of each model. The input variables are 13 preprocessed principal components of the spectra. The 2 output variables are the fat and protein meat contents. 2) The competing networks: 7 feedforward neural models with 13 inputs and 2 outputs are considered. These models were derived from the single-output network fp with 3 neurons on a single hidden layer, previously selected by the U-criterion for the fat prediction problem [37]. These 7 competing neural models, f1 , f2 , f3 , f4 , f5 , f6 , f7 , are made of a single hidden layer with 1, 2, 3, 4, 5, 6 and 7 fully connected neurons respectively. Table III shows the score reached by each of the 7 models for each of the 4 criteria U, CV, AIC, BIC, on the 172 first samples of the data set and the MSEP of the related networks on the 43 next samples. One can note that the U-ranking of the networks is much closer to the MSEPranking, than are the other three criteria rankings. The two best networks according to the U-criterion, f4 , f5 , are also the two best ones according to the MSEP. Idem for the two worst ones, f1 , f2 . With regard to the small size of the training data set with respect to the average parameter number of the competing networks, the performance of the U-criterion is rather satisfying. That of the CV-criterion is not so good, because of the conservative trend of CV which tends to favor unduly complex structures (CV has ranked the seven networks according to their decreasing complexity). The respective AIC and BIC-rankings are even much more unsatisfying, since unsurprisingly, these two criteria have penalised too much the network complexity and have simply ranked the seven networks according to their growing complexity (in contrast

Networks f1 f2 f3 f4 f5 f6 f7

U -3.7639 -2.4911 -1.7563 -1.5830 -1.7277 -1.7796 -1.7510

MSEP - C ASE STUDY B

CV 1.9333 1.1056 0.6947 0.6162 0.6105 0.6048 0.5678

AIC 135.1183 172.6568 196.1045 229.3208 259.3319 290.8199 322.3335

BIC 166.8199 232.5376 284.1645 345.5600 403.7504 463.4175 523.1103

MSEP 6.7284 5.6680 2.4688 2.1241 2.0931 2.2661 2.8895

TABLE IV W ILCOXON

RANK CORRELATIONS OF THE CRITERIA SCORINGS

C ASE STUDY B

U CV AIC-BIC

U 1

CV 0.5714 1

AIC-BIC -0.5714 -1 1

MSEP 0.8214 0.5357 -0.5357

VII. C ONCLUSION This paper shows how the richness of information and the robustness attached to predictive probability distributions can benefit to the right selection of a multi-output feedforward neural net topology. The proposed Bayesian method relies upon a convergent approximation, built from a conjugate parameter prior density, of the neural net predictive probability distribution. This predictive distribution is used to define an expected utility criterion which can be consistently estimated on a sample-reuse basis. For a given data set this criterion detects the neural model in a given set, which on the whole most favors the likelihood of each observations with respect to the others. As compared to the evidence approach, which could be readily extended to multioutput networks, our posterior density approximations are normal and Wishart rather than normal, leading to multivariate Student approximations for the predictive densities. The behaviour of the criterion is compared, on simulated and actual neural model selection problems, with the behaviours of classic model selection criteria as point-prediction-based cross validation criterion and information-based AIC and BIC criteria. Both comparisons reveal the satisfactory trade-off reached by this Bayesian criterion between fitness induced by structural neural complexity and generalization capability offered by simpler structures. Moreover the greater small-data-set robustness of the criterion with respect to that of the classic point-wise cross-validation criterion is also evidenced. Finally, because of its analytic basis, the computing cost of such a utility criterion is comparable to that of the standard cross-validation criterion and generally lower than that of the criteria based on MCMC Bayesian

8

A PPENDIX II P ROOFS

learning and without the problem of efficient stopping rules met by these last criteria. A. Proof of Lemma 1 A PPENDIX I BAYESIAN PRELIMINARIES Proposition 1 (Bernardo and Smith [6]): Let Z = (z1 , . . . , z` ) be a random sample from a w-dimensional regular exponential family distribution. Its likelihood is given by

p(Z|φ) =

` Y

w `  nX o X ` s(zj ) g(φ) exp ci ψi (φ) hi (zj )

j=1

i=1

j=1

(33) then the conjugate prior density of the parameter vector φ has the form

p(φ|T ) = K[T ]−1 g(φ)τ0 exp

w nX

o ci ψi (φ)τi ,

Let us first show that the expectation of the probability distribution of density pˆ(θ, Λ|Zn ) converges to (θ∗ , Λ∗ ) as defined in IV-C. By definition of the normal and the Wishart probability distributions and by the almost sure convergence of the maximum ˆ n ) to (θ∗ , Λ∗ ), it comes likelihood estimators (θˆn , Λ     2n + d + 1 ˆ  n→∞  Λn −→ θ∗ , Λ∗ a.s. Epˆ θ, Λ|Zn = θˆn , 2n Let us show now that the variance of the probability distribution of density pˆ(θ, Λ|Zn ) tends to zero as n grows to infinity. • Let βn , be the inverse of the variance-covariance matrix of θ n  X Vθ−1 = 2 f˙x0

φ∈Φ

(34) where T = (τ0 , τ1 , . . . , τw ), vector of hyperparameters, is such that Z

g(φ)τ0 exp

Φ

w nX

o ci ψi (φ)τi dφ < ∞

=

2

n X

=

(36)

where

= Let

f˙x0

ˆ

i , θn

i=1

i=1

p(φ|Z, T ) = p(φ|T + t` (Z))

= βn

Let us show that βn grows to infinity with n:

(35)

Proposition 2 (Bernardo and Smith [6]): Under the assumptions of Proposition 1 (i) the posterior density for φ is

−1

i=1

βn K[T ] =

ˆ

i , θn

i=1

ˆ n f˙ ˆ Λ xi ,θn

ˆ n f˙ ˆ Λ xi ,θn

n

1 X ˙0 ˆ n f˙ ˆ Λ f xi ,θn n i=1 xi ,θˆn 2nβ˜n 2n

n

1 X ˙0 β˙ n = f Λ∗ f˙xi ,θ∗ n i=1 xi ,θ∗ According to the strong law of large numbers, as the xi are i.i.d., one has 0 lim β˙ n = Ex [f˙x,θ Λ∗ f˙x,θ∗ ] = β˜ a.s. ∗

T + t` (Z) = τ0 + `, τ1 +

` X

h1 (zj ), . . . , τw +

j=1

` X

n→∞

 hw (zj )

j=1

(ii) the predictive density for future observations Z¯ = (¯ z1 , . . . , z¯m ) is ¯ T) p(Z|Z,

¯ + t` (Z)) = p(Z|T  m ¯ Y K T + t` (Z) + tm (Z)  (37) = s(¯ zj ) K T + t` (Z) j=1

As the {xi } belong to a compact set and f is C 1 , we can deduce that n 1 X ˙0 ˆ n f˙ ˆ = β˜ fx ,θˆ Λ a.s. lim xi ,θn i n n→∞ n i=1 Let us show that β˜ is positive definite: For all u ∈ IRq , for all n, ˜ ≥ 0 . If u is such that u0 β˜n u ≥ 0 and then u0 βu 0˜ u βu = 0, we have for all i ∈ IN lim kf˙xi ,θˆn ukΛˆ n = kf˙xi ,θ∗ ukΛ∗ = 0

n→∞

where ¯ = m, tm (Z)

m X j=1

h1 (¯ zj ), . . . ,

m X

 hw (¯ zj )

j=1

The adaptation of these results to the context of multiresponse nonlinear regression introduced in Section 2 is straightforward. In this context, ` = 1, z1 = y1:n and dim(z1 ) = nd.

a.s.

u 6= 0, would imply that f does not depend on all the parameters θ, which contradicts the definition of f . u must then be n→∞ equal to zero and β˜ is positive definite. Then, Vθ−1 ≈ n β˜ n→∞ and Vθ −→ 0. • Let us study the variance of Λ : Let λij be the ij th term of the matrix Λ which follows the Wishart distribution included in (20).

9

According to [32] about the Wishart distribution:   2b 2 Λ χ 2n+d+1 n n

λii ∼ Z |ˆ p(θ, Λ|Zn ) − p(θ, Λ|Zn )|dθdΛ Z Z = |ˆ p − p| + |ˆ p − p| c C C Z Z ≤ |ˆ p − p| + pˆ +

ii

 2 b n (2n + d + 1) 8 Σ

then V (λii )

ii

=

n2

n→∞

−→

0. (38)

Let li,j be the d-dimensional vector with the ith and j th components equal to 1 and the  others equal to 0. According b n l0 χ2 to [32] λii + λjj + 2λij ∼ li,j n2 Λ i,j 2n+d+1 and   V (λii + λjj + 2λij ) = 2 0 b 8(li,j Λn li,j ) (2n + d + 1) n→∞ −→ 0. n2n→∞ Then V (λij ) −→ 0. Let A be a measurable subset of Θ × S including an open neighbourhood of (θ∗ , Λ∗ ). There exists  such that B (θ∗ , Λ∗ ) ⊂ A, where B (θ∗ , Λ∗ ) is the closed parallelotope ˆ n ) converge a.s. to of side  centered at (θ∗ , Λ∗ ). As (θˆn , Λ (θ∗ , Λ∗ ), there exists N ∈ IN, such that for all n > N , ˆ n ) ⊂ B (θ∗ , Λ∗ ) a.s. B/2 (θˆn , Λ Then ˆ n )) ≤ Pˆ (B (θ∗ , Λ∗ )) ≤ Pˆ (A) a.s. Pˆ (B/2 (θˆn , Λ Let ηi , i = 1, . . . , q, q + 1, . . . , q + d(d + 1)/2 denote the q components of θ and the d(d + 1)/2 components of Λ. Let ηˆin = Epˆn (ηi ). Let K = q + d(d + 1)/2 be the total number of network model parameters .

Z

As we shown previously, for all i = 1, . . . , K, limn→∞ V (ηi ) = 0. Hence, for all ε > 0 there exists Ni ∈ IN i) such that 4V(η ≤ ε for n > Ni . Let N = max{Ni , i = 2 1, . . . , K}. For all i = 1, . . . , K and all n > N , one has Pˆ (|ηi − ηˆin | > /2) ≤  and then Pˆ ( max |ηi − ηˆin | > /2) ≤ . i=1,...,K

Finally, for all ε > 0, there exists N ∈ IN such that for all n>N Pˆ (A) ≥ 1 − ε a.s. B. Proof of Theorem 1  P −1 n ˆ n f˙ ˆ Let βn = 2 i=1 f˙x0 ,θˆ Λ . x , θ i n i n Let C be a compact subset of Θ × S including an open neighbourhood of (θ∗ , Λ∗ ). Let C c be the subset of Θ × S complementary to C.

pˆ(θ, Λ|Zn )dθdΛ

By Lemma 1

(39)

Cc n→∞

−→

0 a.s.

Cc

Moreover, by consistency of posterior densities [1] Z n→∞ p(θ, Λ|Zn )dθdΛ −→ 0 a.s. To prove the theorem it Cc Z n→∞ remains to show that |ˆ p(θ, Λ|Zn ) − p(θ, Λ|Zn )|dθdΛ −→ C

0 a.s. Let CS = ProjΘ (C) × S, where ProjΘ (C) denotes the projection of C upon Θ, and let us note that C ⊂ CS . We are going to show the stronger result Z |ˆ p(θ, Λ|Zn ) − p(θ, Λ|Zn )|dθdΛ = 0

lim

n→∞

a.s.

CS

From (19) p(θ, Λ|Zn ) = Kn |Λ|n exp

n



n X

kyi − f (xi , θ)k2Λ

o

(40)

i=1

and from (20)

ˆ n )) = 1 − Pˆ (B/2 (θˆn , Λ  = 1 − Pˆ ({η : max |ηi − ηˆin | > }) i=1,...,K 2

According to the Markov inequality, for all i = 1, . . . , K, one has 4V (ηi ) Pˆ (|ηi − ηˆin | > /2) ≤ 2

p

Cc

C

pˆ(θ, Λ|Zn ) ˆ n )) Pˆ (B/2 (θˆn , Λ

Z

1 − kθ − θˆn k2βn 2 n o X − kyi − f (xi , θˆn )k2Λ

ˆ n |Λ|n exp = K

n

(41)

i=1

ˆ n is a normalizing where K Z constant. Let us denote Ep,C [.]ˆ p(θ, Λ|Zn )dθdΛ. ˆ S [.] = CS

Let us show that the Kullback-Leibler distance between the distributions p and pˆ over CS tends almost surely to 0 as n grows to infinity. This will result in their convergence in L1 norm. KCS (ˆ p, p) Z pˆ(θ, Λ|Zn ) = pˆ(θ, Λ|Zn ) log dθdΛ p(θ, Λ|Zn ) CS h 1 ˆn K ˆ 2 = log + Ep,C ˆ S − kθ − θn kβn Kn 2 n n i X X − kyi − f (xi , θˆn )k2Λ + kyi − f (xi , θ)k2Λ (42) i=1

i=1

let us consider the successive terms of KCS (ˆ p, p): 1 n 2 ˆ • Let EC = Ep,C ˆ S [− 2 kθ − θn kβn ], which is finite. S Since kθ − θˆn k2βn ∼ χ2q under pˆ, it comes immediatly



1 q 0 > ECnS > Epˆ[− kθ − θˆn k2βn ] = − 2 2 ˆ For all i = 1, . . . , n, in a neighbourhood of θn it holds

10

The Kullback-Leibler distance between p and pˆ over CS then becomes:

kyi − f (xi , θ)k2Λ = kyi − f (xi , θˆn ) + f˙xi ,θˆn (θ − θˆn )k2Λ + o(kθ − θˆn k2 ) = kyi − f (xi , θˆn )k2 + kf˙ ˆ (θ − θˆn )k2 Λ

(43)

and n X

kyi − f (xi , θˆn )k2Λ +

i=1

n X

n X

kyi − f (xi , θ)k2Λ

kf˙xi ,θˆn (θ − θˆn )k2Λ n X

(47)

α1 kθ − θˆn k2β˜ ≤ kθ − θˆn k2 ≤ α2 kθ − θˆn k2β˜ α1 nkθ − θˆn k2β˜ ≤ nkθ − θˆn k2 ≤ nα2 kθ − θˆn k2β˜

i=1

+2

ˆ

2n+d+1 n n n log K ECS Kn + ECS − 2n 2 ˆ +Ep,C ˆ S [n o(kθ − θn k )]

ˆ 2 Let us show now that limn→∞ Ep,C ˆ S [n o(kθ − θn k )] = 0: It was shown in Section B.1 that βn ∼ 2nβ˜ as n grows to infinity, with β˜ a positive definite matrix. By the equivalence of norms on IRd there exist α1 and α2 positive, such that

i=1

=

=

Λ

xi ,θn

+2 < yi − f (xi , θˆn ), f˙xi ,θˆn (θ − θˆn ) >Λ +o(kθ − θˆn k2 )



KCS (ˆ p, p)

and then for n large,

< yi − f (xi , θˆn ), f˙xi ,θˆn (θ − θˆn ) >Λ

i=1

+n o(kθ − θˆn kΛ )

(44)

According to the definition of the Wishart probability 2n + d + 1 ˆ distribution, Epˆ[Λ] = Λn . Then 2n n X Ep,C [ kf˙xi ,θˆn (θ − θˆn )k2Λ ] ˆ S

1 1 α1 kθ − θˆn k2βn ≤ nkθ − θˆn k2 ≤ α2 kθ − θˆn k2βn 2 2 2 2 ˆ As kθ − θn kβn ∼ χq under pˆ, it comes 1 1 α1 q ≤ Epˆ[nkθ − θˆn k2 ] ≤ α2 q 2 2 There exist then α ˜ 1 and α ˜ 2 positive such that ˆ 2 α ˜ 1 ≤ Ep,C ˜2 ˆ S [nkθ − θn k ] ≤ α

i=1 n X

= Ep,C ˆ S [Epˆ[

kf˙xi ,θˆn (θ − θˆn )k2Λ |θ]]

i=1 n

= Ep,C ˆ S[

2n + d + 1 X ˙ kfxi ,θˆn (θ − θˆn )k2Λˆ | n 2n i=1

2n + d + 1 1 ˆ 2 Ep,C ˆ S [ kθ − θn kβn ] 2n 2 2n + d + 1 n = − ECS 2n

Let us come back to the study of the o(kθ − θˆn k2 ), in (43): For all couple (xi , yi ) let us note gni (θ) = o(kθ − θˆn k2 ). Then n X

=

i=1

(45)

n gni (θ) 1X i g˜ (θ). and g˜n (θ) = n i=1 n kθ − θˆn k2 Then limθ→θˆn g˜n (θ) = 0. As limn→∞ θˆn = θ∗ a.s., for all i ∈ IN∗ we have limn→∞ g˜ni (θ∗ ) = 0. Moreover as the {xi } belong to the compact subset X and the model function f is C 1 with respect to x and θ, the last convergence is uniform with respect to x:

Let g˜ni (θ) =

In the same way n X Ep,C [ < yi − f (xi , θˆn ), f˙xi ,θˆn (θ − θˆn ) >Λ ] ˆ S i=1

= Ep,C ˆ S

gni (θ) = n o(kθ − θˆn k2 )

h

n i X Epˆ[ < yi − f (xi , θˆn ), f˙xi ,θˆn (θ − θˆn ) >Λ |θ]

∀ ε > 0, ∃ Nε : ∀ x ∈ X , ∀ n > Nε , |˜ gni (θ∗ )| < ε

i=1

h 2n + d + 1 Ep,C = ˆ S 2n n i X < yi − f (xi , θˆn ), f˙xi ,θˆn (θ − θˆn ) >Λˆ n

and ∀ ε > 0, ∃ Nε : ∀ n > Nε , |˜ gn (θ∗ )| < ε

i=1

= =

2n + d + 1 Ep,C ˆ S [0] 2n 0

then (46)

ˆ n ) are the least square estimators of (θ, Λ). since (θˆn , Λ

lim g˜n (θ∗ ) = 0

n→∞

Now let us introduce g˜n in the expectation of interest :

11

ˆ Ep,C gn (θ)nkθ − θˆn k ] ˆ S [n o(kθ − θn k )] = Ep,C ˆ S [˜ 2

2

For all ε > 0, let Vε be the ball of radius ε centred at θ∗ , Λ∗ . Let us choose ε sufficiently small such that Vε ⊂ CS . Then ˆ 2 Ep,C ˆ S [n o(kθ − θn k )] = Z g˜n (θ)nkθ − θˆn k2 pˆ(θ, Λ|Zn )dθdΛ CS \Vε

Z +

g˜n (θ)nkθ − θˆn k2 pˆ(θ, Λ|Zn )dθdΛ

(48)



As pˆ is consistent, for all ε > 0 such that Vε ⊂ CS there exists Nε such that for all n > Nε Z nkθ − θˆn k2 pˆ(θ, Λ|Zn )dθdΛ < ε CS \Vε

CS



lim Ep,C ˆ S

hh i n

ˆn h

Z > lim

n→∞

ˆ 2 lim Ep,C ˆ S [n o(kθ − θn k )] = 0

(49)

n→∞

ˆn K =0: Finally let us show that lim log n→∞ Kn To alleviate notations let us denote, from (40) and (41): p = K n × hn

ˆ n. ˆn × h pˆ = K

and

Let us follow a reductio ad absurdum by assuming that ˆ n /Kn 6= 1. limn→∞ K Because of (44), (45) and (46), for n → ∞ h h ˆn i hn i h → 0 and Ep,C → 0 a.s. Ep,C ˆ S log ˆ S log ˆn hn h (50) ˆn K i) Suppose first that lim 1: n→∞ Kn By a similar reasoning and since p(θ, Λ|Zn ) is consistent Kn hn dθdΛ = 1 CS

which implies

2 ˆ [n o(kθ − θ k )] α2 Ep,C ≤ sup g˜n (θ)ε + sup g˜n (θ)˜ ˆ S n

n→∞

but By (50)

n→∞

Then for all n > Nε



that for n > N , Jensen inequality can be applied to hn ], and gives Ep,C ˆ S[ ˆn h h hh i hn i n exp Ep,C log ≤ E ˆ S p,C ˆ S ˆn ˆn h h

(51)

Due to the convexity of the exponential function and the consistency of pˆ(θ, Λ|Zn ) there exists N such

hh ˆ i n

hn

< 0.

(52)

Due to the convexity of the -log function and the poshh ˆn i sibility to apply the Jensen inequality to Ep,C , ˆ S hn for sufficiently great n, we have by (50)

− log Ep,C ˆ S

hh ˆ i n

hn

h hn i n→∞ −→ 0 ≤ Ep,C ˆ S log ˆn h

which contradicts (52). From i) and ii) we can deduce that ˆ n /Kn ) = 0, and then, from (47) limn→∞ log(K and (49) that n→∞

KCS (ˆ p, p) −→ 0

(53)

This completes the proof of Theorem 1, since Kullback convergence dominates L1 convergence over CS and then over C. C. Proof of Theorem 2 Z D= pˆ(y|x, Zn ) − p(y|x, Zn ) dy Z Z = p(θ, Λ|Zn )dθdΛ pˆ(y|θ, Λ, x)ˆ Z − p(y|θ, Λ, x)p(θ, Λ|Zn )dθdΛ dy Z Z h i = p(y|θ, Λ, x) pˆ(θ, Λ|Zn ) − p(θ, Λ|Zn ) dθdΛ Z h i + pˆ(θ, Λ|Zn ) pˆ(y|θ, Λ, x) − p(y|θ, Λ, x) dθdΛ dy Z ≤ p(y|θ, Λ, x) pˆ(θ, Λ|Zn ) − p(θ, Λ|Zn ) dθdΛdy Z + pˆ(θ, Λ|Zn ) pˆ(y|θ, Λ, x) − p(y|θ, Λ, x) dθdΛdy

12

By Fubini’s theorem Z D ≤ pˆ(θ, Λ|Zn ) − p(θ, Λ|Zn ) dθdΛ Z Z + pˆ(θ, Λ|Zn ) pˆ(y|θ, Λ, x) − p(y|θ, Λ, x) dydθdΛ = T1 + T2 As pˆ(θ, Λ|Zn ) is assumed to be a L1 -convergent approximation of p(θ, Λ|Zn ), T1 tends to zero as n grows to infinity. Let us show that the is true for T2 . Z same ˆ Let h(θ, θn ) = pˆ(y|θ, Λ, x)−p(y|θ, Λ, x) dy. Obviously 0 ≤ h(·, ·) ≤ 2. The mapping h is continuous and h(θˆn , θˆn ) = ˆ n ) = (θ∗ , Λ∗ ) a.s., we 0 for all n ∈ IN∗ . As limn→∞ (θˆn , Λ ˆ deduce that limn→∞ h(θ∗ , θn ) = 0. Moreover, for all ε > 0 there exists a neighbourhood of (θ∗ , Λ∗ ), Vε , and an integer N1 such that for almost all (θ, Λ) ∈ Vε and all n > N1 we have h(θ, θˆn ) < ε/2. Let us now split T2 according to Vε : Z T2 = pˆ(θ, Λ|Zn )h(θ, θˆn )dθdΛ Z Z T2 = + VεZ Vεc T2 ≤ pˆ(θ, Λ|Zn )h(θ, θˆn )dθdΛ + ε/2 c V Z ε T2 ≤ 2 pˆ(θ, Λ|Zn )dθdΛ + ε/2 Vεc

Due to the consistency of pˆ(θ, Λ|Zn ) as n grows to infinity, there exists an integer N2 such that for all n > N2 we have Z pˆ(θ, Λ|Zn )dθdΛ < ε/4 and then T2 < ε. Vεc

It follows that D tends to zero as n grows to infinity.

ACKNOWLEDGMENT The authors would like to thank the Associate Editor and the three anonymous referees for detailed comments which helped to improve the presentation of the paper. R EFERENCES [1] C. Abraham and B. Cadre, “Asymptotic properties of posterior distributions derived from misspecified models,” C.R. Acad. Sci. Paris, Ser. I, 335, pp. 495-498, 2002. [2] U. Anders and O. Korn, “Model selection in neural networks,” Neural Networks, vol. 12, pp. 309-323, 1999. [3] J.O. Berger, Statistical Decision Theory and Bayesian Analysis. New york: Springer, 1985. [4] R.H. Berk, “Limiting behavior of posterior distributions when the model is incorrect.” Ann. Math. Statist., vol. 37, pp. 51-58, 1966. [5] R.H. Berk, “Consistency a posteriori,” Ann. Math. Statist., vol. 41, pp. 894-906, 1970. [6] J.M. Bernardo., A.F.M. Smith, Bayesian Theory. New York: Springer, 2000. [7] C.M. Bishop, Neural Networks for Pattern Recognition. Oxford: Clarendon Press,1995. [8] C. Borggard, H. H. Thodberg, “Optimal minimal neural interpretation of spectra,” Anal. Chemistry, vol. 6, pp. 545-551, 1992. [9] W.L. Buntine, A.S. Weigend, “Bayesian backpropagation”, Complex Syst., vol. 5, pp. 603-643, 1991. [10] K.P. Burnham, D.R. Anderson, Model Selection and Inference. New York: Springer, 1998.

[11] A.M. Chen, H. Lu, R. Hecht-Nielsen, “On the geometry of feedforward neural-network error surfaces”, Neural Comput., vol. 5, pp. 910-927, 1993. [12] J. F. G. De Freitas, M. A. Niranjan, A. H. Gee, A. Doucet, “Sequential Monte Carlo methods to train neural network models”, Neural Computation, vol 12, 4, pp. 955-993, 2000. [13] R.V. Foutz, “On the unique consistent solution to the likelihood equations”, J. Amer. Statist. Ass., vol. 72, pp. 147-148, 1977. [14] S. Geisser, W.F. Eddy, “A predictive approach to model selection”, J. Amer. Statist. Ass., vol. 74, pp. 153-160, 1979. [15] A.E. Gelfand, “Models determination using sampling-based methods”, in Markov Chain Monte Carlo in practice, Eds: W.R. Gilks, S. Richarson, D.J. Spiegelhalter, London: Chapman and Hall, 1996. [16] S. Geman, E. Bienenstock, R. Doursat, “Neural network and bias/variance dilemma,” Neural Computation, vol. 4, pp. 1-58, 1992. [17] R. Gencay and M. Qi, “Pricing and hedging derivative securities with neural networks: Bayesian regularization, early stopping and bagging”, IEEE Trans. Neural Networks, vol. 12, pp. 726-734, 2001. [18] D. Husmeier, W.D. Penny, S.J. Roberts, “An Empirical evaluation of Bayesian sampling with hybrid Monte Carlo for training neural network classifiers”, Neural Networks, vol. 12, pp. 677-705, 1999. [19] T. Y. Kwok, D.Y. Yeung, “Constructive algorithms for structure learning in feedforward neural networks for regression problems,” IEEE Trans. Neural Networks, vol 8, pp. 448-472, 1997. [20] J. Lampinen, A. Vehtari, “Bayesian approach for neural networks–review and case studies”, Neural Networks, vol. 14, pp. 257-274, 2001. [21] H.K.H. Lee, “A noninformative prior for neural networks”, Machine Learning, vol. 50, pp. 197-212, 2003. [22] D.J.C. MacKay, “A practical Bayesian framework for backpropagation networks,” Neural Comput., vol 4, pp. 448-472, 1992. [23] D.J.C. MacKay, “Hyperparameters: optimize or integrate out?”, in Maximum Entropy and Bayesian Methods, The Netherlands: Kluwer, 1995. [24] J.S. Maritz, Lwin, T., Empirical Bayes Methods, 2nd ed. London: Chapman and Hall, 1989. [25] M.C. Medeiros, A. Veiga and C.E. Pedreira, “Modeling exchange rates: smooth transitions, neural networks and linear models,” IEEE Trans. Neural Networks, vol. 12, pp. 755-764, 2001. [26] N. Murata, S. Yoshizawa and S. Amari, “A criterion for determining the number of parameters in an artificial neural network model,” in Artificial Neural Networks. Proceeding of ICANN-91, eds T. Kohonen, K. M¨akisara, O. Simula and J.Kangas, vol. 1, pp. 9-14. Amsterdam: North Holland, 1991. [27] R. M. Neal, Bayesian Learning for Neural Network, Lecture Notes in Statistics 118, New York: Springer, 1996. [28] P. Nelles, Nonlinear System Identification, New York: Springer, 2001. [29] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge MA: Cambridge Univ. Press, 1996. [30] I. R. Rivals and L. Personnaz, “Neural-network construction and seclection in nonlinear modeling”, IEEE Trans. Neural Networks, vol. 14, pp. 804-819, 2003. [31] V. Rossi, J.P. Vila, “Bayesian selection of multiresponse nonlinear regression model,” Rapport de Recherche No 04-01, Groupe de Biostatistique et d’Analyse des Syst`emes, ENSA.M-INRA-UMII, 36p., 2004. [32] G.A.F. Seber, Multivariate Observations. New York: Wiley, 1984. [33] G.A.F. Seber, C.F. Wild, Nonlinear Regression. New York: Wiley, 1989. [34] J. Shao, “Linear model selection by cross-validation,” J. Amer. Statist. Ass., vol. 88, pp. 486-494, 1993. [35] M. Stone, “Cross-validatory choice and assessment of statistical predictions,” J. Royal Statist. Soc., B, vol. 36, pp. 11-147 (with discussion), 1974. [36] H.H. Thodberg, “A review of Bayesian neural networks with an application to near infrared spectroscopy”, IEEE Trans. Neural Networks, vol. 7, pp. 56-72, 1996. [37] J.P. Vila, V. Wagner, P. Neveu, P. “Bayesian nonlinear model selection and neural networks: a conjugate prior approach,” IEEE Trans. on Neural Networks, vol. 11, pp. 265-278, 2000. [38] H. White, “Maximum likelihood estimation of misspecified models,” Econometrica, vol. 50, pp. 1-25, 1982. [39] D.H. Wolpert, “On the use of evidence neural networks,” in Advances in Neural Information Processing System 5. San Mateo, CA: Morgan Kaufmann, pp. 539-546, 1993. [40] B.L. Zhang, R. Coggins, M.A. Jabri, D. Dersch and B. Flower, “Multiresolution forecasting for future trading using wavelet decomposition,” IEEE Trans. Neural Networks, vol. 12, pp. 765-775, 2001.

13

Vivien Rossi was born in Montpellier, France, in 1976. He received the Ph.D. degree in Statistics from the Ecole Nationale Sup´erieure Agronomique de Montpellier in 2004. He is currently with the Probability and Statistics Team, Institute of Mathematics and Modelisation of Montpellier, University of Montpellier II. His main research interests include Bayesian statistics, nonlinear filtering especially particle methods, nonparametric estimation and neural networks.

Jean-Pierre Vila (M’98) received the Docteuring´enieur degree in mathematical statistics from the University of Paris XI-Orsay in 1985. He is currently Research Director at the Department of Applied Mathematics and Computer Science of INRA, the French National Agronomical Research Institute. His main research interests include statistics and control theory of nonlinear dynamical systems, with special regards to nonparametric estimation, neural networks, filtering theory, and their applications in life sciences.