Support Vector Machine in Finance

construction for all classes, we obtain finally the m classifiers f1 (x) ,...,fm (x). For ..... was aso obtained by using “trainlssvm” function in the same toolbox. 29 ...
1MB taille 30 téléchargements 332 vues
Support Vector Machine in Finance Benjamin Bruder Research & Development Lyxor Asset Management, Paris [email protected]

Tung-Lam Dao Research & Development Lyxor Asset Management, Paris [email protected]

Thierry Roncalli Research & Development Lyxor Asset Management, Paris [email protected] July 2011 Abstract We review in this note the well-known machine learning technique so-called support vector machine (SVM). This technique can be employed in different contexts such as classification, regression or density estimation according to Vapnik [1998]. Within this paper, we would like first to give an overview on this method and its numerical variation implementation, then bridge it to financial applications such as the stock selection or the prediction of market trend.

Keywords:Machine learning, Statistical learning, Support vector machine, regression, classification, stock selection. JEL classification: C0, G11, G17.

1

Introduction

Support vector machine is an important part of the Statistical Learning Theory. It was first introduced in the mid-90 by Boser et al., (1992) and contributes important applications for various domains such as pattern recognition (for example: handwritten, digit, image), bioinformatic e.t.c. This technique can be employed in different contexts such as classification, regression or density estimation according to Vapnik [1998]. Recently, different applications in the financial field have been developed via two main directions. The first one employs SVM as non-linear estimator in order to forecast the market tendency or volatility. In this context, SVM is used as a regression technique with feasible possibility for extension to nonlinear case thank to the kernel approach. The second direction consists of using SVM as a classification technique which aims to elaborate the stock selection in the trading strategy (for example long/short strategy). In this paper, we review the support vector machine and its application in finance in both points of view. The literature of this recent field is quite diversified and divergent with many approaches and different techniques. We would like first to give an overview on the SVM from its basic construction to all extensions including the multi classification problem. We next present different numerical implementations, then bridge them to financial applications. 1

Support Vector Machine in Finance

This paper is organized as following. In Section 2, we remind the framework of the support vector machine theory based on the approach proposed in O.Chapelle (2002). We next work out various implementations of this technique from both both primal and dual problems in Section 3. The extension of SVM to the case of multi classification is discussed in Section 4. We finish with the introduction of SVM in the financial domain via an example of stock selection in Sections 5 and 6.

2

Support vector machine at a glance

We attempt to give an overview on the support vector machine method in this section. In order to introduce the basic idea of SVM, we start with the first discussion on the classification method via the concept of hard margin an soft margin classification. As the work pioneered by Vapnik and Chervonenkis (1971) has established a framework for Statistical Learning Theory, so-called “VC theory ”, we would like to give a brief introduction with basic notation and the important Vapnik-Chervonenkis theorem for Empirical Risk Minimization principle (ERM). Extension of ERM to Vicinal Risk Minimization (VRM) will be also discussed.

2.1

Basic ideas of SVM

We illustrate here the basic ideas of SVM as a classification method. The main advantage of SVM is that it can be not only described very intuitively in the context of linear classification but also extended in an intelligent way to the non-linear case. Let us define the training data set consisting of pairs of “input/output” points (xi , yi ), with 1 ≤ i ≤ n. Here the input vector xi belongs to some space X whereas the output yi belongs to {−1, 1} in the case of bi-classification. The output yi is used to identify the two possible classes. 2.1.1

Hard margin classification

The most simple idea of linear classification is to look at the whole set of input {xi ⊂ X } and search the possible hyperplane which can separate the data in two classes based on its label yi = ±1. Its consists of constructing a linear discriminant function of the form: h(x) = wT x + b where the vector w is the weight vector and b is called the bias. The hyperplane is defined by the following equation: H = {x : h(x) = wT x + b = 0} This hyperplane divides the space X into two regions: the region where the discriminant function has positive value and the region with negative value. The hyperplane is the also called the decision boundary. The linear classification comes from the fact that this boundary depends on the data in the linear way. We now define the notion of a margin. In Figure 1 (reprinted from Ben-Hur A. et al., 2010), we give a geometric interpretation of the margin in a linear SVM. Let x+ and x− be the closest points to the hyperplane from the positive side and negative side. The cycle data points are defined as the support vectors which are the closest points to the decision boundary (see Figure 1). √ The vector w is the normal vector to the hyperplane and we ˆ = w/kwk. We assume that x+ and x− denote its norm kwk = wT w and its direction w 2

Support Vector Machine in Finance

Figure 1: Geometric interpretation of the margin in a linear SVM.

are equidistant from the decision boundary. They determine the margin from which the two classes of points of data set D are separated:

1 T ˆ (x+ − x− ) w 2 In the geometric consideration, this margin is just half of the distant between two closest ˆ Using the equations points from both sides of the hyperplane H projected in the direction w. that define the relative positions of these points to the hyperplane H: mD (h) =

h(x+ ) = wT x+ + b = a h(x− ) = wT x− + b = −a where a > 0 is some constant. As the normal vector w and the bias b are undetermined quantity, we can simply divide them by a and renormalized all these equations. This is equivalent to set a = 1 in the above expression and we finally get mD (h) =

1 T 1 ˆ (x+ − x− ) = w 2 kwk

The basic idea of maximum margin classifier is to determine the hyperplane which maximizes the margin. For a separable dataset, we can define the hard margin SVM as the following optimization problem: min w,b

u.c.

1 2 kwk 2  yi wT xi + b > 1

(1) i = 1...n

Here, yi wT xi + b > 1 is just a compact way to express the relative position of two classes of data points to the hyperplane H. In fact, we have wT xi + b > 1 for the class yi = 1 and wT xi + b < −1 for the class yi = −1. 

3

Support Vector Machine in Finance

The historical approach to solve this quadratic program is to map the primal problem to dual problem. We give here the main result while the detailed derivation can be found in the Appendix A. Via KKT theorem, this approach gives us the following optimized solution (w? , b? ): w? =

n X

α?i yi xi

i=1

?

(α?1 , . . . , α?n )

where α = is the solution of the dual optimization problem with dual variable α = (α1 , . . . , αn ) of dimension n: max α u.c.

n X i=1

αi −

αi ≥ 0

n 1 X αi αj yi yj xTi xj 2 i,j=1

i = 1...n

We remark that the above optimization problem is a quadratic program in the vectorial space Rd with n linear inequality constraints. It may become meaningless if it has no solution (the dataset is inseparable) or too many solutions (stability of boundary decision on data). The questions on the existence of a solution in Problem 5 or on the sensibility of solution on dataset are very difficult. A quantitative characterization can be found in the next discussion on the framework of Vapnik-Chervonenskis theory. We will present here an intuitive view of this problem which depends on two main factors. The first one is the dimension of the space of function h(x) which determines the decision boundary. In the linear case, it is simply determined by the dimension of the couple (w, b). If the dimension of this function space is two small as in the linear case, it is possible that there exists no linear solution or the dataset can not be separated by a simple linear classifier. The second factor is the number of data points which involves in the optimization program via n inequality constraints. If the number of constraints is too large, the solution may not exist neither. In order to overcome this problem we must increase the dimension of the optimization problem. There exists two possible ways to do this. The first one consists of relaxing the inequality constrains by introducing additional variables which aim to tolerate the strict separation. We will allow the separation with certain error (some data points in the wrong side). This technique is introduced first by Cortes C. and Vapnik V. (1995) under the name “Soft margin SVM”. The second one consists of using the non-linear classifier which directly extend the function space to higher dimension. The use of non-linear classifier can increase rapidly the dimension of the optimization problem which invokes a computation problem. An intelligent way to get over is employing the notion of kernel. In the next discussions, we will try to clarify these two approaches then finish this section by introducing two general frameworks of this learning theory. 2.1.2

Soft margin classification

 In fact, the inequality constrains described above yi wT xi + b > 1 ensure that all data points will be well classified with respect to the optimal hyperplane. As the data may be inseparable, an intuitive way to overcome is relaxing the strict constrains by introducing additional variables ξi with i = 1, . . . , n so-called slack variables. They allow to commit certain error in the classification via new constrains:  (2) yi wT xi + b > 1 − ξi i = 1...n

For ξi > 1, the data point xi is completely misclassified whereas Pn 0 ≤ ξi ≤ 1 can be interpreted as margin error. By this definition of slack variables, i=1 ξi is directly related to 4

Support Vector Machine in Finance

the number of misclassified points. In order P to fix our expected error in the classification n problem, we introduce an additional term C i=1 ξip in the objective function and rewrite the optimization problem as following: n

min w,b,ξ u.c.

X 1 2 ξi kwk + C 2 i=1  yi wT xi + b ≥ 1 − ξi , ξi ≥ 0

(3) i = 1...n

Here, C is the parameter used to fix our desired level of error and p ≥ 1 is an usual way to fix the convexity on the additional term 1 . The soft-margin solution for the SVM problem can be interpreted as a regularization technique that one can find different optimization problem such as regression, filtering or matrix inversion. The same result can be found with regularization technique later when we discuss the possible use of kernel. Before switching to next discussion on the non-linear classification with kernel approach, we remark that the soft margin SVM problem is now at higher dimension d+1+n. However, the computation cost will be not increased. Thank to the KKT theorem, we can turn this primal problem to a dual problem with more simple constrains. We can also work directly with the primal problem by effectuating a trivial optimization on ξ. The primal problem is now no longer the a quadratic program, however it can be solved by Newton optimization or conjugate gradient as demonstrated in Chapelle O. (2007). 2.1.3

Non-linear classification, Kernel approach

The second approach to improve the classification is to employ the non-linear SVM. In the context of SVM, we would like to insist that the construction of non-linear discriminant function h(x) consists of two steps. We first extend the data space X of dimension d to a feature space F with higher dimension N via a non-linear transformation φ : X → F , then a hyperplane will be constructed in the feature space F as presented before: h (x) = wT φ (x) + b Here, the result vector z = (z1 , . . . , zN ) = φ (x) is N -component vector in F space, hence w is also a vector of size N . The hyperplane H = {z : wT z + b = 0} defined in F is no longer a linear decision boundary in the initial space X : B = {x : wT φ (x) + b = 0} At this stage, the generalization to non-linear case helps us to avoid the problem of overfitting or underfitting. However, a computation problem emerges due to the high dimension of the feature space. For example, if we consider an quadratic transformation, it can lead to a feature space of dimension N = d(d + 3)/2. The main question is how to construct the separating hyperplane in the feature space? The answer to this question is to employ the mapping to the dual problem. By this way, our N -dimension problem turn again to the following n-dimension optimization problem with dual variable α: max α u.c. 1 It

n X i=1

αi −

n 1 X T αi αj yi yj φ (xi ) φ (xj ) 2 i,j=1

αi ≥ 0 i = 1...n

is equivalent to define a Lp norm on the slack vector ξ ∈ Rn

5

Support Vector Machine in Finance

Indeed, the expansion of the optimal solution w? has the following form: w? =

n X

α?i yi φ (xi )

i=1

In order to solve the quadratic program, we do not need the explicit form of the non-linear application but only the kernel of the form K (xi , xj ) = φ (xi )T φ (xj ) which is usually supposed to be symmetric. If we provide only the kernel K (xi , xj ) for the optimization problem, it is enough to construct later the hyperplane H in the feature space F or the boundary decision in the data space X . The discriminant function can be computed as following thank to the expansion of the optimal w? on the initial data xi i = 1, . . . , n: h (x) =

n X

αi yi K (x, xi ) + b

i=1

From this expression, we can construct the decision function which can be used to classified a given input x as f (x) = sign (h (x)). For a given non-linear function φ (x), we can compute the kernel K (xi , xj ) via the scalar product of two vector in F space. However, the reciprocal result does not stay unless the kernel satisfies the condition of the Mercer’s theorem (1909). Here, we study some standard kernel which are already widely used in the pattern recognition domain: p i. Polynomial kernel: K (x, y) = xT y + 1  ii. Radial Basis kernel: K (x, y) = exp −kx − yk2 /2σ 2  iii. Neural Network kernel: K (x, y) = tanh axT y − b

2.2

ERM and VRM frameworks

We finish the review on SVM by discussing briefly on the general framework of Statistical Learning Theory including the SVM. Without enter into the detail like the important theorem of Vapnik-Chervonenkis (1998), we would like to give a more general view on the SVM by answering some questions like how to approach SVM as a regression, how to interpret the soft-margin SVM as a regularization technique... 2.2.1

Empirical Risk Minimization framework

The Empirical Risk Minimization framework was studied by Vapnik and Chervonenkis in the 70s. In order to show the main idea, we first fix some notations. Let (xi , yi ), 1 ≤ i ≤ n be the training dataset of pairs input/output. The dataset is supposed to be generated i.i.d from unknown distribution P (x, y). The dependency between the input x and the output y is characterized in this distribution. For example, if the input x has a distribution P (x, y) and the out put is related to x via function y = f (x) which is altered by a Gaussian noise N 0, σ 2 , then P (x, y) reads  P (x, y) = P (x) N f (x − y) , σ 2  We remark in this example that if σ → 0 then N 0, σ 2 tends to a Dirac distribution which means that the relation between input and output can be exactly determined by the maximum position of the distribution P (x, y). Estimating the function f (x) is fundamental. In order to give measurement of the estimation quality, we compute the expectation value of the loss function with respect to the distribution P (x, y). We define here the loss function in two different contexts: 6

Support Vector Machine in Finance

1. Classification: l (f (x) , y) = If (x)6=y where I is the indicator function. 2

2. Regression: l (f (x) , y) = (f (x) − y)

The objective of statistical learning is to determine the function f in the a certain function space F which minimizes the expected loss or the risk objective function: Z R (x) = l (f (x) , y) dP (x, y) As the distribution P (x, y) is unknown then the expected loss can not be evaluated. However, with available training dataset {xi , yi }, one could compute the empirical risk as following: n

Remp =

1X l (f (xi ) , y) n i=1

In the limit of large dataset n → ∞, we expect the convergence: Remp (f ) → R (f ) for all tested function f thank to the law of large number. However, does the learning function f which minimizes Remp (f ) is the one minimizing the true risk R (f )? The answer to this question is NO. In general, there is infinite number of function f which can learn perfectly the training dataset f (x) = yi ∀i. In fact, we have to restraint the function space F in order to ensure the uniform convergence of the empirical risk to the true risk. The characterization of the complexity of a space of function F was first studied in the VC theory via the concept of VC dimension (1971) and the important VC theorem which gives an upper bound of the convergence probability P {sup f ∈ F |R (f ) − Remp (f )| > ε} → 0. A common way to restrict the function space is to impose a regularization condition. We denote Ω (f ) as a measurement of regularity, then the regularized problem consists of minimizing the regularized risk: Rreg (f ) = Remp (f ) + λΩ (f ) Here λ is the regularization parameter and Ω (f ) can be for example the Lp norm on some deviation of f . 2.2.2

Vapnik and Chervonenkis theory

We are not going to discuss in detail the VC theory on the statistical learning machine but only recall the most important result concerning the characterization of the complexity of function class. In order to well quantify the trade-off between the overfit problem and the inseparable data problem, Vapnik and Chervonenkis have introduced a very important concept which is the VC dimension and the important theorem which characterize the convergence of empirical risk function. First, the VC dimension is introduced to measure the complexity of the class of functions F Definition 1 The VC dimension of a class of functions F is defined as the maximum number of point that can be exactly learned by a function of F : o n |X| (4) h = max |X|, X ⊂ X , such that ∀b ∈ {−1, 1} , ∃f ∈ F ∀ xi ∈ X, f (xi ) = bi

With the definition of the VC dimension, we now present the VC theorems which is a very powerful tool with control the upper limit of the convergence for the empirical risk to the true risk function. These theorems allows us to have a clear idea about the superior boundary on the available information and the number of observation in the training set n. By satisfying this theorem, we can control the trade-off between overfit and underfit. The relation between factors or coordinates of vector x and VC dimension is given in the following theorem: 7

Support Vector Machine in Finance

Theorem 1 (VC theorem of hyperplanes) Let F be the set of hyperplanes in Rd :   F = x 7→ sign wT x + b , w ∈ Rd , b ∈ R

then VC dimension is d + 1

This theorem gives the explicit relation between the VC dimension and the number of factors or the number of coordinates in the input vector of the training set. It can be used in the next theorem in order to evaluate the necessary information for having a good classification or regression. Theorem 2 (Vapnik and Chervonenskis) let F be a class of function of VC dimension h, then for any distribution P r and for any sample {(xi , yi )}i=1n˙ drawn from this distribution, the following inequality holds true: ( ) (    2 ) 1 2n − ε− n P r sup |R (f ) − Remp (f )| > ε < 4 exp h 1 + ln h n f ∈F An important corollary of the VC theorem is the upper bound for the convergence of the empirical risk function to the risk function: Corollary 1 Under the same hypothesis of the VC theorem, the following inequality is hold with the probability 1 − η: s  η h ln 2n 1 h + 1 − ln 4 + ∀f ∈ F, R (f ) − Remp (f ) ≤ n n We will skip all the proofs of these theorems and postpone the discussion on the importance of VC theorems important for practical use later in Section 6 as the overfit and underfit problems are very present in any financial applications. 2.2.3

Vicinal Risk Minimization framework

Vicinal Risk Minimization framework (VRM) was formally developed in the work of Chapelle O. (2000s). In EVM framework, the risk is evaluated by using empirical probability distribution: n 1X δx (x)δyi (y) dPemp (x, y) = n i=1 i

where δxi (x), δyi (y) are Dirac distributions located at xi and yi respectively. In the VRM framework, instead of dPemp , the Dirac distribution is replaced by an estimate density in the vicinity of xi : n 1X dPvic (x, y) = dPxi (x)δyi (y) n i=1

Hence, the vicinal risk is then defined as following: Rvic (f ) =

Z

n

l (f (x) , y) dPvic (x, y) =

1X n i=1

Z

l (f (x) , yi ) dPxi (x)

In order to illustrate the different between the ERM framework and VRM framework, let us consider the following example of the linear regression. In this case, our loss function 2 l (f (x) , y) = (f (x) − y) where the learning function is of the form f (x) = wT x + b. 8

Support Vector Machine in Finance

Assuming that the vicinal density probability dPxi (x) is approximated by a white noise of variance σ 2 . The vicinal risk is calculated as following: n Z 1X 2 (f (x) − yi ) dPxi (x) Rvic (f ) = n i=1 n Z  1X 2 = (f (xi + ε) − yi ) dN 0, σ 2 n i=1 n

=

1X (f (xi ) − yi )2 + σ 2 kwk2 n i=1

It is equivalent to the regularized risk minimization problem: Rvic (f ) = Remp (f ) + σ 2 kwk of parameter σ 2 with L2 penalty constraint.

3

2

Numerical implementations

In this section, we discuss explicitly the two possible ways to implement the SVM algorithm. As discussed above, the kernel approach can be applied directly in the dual problem and it leads to a simple form of an quadratic program. We discuss first the dual approach for the historical reason. Direct implementation for the primal problem is little bit more delicate that why it was much more later implemented by Chapelle O. (2007) by Newton optimization method and conjugate gradient method. According to Chapelle O., in term of complexity both approaches propose more and less the same efficiency while in some context the later gives some advantage on the solution precision.

3.1

Dual approach

We discuss here in more detail the two main applications of SVM which are the classification problem and the regression problem within the dual approach. The reason for the historical choice of this approach is simply it offers a possibility to obtain a standard quadratic program whose numerical implementation is well-established. Here, we summarize the result presented in Cortes C. and Vapnik V. (1995) where they introduced the notion of soft-margin SVM. We next discuss the extension for the regression. 3.1.1

Classification problem

As introduced in the last section, the classification encounters two main problems: the overfitted problem and the underfitted problem. If the dimension of the function space is two large, the result will be very sensible to the input then a small change in the data can cause an instability in the final result. The second one consists of non-separable data in the sense that the function space is too small then we can not obtain a solution which minimizes the risk function. In both case, regularization scheme is necessary to make the problem well-posed. In the first case, on should restrict the function space by imposing some condition and working with some specific function class (linear case for example). In the later case, on needs to extend the function space by introducing some tolerable error (soft-margin approach) or working with non-linear transformation. a) Linear SVM with soft-margin approach In the work of Cortes C. and Vapnik V. (1995), they have first introduced the notion of soft-margin by accepting that there will be some error in the classification. They characterize this error by additional variables ξi associated to each data points xi . These 9

Support Vector Machine in Finance

parameters intervene in the classification via the constraints. For a given hyperplane, and the constrain yi wT xi + b ≥ 1 which means that the point xi is well-classified  is out of the margin. When we change this condition to yi wT xi + b ≥ 1 − ξi with ξi ≥ 0 i = 1...n, it allow first to point xi to be well-classified but in the margin for 0 ≤ ξi < 1. For the value ξi > 1, there is a possibility that the input xi is misclassified. As written above, the primal problem becomes an optimization with respect to the margin and and the total committed error. ! n X 1 2 min ξip kwk + C.F 2 w,b,ξ i=1  u.c. yi wT xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n

Here, p is the degree of regularization. We remark that only for the choice of p ≥ 1 the a soft-margin can have an unique solution. The function F (u) is usually chosen as a convex function with F (0) = 0, for example F (u) = uk .

In the following we consider two specific cases: (i) Hard-margin limit with C = 0; (ii) L1 penalty with F (u) = u, p = 1. We define the dual vector Λ = (α1 , . . . , αn ) and the output vector y = (y1 , . . . , yn ). In order to write the optimization problem in vectorial form, we define as well the operator D = (Dij )n×n with Dij = yi yj xTi xj . i. Hard-margin limit with C = 0. As shown in Appendix A.1, this problem can be mapped to the following dual problem: max Λ u.c.

1 ΛT 1 − ΛT DΛ 2

(5)

ΛT y = 0, Λ ≥ 0

ii. L1 penalty with F (u) = u, p = 1. In this case the associated dual problem is given by: max Λ u.c.

1 ΛT 1 − ΛT DΛ 2

(6)

ΛT y = 0, 0 ≤ Λ ≤ C1

The full derivation is given in Appendix A.2. Remark 1 For the case with L2 penalty (F (u) = u, p = 2), we will demonstrate in the next discussion that it is a special case of kernel approach for hard-margin case. Hence, the dual problem is written exactly the same as hard-margin case with an additional regularization term 1/2C added to the matrix D:   1 1 max ΛT 1 − ΛT D + I Λ (7) 2 2C Λ u.c. ΛT y = 0, Λ ≥ 0 b) Non-linear SVM with Kernel approach The second possibility to extend the function space is to employ a non-linear transformation φ (x) from the initial space X to the feature space F then construct the hard margin problem. This approach conducts to the following dual problem with the use 10

Support Vector Machine in Finance

T

of an explicit kernel K (xi , xj ) = φ (xi ) φ (xj ) in stead of xTi xj . In this case, the D operator is a matrix D = (Dij )n×n with element: Dij = yi yj K (xi , xj ) With this convention, the two first quadratic programs above can be rewritten in the context of non-linear classification by replacing D operator by this new definition with the kernel. We finally remark that, the case of soft-margin SVM with quadratic penalty (F (u) = u, p = 2) can be also seen as the case of hard-margin SVM with a modified Kernel.   √ ˜ We introduce a new transformation φ (xi ) = φ (xi ) 0 . . . yi / 2C . . . 0 where the ele √ √  √ ˜ = w ξ1 2C . . . ξn C . ment yi / C is at i + dim(φ(xi )) position, and new vector w Pn 2 In the new representation, the objective function kwk /2 + C i=1 ξi2 becomes sim˜ 2 /2 whereas the inequality constrain yi φ (w)T xi + b ≥ 1 − ξi becomes ply kwk   ˜ T φ˜ (xi ) + b ≥ 1. Hence, we obtain the hard-margin SVM with a modified keryi w nel which can be computed simply: ˜ i , xj ) = φ˜ (xi )T φ˜ (xj ) = K(xi , xj ) + δij K(x 2C This kernel is consistent with QP program in the last remark. In summary, the linear SVM is nothing else a special case of the non-linear SVM within kernel approach. In the later, we study the SVM problem only for the two case with hard and soft margin within the kernel approach. After obtaining the optimal vector Λ? by solving the associated QP program described above, we can compute Pn b by the KKT condition then derive the decision function f (x). We remind that w? = i=1 α?i yi φ (x). i. For the hard-margin case, KKT condition given in Appendix A.1:    α?i yi w?T φ (xi ) + b? − 1 = 0

We notice that for the value αi > 0, the inequality constraint becomes equality. As the inequality constraint becomes equality constrain, these points are the closest points to the optimal frontier and they are called support-vectors. Hence, b can be computed easily for a given support vector (xi , yi ) as following: b? = yi − w?T φ (xi ) In order to enhance the precision of b? , we evaluate this value as the average all over the set SV of support vectors : b? =

=

1 nSV 1 nSV

X

i∈SV

X

i∈SV

yi −

X

T

α?j yj φ (xj ) φ (xi )

i,j∈SV



yi 1 − α?i

X

j∈SV



K (xi , xj )

ii. For the soft-margin case, KKT condition given in Appendix A.2 is slightly different:    α?i yi w?T φ (xi ) + b? − 1 + ξi = 0 11

Support Vector Machine in Finance

However, if αi satisfies the condition 0 ≤ αi ≤ C then we can show that ξi = 0. The condition 0 ≤ αi ≤ C defines the subset of training points (support vectors) which are closest to the frontier of separation. Hence, b can be computed by exactly the same expression as the hard-margin case. From the optimal value of the triple (Λ? , w? , b? ), we can construct the decision function which can be used to classified a given input x as following: ! n X α?i yi K (x, xi ) + b (8) f (x) = sign i=1

3.1.2

Regression problem

In the last sections, we have discussed the SVM problem only in the classification context. In this section, we show how the regression problem can be interpreted as a SVM problem. As discussed in the general frameworks of statistical learning (ERM or VRM), the SVM problem consists of minimizing the risk function Remp or Rvic . The risk function can be computed via the loss function l (f (x) , y) which defines our objective (classification or regression). Explicitly, the risk function is calculated as: Z R (f ) = l (f (x) , y) dP (x, y) where the distribution dP (x, y) can be computed in the ERM framework or in the VRM framework. For the classification problem, the loss function is defined as l (f (x) , y) = If (x)6=y which means that we count as an error whenever the given point is misclassified. The minimization of the risk function for the classification can be mapped then to the minimization of the margin 1/ kwk. For the regression problem, the loss function is 2 l (f (x) , y) = (f (x) − y) which means that we count the loss as the error of regression. Remark 2 We have chosen here the loss as the least-square error just for illustration. In general, it can be replaced by any positive function F of f (x) − y. Hence, we have the loss function in general form l (f (x) , y) = F (f (x) − y). We remark that the least-square case corresponds to L2 norm, then the most simple generalization is to have the loss function as p Lp norm l (f (x) , y) = |f (x) − y| . We show later that the special case with L1 can bring the regression problem to a similar form of soft-margin classification. In the last discussion on the classification, we have concluded that the linear-SVM problem is just a special case of non-linear-SVM within kernel approach. Hence, we will work here directly with non-linear case where the training vector x is already transformed by a non-linear application φ (x). Therefore, the approximate function of the regression reads f (x) = wT φ (x) + b. In the ERM framework, the risk function is estimated simply as the empirical summation over the dataset: n

Remp =

1X 2 (f (xi ) − yi ) n i=1

whereas in the VRM framework, if we assume that dP (x, y) is a Gaussian noise of variance σ 2 then the risk function reads: n

Rvic =

1X (f (xi ) − yi )2 + σ 2 kwk2 n i=1 12

Support Vector Machine in Finance

The risk function in the VRM framework can be interpreted as a regulated form of risk function in the ERM framework. We rewrite the risk function after renormalizing it by the factor 2σ 2 : n X 1 2 ξi2 Rvic = kwk + C 2 i=1 with C = 1/2σ 2 n. Here, we have introduced new variables ξ = (ξi )i=1...n which satisfy yi = f (xi ) + ξi = wT φ (xi ) + b + ξi . The regression problem can be now written as a QP program with equality constrain as following: n

min w,b,ξ u.c.

X 1 2 ξi2 kwk + C 2 i=1

yi = wT φ (xi ) + b + ξi

i = 1...n

In the present form, the regression looks very similar to the SVM problem for the classification. We notice that the regression problem in the context of SVM can be easily generalized by two possible ways: • The first way is to introduce more general loss function F (f (xi ) − yi ) instead of the least-square loss function. This generalization can lead to other type of regression such as ε-SV regression proposed by Vapnik (1998). • The second way is to introduce a weight ωi distribution for the empirical distribution instead of the uniform distribution: dPemp (x, y) =

n X

ωi δxi (x)δyi (y)

i=1

As financial quantities depend more on the recent pass, hence an asymmetric weight distribution in the favor of recent data would improve the estimator. The idea of this generalization is quite similar to exponential moving-average. By doing this, we recover the results obtained in Gestel T.V. et al., (2001) and in Tay F.E.H. and Cao L.J. (2002) for the LS-SVM formalism. For examples, we can choose the weight distribution as proposed in Tay F.E.H. and Cao L.J. (2002): ωi = 2i/n (n + 1) (linear distribution) or ωi = (1 + exp (a − 2ai/n)) (exponential weight distribution). Our least-square regression problem can be mapped again to a dual problem after introducing the Lagrangian. Detail calculations are given in Appendix A. We give here the T principle result which invokes again the kernel Kij = K (xi , xj ) = φ (xi ) φ (xj ) for treating the non-linearity. Like the classification case, we consider only two problems which are similar to the hard-margin and the soft-margin in the context of regression. i. Least-square SVM regression: In fact, the regression problem discussed above similar to the hard-margin problem. Here, we have to keep the regularization parameter C as it define a tolerance error for the regression. However, this problem with the L2 constrain is equivalent to hard-margin with a modified kernel. The quadratic optimization program is given as following:   1 1 I Λ (9) max ΛT y − ΛT K + 2 2C Λ u.c.

ΛT 1 = 0 13

Support Vector Machine in Finance

ii. ε-SVM regression The ε-SVM regression problem was introduced by Vapnik (1998) in order to have a similar formalism with the soft-margin SVM. He proposed to employ the loss function in the following form: l (f (x) , y) = (|y − f (x)| − ε) I{|y−f (x)|≥ε} The ε-SVM loss function is just a generalization of L1 error. Here, ε is an additional tolerance parameter which allows us not to count the regression error small than ε. Insert this loss function into the expression of risk function then we obtain the objective of the optimization problem: n

Rvic =

X 1 (|f (xi ) − yi | − ε) I{|yi −f (xi )|≥ε} kwk2 + C 2 i=1

Because the two ensembles {yi − f (xi ) ≥ ε} and {yi − f (xi ) ≤ −ε} are disjoint. We now break the function I{|yi −f (xi )|≥ε} into two terms: I{|yi −f (xi )|≥ε} = I{yi −f (xi )−ε≥0} + I{f (xi )−yi −ε≥} By introducing the slack variables ξ and ξ 0 as the last case which satisfy the condition ξi ≥ yi −f (xi )−ε and ξi0 ≥ f (xi )−yi −ε. Hence, we obtain the following optimization problem: n

min 0 w,b,ξ ,ξ u.c.

X 1 2 (ξi + ξi0 ) kwk + C 2 i=1

wT φ (xi ) + b − yi ≤ ε + ξi , T

yi − w φ (xi ) − b ≤ ε +

ξi0 ,

ξi ≥ 0 ξi0

≥0

i = 1...n i = 1...n

Remark 3 We remark that our approach gives exactly the same result as the traditional approach discussed in the work of Vapnik (1998) in which the objective function is constructed by minimizing the margin with additional terms defining the regression error. These terms are controlled by the couple of slack variables. The dual problem in this case can be obtained by performing the same calculation as the soft-margin SVM: max Λ,Λ0

Λ − Λ0

u.c.

Λ − Λ0

T

T

y − ε Λ + Λ0 1 = 0,

T

1−

T  1 Λ − Λ0 K Λ − Λ0 2

(10)

0 ≤ Λ, Λ0 ≤ C1

For the particular case with ε = 0, we obtain: 1 max ΛT y − ΛT KΛ 2 Λ T u.c. Λ 1 = 0, |Λ| ≤ C1 After the optimization procedure using QP program, we obtain the optimal vector Λ? then compute b? by the KKT condition: wT φ (xi ) + b − yi = 0 14

Support Vector Machine in Finance

for support vectors (xi , yi ) (see Appendix A.3 for more detail). In order to have good accuracy for the estimation of b, we average over the set of support vectors SV and obtain:   n n SV X X 1 yi − α?i K (xi , xj ) b? = nSV i=1 j=1

The SVM regressor is then given by the following formula: f (x) =

n X

α?i K (x, xi ) + b?

i=1

3.2

Primal approach

We discuss now the possible of an direct implementation for the primal problem. This problem has been proposed and studied by Chapelle O. (2007). In this work, the author argued that both primaland dual implementations give the same complexity of the order  2 O max (n, d) min (n, d) . Indeed, according to the author, the primal problem might give a more accurate solution as it treats directly the quantity that one is interested in. It is can be easily understood via the special case of a LS-SVM linear estimator where both primal and dual problems can be solved analytically. The main idea of primal implementation is to rewrite the optimization problem under constraint as a unconstrained problem by performing a trivial minimization on the slack variables ξ. We then obtain: n

min w,b

X  1 2 kwk + C L yi , wT φ (xi ) + b 2 i=1 p

(11) p

Here, we have L (y, t) = (y − t) for the regression problem whereas L (y, t) = max (0, 1 − yt) for the classification problem. In the case with quadratic loss or L2 penalty, the function L (y, t) is differentiable with respect to the second variable hence one can obtain the zero gradient equation. In the case where L (y, t) is not differentiable such as L (y, t) = max (0, 1 − yt), we have to approximate it by a regular function. Assuming that L (y, t) is differentiable with respect to t then we obtain: w+C

n X ∂L i=1

∂t

 yi , wT φ (xi ) + b φ (xi ) = 0

which leads to the following representation of the solution w: w=

n X

βi φ (xi )

i=1

T

By introducing the kernel Kij = K (xi , xj ) = φ (xi ) φ (xj ) we rewrite the primal problem as following: n X  1 L yi , KiT β + b (12) min βT Kβ + C β ,b 2 i=1

where Ki is the ith column of the matrix K. We note that it is now an unconstrained optimization problem which can be solved by gradient descent whenever L (y, t) is differentiable. In Appendix A, we present detail derivation of the primal implementation in for the case quadratic loss and soft-margin classification. 15

Support Vector Machine in Finance

3.3

Model selection - Cross validation procedure

The possibility to enlarge or restrict the function space let us the possibility to obtain the solution for SVM problem. However, the choice of the additional parameter such as the error tolerance C in the soft-margin SVM or the kernel parameter in the extension to nonlinear case is fundamental. How can we choose these parameters for a given data set? In this section, we discuss the calibration procedure so-called “model selection ” which aims to determine the ensemble of parameters for SVM. This discussion bases essentially on the result presented the O. Chapelle’s thesis (2002). In order to define the calibration procedure, let us first define the test function which is used to estimate the SVM problem. In the case where we have a lot of data, we can follow the traditional cross validation procedure by dividing the total data in two independent sets: the training set and the validation set. The training set {xi , yi }1≤i≤n is used for the optimization problem whereas the validation set {x0i , yi0 }1≤i≤m is used to evaluate the error via the following test function: m

T =

1 X ψ (−yi0 f (x0i )) m i=1

where ψ (x) = I{x>0} with IA the standard notation of the indicator function. In the case where we do not have enough data for SVM problem, we can employ directly the training set to evaluate the error via the “Leave-one-out error” . Let f 0 be the classifier obtained by the full training set and f p be the one with the point (xp , yp ) left out. The error is defined by the test of the decision rule f p on the missing point (xp , yp ) as following: n

T =

1X ψ (−yp f p (xp )) n i=1

We focus more here the first test error function with available validation data set. However, the error function requires the step function ψ which is discontinuous can cause some difficulty if we expect to determine the best selection parameter via the optimal test error. In order to perform the search for minimal test error by gradient descent for example, we should smooth the test error by regulate the step function by: ψ˜ (x) =

1 1 + exp (−Ax + B)

The choice of the parameter A, B are important. If A is too small the approximation error is too much whereas if A is large the test error is not smooth enough for the minimization procedure.

4

Extension to SVM multi-classification

The single SVM classification (binary classification) discussed in the last section was very well-established and becomes a very standard method for various applications. However, the extension to the multi classification problem is not straight forward. This problem still remains a very active research topic in the recognition domain. In this section, we give a very quick overview on this progressing field and some practical implementations. 16

Support Vector Machine in Finance

4.1

Basic idea of multi-classification

The multiclass SVM can be formulated as following. Let (xi , yi )i=1...n be the training set of data with characteristic x ∈ Rd under classification criterion y. For example, the training data belong to m different classes labeled from 1 to m which means that y ∈ {1, . . . , m}. Our task is to determine a classification rule F : Rd → {1, . . . , m} based on training set data which aims to predict to which class belongs the test data xt by evaluating the decision rule f (xt ). Recently, many important contributions have progressed the field both in the accuracy and complexity (i.e. reduction of time computation). The extensions have been developed via two main directions. The first one consists of dividing the multiclassification problem into many binary classification problem by using “one-against-all” strategy or “one-againstone”. The next step is to construct the decision function in the recognition phase. The implementation of the decision for “one-against-all” strategy is based on the maximum output among all binary SVMs. The outputs are usually mapped into an estimation probability which are proposed by different authors such as Platt (1999). For “one-against-one”strategy, in order to take the right decision, the Max Wins algorithm is adopted. The resultant class is given by the one voted by the majority of binary classifiers. Both techniques encounter the limitation of complexity and high cost of computation time. Other improvement in the same direction such as the binary decision tree (SVM-BDT) was recently proposed by Madzaro G. et al., (2009). This technique proved to be able to speed up the computation time. The second direction consist of generalizing the kernel concept in the SVM algorithm into a more general form. This method treats directly the multiclassification problem by writing a general form of the large margin problem. It will be again mapped into the dual problem by incorporating the kernel concept. Crammer K. and Singer Y. (2001) introduced an efficient algorithm which decomposes the dual problem into multiple optimization problems which can be solved later by fixed-point algorithm.

4.2

Implementations of multiclass SVM

We describe here the two principal implementations of SVM for multiclassification problem. The first one concerns a direct application of binary SVM classifier, however the recognition phase requires a careful choice of decision strategy. We next describe and implement the multiclass kernel-based SVM algorithm which is a more elegant approach. Remark 4 Before discussing details of the two implementations, we remark that there exists other implementations of SVM such as the application of Nonnegative Matrix Factorization (Poluru V. K. et al., 2009) in the binary case by rewriting the SVM problem in NMF framework. Extension of this application to multiclassification case must be an interesting topic for future work. 4.2.1

Decomposition into multiple binary SVM

The most two popular extensions of single SVM classifier to multiclass SVM classifier are using the one-against-all strategy and one-against-all strategy. Recently, another technique utilizing the binary decision tree provided less effort in training the data and it is much faster for recognition phase with a complexity of the order O [log2 N ]. All these techniques employ directly the above SVM implementation. a) One-against-all strategy: In this case, we construct m single SVM classifiers in order separate the training data from every class to the rest of classes. Let us consider 17

Support Vector Machine in Finance

the construction of classifier separating class k from the rest. We start by attributing the response zi = 1 if yi = k and zi = −1 for all yi ∈ {1, . . . m} / {k}. Applying this construction for all classes, we obtain finally the m classifiers f1 (x) , . . . , fm (x). For a testing data x the decision rule is obtained by the maximum of the outputs given by these m classifiers: y = argmaxk∈{1...m} fk (x) In order to avoid the error coming from the fact that we compare the output corresponding to different classifiers, we can map the output of each SVM into the same form of probability proposed by Platt (1999): Pˆ r ( ωk | fk (x)) =

1 1 + exp (Ak fk (x) + Bk )

where ωk is the label of the k th class. This quantity can be interpreted as a measure of the accepting probability of the classifier ωk for a given point x with output Pm fk (x). However, nothing guarantees that k=1 Pˆ r ( ωk | fk (x)) = 1, hence we have to renormalize this probability: Pˆ r ( ωk | fk (x)) Pˆ r ( ωk | x) = Pm ˆ j=1 P r ( ωj | fj (x))

In order to obtain these probability, we have to calibrate the parameters (Ak , Bk ). It can be realized by performing the maximum likehood on the training set (Platt (1999)). b) One-against-one strategy: Other way to employ the binary SVM classifier is to construct Nc = m(m − 1)/2 binary classifiers which separate all couples of classes (ωi , ωj ). We denote the ensemble of classifiers C = {f1 , . . . , fNc }. In the recognition phase, we evaluate all possible outputs f1 (x) , . . . , fNc (x) over C for a given point x. These outputs can be mapped to the response function of each classifier signfk (x) which determines to which class the point x belongs with respect to the classifier fk . We denote N1 , . . . , Nm numbers of times that the point x is classified in the classes ω1 , . . . , ωm respectively. Using the responses we can construct a probability distribution Pˆ r ( ωk | x) over the set of classes {ωk }. This probability again is used to decide the recognition of x. c) Binary decision tree: Both methods above are quite easy for implementation as they employ directly the binary solver. However, they are all suffer a high cost of computation time. We discuss now the last technique proposed recently by Madazarov G. et al., (2009)which uses the binary decision tree strategy. With advantage of the binary tree, the technique gains both complexity and computation time consumption. It needs only m − 1 classifiers which do not always run on the whole training set for constructing the classifiers. By construction, for recognizing a testing point x, it requires only O (log2 N ) evaluation by descending the tree. Figure 2 illustrates how this algorithm works for classifying 7 classes. 4.2.2

Multiclass Kernel-based Vector Machines

A more general and elegant formalism can be obtained for multiclassification by generating the concept kernel. Within this discussion, we follow the approach given in the work of Crammer G. et al., (2001) but with more geometrical explanation. We demonstrate that 18

Support Vector Machine in Finance

Figure 2: Binary decision tree strategy for multiclassification problem

this approach can be interpreted as a simultaneous combination of “one-against-all” and “one-against-one” strategies. As in the linear case, we have to define a decision function. For the binary case, f (x) = sign (h (x)) where h (x) is the boundary (i.e. f (x) = +1 if x ∈ class 1 whereas f (x) = −1 if x ∈ class 2). For the decision function must as-well indicate the class index. In the work of Crammer K. et al., (2001), they proposed to construct the decision rule F : Rd → {1, . . . , m} as following:  F (x) = argmaxk∈{1,...,m} WkT x

Here, W is the d × m weight matrix in which each column Wk corresponds to a d × 1 weight vector. Therefore, we can rewrite the weight matrix as W = (W1 W2 . . . Wm ). We remind that the vector x is of dimension d. In fact, the vectors Wk corresponding to k th class can be interpreted as the normal vector of the hyperplan in the binary SVM. It characterizes the sensibility of a given point x to the k th class. The quantity WkT x is similar to a “score ” that we attribute to the class ωk .

Remark 5 This construction looks quite similar to the “one-against-all” strategy. The main difference is that for the “one-against-all” strategy, all vectors W1 . . . Wm are constructed independently one by one with binary SVM whereas within this formalism, they are constructed spontaneously all together. We will show in the following that the selection rule of this approach is more similar to “one-against-one” strategy. Remark 6 In order to have an intuitive geometric interpretation, we treat here the case of linear-classifier. However, the generation to non-linear case will be straight forward when we replace xTi xj by φ xTi f (xj ). This step introduces the notion of kernel K (xi , xj ) = T φ (xi ) φ (xj ). By definition Wk is the vector defining the boundary which distinguishes the class ωk from the rest. It is a normal vector to the boundary and point to the region occupied by class ωk . Assuming that we are able to separate correctly all data by classifier W. For any point (x, y) when we compute the position of x with respect to two classes ωy and ωk for all k 6= y, 19

Support Vector Machine in Finance

we must find that x belongs to class ωy . As Wk defines the vector pointing to the class ωk , hence when we compare a class ωy to a class ωk , it is natural to define the vector Wy − Wk to define the vector point to class ωy but not ωk . As consequence, Wk − Wy is the vector point to class ωk but not ωy . When x is well classified, we must have WyT − WkT x > 0 (i. e. the class ωy has the best score). In order to have a margin like the binary case, we  impose strictly that WyT − WkT x ≥ 1 ∀k 6= y. This condition can be translated for all k = 1 . . . m by adding δy,k (the Kronecker symbol) as following:  WyT − WkT x + δy,k ≥ 1 Therefore, solving the multi-classification problem for training set (xi , yi )i=1...n is equivalent to finding W satisfying:  WyTi − WkT xi + δyi ,k ≥ 1 ∀i, k We notice here that w = WiT − WjT is normal vector to the separation boundary Hw =  z|wT z + bij = 0 between two classes ωi and ωj . Hence the width of the margin between two classes is as in the binary case: M (Hw ) =

1 kwk

2 Maximizing the margin is equivalent  to minimizing the norm kwk. Indeed, we have kwk = 2 2 2 kWi − Wj k ≤ 2 kWi k + kWj k . In order to maximize all the margin at the same time, it turns out that we have to minimize the L2 -norm of the matrix W:

kWk22 =

m X i=1

kWi k2 =

m X d X

Wij2

i=1 j=1

Finally, we obtain the following optimization problem: min W

u.c.

1 2 kWk 2  WyTi − WkT xi + δyi ,k ≥ 1 ∀i = 1 . . . n, k = 1 . . . m

The extension the similar case with“soft-margin” can be formulated easily bu introducing the slack variables ξi corresponding for each training data. As before, this slack variable allow the point to be classified in the margin. The minimization problem now becomes: ! n X 1 2 p min ξi kWk + C.F 2 W,ξ i=1  u.c. WyTi − WkT xi + δyi ,k ≥ 1 − ξi , ξi ≥ 0 ∀i, k Remark 7 Within the ERM or V RM frameworks, we can construct the risk function via the loss function l (x) = I{F (x)6=y} for the couple of data (x, y). For example, in the ERM framework, we have: n 1X I{F (xi )6=yi } Remp (W) = n i=1

The classification problem is now equivalent to find the optimal matrix W? which minimizes the empirical risk function. In the binary case, we have seen that the optimization of risk 20

Support Vector Machine in Finance

function is equivalent to maximizing the margin kwk2 under linear constraint. We remark that in VRM framework, this problem can be tackled exactly as the binary case. In order to prove the equivalence of minimizing the risk function with the large margin principle, we look for a linear superior boundary the indicator function I{F (x)6=y} . As shown in Crammer K. et al., (2001), we consider the following function:  g (x, y; k) = WkT − WyT x + 1 − δy,k In fact, we can prove that

I{F (x)6=y} ≤ g (x, y) = max g (x, y; k) k

∀ (x, y)

 We first remark that g (x, y; y) = WyT − WyT x+1−δy,y = 0, hence g (x, y) ≥ g (x, y; y) = 0. If the point (xi , yi ) satisfies F (xi ) = yi then WyTi x = maxk WkT xi and I{F (x)6=y} (xi ) = 0. For this case, it is obvious that I{F (x)6=y} (xi ) ≤ g (xi , yi ). If we have now F (xi ) 6= yi then WyTi x < maxk WkT xi and I{F (x)6=y} (xi ) = 1. In this case, g (x, y) = maxk WkT x − WyT + 1 ≥ 1. Hence, we obtain again I{F (x)6=y} (xi ) ≤ g (xi , yi ). Finally, we obtain the upper boundary of the risk function by the following expression: Remp (W) ≤

n   1X max WkT − WyTi xi + 1 − δyi ,k n i=1 k

If the the data is separable, then the optimal value of the risk function is zero. If one require that the superior boundary of the risk function is zero, then the W? which optimizes this boundary must be the one optimize Remp (W). The minimization can be expressed as:   max WkT − WyTi xi + 1 − δyi ,k = 0 ∀i k

or in the same form of the large margin problem:  WyTi − WkT xi + 1 + δyi ,k ≥ 1

∀i, k

Follow the traditional routine for solving this problem, we map it into the dual problem as in the case with binary classification. The detail of this mapping is given in K. Crammer and Y. Singer (2001). We summarize here their important result in the dual form with the dual variable η i of dimension m with i = 1 . . . n. Define τ i = 1yi − η i where 1yi is zero column vector except for ith element, then in the case with soft margin p = 1 and F (u) = u we have the dual problem: ! n 1X T  T  1 X T max Q (τ ) = − τ 1y xi xj τ i τ j + τi 2 i,j C i=1 i i u.c.

τ i ≤ 1y i

and τ Ti 1 = 0 ∀i

We remark here again that we obtain a quadratic program which involves only the interior product between all couples of vector xi , xj . Hence the generation to non-linear is straight forward with the introduction of the  kernel concept. The general problem is finally written by replacing the the factor xTi xj by the kernel K (xi , xj ): ! n  1 X 1X T T (13) τ 1y K (xi , xj ) τ i τ j + max Q (τ ) = − τi 2 i,j C i=1 i i u.c.

τ i ≤ 1y i

and τ Ti 1 = 0 21

∀i

(14)

Support Vector Machine in Finance

The optimal solution of this problem allows to evaluate the classification rule: ) ( n X τ i,r K (x, xi ) H(x) = arg max r=1...m

(15)

i=1

For small value of class number m, we can implement the above optimization by the traditional QP program with matrix of size mn × mn. However, for important number of class, we must employ efficient algorithm as stocking a mn × mn is already a complicate problem. Crammer and Singer have introduced an interesting algorithm which optimize this optimization problem both in stockade and computation speed.

5

SVM-regression in finance

Recently, different applications in the financial field have been developed through two main directions. The first one employs SVM as non-linear estimator in order to forecast the market tendency or volatility. In this context, SVM is used as a regression technique with feasible possibility for extension to non-linear case thank to the kernel approach. The second direction consists of using SVM as a classification technique which aims to elaborate the stock selection in the trading strategy (for example long/short strategy). The SVM-regression can be considered as a non-linear filter for times series or a regression for evaluating the score. We discuss first here how to employ the SVM-regression as as an estimators of the trend for a given asset. The observed trend can be used later for momentum strategies such as trend-following strategy. We next use SVM as a method for constructing the score of the stock for long/short strategy.

5.1

Numerical tests on SVM-regressors

We test here the efficiency of different regressors discussed above. They can be distinguished by the form of loss function into L1 -type or L2 type or by the form of non-linear kernel. We do not focus yet on the calibration of SVM parameter and reserve it for the next discussion on the trend extraction of financial time series with a full description of cross validation procedure. For a given times series yt we would like to regress the data with the training vector x = t = (ti )i=1...n . Let us consider two model of time series. The first model is simply an determined trend perturbed by a white noise: yt = (t − a)3 + σN (0, 1)

(16)

The second model for our tests is the Black-Scholes model of the stock price: dSt = µt dt + σt dBt St

(17)

We notice here that the studied signal yt = ln St . The parameters of the model are the annualized return µ = 5% and the annulized volatility σ = 20%. We consider the regression on a period of one year corresponding to N = 260 trading days. The first test consists of comparing the L1 -regressor and L2 -regressor for Gaussian kernel (see Figures 3-4). As shown in Figure 3 and Figure 4, the L2 -regressor seems to be more favor for the regression. Indeed, we observe that the L2 -regressor is more stable than the L1 -regressor (i.e L1 is more sensible to the training data set) via many test on simulated data of Model 17. In the second test, we compare different L2 regressions corresponding to four typical kernel: 1. Linear, 2. Polynomial, 3. Gaussian, 4. Sigmoid. 22

Support Vector Machine in Finance

Figure 3: L1 -regressor versus L2 -regressor with Gaussian kernel for model (16)

20 15 10

yt

5 0 −5 −10 −15 −20 0

Real signal L1 regression L2 regression 0.5

1

1.5

2

2.5

t

3

3.5

4

4.5

5

Figure 4: L1 -regressor versus L2 -regressor with Gaussian kernel for model (17)

0.1

0.05

ln(St /S0 )

0

−0.05

−0.1

−0.15

−0.2

−0.25 0

Real signal L1 regression L2 regression 50

100

150

t

23

200

250

300

Support Vector Machine in Finance

Figure 5: Comparison of different regression kernel for model (16)

20 15 10

yt

5 0 −5 −10

Real signal Linear Polynomial Gaussian Sigmoid

−15 −20 0

0.5

1

1.5

2

2.5

t

3

3.5

4

4.5

5

Figure 6: Comparison of different regression kernel for model (17)

0.15

0.1

0.05

yt

0

−0.05

−0.1

Real signal Linear Polynomial Gaussian Sigmoid

−0.15

−0.2 0

50

100

150

t

24

200

250

300

Support Vector Machine in Finance

5.2

SVM-Filtering for forecasting the trend of signal

Here, we employ SVM as a non-linear filtering technique for extracting the hidden trend of a time series signal. The regression principle was explained in the last discussion. We apply now this technique for estimating the derivative of the trend µ ¯t , then plug it into a trend-following strategy. 5.2.1

Description of trend-following strategy

We choose here the most simple trend-following strategy whose exposure is given by: et = m

µ ˆt σ ˆt2

with m the risk tolerance and σ ˆt the estimator of volatility given by: σ ˆt2 =

1 T

Z

0

T

σt2 dt =

1 T

t X

i=t−T +1

ln2

Si Si−1

In order to limit the risk of explosion of the exposure et , we capture it by a superior and inferior boundaries emax and emin :     µ ˆt et = max min m 2 , emin , emax σ ˆt The wealth of the portfolio is then given by the following expression:     St+1 Wt+1 = Wt + Wt e?t − 1 + (1 − e?t )rt St 5.2.2

SVM-Filtering

We discuss now how to build a cross-validation procedure which can help to learn the trend of a given signal. We employ the moving-average as a benchmark to compare with this new filter. An important parameter in moving-average filtering is the estimation horizon T then we use this horizon as a reference to calibrate our SVM-filtering. For the sake of simplicity, we studied here only the SVM-filter with Gaussian kernel and L2 penalty. The two typical parameters of SVM-filter are C and σ. C is the parameter which allows some certain level of error in the regression curve while σ characterizes the horizon of estimation and it is directly proportional to T . We propose to scheme of the validation procedure which base on the following structure of data division: training set, validation set and testing set. In the first scheme, we fix the kernel parameter σ = T and optimize the error tolerance parameter C on the validation set. This scheme is comparable to our benchmark moving-average. The second scheme consists of optimizing both couple of parameter C, σ on the validation set. In this case, we let our validation data decides its estimation horizon. This scheme is more complicate to interpret as σ is now a dynamic parameter. However, by affecting σ to the local horizon, we can have an additional understanding on the change in the price of the underlying asset. For example, we can determine in the historical data if the underlying asset undergoes a period with long or short trend. It can help to recognize some additional signature such as the cycle of between long and short trends. We report the two schemes in the following algorithm. 25

Support Vector Machine in Finance Figure 7: Cross-validation procedure for determining optimal value C ? σ ? Training set

|

|

Validation set -|

T1

T2

Historical data

Forecasting | k Today

T2

Prediction

Algorithm 1 SVM score construction procedure SVM_Filter(X, y, T ) Divide data into training set Dtrain , validation set Dvalid and testing set Dtest Regression on the training data Dtrain Construct the SVM prediction on validation set Dvalid if Fixed horizon then σ=T Compute Error(C) prediction error on Dvalid Minimize Error(C) and obtain the optimal parameters (C ? ) else Compute Error(σ, C) prediction error on Dvalid Minimize Error(σ, C) and obtain the optimal parameters (σ ? , C ? ) end if Use optimal parameters to predict the trend on testing set Dtest end procedure

5.2.3

Backtesting

We first check the SVM-filter with simulated data given by the Black-Scholes model of the price. We consider a stock price with annualized return µ = 10% and annualized volatility σ = 20%. The regression is based on 1 trading year data (n = 260 days) with a fixed horizon of 1 month T = 20 days. In Figure 8, we present the result of the SVM trend prediction with fixed horizon T = 20 whereas Figure 9 presents the SVM trend prediction for the second scheme.

5.3

SVM for multivariate regression

As a regression method, we can employ SVM for the use of multivariate regression. Assuming that we consider an universal of d stocks X = X (i) i=1...d during the period of n dates. The performance of the index or an individual stock that we are interested in is given by y. We are looking for the prediction of the value of yn+1 by using the regression of the historical data of (Xt , yt )t=1...n . In this case, the different stocks play the role of the factors of vector in the training set. We can as well apply other regression like the prediction of the performance of the stock based on available information of all the factors. 5.3.1

Multivariate regression

We first test here the efficiency of the multivariate regression on a simulated model. Assuming that all the factors at a given date j follow a Brownian motion. (i)

dXt

(i)

= µt dt + σt dBt 26

∀i = 1...d

Support Vector Machine in Finance

Figure 8: SVM-filtering with fixed horizon scheme

0.15

0.1

0.05

yt

0

−0.05

−0.1

Real signal Training Validation Prediction

−0.15

−0.2 0

50

100

150

t

200

250

300

Figure 9: SVM-filtering with dynamic horizon scheme

0.15

0.1

0.05

yt

0

−0.05

−0.1

Real signal Training Validation Prediction

−0.15

−0.2 0

50

100

150

t

27

200

250

300

Support Vector Machine in Finance

Let (yt )1n˙ be the vector to be regressed which is related to the input X by a function: yt = f (Xt ) = WtT Xt We would like to regress the vector y = (yt )t=2...n by the historical data (Xt )t=1...n−1 by SVM-regression. This regression is give by the function yt = F (Xt−1 ). Hence, the prediction of the future performance of yn+1 is given by: E [yn+1 |Xn ] = F (Xn ) In Figure 10, we present the results obtained by Gaussian kernel with L1 and L2 penalty condition whereas in Figure 11, we compare the result obtained with different types of kernel. Here, we consider just a simple scheme with the lag of one trading day for the regression. In all Figures, we remark this lack on the prediction of the value of y. Figure 10: L1 -regressor versus L2 -regressor with Gaussian kernel for model (16)

6 5 4 3

yt

2 1 0 −1

Real signal L1 regression L2 regression

−2 −3 0

5.3.2

6

50

100

150

200

250

t

300

350

400

450

500

Backtesting

SVM-classification in finance

We now discuss the second applications of SVM in the finance as a stock classifier within this section. We will first test our implementations of the binary classifier and multiclassifier. We next employ the SVM technique to study two different problems: (i) recognition of sectors and (ii) construction of SVM score for stock picking strategy.

6.1

Test of SVM-classifiers

For the binary classification problem, we consider the both approaches (dual/primal) to determine the boundary between two given classes based on the available information of 28

Support Vector Machine in Finance

Figure 11: Comparison of different kernels for multivariate regression

5 4 3

yt

2 1 0 −1

Real signal Linear Polynomial Gaussian Sigmoid

−2 −3 0

50

100

t

150

200

250

each data point. For the multiclassification problem, we first extend the binary classifier to the multi-class case by using the binary decision tree (SVM-BDT). This algorithm was demonstrated to be more efficient than the traditional approaches such as “one-against-all” or “one-against-one” both in computation time and in precision. The general approach of multi-SVM will be then compared to SVM-BDT.

6.1.1

Binary-SVM classifier

Let us compare here the two proposed approaches (dual/primal) for solving numerically SVM-classification problem. In order to realize the test, we consider a random training data set of n vector xi with classification criterion yi = sign (xi ). We present here the comparison of two classification approaches with linear kernel. Here, the result of primal approach is directly obtained by the software of O. Chapelle 2 . This software was implemented with L2 penalty condition. Our dual solver is implemented for both L1 and L2 penalty conditions by employing simply the QP program. In Figure 12, we show the results of classification obtained by both methods within L2 penalty condition. We test next the non-linear classification by using the Gaussian kernel (RBF kernel) for the binary dual-solver. We generate the simulated data by the same way as the last example with x ∈ R2 . The result of the classification is illustrated in Figure 13 for RBF kernel with parameter C = 0.5 and σ = 2 3 .

2 The

free software of O. Chapelle can be found in the following website http://olivier.chapelle.cc/primal/ used here the “plotlssvm ” function of the LS-SVM toolbox for graphical illustration. Similar result was aso obtained by using “trainlssvm” function in the same toolbox. 3 We

29

Support Vector Machine in Finance

Figure 12: Comparison between Dual algorithm and Primal algorithm

Primal,

Dual,

Margins

Boundary,

6

h(x, y)

4

2

0

−2

−4 0

10

20

30

40

50

60

70

Training data

80

90

100

Figure 13: Illustration of non-linear classification with Gaussian kernel

1

2.5

class 1 class 2

2 1.5 1

1 0.5

X2

1

0 −0.5 1

−1 −1.5 −2

1

−2.5 −3

−2.5

−2

−1.5

−1

−0.5

X1

30

0

0.5

1

1.5

2

Support Vector Machine in Finance

6.1.2

Multi-SVM classifier

We first test the implementation generated randomly. We suppose to test efficiently our multi-SVM supposed to be dependent only on

of SVM-BDT via simulated data (xi )i=1...n which are that these data are distributed in Nc classes. In order implementation, the response vector y = (y1 . . . yn ) is the first coordinate of the data vector: z = U (0, 1) x1 = Nc z y = [x1 ] + N (0, 1) xi = U (0, 1)

∀i > 1

Here [a] denote the part of a. We can generate our simulated data in much more general way but it will be very hard to visualize the result of the classification. Within the above choice of simulated data, we can see that in the case  = 0 the data a separable in the axis x1 . In the geometric view, the space Rd is divided in to Nc zones along the axis x1 : Rd−1 × [0, 1[, . . . , Rd−1 × [Nc , Nc + 1[. The boundaries are simply the Nc hyperplane Rd−1 crossing x1 = 1 . . . Nc . When we introduce some noise on the coordinate x1 ( > 0), then the training set is now is not separable by these ensemble of linear hyperplanes. There will be some misclassified points and some deformation of the boundaries thank to non-linear kernel. For the sake of simplicity, we assume that the data (x, y) are already gathered by group. In Figures ?? and 15, we present the classification results for in-sample data and out-of-simple data in the case  = 0 (i.e. separable data). We are now introduce the noise Figure 14: Illustration of multiclassification with SVM-BDT for in-sample data

C10 C09 C08

Classes

C07 C06 C05 C04 C03 C02 C01

Real sector distribution Multiclass SVM S10

S20

S30

S40

S50

Stocks

in the data coordinate x1 with  = 0.2.

31

S60

S70

S80

S90

S99

Support Vector Machine in Finance

Figure 15: Illustration of multiclassification with SVM-BDT for out-of-sample data

C10 C09 C08

Classes

C07 C06 C05 C04 C03 C02 C01

Real sector distribution Multiclass SVM S05

S10

S15

S20

S25

S30

S35

S40

S45

S50

Stocks

Figure 16: Illustration of multiclassification with SVM-BDT for  = 0

1.2

C1,

C2,

1

2

C3,

C4,

3

4

C5,

C6,

C7,

C8,

C9,

C10

1

x2

0.8

0.6

0.4

0.2

0

5

x1

32

6

7

8

9

Support Vector Machine in Finance

Figure 17: Illustration of multiclassification with SVM-BDT for  = 0.2

1.2

C1,

C2,

C3,

C4,

C5,

C6,

C7,

C8,

C9,

3

4

5

6

7

8

9

C10

1

x2

0.8

0.6

0.4

0.2

0

6.2

1

2

x1

10

SVM for classification

We employ here multi-SVM algorithm for all the compositions of the Eurostoxx 300 index. Our goal is to determine the boundaries between various sector to which belong the compositions of the index. As the algorithm contains two main parts, classification and prediction, we then can classify our stocks via their common properties resulted from the available factors. The number of misclassified stocks or the error of classification can give us an estimation on sector definition. We next study the recognition phase on the ensemble of tested data. 6.2.1

Classification of stocks by sectors

In order to well classify the stocks composing the Eurostoxx 300 index, we consider the Ntrain = 100 most representative stocks in term of value. In order to establish the multiclasssvm classification using the binary decision tree, we sort the Ntrain = 100 assets by sectors. We next employing the SVM-BDT for computing the Ntrain −1 binary separators. In Figure 18, we present the classification result with Gaussian kernel and L2 penalty condition. For σ = 2 and C = 20, we are able to well classify the 100 assets over ten main sectors: Oil & Gas, Industrials, Financial, Telecommunications, Health Care, Basic Materials, Consumer Goods, Technology, Utilities, Consumer Services. In order to check the efficiency of the classification, we test the prediction quality on a test set which is composed by Ntest = 50 assets. In Figure 19, we compare the SVM-BDT result with the true sector distribution of 50 assets. We obtain in this case the rate of correct prediction is about 58%.

33

Support Vector Machine in Finance

Figure 18: Multiclassification with SVM-BDT on training set

Consumer Services Utilities Technology Consumer Goods Basic Materials Health Care Telecommunications Financials Industrials Oil & Gas S1

Real sector distribution Multiclass SVM S10

S20

S30

S40

S50

S60

S70

S80

S90

S100

Figure 19: Prediction efficiency with SVM-BDT on the validation set

Consumer Services Utilities

Real sector distribution Multiclass SVM

Technology Consumer Goods Basic Materials Health Care Telecommunications Financials Industrials Oil & Gas

S101

S110

S120

34

S130

S140

S150

Support Vector Machine in Finance

6.2.2

Calibration procedure

As discussed above in the implementation part of the SVM-solver, there are two kinds of parameter which play important role in the classification process. The first parameter C concerns the tolerance error of the margin and the second parameters concern the choice of kernel (σ for Gaussian kernel for example). In last example, we have optimized the couple of parameters C, σ in order to have the best classifiers which do not commit any error on the traing set. However, this result is true only in the case if the sectors are correctly defined. Here, nothing guaranties that the given notion of sectors is the most appropriate one. Hence, the classification process should consist of two steps: (i) determine of binary SVM classifiers on training data set and (ii) calibrate parameters on the validation set. In fact, we decide to optimize this couple of parameters C, σ by minimizing the realized error on the validation set because the committed error on the training set (learning set) must be always smaller than the one on validation set (unknown set). In the second phase, we can redefine the sectors in the sens that if any asset is misclassified, we change its sector label and repeat the optimization on the validation set until convergence. In the end of the calibration procedure, we expect to obtain first a new recognition of sectors and second a multi-classifiers for new assets. As SVM uses the training set to lean about the classification, it must commits less error on this set than on the validation set. We propose here to optimize the SVM parameters by minimizing the error on the validation set. We use the same error function defined in Section 3 but apply it on the validation data set V: Error =

X 1 ψ (−yi0 f (x0i )) card (V) i∈V

where ψ (x) = I{x>0} with IA the standard notation of the indicator function. However, the error function requires the step function ψ which is discontinuous can cause some difficulty if we expect to determine the best selection parameter via the optimal test error. In order to perform the search for minimal test error by gradient descent for example, we should smooth the test error by regulate the step function by: ψ˜ (x) =

1 1 + exp (−Ax + B)

The choice of the parameter A, B are important. If A is too small the approximation error is too much whereas if A is large the test error is not smooth enough for the minimization procedure. 6.2.3

Recognition of sectors

By construction, SVM-classifier is a very efficient method for recognize and classify a new element with respect to a given number of classes. However, it is not able to recognize the sectors or introduces a new correct definition of available sectors over an universal of available data (stocks). In finance, the classification by sector is more related to the origin of stock than the intrinsic property of the stock in the market. It may introduce some problem on the trading strategy if a stock is misclassified, for example, the case of pair-trading strategy. Here, we try to overcome this weakness point of SVM in order to introduce a method which modifies the initial definition of sectors. The main idea of sector recognition procedure is the following. We divide the available data into two sets: training set and validation set. We employ the training set to learn about the classification and the validation set to optimize the SVM parameters. We start 35

Support Vector Machine in Finance

with the initial definition of the given sectors. Within each iteration, we learn the training set in order to determine the classifiers then we test the validation error. An optimization procedure on the validation error helps us to determine the optimal parameters of SVM. For each ensemble of optimal parameters, we encounter some error on the training set. If the validation is smaller on certain threshold with no error on the training set, we reach the optimal configuration of sector definition. In the case, there are errors on the training set, we relabel the misclassified data point and define new sectors with this correction. All the sector labels will be changed by this rule for both training and validation sets. The iteration procedure will be repeat until no error on the training set is committed for a given expected threshold of error on the validation set. The algorithm of this sector-recognition procedure is summarized in the following table: Algorithm 2 Sector recognition by SVM classification procedure SVM_SectorRecognition(X, y, ε) Divide the historical data by training set T and validation set V Initiate the sectors label by the physical sector names: Sec01 , . . . , Sec0m while E T >  do while E V >  do Compute the SVM separators for labels Sec1 , . . . , Secm on T for given (C, σ) Construct the SVM predictor from the separator Sec1 , . . . , Secm Compute error E V on validation set Update parameter (C, σ) until convergence of E V >  end while Compute error E T on training set Verify misclassified points of training set Relabel the misclassified points then update definition of sectors end while end procedure

6.3

SVM for score construction and stock selection

Traditionally, in order to improve the stock picking we rank the stocks by constructing a “score” based on all characterizations (so-called factor) of the considered stock. We require that the construction of this global quantity (combination of factors) must satisfy some classification criterion, for example the performance. We denote here the (xi )i=1...n with xi the ensemble of factors for the ith stock. The classification criterion such as the performance is denoted by the vector y = (yi )i=1...n . The aim of SVM-classifier in this problem is to recognize which stocks (scores) belong to the high/low performance class (overperformed/underperformed). More precisely, we have to identify the a boundary of separation as a function of score and performance f (x, y). Hence, the SVM stock peaking consists of two steps: (i) construction of factors ensemble (i.e. harmonize all characterizations of a given stock such as the price, the risk, marco-properties e.t.c into comparable quantities); (ii) application of SVM-classifier algorithm with adaptive choice of parameters. In the following, we are going to first give a brief description of score constructions and then establish the backtest on stock-picking strategy. 6.3.1

Probit model for score construction

We summary here briefly the main idea of the score construction by the Probit model. Assuming that the set of training data (xi , yi )i=1...n is available. Here x is the vector of 36

Support Vector Machine in Finance

factors whereas y is the binary response. We look for constructing a conditional probability distribution of the random variable Y for a given point X. This probability distribution can be used later to predict the response of a new data point xnew . The probit model suppose to estimate this conditional probability in the form:  P r (Y = 1 |X ) = Φ XT β + α

with Φ (x) the cumulative distribution function (CDF) of the standard normal distribution. The couple of parameters (α, β) can be obtained by using estimators of maximum likehood. The choice of the function Φ (x) is quite natural as we work with a binary random variable because it allows to have a symmetric probability distribution. Remark 8 We remark that this model can be written in another form with the introduction of a hidden random variable: Y ? = XT β + α +  where  ∼ N (0, 1). Hence, Y can be interpreted as an indicator for whether Y ? is positive.  1 if Y ? > 0 Y = I{Y ? >0} = 0 otherwise In finance, we can employ this model for the score construction. If we define the binary variable Y is the relative return of a given asset with respect to the benchmark: Y = 1 if the return of is higher than the one of the benchmark and Y = 0 otherwise. Hence, P r (Y = 1|X) is the probability for the give asset with the vector of factors X to be superperformed. Naturally, we can define this quantity as a score measuring the probability of gain over the benchmark: S = P r (Y = 1|X) In order to estimate the regression parameters α, β, we maximize the log-likehood function: L (α, β) =

n X i=1

  yi ln Φ xTi β + α + (1 − yi ) ln 1 − Φ xTi β + α

Using the estimated parameters by maximum likehood, we can predict the score of the a given asset with its factor vector X as following:   ˆ Sˆ = Φ XT βˆ + α The probability distribution of the score Sˆ can be computed by the empirical formula n   1X I{Si 0}

Here, the parameters of the model α0 and β0 are chosen as α0 = 0.1 and β0 = 1. We employ the Probit regression in order to determine the score of n = 500 data in the cases d = 2 and d = 5. The comparisons between the Probit score and the simulated score are presented in Figures 20-22 Figure 20: Comparison between simulated score and Probit score for d = 2

0.7 0.65 0.6

Score

0.55 0.5

0.45 0.4 0.35

Simulated score Probit score 0

6.3.2

50

100

150

200

250

Assets

300

350

400

450

500

SVM score construction

We discuss now how to employ SVM to construct the score for a given ensemble of the assets. In the work of G. Simon (2005), the SVM score is constructed by using SVMregression algorithm. In fact, with SVM-regression algorithm, we are able to forecast the future performance E [µt+1 |Xt ] = µ ˆt based on the present ensemble of factor then this value can be employed directly as the prediction in a trend-following strategy without need of score construction. We propose here another utilization of SVM algorithm based on SVMclassification for building the scores which allow later to implement long/short strategies by using the selection curves. Our main idea of SVM-score construction is very similar to Probit model. We first define a binary variable Yi = ±1 associated to each asset xi . This variable characterizes the performance of the asset with respect to the benchmark. If Yi = −1, the stock is underperformed whereas Yi = 1 the stock is overperformed. We next employ the binary SVM-classification to separate the universal of stocks into two classes: 38

Support Vector Machine in Finance

Figure 21: Comparison between simulated score CDF and Probit score CDF for d = 2

1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 0

Simulated CDF Probit CDF 0.1

0.2

0.3

0.4

0.5

Score

0.6

0.7

0.8

0.9

1

Figure 22: Comparison between simulated score PDF and Probit score PDF for d = 2

6

Simulated PDF Probit PDF

5

PDF

4

3

2

1

0 0

0.1

0.2

0.3

0.4

0.5

Score

39

0.6

0.7

0.8

0.9

1

Support Vector Machine in Finance

high performance and low performance. Finally, we define the score of each stock the its distance to the boundary decision. 6.3.3

Selection curve

In order to construct a simple strategy of type long/short for example, we must be able to establish a selection rule based on the score obtained by Probit model and SVM regression. Depending on the strategy long, short or long/short, we expect to build a selection curve which determine the portion of assets which have a certain level of error. For a long strategy, we prefer to buy a certain portion of high performance with the knowledge on the possible committed error. To do so, we define a selection curve for which the score plays the role of the parameter: Q (s) = P r (S ≥ s)

E (s) = P r (S ≥ s |Y = 0 ) ∀ s ∈ [0, 1]

This parametric curve can be traced in the the square [0, 1] × [0, 1] as shown in Figure 23. On the x-axis, Q (s) defines the quantile corresponding to the stock selection among the considered universal of stocks. On the y-axis, E (s) defines the committed error corresponding to the stock selection. Precisely, for a certain quantile, it measures the chance that we pick the bad performance stock. Two trivial limits are the points (0, 0) and (1, 1). The first point corresponds to the limit with no selection whereas the second point corresponds to the limit with all selection. A good score construction method should allow a selection curve as much convex as possible because it guaranties a selection with less error. Figure 23: Selection curve for long strategy for simulated data and Probit model

1 0.9

Simulated data Probit model

P r(S > s|Y = 0)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

P r(S > s)

40

0.6

0.7

0.8

0.9

1

Support Vector Machine in Finance

Figure 24: Probit scores for Eurostoxx data with d = 20 factors

1 0.9

Probit on Training Probit on Validation

P r(S > s|Y = 0)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

P r(S > s)

0.6

0.7

0.8

0.9

1

Reciprocally, for a short strategy, the selection curve can be obtained by tracing the following parametric curve: Q (s) = P r (S ≤ s)

E (s) = P r (S ≤ s |Y = 1 ) ∀ s ∈ [0, 1]

Here, Q (s) aims us to determine the quantile of low-performance stocks to be shorted while E (s) helps us to avoid selling the high-performance one. As the selection curve is independent of the score definition, it is an appropriate quantity to compare different scoring techniques. In the following, we employ the selection curve for comparing the score constructions of the Probit model and of the SVM-regression. Figure 24 shows the comparison of the selection curves constructed by SVM score and Probit score on the training set. Here, we did not effectuate any calibration on the SVM parameters. 6.3.4

Backtesting and comparison

As presented in the last discussion on the regression, we have to build a cross validation procedure to optimize the SVM parameters. We follow the traditional routine by dividing the data in three independent sets: (i)training set, (ii)validation set and (iii)testing set. The classifier is obtained by the training set whereas its optimal parameters (C, σ) will be obtained by minimizing the fitting error on the validation set. The efficiency of the SVM algorithm will be finally checked on the testing set. We summarize the cross-validation procedure in the below algorithm. In order to make the training set close to both validation data and testing data, we decide to divide the data in the the following time order: validation set, training set and testing set. Using this way, the prediction score on the testing set contains more information in the recent past. 41

Support Vector Machine in Finance Algorithm 3 SVM score construction procedure SVM_Score(X, y) Divide data into training set Dtrain , validation set Dvalid and testing set Dtest Classify the training data by using high/low performance criteria Compute the decision boundary on Dtrain Construct the SVM score on Dvalid by using the distance to the decision boundary Compute Error(σ, C) prediction error and classification error on Dvalid Minimize Error(σ, C) and obtain the optimal parameters (σ ? , C ? ) Use optimal parameters to compute the final SVM-score on testing set Dtest end procedure

We now employ this procedure to compute the SVM score on the universal of stocks of Eurostoxx index. Figure 25 present the construction of the score basing on the the training set and validation set. The SVM parameters are optimized on the validation set while the final score construction uses both training and validation set in order to have largest data ensemble. Figure 25: SVM scores for Eurostoxx data with d = 20 factors

1

SVM Training SVM Validation SVM Testing

0.9

P r(S > s|Y = 0)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

7

0.1

0.2

0.3

0.4

0.5

P r(S > s)

0.6

0.7

0.8

0.9

1

Conclusion

Support vector machine is a well-established method with a very wide use in various domain. In the financial point of view, this method can be used to recognize and to predict the high performance stocks. Hence, SVM is a good indicator to build efficients trading strategy over an universal of stocks. Within this paper, we first have revisited the basic idea of SVM in both classification and regression contexts. The extension to the case of multi-classification is 42

Support Vector Machine in Finance

also discussed in detail. Various applications of this technique were introduced and discussed in detail. The first class of applications is to employ SVM as forecasting method for timeseries. We proposed two applications: the first one consists of using SVM as a signal filter. The advantage of the method is that we can calibrate the model parameter by using only the available data. The second application is to employ SVM as a multi-factor regression technique. It allows to refine the prediction with additional inputs such as economic factors. For the second class of applications, we deal with SVM classification. Two main applications that we discussed in the scope of this paper are the score construction and the sector recognition. Both resulting information are important to build momentum strategies which are the core of the modern asset management.

References [1] Allwein E. L. et al., (2000) , Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers, Journal of Machine Learning Research, 1, pp. 113-141. [2] At A. (2005), Optimisation d’un Score de Stock Screening, Rapport de stage-ENSAE, Société Générale Asset Management. [3] Basak D., Pal S. and Patranabis D.J. (2007), Support Vector Regression, Neural Information Processing, 11, pp. 203-224. [4] Ben-Hur A. and Weston J. (2010), A User’s Guide to Support Vector Machines, Methods In Molecular Biology Clifton Nj, 609, pp. 223-239. [5] Burges C. J. C. (1998), A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 2, pp. 121-167. [6] Chapelle O. (2002), Support Vector Machine: Induction Principles, Adaptive Tuning and Prior Knowledge PhD thesis, Paris 6. [7] Chapelle O. et al., (2002), Choosing Multiple Parameters for Support Vector Machine, Machine Learning, 46, pp. 131-159. [8] Chapelle O. (2007), Training a Support Vector Machine in the Primal, Journal Neural Computation, 19, pp. 1155-1178. [9] Cortes C. and Vapnik V. (1995), Support-Vector Networks, Machine Learning, 20, pp. 273-297. [10] Crammer K. and Singer Y. (2001), On the Algprithmic Implementation of Multiclass Kernel-based Vector Machines, Journal of Machine Learning Research, 2, pp. 265-292. [11] Gestel T. V. et al., (2001), Financial Time Series Prediction Using Least Squares Support Vector Machines Within the Evidence Framework, IEEE Transactions on neural Networks, 12, pp. 809-820. [12] Madzarov G. et al., (2009), A multi-class SVM Classifier Utilizing Binary Decision Tree ,Informatica, 33, pp. 233-241. [13] Milgram J. et al., (2009), “One Against One” or “One Against All”: Which One is Better for Handwriting Recognition with SVMs? (2006) ,Tenth International Workshop on Frontiers in Handwriting Recognition. 43

Support Vector Machine in Finance

[14] Potluru V. K. et al., (2009), Efficient Multiplicative updates for Support Vector Machines ,Proceedings of the 2009 SIAM Conference on Data Mining. [15] Simon G. (2005), L’Econométrie Non Linéaire en Gestion Alternative, Rapport de stageENSAE, Société Générale Asset Management. [16] Tay F.E.H. and Cao L.J. (2002), Modified Support Vector Machines in Financial Times Series forecasting,Neurocomputing,48, pp. 847-861 [17] Tsochantaridis I. et al., (2004), Support Vector Machine Learning for Interdependent and Structured Output Spaces,Proceedings of the 21 st International Confer- ence on Machine Learning,Banff, Canada. [18] Vapnik V. (1998), Statistical Learning Theory, John Wiley and Sons,New York.

A

Dual problem of SVM

In the traditional approach, the SVM problem is first mapped to the dual problem then is solved by a QP program. We present here the detail derivation of the dual problem in both hard-margin SVM and soft-margin SVM case.

A.1

Hard-margin SVM classifier

Let us start first with the hard-margin SVM problem for the classification: min w,b

u.c.

1 2 kwk 2  yi wT xi + b ≥ 1

i = 1...n

In order to get the dual problem, we construct the Lagrangian for inequality constrains by introducing positive Lagrange multipliers Λ = (α1 , . . . , αi ) ≥ 0: L (w, b, Λ) =

n n X  X 1 2 αi αi yi wT xi + b + kwk − 2 i=1 i=1

In minimizing the Lagrangian with respect to (w, b), we obtain the following equations: n X ∂L αi yi xi = 0 = w − ∂wT i=1 n X ∂L αi yi = 0 =− ∂b i=1

Insert these results into the Lagrangian, we obtain the dual objective LD function with respect to the variable w: 1 LD (Λ) = ΛT 1 − ΛT DΛ 2 with Dij = yi yj xTi xj and the constrains ΛT y = 0 and Λ ≥ 0. Thank to the KKT theorem, the initial optimization problem is equivalent to maximizing the dual objective function LD (Λ) max Λ u.c.

1 ΛT 1 − ΛT DΛ 2 ΛT y = 0, 44

Λ≥0

Support Vector Machine in Finance

A.2

Soft-margin SVM classifier

We turn now to the soft-margin SVM classifier with L1 constrain case F (u) = u, p = 1. We first write down the primal problem: ! n X 1 2 p ξi kwk + C.F min 2 w,b,ξ i=1  u.c. yi wT xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n For both case, we construct Lagrangian by introducing the couple of Lagrange multiplier (Λ, µ) for 2n constraints. ! n n n X X  X  1 2 T µi ξi αi yi w xi + b − 1 + ξi − L (w, b, Λ, µ) = kwk + C.F ξi − 2 i=1 i=1 i=1

with the following constraints on the Lagrange multipliers Λ ≥ 0 and µ ≥ 0. Minimizing the Lagrangian with respect to (w, b, ξ) gives us: n X ∂L αi yi xi = 0 = w − ∂wT i=1 n X ∂L αi yi = 0 =− ∂b i=1

∂L =C −Λ−µ=0 ∂ξ

with inequality constraints Λ ≥ 0 and µ ≥ 0. Insert these results into the Lagrangian leads to the dual problem: max Λ u.c.

A.3

1 ΛT 1 − ΛT DΛ 2

(18)

ΛT y = 0, 0 ≤ Λ ≤ C1

ε-SV regression

We study here the ε-SV regression. We first write down the primal problem with all constrains: ! n X 1 2 ξi kwk + C min 2 w,b,ξ i=1 u.c.

wT xi + b − yi ≤ ε + ξi

yi − wT xi − b ≤ ε + ξi0 ξi ≥ 0 ξi0 ≥ 0 i = 1...n

In this case, we have 4n inequality constrain. Hence, we construct Lagrangian by introducing  the positive Lagrange multipliers Λ, Λ0 , µ, µ0 . The Lagrangian of this primal problem reads: ! n n n X X X  1 2 0 µ0i ξi0 µi ξi − ξi − L w, b, Λ, Λ , µ = kwk + C.F 2 i=1 i=1 i=1 −

n X i=1

T



αi w φ (xi ) + b − yi + ε + ξi − 45

n X i=1

βi −wT φ (xi ) − b + yi + ε + ξi0



Support Vector Machine in Finance

with Λ = (αi )i=1...n , Λ0 = (βi )i=1...n and the following constraints on the Lagrange multipliers Λ, Λ0 , µ, µ0 ≥ 0. Minimizing the Lagrangian with respect to (w, b, ξ) gives us: n X ∂L (αi − βi ) yi xi = 0 = w − ∂wT i=1 n

∂L X (βi − αi ) yi = 0 = ∂b i=1

∂L = CI − Λ − µ = 0 ∂ξ ∂L = CI − Λ0 − µ0 = 0 ∂ξ0

Insert these results into the Lagrangian leads to the dual problem: max Λ,Λ0

Λ − Λ0

u.c.

Λ − Λ0

T

T

y − ε Λ + Λ0 1 = 0,

T

1−

T  1 Λ − Λ0 K Λ − Λ0 2

(19)

0 ≤ Λ, Λ0 ≤ C1

T When ε = 0, the term ε Λ + Λ0 1 in the objective function disappears, then we can reduce  0 the optimization problem by changing variable Λ − Λ → Λ. The inequality constrain for new variable reads |Λ| < CI. The dual problem can be solved by the QP program which gives the optimal solution Λ? . In order to compute b, we use the KKT condition:  αi wT φ (xi ) + b − yi + ε + ξi = 0  βi yi − wT φ (xi ) − b + ε + ξi = 0 (C − αi ) ξi = 0 (C − βi ) ξi0 = 0

We remark that the two last conditions give us: ξi = 0 for 0 < αi < C and ξi0 = 0 for 0 < βi < C. This result implies direclty the following condition for all support vectors of training set (xi , yi ): wT φ (xi ) + b − yi = 0 We denote here SV the set of support vectors. Using the condition w = and averaging over the training set, we obtain finally: b=

1 nSV

n SV X i

Pn

i=1

(αi − βi ) φ (xi )

(yi − (z)i ) = 0

 with z = K Λ − Λ0 .

B

Newton optimization for the primal problem

We consider here the Newton optimization scheme for solving the unconstrained primal problem: n X  1 L yi , KiT β + b min LP (β, b) = min β T Kβ + C 2 β,b β ,b i=1 46

Support Vector Machine in Finance

The required condition of this scheme is that the function L (y, t) is differentiable. We study first the case of quadratic loss where L (y, t) is differentiable then the case with soft-margin where we have to regularize L (y, t).

B.1

Quadratic loss function

For the quadratic loss case, the penalty function has a suitable form: 2

L (yi , f (xi )) = max (0, 1 − yi f (xi ))

This function is differentiable everywhere and its derivative reads: ∂L (y, t) = 2y (yt − 1) I{yt≤1} ∂t However, the second derivative is not defined at the point yt = 1. In order to avoid this problem, we consider directly the function L as a function of the vector β and perform a quasi-Newton optimization. The second derivative now is replaced by an approximation of the Hessian matrix. The gradient of the objective function with respect to the vector (bβ)T is given as following:     T 0  2C1T I0 1 2C1T I0 K b 1 I y ∇LP = − 2C 2CK T I0 1 K + CKI0 K β KI0 y and the pseudo-Hessian matrix is given by:  2C1T I0 1 H= 2CKI0 1

2C1T I0 K K + 2CKI0 K



Then the Newton iteration consists of updating the vector (bβ) lowing:     b b ← − γH −1 ∇LP β β

B.2

T

until convergence as fol-

Soft-margin SVM

For the soft-margin case, the penalty function has the following form L (yi , f (xi )) = max (0, 1 − yi f (xi )) which requires a regularization. A differentiable approximation is to use the following penalty function:  0 if yt > 1 + h  (1+h−yt)2 L (y, t) = if |1 − yt| ≤ h 4h  1 − yt if yt < 1 − h

47