A Filter Feature Selection Method for Clustering

Feature Selection (FS) is an efficient preprocessing step for dimen- sionnality ... proposed FS methods [2][4][9][5][11][3] most of them are ”wrapper” methods,.
237KB taille 2 téléchargements 422 vues
A Filter Feature Selection Method for Clustering Pierre-Emmanuel JOUVE, Nicolas NICOLOYANNIS LABORATOIRE ERIC, Universit´e Lumi`ere - Lyon2 Bˆ atiment L, 5 avenue Pierre Mend`es-France 69 676 BRON cedex FRANCE (http://eric.univ-lyon2.fr) [email protected] ; [email protected]

Abstract. High dimensionnal data is a challenge for the KDD community. Feature Selection (FS) is an efficient preprocessing step for dimensionnality reduction thanks to the removal of redundant and/or noisy features. Few and mostly recent FS methods have been proposed for clustering. Furthermore, most of them are ”wrapper” methods that require the use of clustering algorithms for evaluating the selected features subsets. Due to this reliance on clustering algorithms that often require parameters settings (such as number of clusters), and due to the lack of a consensual suitable criterion to evaluate clustering quality in different subspaces, the wrapper approach can not be considered as a universal way to perform FS within the clustering framework. Thus, we propose and evaluate in this paper a ”filter” FS method. This approach is consequently completely independent of any clustering algorithm. It is based upon the use of two specific indices that allow to assess the adequacy between two sets of features. As these indices exhibit very specific and interesting properties as far as their computational cost is concerned (they just require one dataset scan), the proposed method can be considered as an effective method not only from the point of view of the results quality but also from the executing time point of view...

1

Introduction

High dimensional data is a common problem for data miners. Feature Selection (FS) enables to choose the relevant original features and is an effective dimensionality reduction technique. A relevant feature for a learning task can be defined as one whose removal degrades the learning accuracy. By removing the non relevant features, data sizes reduce, while learning accuracy and comprehensibility may improve or, at least, remain the same. Learning can occur within two contexts : supervised or unsupervised (clustering). While there are a bunch of methods for FS in supervised context [1], there are only very few, mostly recent, FS methods for the unsupervised context. This may be explained by the fact that it is easier to select features for supervised learning than for clustering: in supervised context, you know a priori what has to be learn whereas it is not the case for clustering and thus it might be hard to determine which features are relevant to the learning task. FS for clustering may then be described as the task of selecting relevant features for the underlying clusters [3]. Among the

2

Pierre-Emmanuel JOUVE

proposed FS methods [2][4][9][5][11][3] most of them are ”wrapper” methods, these methods evaluate the candidate feature subsets by the learning algorithm itself (which later uses the selected features for efficient learning). In clustering, a wrapper method evaluates the candidate feature subsets by a clustering algorithm (For example, Kmeans are used in [2][9], EM algorithm in [5]). Although wrapper methods for supervised learning have several disadvantages (computational cost, lack of robustness across different learning algorithms), they are still interesting in applications where accuracy is important. But, unlike supervised learning which owns a consensual way for evaluating accuracy, there is no unanimous criterion to estimate the accuracy of clustering (furthermore, this criterion would have to perform well in different subspaces). These limitations make wrapper methods for clustering very disadvantageous. In this paper we propose and evaluate a ’filter’ method for FS. A filter method, by definition, is independent of clustering algorithms, and thus completely avoids the issue about lack of unanimity in the choice of clustering criterion. The proposed method is based on two indices for assessing the adequacy between two sets of attributes (i.e. to determine if two sets of features include the same information). Moreover the proposed method just needs one data set scan which confers it a great advantages over other method as far as the executing time is concerned.

2

Introductory Concepts and Formalisms

This section is essential for the presentation of our FS method, it consists in the presentation of two indices for evaluating in which extent two sets of attributes include the same information (in the rest of this paper, this evaluation is named adequacy assessment between two sets of attributes). Through the rest of the paper we consider a clustering problem involving a dataset DS composed by a set (O) of n objects described by a set (SA) of l attributes. Notation 1 O = {oi , i = 1..n} a set of n objects (of a given dataset) SA = {A1 , .., Al } the set of l attributes (features) describing the objects of O. oi = [oi1 , .., oil ] an object of O, oij corresponds to the value of oi for attribute (feature) Aj (this value may be numerical, categorical...) 2.1

Notion of Link

In categorical data framework, the notion of similarity between objects of a dataset is used; in this paper, we substitute for it an extension of this notion that may be applied to any type of data (categorical or quantitative). It is named link according to an attribute and defined as follows : Definition 1 Link between 2 objects: We associate to each attribute Ai a function noted linki which defines a link (a kind of similarity) or a non-link (a kind of dissimilarity) according to the attribute Ai between two objects of O:   1 if a particular condition which determines a link linki (oai , obi ) = (according to Ai ) between objects oa and ob is verified (1)  0 otherwise (non-link)

A Filter Feature Selection Method for Clustering

3

Examples : - For a categorical ½ attribute Ai , we can naturally define linki as follows : 1 if oai = obi linki (oai , obi ) = 0 otherwise - For a quantitative ½ attribute Ai , we can for instance define linki as follows : 1 if |oai − obi | ≤ δ, with δ a threshold fixed by the user linki (oai , obi ) = 0 otherwise - For a quantitative attribute Ai , we can also think about discretizing it and subsequently use the definition of linki proposed for categorical attributes. 2.2

Assessment of the adequacy between a set of attributes SA and a subset SA? of SA (SA? ⊆ SA)

To assess the adequacy between SA = {A1 , .., Al } (the whole set of attributes describing the objects of the dataset DS) and SA? = {A?1 , .., A?m } a subset of SA (SA? ⊆ SA) we use four indices defined in [7]. Thus, these indices allow to assess in which extent those two sets of attributes contain the same information concerning the objects of the dataset. They are first presented in a relatively intuitive way and their mathematical formulation is given afterwards. Let’s consider couples ((oa , ob ), (A?j , Ai )) composed by : (1) a couple of objects (oa , ob ) such that a < b ; (2) a couple of attributes (A?j , Ai ) constituted by an attribute A?j ∈ SA? and an attribute Ai ∈ SA such that A?j 6= Ai . The indices are then ”intuitively” defined as follows : e L(SA e – L ? , SA) counts the number of couples ((oa , ob ), (A?j , Ai )) such that : 1. there is a link according to A?j between oa and ob : link?j (oa?j , ob?j ) = 1, 2. there is a link according to Ai between oa and ob : linki (oai , obi ) = 1.

e L(SA e – L ? , SA) counts the number of couples ((oa , ob ), (A?j , Ai )) such that : 1. there is a non-link according to A?j between oa and ob : link?j (oa?j , ob?j ) = 0, 2. there is a non-link according to Ai between oa and ob : linki (oai , obi ) = 0.

e L(SA e – L ? , SA) counts the number of couples ((oa , ob ), (A?j , Ai )) such that : 1. there is a link according to A?j between oa and ob : link?j (oa?j , ob?j ) = 1, 2. there is a non-link according to Ai between oa and ob : linki (oai , obi ) = 0.

e L(SA e – L ? , SA) counts the number of couples ((oa , ob ), (A?j , Ai )) such that : 1. there is a non-link according to A?j between oa and ob : link?j (oa?j , ob?j ) = 0, 2. there is a link according to Ai between oa and ob : linki (oai , obi ) = 1.

More Formally, indices are defined as follows : X X X X eL e(SA? , SA) = L a=1..n b=a+1..n i=1..l

linki (oai , obi )×linkj (oa?j , ob?j )

j = 1..m j such that A?j 6= Ai

eL e(SA? , SA) = L

X

X

X

a=1..n b=a+1..n i=1..l

X

(2) (1−linki (oai , obi ))×(1−linkj (oa?j , ob?j ))

j = 1..m j such that A?j 6= Ai

(3)

4

Pierre-Emmanuel JOUVE

e(SA? , SA) = eL L

X

X

X

a=1..n b=a+1..n i=1..l

X

linki (oai , obi )×(1−linkj (oa?j , ob?j ))

j = 1..m j such that A?j 6= Ai

eL e(SA? , SA) = L

X

X

X

a=1..n b=a+1..n i=1..l

X

(4) (1−linki (oai , obi ))×linkj (oa?j , ob?j )

j = 1..m j such that A?j 6= Ai

(5)

It is shown in [7] that the level of adequacy between SA? and SA can be chareL e, L eL e, L eL e, L eL e) and that a strong adequacy acterized by the previous indices (L eL e and L e L. e However, between SA and SA? is associated with high values for L the meaning of high values is not completely intuitive, so we have also determined in [7] the statistical laws (two different binomial laws) followed by indices eL e and L eL e under the assumption of non-adequacy. This has then allowed us to L derive (thanks to a normal approximation and a standardisation) two indices Aq1 (SA? , SA) and Aq2 (SA? , SA) which respectively characterize how signifieL e and L eL e. Under the non-adequacy assumption, both these cantly high are L indices follow a standardized normal law (N (0, 1)), they are defined as follows : (e Le L+e Le L)(e Le L+e Le L) e Le L− e Le L+e Le L+e Le L+e Le L Aq1 (SA? , SA) = s µ ¶ , Aq1 (SA? , SA) ,→ N (0, 1) (e Le L+e Le L)(e Le L+e Le L) e Le L+e Le L × 1− e Le L+e Le L+e Le L+e Le L e Le L+e Le L+e Le L+e Le L

Aq2 (SA? , SA) = s

(e Le L+e Le L)(e Le L+e Le L) e Le L− e Le L+e Le L+e Le L+e Le L µ ¶ , Aq2 (SA? , SA) ,→ N (0, 1) (e Le L+e Le L)(e Le L+e Le L) e Le L+e Le L × 1− e Le L+e Le L+e Le L+e Le L e Le L+e Le L+e Le L+e Le L

Consequently, we can say that the adequacy between SA? and SA is strong if both values Aq1 (SA? , SA) and Aq2 (SA? , SA) are simultaneously significantly high. To simplify, the more Aq1 (SA? , SA) and Aq2 (SA? , SA) values are simultaneously high the more the adequacy between SA? and SA is high. Important Remark : We have shown in [7] that Aq1 (SA? , SA) and Aq2 (SA? , SA) indices exhibit a very interesting and specific property concerning their computation : IF [ l(l−1) specific contingency tables crossing each attributes of SA are built] 2

(this only requires one dataset scan, and O( l(l−1) n) (resp. O( l(l−1) n2 )) comparisons if 2 2 all attributes of SA are categorical (resp. if some attributes of SA are numerical and not-discretized))

THEN : it is possible to compute Aq1 (SA? , SA) and Aq2 (SA? , SA) for any subset of SA without accessing the original dataset. This may be done with a l(l−1) specific contingency tables. complexity in o( l(l−1) 2 ) just by accessing the 2

A Filter Feature Selection Method for Clustering

3

5

The new Filter Feature Selection Method for Clustering

The method we propose is based upon Aq1 (SA? , SA) and Aq2 (SA? , SA) indices. The basic idea is to discover the subset of SA which is the more in adequacy with SA (that is to say the subset which seems to best carry the information of SA). To do so, we use indices Aq1 (SA? , SA) and Aq2 (SA? , SA) to derive a unique new measure which reflects the adequacy between SA and SA? (SA? ⊆ SA ). Then the aim is to discover the subset of SA which optimizes this new measure. The new adequacy measure, named f it(SA, SA? ) is based upon the fact that a great adequacy between SA? and SA is characterized by simultaneously very high valuesfor Aq1 (SA? , SA) and Aq2 (SA? , SA). It is defined as follows : p  (aq˜1 − Aq1 (SA? , SA))2 + (aq˜2 − Aq2 (SA? , SA))2 , f it(SA, SA? ) =

if Aq1 (SA? , SA) > 0 and Aq2 (SA? , SA) > 0

 +∞ otherwise We can see that, in a certain way, this function corresponds to a ”distance” between two subsets of attributes from the adequacy with SA point of view. More precisely, this measure may be seen as the ”distance” from the adequacy with SA point of view between : a virtual subset of attributes (whose values for Aq1 and Aq2 would be respectively aq ˜1 and aq ˜2 ) and the set of attributes SA? . Indeed, we set aq ˜1 = aq ˜2 = high values in order to confer to the particular virtual set of attributes the aspect of a kind of ideal set of attributes from the point of view of the adequacy with SA. Consequently, the weaker is the value for this measure (i.e. the weaker is the ”distance”), the more the adequacy between SA and SA? can be considered to be high.

The Filter method for Feature Selection for Clustering that we propose is based upon the use of this measure : it consists in looking for the subset of SA which minimizes f it(SA, SA? ) function. This search might be exhaustive but it would imply a far too high computational cost, to reduce it we decided to use a genetic algorithm (GA) in order to perform only a partial exploration of the space composed by subsets of SA 1 . The GA used is defined as follows : (1) a chromosome codes (corresponds to) a subset of SA ; (2) each gene of the chromosome codes an attribute of SA (so, there are l genes); (3) each gene of a chromosome has a binary value : the gene value is 1 (resp. 0) if its associated attribute is present (resp. absent) in the subset of SA coded by the chromosome. The FS method algorithm is given next page, we should note that : (1) it only requires one scan of the dataset (due to properties of Aq1 and Aq2 ) ; (2) it needs to store some contingency tables (due to Aq1 &Aq2 )but that corresponds to a small memory amount (typically fit into computer main memory); 1

Note that we could have used any other optimization method and that we made a debatable arbitrary choice. Using other greedy approaches would allow to limit the computational cost in a greater extent. However, this choice is not the point here.

6

Pierre-Emmanuel JOUVE

(3) its complexity is small (it is quadratic according to the number of attributes of the dataset and completely independent from the number of objects once the needed contingency tables have been built) (due to properties of Aq1 and Aq2 ) ; (4) it can deal with either numerical, categorical or mixed categorical and numerical data. (However, we should note that the computational cost associated to the creation of the needed contingency tables may appear excessive in case of quantitative attributes (since the complexity is quadratic with the number of objects) and that is more interesting to treat categorical data or discretized numerical data (since the complexity of contingency tables creation is thereafter linear with the number of objects)).

Algorithm : Filter Feature Selection for Clustering contingency tables necessary 1. In only one scan of the dataset derive the l(l−1) 2 for computing the previously presented adequacy indices. 2. Run the Genetic Algorithm using the fitness function f it(SA, SA? ). 3. Select the best subspace found by the Genetic Algorithm

4

Experimental Evaluations

We experimented our method either on synthetic datasets or on UCI[10] datasets. 4.1

Experiment #1: Experimental evaluation on synthetic datasets

Description : The objective was to test whether or not our method is able to detect the relevant attributes. So, we built synthetic datasets including 1000 objects characterized by 9 relevant attributes (A1 , A2 , A3 , A4 , A5 , A6 , A7 , A8 , A9 ) and by a set of (l − 9) noisy attributes. More precisely : objects o1 to o250 (resp. o251 to o500 ) (resp. o501 to o750 ) (resp. o751 to o1000 ) all have the same value D for attributes A1 , A2 , A3 (resp. A3 , A4 , A5 ) (resp. A5 , A6 , A7 ) (resp. A7 , A8 , A9 ); as for the remaining attributes, a value among A, B and C is randomly assigned to the objects (the probability of assignment of each value is 1/3). We illustrate in figure 4.1, the composition of the datasets. We can thus see that only the first 9 attributes are relevant, and, the four clusters datasets’ structures. Fig.1 Synthetic Datasets The experiments were the followings : we ran several FS processes for 6 datasets made up of the 1000 objects characterized by attributes A1 , A2 , A3 , A4 , A5 , A6 , A7 , A8 , A9 and respectively by 9, 18, 27, 36, 81 and 171 noisy attributes. Consequently, the datasets were respectively composed by 18 attributes (50% of

A Filter Feature Selection Method for Clustering

7

them are relevant), 27 attributes ( 13 of them are relevant), 36 attributes (25% of them are relevant), 45 attributes (20% of them are relevant), 90 attributes (10% of them are relevant), 180 attributes (5% of them are relevant). For each of the 6 datasets, we then ran 5 series of 5 FS processes, each series being characterized by the number of generations for the GA used. Thus, for the first series, the number of generations for the GA was 50, this number was set to 100, 500, 1000 and 2500 for the second, third, fourth and fifth series. The other parameters of the GA were : number of chromosomes per generation = 30; cross-over rate = 0.98 ; mutation rate = 0,4 ; elitism = yes. Analysis of the Results: Results are presented in figure 2., they somewhat require explanations... Let us note first that each of the (6 × 5 × 5 = 150) FS processes led to a subset of attributes including the 9 relevant attributes (A1 ,...,A9 ). Thus, the various curves of figure 2. describe how many noisy attributes were simultaneously selected with the 9 relevant attributes for each series of 5 FS processes. They detail for each series: (1) the average percentage of noisy attributes selected by the 5 FS processes of the series; (2) the smallest percentage of selected noisy attributes (percentage of noisy attributes selected by the ”best” FS process of the series); (3) the greatest percentage of selected noisy attributes (percentage of noisy attributes selected by the ”worst” FS process of the series).

Fig.2 Experiments on Synthetic Datasets

The first interesting point is the ability of the method not to omit relevant attributes in its selection, even when there are very few relevant attributes (5%) and when simultaneously the number of generations of the GA is very small (50) (for so low numbers of generations one can really consider that the optimization process of the GA did not reach its end).

8

Pierre-Emmanuel JOUVE

As for the percentage of selected non relevant (noisy) attributes, we see that: - it is null (resp. nearly null) for datasets including at least 25% (resp. 20%) of relevant attributes (even for very small numbers of generations (50)); - concerning datasets including 10% or less than 10% of relevant attributes, the selection of the optimal subset of attributes (SA? = {A1 , A2 , A3 , A4 , A5 , A6 , A7 , A8 , A9 }) is obtained for numbers of generations greater or equal to 1000. The method thus seems very efficient, because, the indices as well as the used fitness function really give an account of what is a good subset of attributes, and, the used optimization process allows the discovery of the optimal subset while not implying a disproportionate computing time. (As an indication, for the dataset including 180 attributes, the number of non-empty subsets of SA is 2180 = 1.53 × 1054 , the greatest number of tested subsets (for 2500 generations and if we assume that subsets are evaluated only once by the GA) is 2500 × 30 = 75000, the comparison between these two values really shows the effectiveness of the search process...) Thus, on these (relatively naive) synthetic examples the method which we propose seems very effective. Finally, note that using clustering algorithms on the ”reduced ” dataset would lead to the ”good” clustering in 4 clusters and that the associated computing time would be reduced by a factor from 2 to 20 (resp. 4 to 400) in case of algorithms having a linear complexity (resp. quadratic) in the number of attributes. 4.2

Experiment #2: Experimental evaluation on UCI datasets

Description : The objective of these experiments was to determine if clusterings obtained by considering a subset of the whole set of attributes (a subset of SA) selected thanks to our FS methods exhibit an equivalent or better level of quality than clusterings obtained by considering the whole set of attributes (SA). For this, we used two famous datasets of the UCI Repository [10] : Mushrooms dataset and Small Soybean Diseases dataset. More precisely, we used our FS method in order to select a subset of the attributes of these datasets : - For Small Soybean Diseases, 9 attributes (plantstand, precip., temp, area-damaged, stem-cankers, canker-lesion, int-discolor, sclerotia, fruit-pods)) have been selected among the 35 initial attributes. - For Mushrooms, 15 attributes (bruises?, odor, gill-color, stalk-shape, stalk-root, stalk-surface-above-boxing ring, stalk-surfacebelow-boxing ring, stalk-color-above-boxing ring, stalk-color-below-boxing ring, veil-type, sporeprint-color, population, habitat)

have been selected among the 22 initial attributes. Then, we clustered objects of the Small Soybean Diseases (resp. Mushrooms) dataset by considering either the whole set of 35 (resp. 22) attributes or by considering the 9 (resp. 15) attributes selected by our FS method. We used KModes [6] categorical data clustering method. Different parameters (different numbers of clusters) were used so as to generate clusterings having different numbers of clusters 2 . To sum up, for Mushrooms (resp. Small Soybean Diseases) 2

For each parameter setting (number of clusters), we carried out 10 different experiments and kept the clustering corresponding to the best value for criterion QKM (see note 3.).This was done to minimize the initialization effect on the K-Modes.

A Filter Feature Selection Method for Clustering

9

dataset, we performed clusterings (using K-Modes method) into 2, 3, 4, ..., 24, 25 (resp. 2, 3, ..., 9, 10) clusters either by considering the whole set of attributes or by considering the selected subset. To assess the quality of these clusterings we used an intern validity measure: the QKM criterion 3 . Obviously, we computed the value for this criterion by taking into account the whole set of attributes of the considered dataset even if the clustering was obtained by considering only the selected subset. Analysis of the Results : Mushrooms Dataset : We see (Fig.3) that the quality of clusterings (according to QKM criterion) obtained either by performing clustering on the whole set of attributes or on the selected subset is quite similar. This shows the efficiency of our method since clusterings obtained with the FS preprocessing step are as good as those obtained without this step. Fig.3 Experiments on Mushrooms Dataset Small Soybean Diseases Dataset Results (Fig. 4) are similar to those obtained with Mushrooms dataset, it gives another piece of evidence about the quality of the selected subset and of our method. Note also that the number of selected attributes is quite low (about 25% of the initial attributes). Fig.4 Experiments on S. S. Diseases Dataset Remark : More extensive experiments (involving different criterions and methodologies for quality checking -such as extern validity measures- and also involving different clustering methods such as Kerouac method [8]) are presented in [7]. These experiments also confirm the efficiency of our FS method. Further experiments, concerning the composition (in term of objects) of each clusters of the obtained clusterings, show that, for a given number of clusters, clustering obtained by treating the whole set of attributes and those obtained by treating the selected subset do have a very similar composition.

3

QKM is the criterion to be optimized (minimized) by the K-Modes method

10

5

Pierre-Emmanuel JOUVE

Further Discussion and Conclusion

In short, we propose a new filter FS method for clustering which, in addition to classical filter methods advantages (see introduction), exhibits several other interesting points : (1) only requires one scan of the dataset contrary to almost all other filter approaches ; (2) it has a small memory cost ; (3) its algorithmic complexity is rather low and completely independent from the number of objects once some needed contingency tables have been built ; (4) it can deal with either numerical, categorical or mixed categorical and numerical data contrary to other filter FS methods such as [3]; (5) similarly to [3] this method allows to select attributes of the initial set of attributes and not a selection of new virtual attributes like approaches based on factorial analysis or multi-dim. scaling, this is particularly interesting if one wants to build an easily interpretable model. Experiments have shown the efficiency of our approach concerning clusterings quality, dimensionality reduction, noisy data handling, and computing time. We do think that the power of this method essentially lies in the efficiency of Aq1 and Aq2 indices. Therefore many potential improvements may be investigated such as substituting a greedy optimization method to the GA, modifying the structure of the GA in order not to seek the ”optimal” subset of attributes but the best subset of attributes such as it includes a fixed number of attributes...

References 1. M. Dash and H. Liu : Feature selection for classification. Int. Journal of Intelligent Data Analysis, 1(3), (1997) 2. M. Dash and H. Liu : Feature selection for clustering. In Proc. of Fourth PacificAsia Conference on Knowledge Discovery and Data Mining, (PAKDD) (2000) 3. M. Dash, K. Choi, P. Scheuermann and H. Liu : Feature Selection for Clustering-A Filter Solution. Proc. of Int. Conference on Data Mining (ICDM02), (2002), 115–122 4. M. Devaney and A. Ram : Efficient feature selection in conceptual clustering. In Proc. of the International Conference on Machine Learning (ICML), (1997) 92–97 5. J. G. Dy and C. E. Brodley :Visualization and interactive feature selection for unsupervised data. In Proc. of the International Conference on Knowledge Discovery and Data Mining (KDD), (2000) 360–364 6. Z. Huang : A Fast Clustering Algorithm to Cluster VeryLarge Categorical Data Sets in Data Mining. Research Issues on Data Mining and Knowledge Discovery, (1997) 7. P. E. Jouve : Clustering and Knowledge Discovery in Databases. PhD thesis, Lab. ERIC, University Lyon II, France, (2003) 8. P.E. Jouve and N. Nicoloyannis : KEROUAC, an Algorithm for Clustering Categorical Data Sets with Practical Advantages. In. Proc. of International Workshop on Data Mining for Actionable Knowledge (PAKDD03), (2003) 9. Y. S. Kim, W. N. Street, and F. Menczer : Feature selection in unsupervised learning via evolutionary search. In Proc. of ACM SIGKDD International Conference on Knowledge and Discovery, (2000) 365–369 10. Merz, C., Murphy, P. : UCI repository of machine learning databases. http://www.ics.uci.edu/#mlearn/mlrepository.html. (1996) 11. L. Talavera : Feature selection and incremental learning of probabilistic concept hierarchies. In Proc. of International Conference on Machine Learning (ICML), (2000)