A new rule-based algorithm for identifying

Sep 3, 2008 - Data and text mining. A new rule-based algorithm for identifying metabolic markers in prostate cancer using tandem mass spectrometry.
121KB taille 3 téléchargements 409 vues
BIOINFORMATICS

ORIGINAL PAPER

Vol. 24 no. 24 2008, pages 2908–2914 doi:10.1093/bioinformatics/btn506

Data and text mining

A new rule-based algorithm for identifying metabolic markers in prostate cancer using tandem mass spectrometry Melanie Osl1,∗ , Stephan Dreiseitl1,2 , Bernhard Pfeifer1 , Klaus Weinberger3 , Helmut Klocker4 , Georg Bartsch4 , Georg Schäfer5 , Bernhard Tilg1 , Armin Graber3,6 and Christian Baumgartner1 1 Institute

of Biomedical Engineering, University for Health Sciences, Medical Informatics and Technology, Hall in Tyrol, 2 Department of Software Engineering, Upper Austria University of Applied Sciences, Hagenberg, 3 Biocrates Life Sciences AG, Innsbruck, 4 University Clinic for Urology, 5 Institute for Pathology, Innsbruck Medical University, Innsbruck and 6 Institute for Bioinformatics, University for Health Sciences, Medical Informatics and Technology, Hall in Tyrol, Austria Received on May 16, 2008; revised on September 3, 2008; accepted on September 22, 2008 Advance Access publication September 24, 2008 Associate Editor: Thomas Lengauer

ABSTRACT Motivation: Prostate cancer is the most prevalent tumor in males and its incidence is expected to increase as the population ages. Prostate cancer is treatable by excision if detected at an early enough stage. The challenges of early diagnosis require the discovery of novel biomarkers and tools for prostate cancer management. Results: We developed a novel feature selection algorithm termed as associative voting (AV) for identifying biomarker candidates in prostate cancer data measured via targeted metabolite profiling MS/MS analysis. We benchmarked our algorithm against two standard entropy-based and correlation-based feature selection methods [Information Gain (IG) and ReliefF (RF)] and observed that, on a variety of classification tasks in prostate cancer diagnosis, our algorithm identified subsets of biomarker candidates that are both smaller and show higher discriminatory power than the subsets identified by IG and RF. A literature study confirms that the highest ranked biomarker candidates identified by AV have independently been identified as important factors in prostate cancer development. Availability: The algorithm can be downloaded from the following URL: http://biomed.umit.at/page.cfm?pageid=516 Contact: [email protected]

1

INTRODUCTION

1.1

Clinical question

Prostate cancer is the most prevalent tumor in males in developed countries and a major cause of death due to malignancy. This problem will be aggravating with increasing life-time expectancy in the future, since the frequency of prostate cancer is rampant in elderly men. A curative treatment of prostate cancer is possible, given that the tumor is diagnosed in an organ-confined stage and completely removed. However, early diagnosis is hampered by the lack of symptoms and markers. Thus, novel diagnostic and prognostic tools for prostate cancer management are urgently needed (Dhanasekaran et al., 2001; Tomlins et al., 2006). ∗ To

One major challenge to reduce prostate cancer mortality demands the discovery of plasma prognostic markers allowing the distinction between indolent and aggressive tumors and the establishment of early detection, risk assessment and treatment programs. An international project, funded in the framework of the IMGuS (Institute for Medical Genome Research and Systems Biology, Vienna) research program, on prostate cancer pursues a systems biology approach analyzing genomic, proteomic and metabolomic components of samples from prostate cancer patients (Herwig et al., 2007), and its consortium aims at the identification of a new set of such diagnostic and prognostic molecular signatures and markers, and attempts to reveal their inherent biological functions. The metabolomic analysis comprises a set of quantitative targeted assays applied on serum samples from patients with aggressive [Gleason score (GS) 8–10] and non-aggressive (GS6) tumors and healthy age matched controls. Metabolite concentration profiling shows potential to discover multivariate biomarker sets that assist in early diagnosis, disease staging and subtyping at the molecular level, and will open up the opportunity to develop individually adapted forms of treatments in different cancers; especially, once metabolic changes can be characterized on a comprehensive scale (Baumgartner and Graber, 2008; Weinberger and Graber, 2005).

whom correspondence should be addressed.

2908

1.2

Biochemical background

A metabolomic approach seems promising as tumor cells exhibit defined changes in their intermediary metabolism due to two distinct influences. First, the internal alterations, associated with transformation and immortalization, have profound effects on gene transcription patterns and, consequently, influence protein levels and enzyme activities, which lead to all sorts of abnormal metabolite concentrations. Second, cells in solid tumor tissues frequently encounter external limitations like ischemia and, thus, constrained availability of oxygen and various substrates for energy generation and synthetic processes. Therefore, these tissues show a characteristic, fermentation-like phenotype with elevated

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

A rule-based algorithm for identifying metabolic markers

production of certain organic acids like pyruvate and lactate and an increased turn-over of amino acid catabolism (Boros et al., 2002). Supplementary characteristics should be measured in specific cancers. For example, as far as prostate cancer is concerned most metabolic studies have focused on the role of androgens, such as testosterone, and their biosynthesis and degradation, while research is also directed to nutritional effects (Dagnelie et al., 2004). Another aspect, where metabolomics could greatly enhance the understanding of the pathomechanisms in prostate cancer, is the huge area of intracellular signaling. Up to now, the characterization of signaling pathways is mainly focused on protein phosphorylation and kinase cascades although various metabolites, such as sphingolipids like ceramide, sphingosine and sphingosine-1phosphate, are frequently acting as intracellular second messengers. Although all three sphingolipids are structurally similar, they regulate different signal transduction pathways with high specificity and are directly involved in the regulation of many cellular processes, in particular also in cell proliferation and apoptosis (Jaffrezou and Laurent, 2004).

1.3

Feature selection for biomarker discovery

A recent biomarker discovery strategy in human disease involves the search for novel diagnostic, prognostic and predictive markers in massive and complex datasets gathered from modern MS/MS profiling platforms (Baumgartner et al., 2004). This process is highly data driven and requires sophisticated data mining concepts for identifying, verifying and interpreting robust and generalizable biomarkers. Analytical and computational challenges are consequences of the properties of high-dimensional data spaces, such as high variability, presence of strong correlations and confounding due to the multimodality of heterogeneous and dynamic processes in cancer biology that are inherent in experimental profiling data (Clarke et al., 2008). Feature selection, perhaps the most widely used approach to the analysis of high-dimensional data spaces before classification and biomedical interpretation, reduces the dimensionality of data significantly and searches for those feature subsets that show superior discriminatory and predictive performance. However, many popular approaches do not optimally reflect the characteristics of given MS/MS data structures, and thus the apparent need arises for alternative advanced data analysis strategies for identifying novel biomarker candidates in metabolomic datasets. Hence, we propose a novel feature selection method for the identification of biomarker candidates distinguishing patients with aggressive and non-aggressive stages of prostate cancer and healthy controls.

1.4

Filter-based feature selection

Our feature selection algorithm for identifying metabolic markers in prostate cancer is a filter approach, a particular sub-category of feature selection (Hall and Holmes, 2003). Filter approaches use an evaluation criterion to assess the discriminatory power of the attributes. As a result, a ranked list is returned to the user. Popular methods like Information Gain (IG) (Quinlan, 1993) and ReliefF (RF) (Kononenko, 1994) apply entropy- and correlation-based evaluation criteria, respectively. The IG of an attribute reflects the amount of entropy of the class labels that can be explained by the attribute. Given an attribute ai , its IG IG(ai ) with respect to class cj is defined as the difference

between the entropy of class cj and the conditional entropy of class cj given ai . The overall IG of attribute ai is the sum over all IGs with respect to all class labels. IG is a univariate filter approach, because it evaluates each attribute separately. RF is the extension of Relief to noisy, incomplete and multi-class datasets. The main idea of Relief is that the values of a significant attribute are correlated with the attribute values of an instance of the same class, and uncorrelated with the attribute values of an instance of the other class. For a given instance, Relief determines its two nearest neighbors: one from the same class, and one from the other class. Then it estimates the value of an attribute ai by the difference between the conditional probabilities P (different value of ai | nearest instance from different class) and P (different value of ai | nearest instance from same class). Note that the nearest instances are identified according to the sum of differences to all attributes, so that Relief as well as RF are multivariate filter approaches. In our algorithm, we evaluate attributes by a rule-based evaluation criterion. More precisely we evaluate attributes by a special form of association rules. Most previous work on the use of association rules for feature selection has been done in the area of text mining. Do et al. (2006) use scores based on how often features appear in rules, but do this in an unsupervised manner. Wiratunga et al. (2004) employ boosting of decision stumps (one-element rules); the selected features are those that occur in these stumps. The work by Foschi et al. (2003) considers only the features that occur in the best rule in the context of image mining.

2 2.1

METHODS Data collection

2.1.1 Blood serum sample collection and processing Serum samples were obtained from the Prostate Cancer Biobank in Innsbruck, Austria. The serum procurement, data management and blood collection protocols were approved by the local Ethical Review Board. Blood samples from patients diagnosed with prostate cancer and from healthy controls, respectively, were obtained from the Prostate Cancer Screening program open to the general public in Tyrol (Bartsch et al., 2001). After informed consent of the patients, blood samples were drawn by venous puncture using Sarstedt 9 ml z-gel serum monovettes, serum was obtained by centrifugation (4 min, 1800g) and frozen in 2 ml cryovials (Simport) at −80◦ C. At first use the sera were distributed into 250 µl aliquots to avoid repeated freeze–thaw cycles. 2.1.2 Patient and control cohorts Hundred and fourteen serum samples from control screeners and 206 serum samples obtained prior to treatment from prostate cancer patients who underwent radical prostatectomy after cancer diagnosis (121 GS6, 85 GS8-10) were studied. The inclusion criteria for prostate cancer patients required the histopathological assessment of radical prostatectomy specimens. Histopathological results were confirmed by a second independent pathologist from the State Hospital Klagenfurt in Austria. The age-matched normal controls were defined by prostate specific antigen (PSA) measurements covering a time period of at least 3 years with no increase of PSA above 0.5 ng/ml. Clinical data of the patients were retrieved from the clinical databases and the patients’ history records. 2.1.3 MS-analysis and data preprocessing Metabolite concentrations were detected in blood sera (100 l per sample) of the 320 patients by targeted MS/MS analysis (Weinberger and Graber, 2005). In detail, sample aliquots were directly extracted in Folch solution (i.e. glyco- and phospholipids, and oxidized fatty acids) or derivatized (i.e. amino acids, biogenic amines, acylcarnitines, reducing mono- and oligosaccharides) with a liquid-handling system, and analyzed either by FIA-MS/MS or LC-MS/MS combined with

2909

M.Osl et al.

MRM, precursor and NL scans using a 4000 Q TRAP system equipped with an electrospray source. Isotope correction was performed on the mass profile of lipids (Eibl et al., 2008). Concentrations were calculated from the raw MS spectra by reference to a wide range of appropriate internal standards (stable isotopes) and filtered based on their signal-to-noise threshold (>4) and overall percentage of missing values (