Sample size determination in clinical proteomic profiling

The aim of this study was to develop a protocol for the use of sample size calculations in prote- ..... pilot study, ten cases and ten control samples were randomly.
485KB taille 153 téléchargements 348 vues
74

DOI 10.1002/pmic.200800417

Proteomics 2009, 9, 74–86

RESEARCH ARTICLE

Sample size determination in clinical proteomic profiling experiments using mass spectrometry for class comparison David A. Cairns1, Jennifer H. Barrett2, Lucinda J. Billingham3, Anthea J. Stanley1, George Xinarianos4, John K. Field4, Phillip J. Johnson3, Peter J. Selby1 and Rosamonde E. Banks1 1

Clinical and Biomedical Proteomics Group, Cancer Research UK Clinical Centre, Leeds Institute of Molecular Medicine, St. James’s University Hospital, Leeds, UK 2 Section of Epidemiology and Biostatistics, Leeds Institute of Molecular Medicine, St. James’s University Hospital, Leeds, UK 3 Cancer Research UK Institute for Cancer Studies, School of Medicine, University of Birmingham, Edgbaston, Birmingham, UK 4 Roy Castle Lung Cancer Research Programme, Cancer Research Centre, University of Liverpool, Liverpool, UK

Mass spectrometric profiling approaches such as MALDI-TOF and SELDI-TOF are increasingly being used in disease marker discovery, particularly in the lower molecular weight proteome. However, little consideration has been given to the issue of sample size in experimental design. The aim of this study was to develop a protocol for the use of sample size calculations in proteomic profiling studies using MS. These sample size calculations can be based on a simple linear mixed model which allows the inclusion of estimates of biological and technical variation inherent in the experiment. The use of a pilot experiment to estimate these components of variance is investigated and is shown to work well when compared with larger studies. Examination of data from a number of studies using different sample types and different chromatographic surfaces shows the need for sample- and preparation-specific sample size calculations.

Received: May 12, 2008 Revised: June 25, 2008 Accepted: July 11, 2008

Keywords: False discovery rate / Mass spectrometry / Power / Sample size / Type I error

1

Introduction

A major aim of many clinical and biomedical proteomic research studies is the identification of new biomarkers, i.e. characteristics that are objectively measured and evaluated as indicators of normal biologic processes, pathogenic pro-

Correspondence: Dr. David A. Cairns, Cancer Research UK Clinical Centre, Leeds Institute of Molecular Medicine, St. James’s University Hospital, Beckett Street, Leeds LS9 7TF, UK E-mail: [email protected] Fax: 144-113-2429886 Abbreviations: FDR, false discovery rate; QC, quality control

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

cesses or pharmacologic responses to a therapeutic intervention [1]. The discovery of a diagnostic or prognostic biomarker can also lead to an increased understanding of the mechanism of disease [2]. Major advances in proteomic technologies and bioinformatics have led to a great number of studies investigating the potential of these approaches in biomarker discovery, but criticisms have been made about poor experimental design [3–5]. Problems in many aspects have been recognised, including the issue of clinical proteomic profiling experiments being undertaken using too few biologically distinct samples to draw firm conclusions [6–8], thus failing to maximise the potential information from valuable biological samples and expensive laboratory procedures. www.proteomics-journal.com

75

Proteomics 2009, 9, 74–86

Although recommendations have been outlined for improving studies using simple ideas in experimental design (e.g. randomisation, matching for confounders, and replication [9]), there is little in the literature about sample size calculation for statistically powerful proteomic profiling experiments whether using MS or other proteomic technologies. The only work currently available focuses on 2-D difference in-gel electrophoreses (2-D DIGE) [6, 10–12]. Profiling studies undertaken using MS, for instance using MALDI-TOF or SELDI-TOF, can be complex in nature, observing samples from a number of different patient groups, e.g. healthy and benign disease controls and patients with early and late stage transitional cell carcinoma (TCC) [13]. However, even in complex designs the key questions often concern the comparison of two groups. In the example given above this could be the comparison of grouped controls (healthy and benign disease) with cancer cases (early and late stage), or alternatively the specific subgroup comparison which may be of primary interest, e.g. low versus high stage TCC. Focus on simple comparisons allows the development of a strategy for the determination of appropriate sample sizes. Pilot studies, which use a small number of samples analysed in a manner as similar as possible to the full study, can be used for the prior estimation of possible response levels or to elicit parameters such as variance component estimates for use in sample size calculations. Additionally, they can inform the choice of optimal laboratory conditions to use in the main study. The pilot study need not be very large (possibly based on only five to ten samples), but should mimic the sample handling and profile generation that will occur in the full study (including randomisation of biological samples over experimental arrays, profile generation parameters and determination of profiles in replicate). The profiles should then be processed (baseline subtracted, peak detected and normalised if appropriate) and the variance components for each peak estimated. Alternatively, estimates of variance components can be obtained from similar data from recent experiments, e.g. data from controls of another experiment using the same type of biological sample. The choice of past experiment may require some careful consideration as to how the two studies (previous and planned) are similar and how different. The variance in the proteomic profile of a group of samples from subjects with a particular phenotype (subsequently referred to as a class) can be decomposed into two general components: biological variance and technical variance. Biological variance can be defined as the heterogeneity within a class of subjects, e.g. the variation in the specific proteome within a group of healthy controls, or between different groups, e.g. healthy versus diseased, and can be due to genetic, epigenetic or environmental factors [11]. Another source of biological variation is within-subject variation, representing changes within a subject over time. This would require repeated sampling of the same subject at different time-points and is not considered further here. Technical variance can be defined as the variation due to the technical procedures © 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

employed. This may be at the pre-analytical stage, for example differences in sample processing [14, 15], or analytical, arising from experimental processes such as depletion [16], calibration [3] or instrument performance and chromatographic separation [15]. Technical variance can be measured by the variance in determination of the proteomic profile of replicates of the same biological sample. In this paper a careful consideration of sample size determination in proteomic profiling experiments for biomarker discovery using MS is described, although the general methods and principles could equally be applied to other profiling techniques, e.g. 2-D PAGE. Procedures are described for estimation of the variance inherent in biological samples and profile determination and an approach to the problem of multiple testing is developed through consideration of the false discovery rate (FDR). A pre-study protocol for a proteomic profiling experiment is developed that combines sample size calculations with recommendations on use of a pilot study to assess variability. A simulation of the pilot study procedure, based on proteomic mass spectra from a group of patients with lung cancer and healthy controls, is conducted to evaluate the effect of pilot study size on the estimation of full study sample size requirements. Further study sets are also used to illustrate the influence of sample type/methodology on the calculations.

2

Materials and methods

2.1 Biological samples, generation and preprocessing of spectra The main study set used was data generated from a comparison of EDTA plasma samples collected from 52 healthy controls and 55 patients with lung cancer prior to treatment with the aim of finding potential biomarkers for diagnosis of lung cancer. All plasma samples used in this study were obtained from individuals recruited in the Liverpool Lung Project (LLP) [17, 18]. Several peaks of interest are now awaiting identification prior to submission of the results for publication. All samples were processed within 1 h of venepuncture according to a standardised operating procedure and stored at 2807C. Details of calibration of the SELDI PCS Enterprise 4000 and the preparation of CM10 (weak cation exchange) sample chips using the Biomek robot are as described previously [14]. Samples were analysed in duplicate following complete randomisation throughout all replicates. A quality control (QC) sample formed by pooling EDTA plasma from six healthy individuals was used on three QC chips run on the first day of the analysis to define a reference set that indicated the normal within-run technical variability in the profiling technique. Ideally the QC sample would be formed by pooling samples from all subjects in the study, but this was not possible here due to limited quantities of biological samples. To assess between-chip and betweenday technical variation, the QC sample was also included on www.proteomics-journal.com

76

D. A. Cairns et al.

a single spot on each SELDI chip used in the analysis, ensuring equal use of spots A to H for this purpose. For the main run, 32 chips were used (excluding three QC chips), with 31 plasma samples being rerun following replicate analyses and QC. The QC samples run throughout the study were used to examine technical variance of peak intensities using the mean and CV in a spectral representation for the detected peaks. Automated in-house QC methods were utilised to monitor the experimental run and determine replicate reruns (Perkins et al., manuscript in preparation). All spectra were baseline-subtracted using a two-stage Loess algorithm similar to that described in work by Wagner et al. [19]. After baseline subtraction, peaks were then detected in the spectra using an in-house algorithm [20, 21], resulting in a matrix of intensity measurements at peak clusters in the same format as that provided by the Ciphergen Express 3.0 (Ciphergen Biosystems, Fremont, CA). It should be noted that no normalisation of the intensity axis, for example to TIC, was undertaken as we have previously found this to introduce additional problems [22]. For illustrative comparative purposes, profile data from serum samples from healthy controls, patients with renal failure and patients postrenal transplantation (Thompson et al., manuscript in preparation) and urine samples from patients with renal cell carcinoma (Sim et al., manuscript in preparation) were also examined following similar protocols [14]. 2.2 Sample size calculations for the linear mixed model To provide a sample size calculation it is necessary to define an appropriate statistical test for the hypothesis of interest at the design stage of the study. As a linear mixed model for each peak intensity has been used with success previously to identify differential expression [13, 14] and is a method which allows the inclusion of random effects for different components of variance, this model was chosen as the basis of the sample size calculations. In addition, as it has been observed that the distribution of the logs of peak intensities is often normal, both in our laboratory (data not shown) and in other investigations [23], log intensities rather than raw intensities were considered. Logarithms to base 2 were chosen as they provided intuitive interpretation of differences in terms of fold changes (a difference of one unit (log2(2)) represents a doubling in intensity). A further concept which must be defined is the featureselected proteomic profile. This is the proteomic profile that is obtained from MS data after some form of peak detection (e.g. [20, 24, 25]) has been applied to the data to identify features or “peaks” in the profile defined by their m/z and intensity values. The intensity of each peak p has variance S2p across samples within the class, and this can be split into two components: biological variance (s2p ) and technical variance (t2p ). In sample size formulae in the following section, m is the number of technical replicates of each sample and n is © 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2009, 9, 74–86

the total number of samples or biological replicates in each class. Biological replicates are defined as distinct biological samples from a class and technical replicates as repeated determination of the proteomic profile of a biological sample. Furthermore, D is the difference in class means which these studies are being powered to detect. In contrast to clinical trials of specific treatments where differences in effect between groups (e.g. five-year survival) may be as low as 10%, good serum or plasma biomarkers may be expected to have much greater magnitude of change, with currently used markers often differing several-fold and in extreme cases even several hundred- or thousand-fold, for example CA125 between- or within-patients in ovarian cancer [26, 27]. With these definitions in place we developed the following scheme for sample size calculation, initially considering the simplest situation with only one peak and one technical replicate (m = 1). In this case the technical and biological variance are subsumed into the overall estimate of variance, and this can be estimated by the sample variance for that peak. The remaining two parameters to be specified are the significance level a and the power 1 2 b. These quantities are derived from two kinds of statistical error that can occur when testing a hypothesis. If the null hypothesis (H0) is rejected when it is true, then a type I error has occurred. If the null hypothesis is not rejected when it is false, then a type II error has occurred. These are formally defined as probabilities a and b through the relations: a = P(type I error) = P(reject H0 when H0 is true) and b = P(type II error) = P(fail to reject H0 when H0 is false). The significance level for a statistical test is an upper bound on a and the power of a test is defined as the probability of correctly rejecting the null hypothesis when it is false, i.e. power = 1 2 b = P(reject H0 when H0 is false). In sample size formulae in the following section za/2 and zb are the 100a/2th and 100bth percentiles of the normal distribution related to these specified constants. When considering only one peak, the null hypothesis is that there is no difference between the mean of (logged) intensities in the two classes against the alternative hypothesis that there is a difference. The required sample size S in each group to detect a difference of magnitude D when the variance in the peak is S2p is given by:   za=2 þ zb 2 2 S¼2 Sp D

(1)

as shown in any standard text book describing sample size methodology (e.g. Section 3.2.1 in ref. [28]). The required sample size S should be taken as the smallest integer that is larger than S obtained from the above calculation. This can be extended to a design with technical replicates using the approach suggested by Dobbin and Simon for transcriptomic studies [29], where the measured variance is decomposed into biological variation and technical variation. The number of biologically distinct samples S in each class is given by: www.proteomics-journal.com

77

Proteomics 2009, 9, 74–86

#   " za=2 þ zb 2 t2p 2 þ sp S¼2 D m

(2)

where m = 1 this reduces to Eq. (1) as S2p ¼ t2p þ s2p . However, a proteomic profiling experiment undertaken using MS generally includes technical replicates, as this allows QC procedures to be undertaken (Perkins et al., in preparation). Variance components analysis [30] was used to estimate these two sources of variation in data using the lmer() function in the lme4 package [31] for the R software environment for statistical computing (R Development Core Team, Vienna, Austria). In proteomic profiling experiments, typically several hundred peaks will be tested for differences in mean intensity at the biomarker discovery stage. To reduce the chance of false positive results, a stringent value of a must be chosen. An appropriate level can be chosen by considering the expected FDR [32, 33]. One definition of the FDR is  FDR ¼ E

#FD #FD þ #TD

 (3)

where #FD is the number of false discoveries and #TD is the number of true discoveries (and the FDR is defined to be zero when #FD 1 #TD = 0). Following the arguments of Dobbin and Simon [29], assume that each protein peak falls into one of two categories: peaks that are not differentially expressed and peaks that are differentially expressed by some amount D. Then the expected number of false discoveries is E[#FD] = a(1 2 p)P, where p is the proportion of peaks that are truly differentially expressed and P is the total number of protein peaks, and similarly the expected number of true discoveries is E[#TD] = (1 2 b)pP. This leads to the expected FDR being approximated by: ^ E½FDR 

að1  pÞ að1  pÞ þ ð1  bÞp

(4)

This relationship was then used to investigate various choices of a and b for their appropriateness.

2.3 Selection of the variance estimates for sample size calculation The sample size formulae require estimates of the different components of variance. Firstly, as there are multiple peaks – there are multiple variances and one value must be chosen for use in sample size determination. In the design of cDNA microarray experiments, various summary measures have been proposed [34], and here we investigate the appropriateness of the median, 90th percentile and maximum variance across peaks in this context. Secondly, we decided to take the maximal component across class for each of these summaries, i.e. the larger between the healthy controls and lung cancer cases for use in our sample size calculations. © 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

2.4 Construction of pilot study data In addition to using the full dataset in the above, to simulate a pilot study, ten cases and ten control samples were randomly selected from the lung cancer study to use to estimate variance components for the sample size calculations. These results were then compared with those obtained from using estimates derived from the full study. Further, to evaluate the effect of the size of the pilot study, 1000 pilot studies, each consisting of random samples of three, five and ten cases and controls were selected, variance components estimated and sample size evaluated. Evaluation of these results was undertaken using histograms and kernel density estimates (using a Gaussian kernel function) of the distributions. These were obtained using the hist() and density() functions in R.

3

Results and discussion

3.1 Technical and biological variance An example proteomic profile in duplicate for a patient with lung cancer is shown in Fig. 1, the raw spectra being indicated by the solid dark line and detected common peaks by crosses (in the expanded plot in the lower left panel). Spectral preprocessing resulted in 261 peaks in the low mass range (2–10 kDa) and 182 peaks in the medium mass range (10– 20 kDa). The contribution of technical variance to the overall variance can be seen in Fig. 2 where both the mean and CV profile are shown. For the technical variance, both within-chip (Fig. 2a), between chip/within-day (Figs. 2b–c) variance and variance over a number of days (Fig. 2d) are shown for six QC replicates. It can be seen that the mean and CVspectra are very similar in all situations indicating the stability of the profiling technique over a number of days and also across a number of chips. There is also evidence that the largest variance is seen in the regions of the spectra where the intensity is very low, e.g. between 7000 and 8000 Da, this being the only place where the CV approaches 200%, the median value being approximately 20%. Further summaries of the mean and CV distributions of the intensity in each of these situations can be seen in Table S2 of Supporting Information where the small values of the CV for the vast majority of peaks can be deduced by examining the values in the table for the third quartile where values for the CV range from 21% for samples on the same day to 48% for profiles generated on consecutive days or less for three quarters of the detected peaks. 3.2 Pilot study to determine sample size for proteomic profiling experiment undertaken by MS Using proteomic profiles of ten samples from each of two classes (healthy controls and lung cancer patients) to give an example of both a pilot study and a sample size calculation, www.proteomics-journal.com

78

D. A. Cairns et al.

Proteomics 2009, 9, 74–86

Figure 1. Duplicate proteomic spectra from plasma sample of a patient with lung cancer. First replicate determination is shown in the top row of the panel and second replicate in the second row. The low mass range (2–10 kDa) is shown in the first column of the panel and the medium mass range (10–20 kDa) in the second column of the panel. The bottom left panel shows an expanded view of that spectrum in the 8–10 kDa region with detected peaks indicated by crosses.

the estimated variance components are given in Table 1. As the estimated variance components in the medium mass range are smaller than those in the low mass range, the subsequent sample size calculations are based only on those for the low mass regions. This will result in sample sizes larger than those suggested by the medium mass range, but in practice the same number of samples would be used for each analysis. It is apparent how a small number of peaks with very large variance make the maximum of all the estimated components very large, many times larger than for the majority of peaks. It is also more apparent here than in the CV plots in Fig. 2 as this measure of the variance is not scaled by the mean and these peaks have a high intensity. Table 2 shows the expected FDR for a few reasonable choices of p and a few common choices of a and b. The expected FDR is sensitive to the significance level and the proportion of proteins that are truly differentially expressed. It is clear that the expected FDR is greatest when the proportion of truly differentially expressed proteins is small and © 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

vice versa [32] and also that the expected FDR decreases as a decreases. Given 261 detected peaks (in the low mass range) and a significance level of a = 0.001 and where we have either 1, 10 or 25 truly differentially expressed peaks, we have an expected FDR of 0.260, 0.251 and 0.236, which on average should result in less than 0.3 false discoveries. These results are for a 95% power which should on average result in 0.05, 0.5 or 1.25 true discoveries not being identified in this situation. For more conventional significance levels such as 0.01 and 0.05, the number of expected false discoveries increases. The sensitivity of the expected FDR to the power 1 2 b is less apparent, but it clearly increases as power decreases, although it is not as critical a factor as a and p. Similar results are seen where only 50 peaks are detected (a number more common when using Ciphergen Express 3.0 (Fremont, CA), which has more conservative default peak detection capabilities) where similar trends are observed in the final two columns (Table S1 of Supporting Information). In the analysis that follows the power is set to two values for illustration: www.proteomics-journal.com

Proteomics 2009, 9, 74–86

79

Figure 2. Mean (solid line and left ordinate) and CV (dashed line and right ordinate) spectra for intensity in technical replicates. The four graphs refer to six multiple determinations (technical replicates) of a pooled QC sample where (a) six samples are run concurrently on the same chip, (b–c) six samples are all run on each of 2 days but on different chips approximately at equally spaced time intervals, (d) six samples are run on different chips on consecutive days. Summary statistics for the distribution of the mean and CV for intensity are given in Table S2 of Supporting Information.

1 2 b = 0.95 as used in the calculations above and also 1 2 b = 0.8 for comparability with other studies. The size of detectable difference D chosen for the assigned significance level and power is set to a range of values: log2(1.25), log2(1.5) and log2(2) representing a 1.25-fold change, 1.5-fold change and 2-fold change. In Table 3 estimated sample sizes for a variety of different design schemes are shown, where m = 2 and 4, D = log2(1.25), log2(1.5) and log2(2), za/2 = F21(0.0005) = 3.291, zb = F21(0.05) = 1.645 and zb = F21(0.2) = 0.842. Variance components for the calculation are taken from Table 1 with the larger value across the two classes being chosen to calculate the sample sizes shown. This choice ensures that, if © 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

the variance estimates are accurate, then for the column showing the 90th percentile of variances, the desired power should be achieved for at least 90% of peaks overall. Considering the last column in Table 3, which presents sample size based on the maximum variance components across peaks, it can be seen that larger numbers of samples are required to power a study to detect a small fold change where peaks have a large biological and technical variance. These sample sizes may be too large to be considered practical for a proteomic profiling experiments, and the large maximum variance may be driven by just one highly variable peak. However, considering the sample sizes obtained using the 90th percentiles of the estimated variance components to www.proteomics-journal.com

80

D. A. Cairns et al.

Proteomics 2009, 9, 74–86

Table 1. Summary statistics for components of variance for biological variation (s2p ) and technical variation (t2p ) calculated from log2 transformed intensities for each mass range considered in study for each class of subjects

Mass range

Variance component

Class

Median

90th Percentile

Maximum

Low (2–10 kDa)

(Biological) s2p

Control Case Control Case

0.016 0.015 0.010 0.013

0.055 0.119 0.049 0.152

0.233 0.226 2.298 2.279

Control Case Control Case

0.003 0.004 0.002 ,0.001

0.008 0.011 0.027 0.031

0.111 0.267 1.159 1.136

(Technical) t2p Medium (10–20 kDa)

(Biological) s2p (Technical) t2p

Summaries (median, 90th percentile and maximum) are listed for each class: ten healthy controls and ten lung cancer patients (referred to as controls and cases) separated by component (biological and technical).

Table 2. Expected FDR and expected false negative rate for identifying differentially expressed proteins for various values of significance level a and power 1 2 b and various proportions p of truly differentially expressed peaks in a peak detected proteomic profile containing 200 peaks

p

a

12b

Number of truly differentially expressed peaks

Approximate expected FDR

Expected number of FD in 200 peaks

Expected number of truly differentially expressed peaks missed

0.005 0.005 0.005 0.05 0.05 0.05 0.2 0.2 0.2 0.005 0.005 0.005 0.05 0.05 0.05 0.2 0.2 0.2 0.005 0.005 0.005 0.05 0.05 0.05 0.2 0.2 0.2

0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05

0.95 0.90 0.80 0.95 0.90 0.80 0.95 0.90 0.80 0.95 0.90 0.80 0.95 0.90 0.80 0.95 0.90 0.80 0.95 0.90 0.80 0.95 0.90 0.80 0.95 0.90 0.80

1 1 1 10 10 10 40 40 40 1 1 1 10 10 10 40 40 40 1 1 1 10 10 10 40 40 40

0.173 0.181 0.199 0.020 0.021 0.023 0.004 0.004 0.005 0.677 0.689 0.713 0.167 0.174 0.192 0.040 0.043 0.048 0.913 0.917 0.926 0.500 0.514 0.543 0.174 0.182 0.200

0.199 0.199 0.199 0.19 0.19 0.19 0.16 0.16 0.16 1.99 1.99 1.99 1.9 1.9 1.9 1.6 1.6 1.6 9.95 9.95 9.95 9.5 9.5 9.5 8 8 8

0.05 0.1 0.2 0.5 1 2 2 4 8 0.05 0.1 0.2 0.5 1 2 2 4 8 0.05 0.1 0.2 0.5 1 2 2 4 8

Equivalent results for profiles containing 50 peaks are given in Table S1 of Supporting Information)S:\3d\vch¯pdf\

detect a 1.5-fold change with a high power (1 2 b = 0.95) and high significance level a = 0.001 are more reasonable, at 14– 19 in each class group. If the median variance is chosen, then © 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

the desired power will only be expected to be achieved for 50% of the peaks and although this is not desirable, it may be a necessary compromise when biological samples are www.proteomics-journal.com

81

Proteomics 2009, 9, 74–86 Table 3. Table of estimated required sample sizes for various permutations of parameters for analysis of EDTA plasma samples on CM10 chips based on a two-class comparison

D

12b

m

Variance Median

log2(1.25)

0.80 0.95

log2(1.5)

0.80 0.95

log2(2)

0.80 0.95

90th Percentile

Maximum

2 4 2 4

8 7 11 10

(16) (28) (22) (40)

44 31 62 44

(88) (124) (124) (176)

456 267 650 380

(912) (1068) (1300) (1520)

2 4 2 4

3 2 4 3

(6) (8) (8) (12)

14 10 19 14

(28) (40) (38) (56)

138 81 197 116

(276) (324) (394) (464)

2 4 2 4

1 1 2 1

(2) (4) (4) (4)

5 4 7 5

(10) (16) (14) (20)

48 28 68 40

(96) (112) (136) (160)

The first number in each cell is the number of distinct biological samples required in each group and the figure in parenthesis in the total number of experimental units (samples multiplied by replicates) required. Various choices of the number of technical replicates m, power 1 2 b, difference in classes D (in terms of log2(fold change)) and estimates of variance components (median, 90th percentile and maximum observed value of the biological and technical variance components of each peak) are shown. In all calculations the significance level a = 0.001 is chosen to keep the expected FDR low.

limited. Table 3 also demonstrates that a good level of power can be obtained by increasing the number of technical replicates while using a reduced number of biological replicates in situations such as these. Some of the sample sizes calculated in Table 3 are very small, e.g. those for the median variance with larger fold changes. It is never sensible to perform such small experiments for a combination of reasons, both experimental and concerning limitations in inference. Table 3 shows that for large differences between classes with the median variance value, these calculations suggest a single biological replicate is sufficient when we have four technical replicates. However, as this design would not represent the biological variation in the population, making any conclusions limited, common sense requires that a larger sample size be used. An alternative statistical reason for rejecting extremely small sample sizes derives from considering an alternative statistical test which does not rely on assumptions about the distribution of the data. Applying a Wilcoxon–Mann–Whitney test to this problem, at least four biological replicates in each class would be required to achieve a result significant at the 5% level for a two-sided hypothesis and for a 0.1% significance level, as considered in Table 3, this rises to at least seven biological replicates. Considerations such as these should cause caution to be exercised when using the output from sample size calculations to ensure that they do not remove common sense in experimental design and statistical analysis. © 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

3.3 Comparing pilot study with data from other studies Data from other proteomic profiling experiments using different sample types or on different chip surfaces were examined to determine whether the sample size in the pilot study was producing sample size determinations which were broadly applicable. For each study the variance components for the peaks from the profile are obtained as before, and results shown for each of the samples and chip types used (Table 4). The rows of Table 4A describing the study based on plasma and the rows relating to 1.5-fold change of Table 3 are directly comparable and can be seen to be quite similar. This gives some indication that estimating variance from a pilot study can lead to similar sample size determinations as estimating these from a large (identical) study. This is further investigated in the following section where simulated pilot studies of different sizes are considered. Table 4 additionally gives insight into other aspects of the profiling experiments. Generally larger sample sizes are required for studies where serum is the sample type than for plasma and urine. This reflects a larger amount of variation in the profile, with many of the peaks known to be derived from cleavage of proteins during clotting and complement activation [14]. There is no reason that serum may be expected to have inherently greater technical variation with the exception of the pre-analytical level, although in all cases the samples used here had been processed according to standard protocols. However, variation in clotting/complement actiwww.proteomics-journal.com

82

D. A. Cairns et al.

Proteomics 2009, 9, 74–86

Table 4. (A) Estimated required sample sizes for various permutations of parameters for 1.5-fold change and (B) similar results for a twofold change for serum

Sample type

Chip

D

12b

m

Variance 90th percentile

Median A Serum

CM10

log2(1.5)

0.80 0.95

H50

log2(1.5)

0.80 0.95

IMAC-Cu

log2(1.5)

0.80 0.95

Plasma

CM10

log2(1.5)

0.80 0.95

Urine

CM10

log2(1.5)

0.80 0.95

IMAC-Cu

log2(1.5)

0.80 0.95

B Serum

CM10

log2(2)

0.80 0.95

H50

log2(2)

0.80 0.95

IMAC-Cu

log2(2)

0.80 0.95

Maximum

2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4

17 13 24 19 10 9 15 13 48 34 68 48 4 4 6 5 9 7 13 10 8 6 11 9

(34) (52) (48) (76) (20) (36) (30) (52) (96) (136) (136) (192) (8) (16) (12) (20) (18) (28) (26) (40) (16) (24) (22) (36)

76 57 108 80 56 40 79 57 139 96 199 136 15 11 22 16 22 16 31 22 21 14 29 19

(152) (228) (216) (320) (112) (160) (158) (228) (278) (384) (398) (544) (30) (44) (44) (64) (44) (64) (62) (88) (42) (56) (58) (76)

373 259 532 370 180 136 256 194 486 333 694 475 132 79 188 112 151 90 216 128 110 63 156 90

(746) (1036) (1064) (1480) (360) (544) (512) (776) (972) (1332) (1388) (1900) (264) (316) (376) (448) (302) (360) (432) (512) (220) (252) (312) (360)

2 4 2 4 2 4 2 4 2 4 2 4

6 5 8 7 4 3 5 5 17 12 24 17

(12) (20) (16) (28) (8) (12) (10) (20) (34) (48) (48) (68)

26 20 37 28 19 14 27 19 48 33 68 47

(52) (80) (74) (112) (38) (56) (54) (76) (96) (132) (136) (188)

128 89 182 127 62 47 88 67 167 114 238 163

(256) (356) (364) (508) (124) (188) (176) (268) (334) (456) (476) (652)

The first number in each cell is the number of distinct biological samples required in each group and the figure in parenthesis in the total number of experimental units required. Various choices of the number of technical replicates m, power 1 2 b, and various estimates of variance components (median, 90th percentile and maximum observed value of the biological and technical variance components of each peak). In all calculations the significance level a = 0.001 is chosen to keep the expected FDR low.

vation and the proteolytic activity does vary with disease and this may also contribute to the enhanced variance with this fluid. Table 4B shows the required sample sizes to estimate a two-fold change in serum using each chip type, a common choice in sample size calculations, and shows that many studies of serum using MS have at least been powered to detect changes of this magnitude. It can also be seen that larger numbers of samples are needed to conduct equally © 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

powerful experiments for different ProteinChips using the same sample types. The pattern shown in Table 4 of IMACCu requiring larger numbers than H50 and CM10 ProteinChips has been observed in other experiments. These differences may reflect a greater amount of inherent variability in the experimental process for these chip types in a SELDI experiment, although peaks associated with clotting/ complement activation appear to make up a greater compowww.proteomics-journal.com

Proteomics 2009, 9, 74–86

nent of the profile on IMAC chips compared with CM10 chips, for example and therefore biological variance may also contribute to the differences between chips. This aspect has not been specifically explored here. However, the data overall provides evidence that different sample sizes can be required depending on fluid type, chip surface and disease type, implying that appropriate study-specific pilot studies should be performed for calculating sample sizes. 3.4 The effect of the size of a pilot study Figure 3 shows the calculated sample size from 1000 simulated pilot studies (constructed from taking random subsets from the full study data) with (a) three, (b) five and (c) ten

83 samples in each class, where a = 0.001, 1 2 b = 0.95, D = log2(1.5), m = 2 and the 90th percentile of the biological and technical variance components are estimated from each pilot study simulation. The solid line from the abscissa in each panel indicates the sample size calculated for these parameters when estimating variance components from all the samples in the study (S = 22). It is clear that there is quite a large spread of sample sizes obtained from the simulated pilot studies, some of which are greater than and some of which are less than the value obtained if the full study dataset were available forming a quite symmetric distribution with a mode close to the sample size calculated from the entire study sample. Figure 3 shows that as the size of the simulated pilot study increases the tails of the distribution of cal-

Figure 3. Histograms (dashed bars) and kernel density estimates showing distribution of calculated sample sizes from 1000 simulated pilot studies of size (a) three, (b) five and (c) ten samples in each class where a = 0.001, 1 2 b = 0.95, D= log2(1.5), m = 2 and the biological and technical variance components are estimated from each simulated pilot study and the 90th percentile of the variances used in sample size calculation. The solid line from the abscissa in each panel indicates the sample size (22) calculated for these parameters when estimating variance components from all of the samples in the study.

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

84

D. A. Cairns et al.

culated sample sizes become less heavy with an increasing number of values close to the sample size predicted from the whole sample. If we consider just the range 17–27 samples, i.e. five either side of the true value, then there are: 640, 713 and 717 simulations in that range from 1000 pilot studies of size three, five and ten. This indicates both the increase in accuracy as pilot study sample size increases and also the near equivalence of using pilot studies of slightly larger size, i.e. five and ten samples. The phenomena shown in Fig. 3 are a consequence of the variability inherent in estimating variances, but the evidence here suggests that when estimating the 90th percentile of variances from a pilot study a reasonable estimate of the variance is obtained even with five to ten samples in a pilot study – making its undertaking of limited cost for the benefit of a statistically powerful experiment. These simulations illustrate what could occur when selecting samples randomly to use in a pilot experiment. In this simulation no attempt was made to stratify the classes according to known factors, e.g. tumour stage for the lung cancer cases and age and sex for all subjects, i.e. a pilot study should be large enough to be representative of the heterogeneity within a defined class. The careful control of these factors will lead to less variability in the sample size calculation.

4

Concluding remarks

Sample size is one of the most important issues in experimental design in clinical proteomic studies and yet justification for the choice of sample numbers is often not attempted. Equally, the power is often not indicated if only a limited number of subjects are available to study. This study highlights some of the issues involved in making such decisions, particularly illustrating the possibility of performing sample size calculations for proteomic profiling experiments using MS by estimating variance components from pilot experiments or previous experimental data and selecting the significance level and power to control the FDR. A clear dependence on both the technical and biological variance can be seen, reflecting the need for sample size calculations to be undertaken not just on a technique-based basis, but also at a laboratory-based level and for each disease/sample type. It is unwise to take decisions (on sample size calculations or methods of analysis) based on the published results without at least applying cursory checks using in-house data. This applies generally to assumptions such as normality (for some parametric statistical tests), symmetry and/or similarity of distributions (for some nonparametric statistical tests), homogeneity of variance (e.g. for t-tests and ANOVA or when the variance depends on the mean of the signal in spots from 2-D DIGE experiments [35]). Particularly in the situation described here carefully chosen relevant data should be used to provide estimates of variance. © 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2009, 9, 74–86

There are also other issues of study design of arguably greater importance, such as matching for potential confounding factors and removing sources of potential bias that could invalidate the findings even if an adequate sample size is used. In all types of proteomic profiling study, the design must be carefully considered to ensure that appropriate allowance is made for potential confounding factors. Confounding effects are defined as effects contributed by various factors that cannot be separated by the design under study [36]. Typical confounding effects include age, sex and inconsistencies in sample handling – each of which could be related both to the class from which the sample is drawn and to the profile obtained. Careful statistical methods to attempt to disentangle these effects have been presented, but are only considered occasionally [13, 14, 37, 38]. Additionally, in the interest of generalising results, it can be important to stratify for clinical factors according to population proportions when considering a case group, for example including a representative stage or grade mix if considering a specific cancer type, unless the question relates to analysis of specific subgroups. The sample size formulae given are only approximate, because in reality the variance parameters are not known exactly and the distribution of the test statistic under the alternative hypothesis is a t-distribution and not a normal distribution. However, these are necessary and accepted approximations taken in sample size calculation and the approximation is increasingly good as the estimated sample size increases. This also indicates that the smaller figures in Table 3 should be considered with a great deal of caution, as they are violating the assumptions of the calculation method and also they suggest experiments too small to give real confidence. There are also various ways in which the methods described in this paper could be extended to address further issues. This work considers the simplest statistical question that can be posed in the analysis of features from a proteomic profile, a comparison of two groups, but various other questions can be posed whether they regard prognosis [29, 39], classification accuracy [40] or diagnostic ability through receiver operation characteristic curves [41]. The method described could be adapted to other simple scenarios by considering the estimation of the variance and determining similarly the significance level and power which will appropriately control the FDR. Also, it could be suggested that when considering anything above two group comparisons a nested ANOVA should be the test on which to develop sample size calculations. However, it seems more appropriate to consider sample size calculations based on the pairwise comparisons which a significant result from an ANOVA will generally lead on to. Additionally, alternatives to controlling the FDR such as the false positive report probability [42] can be incorporated into this protocol if some knowledge of the prior odds of identifying differential expression is known, and in situations where more complicated models are required then power can be examined through simulation if www.proteomics-journal.com

Proteomics 2009, 9, 74–86

a suitable model for the disease and technological process is available. It is hoped that this work and the prestudy protocol it describes will assist researchers in planning proteomic studies and add to the body of literature which is leading to ever more robust mass spectrometric proteomic profiling studies.

The financial support of Cancer Research UK is gratefully acknowledged. In addition the authors would like to thank Dr. Douglas Thompson and Dr. Andrew Mooney for permission to use the chronic renal allograft study data and Dr. Sheryl Sim for permission to use the renal cell carcinoma study data to illustrate the methodology presented in this manuscript. The authors have declared no conflict of interest.

5

References

85 [11] Karp, N. A., Spencer, M., Lindsay, H., O’Dell, K., Lilley, K. S., Impact of replicate types on proteomic expression analysis. J. Proteome Res. 2005, 4, 1867–1871. [12] Hunt, S. M., Thomas, M. R., Sebastian, L. T., Pedersen, S. K. et al., Optimal replication and the importance of experimental design for gel-based quantitative proteomics. J. Proteome Res. 2005, 4, 809–819. [13] Munro, N. P., Cairns, D. A., Clarke, P., Rogers, M. et al., Urinary biomarker profiling in transitional cell carcinoma. Int. J. Cancer 2006, 119, 2642–2650. [14] Banks, R. E., Stanley, A. J., Cairns, D. A., Barrett, J. H. et al., Influences of blood sample processing on low-molecularweight proteome identified by surface-enhanced laser desorption/ionization mass spectrometry. Clin. Chem. 2005, 51, 1637–1649. [15] Coombes, K. R., Fritsche, H. A., Jr., Clarke, C., Chen, J. N. et al., Quality control and peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption and ionization. Clin. Chem. 2003, 49, 1615– 1623. [16] Anderle, M., Roy, S., Lin, H., Becker, C., Joho, K., Quantifying reproducibility for differential proteomics: noise analysis for protein liquid chromatography-mass spectrometry of human serum. Bioinformatics 2004, 20, 3575–3582.

[1] Atkinson, A. J., Colburn, W. A., DeGruttola, V. G., DeMets, D. L. et al., Biomarkers and surrogate endpoints: Preferred definitions and conceptual framework. Clin. Pharm. Ther. 2001, 69, 89–95.

[17] Field, J. K., Youngson, J. H., The Liverpool Lung Project: A molecular epidemiological study of early lung cancer detection. Eur. Respir. J. 2002, 20, 464–479.

[2] Hoorn, E. J., Hoffert, J. D., Knepper, M. A., The application of DIGE-based proteomics to renal physiology. Nephron Physiol. 2006, 104, 61–72.

[18] Field, J. K., Smith, D. L., Duffy, S., Cassidy, A., The Liverpool Lung Project research protocol. Int. J. Oncol. 2005, 27, 1633– 1645.

[3] Hu, J., Coombes, K. R., Morris, J. S., Baggerly, K. A., The importance of experimental design in proteomic mass spectrometry experiments: Some cautionary tales. Brief Funct. Genomics Proteomics 2005, 3, 322–331.

[19] Wagner, M., Naik, D., Pothen, A., Protocols for disease classification from mass spectrometry data. Proteomics 2003, 3, 1692–1698.

[4] Baggerly, K. A., Morris, J. S., Coombes, K. R., Reproducibility of SELDI-TOF protein patterns in serum: Comparing datasets from different experiments. Bioinformatics 2004, 20, 777–785. [5] Ransohoff, D. F., Bias as a threat to the validity of cancer molecular-marker research. Nat. Rev. Cancer 2005, 5, 142– 149. [6] Karp, N. A., Lilley, K. S., Design and analysis issues in quantitative proteomics studies. Proteomics 2007, 7, 42–50. [7] Becker, S., Cazares, L. H., Watson, P., Lynch, H. et al., Surfaced-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) differentiation of serum protein profiles of BRCA-1 and sporadic breast cancer. Ann. Surg. Oncol. 2004, 11, 907–914. [8] Tolson, J., Bogumil, R., Brunst, E., Beck, H. et al., Serum protein profiling by SELDI mass spectrometry: Detection of multiple variants of serum amyloid alpha in renal cancer patients. Lab. Invest. 2004, 84, 845. [9] Mischak, H., Apweiler, R., Banks, R. E., Conaway, M. et al., Clinical proteomics: A need to define the field and to begin to set adequate standards. Proteomics Clin. Appl. 2007, 1, 148–156. [10] Karp, N. A., McCormick, P. S., Russell, M. R., Lilley, K. S., Experimental and statistical considerations to avoid false conclusions in proteomics studies using differential in-gel electrophoresis. Mol. Cell. Proteomics 2007, 6, 1354–1364.

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

[20] Rogers, M. A., Clarke, P., Noble, J., Munro, N. P. et al., Proteomic profiling of urinary proteins in renal cancer by surface enhanced laser desorption ionization and neural-network analysis: Identification of key issues affecting potential clinical utility. Cancer Res. 2003, 63, 6971–6983. [21] Barrett, J. H., Cairns, D. A., Application of the random forest classification method to peaks detected from mass spectrometric proteomic profiles of cancer patients and controls. Stat. Appl. Genet. Mol. Biol. 2008, 7, 4. [22] Cairns, D. A., Thompson, D., Perkins, D. N., Stanley, A. J. et al., Proteomic profiling using mass spectrometry – does normalising by total ion current potentially mask some biological differences? Proteomics 2008, 8, 21–27. [23] Izmirlian, G., Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Ann. NY Acad. Sci. 2004, 1020, 154– 174. [24] Morris, J. S., Coombes, K. R., Koomen, J., Baggerly, K. A., Kobayashi, R., Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 2005, 21, 1764–1775. [25] Yasui, Y., Pepe, M., Thompson, M. L., Adam, B. L. et al., A data-analytic strategy for protein biomarker discovery: Profiling of high-dimensional proteomic data for cancer detection. Biostatistics 2003, 4, 449–463. [26] Hogdall, E. V., Christensen, L., Kjaer, S. K., Blaakaer, J. et al., CA125 expression pattern, prognosis and correlation with serum CA125 in ovarian tumor patients. From The Danish

www.proteomics-journal.com

86

D. A. Cairns et al. “MALOVA” ovarian cancer study. Gynecol. Oncol. 2007, 104, 508–515.

[27] Willemse, P. H., Aalders, J. G., de Bruyn, H. W., Mulder, N. H. et al., CA-125 in ovarian cancer: Relation between half-life, doubling time and survival. Eur. J. Cancer 1991, 27, 993–995. [28] Chow, S.-C., Shao, J., Wang, H., Sample Size Calculations in Clinical Research, Marcel Dekker, New York 2003. [29] Dobbin, K., Simon, R., Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 2005, 6, 348–348. [30] Snedecor, G. W., Cochran, W. G., Statistical Methods, Iowa State University Press, Ames 1989. [31] Bates, D., Fitting linear mixed models in R. R. News 2005, 5, 27–30. [32] Benjamini, Y., Hochberg, Y., Controlling the false discovery rate – a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodological) 1995, 57, 289–300.

Proteomics 2009, 9, 74–86 [35] Karp, N. A., Lilley, K. S., Maximising sensitivity for detecting changes in protein expression: Experimental design using minimal CyDyes. Proteomics 2005, 5, 3105–3115. [36] Chow, S.-C., Liu, J.-P., Design and Analysis of Clinical Trials: Concept and Methodologies, Wiley, New York, Chichester 1998. [37] Mary-Huard, T., Daudin, J. J., Baccini, M., Biggeri, A., BarHen, A., Biases induced by pooling samples in microarray experiments. Bioinformatics 2007, 23, i313–i318. [38] Timms, J. F., Arslan-Low, E., Gentry-Maharaj, A., Luo, Z. et al., Preanalytic influence of sample handling on SELDI-TOF serum protein profiles. Clin. Chem. 2007, 53, 645–656. [39] Hsieh, F. Y., Lavori, P. W., Sample-size calculations for the Cox proportional hazards regression model with nonbinary covariates. Control. Clin. Trials 2000, 21, 552–560. [40] Dobbin, K. K., Simon, R. M., Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics 2007, 8, 101–117.

[33] Storey, J. D., A direct approach to false discovery rates. J. R. Stat. Soc. B 2002, 64, 479–498.

[41] Obuchowski, N. A., Lieber, M. L., Wians, F. H., Jr., ROC curves in clinical chemistry: Uses, misuses, and possible solutions. Clin. Chem. 2004, 50, 1118–1125.

[34] Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M. et al., Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002, 30, e15.

[42] Wacholder, S., Chanock, S., Garcia-Closas, M., El Ghormli, L., Rothman, N., Assessing the probability that a positive report is false: An approach for molecular epidemiology studies. J. Natl. Cancer Inst. 2004, 96, 434–442.

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com