The assessment of data reproducibility is essential for application of microarray technology to exploration of biological pathways and disease states. Technical variability in data analysis largely depends on signal intensity. Within that context, the reproducibility of individual probe sets has not been hitherto addressed.
We used an extraordinarily large replicate data set derived from human placental trophoblast to analyze probe-specific contribution to variability of gene expression. We found that signal variability, in addition to being signal-intensity dependant, is probe set-specific. Importantly, we developed a novel method to quantify the contribution of this probe set-specific variability. Furthermore, we devised a formula that incorporates a priori-computed, replicate-based information on probe set- and intensity-specific variability in determination of expression changes even without technical replicates.
The strategy of incorporating probe set-specific variability is superior to analysis based on arbitrary fold-change thresholds. We recommend its incorporation to any computation of gene expression changes using high-density DNA microarrays. A Java application implementing our T-score is available at http://www.sadovsky.wustl.edu/tscore.html.
The introduction of microarray technology has enabled investigators to profile the expression of a large number of genes, derived from diverse biological conditions, in a single experiment. Nevertheless, these experiments are expensive, and the cost is amplified by replication when data reliability is lessened. Technical and biological variability, key determinants of microarray reliability, are critical for assessing which genes are differentially expressed. The level of expression of gene products is estimated using a set of oligonucleotide probes (termed here "probe set"). Within the context of technology-related variability, the reproducibility of individual probes has been insufficiently addressed. [1–6]. Importantly, the impact of probe-specific reliability on data reproducibility, which directly influences experimental design, data analysis and interpretation, remains largely unexplored. This problem is further amplified with the use of oligonucleotide microarrays, which may be more susceptible to probe-specific variability than spotted cDNA arrays [7–10].
Static fold change metrics as an unbiased predictor of differentially expressed genes depend on the assumption of constant coefficient of variation . Since violation of this assumption is relatively common in microarray data, many methods have been designed to circumvent its requirement [2, 12–14]. These methods model variance as a function of intensity, and assume independence of probe sets. Other methods, including popular permutation tests and t-tests that do not explicitly rely on this assumption require replication in order to assess the variability associated with each probe set [15–17]. We proposed to use intensity-corrected measures of variance and a correlation test to determine if the assumption of probe set and variance independence is valid. Using a large replicate data set we found that signal variability in microarray data is in fact probe set-specific, and developed a novel method to integrate a priori-generated, replicate-based information on signal intensity and probe set variability into profiling differential gene expression. The signal intensity as well as probe set information serves as a database for future experiments performed without technical replicates. The use of our method enhances the significance of differences found using reliable probe sets and diminishes the significance of differences found using unreliable probe sets.
Results and Discussion
We have developed a large replicate data set, based on gene expression in a single pool of primary placental trophoblast cells. The cells were divided to three groups and exposed to two different peroxisome proliferator activated receptor gamma (PPARγ) ligands, troglitazone or GW7845, or to control, as described in Methods. Prior to hybridization each labeled RNA sample was divided to five aliquots. Each of the five cRNA aliquots was sequentially hybridized to identical lot number U95A, U95B, U95C, U95D, and U95E arrays, resulting in expression data for approximately 60 000 genes for each of the three conditions, for a total of 180 000 probe sets, each sampled in five replicates using a total of 75 chips.
The five replicates for each experimental condition i and probe set j yielded a mean () and standard deviation (). We formulated an estimate of the standard deviation () as a function of signal intensity using locally weighted scatter smooth plot (LOESS) local regression (using the PROC LOESS command in SAS). LOESS regression restricts attention to a small window of the data and fits a regression to that data. The window is then shifted and another local regression is calculated. These sequential regressions are combined to yield a LOESS curve [18–20]. Thus the LOESS curve indicates the average / estimated standard deviation associated with any given mean intensity, based on the replicate data set of ~180 000 sets of five replicates. To find a probe-specific effect, we compared the observed standard deviation to the estimated standard deviation. We define αi,jas:
where the function f is the LOESS regression, which returns the average standard deviation () for a given intensity level (). By using the ratio of observed standard deviation to estimated standard deviation, αi,jrepresents a measure of residual variance for each probe set after correction for intensity.
To ensure that αi,jis a useful and correct measure of probe set residual variance and not subject to low-intensity related bias, we initially demonstrated that αi,jis independent of signal intensity. As shown in Fig. 1A, αi,jwas essentially unchanged across the range of signal intensities observed in our experiments. This finding is not unexpected, because the denominator of αi,jreflects the expected standard deviation for any given intensity. To demonstrate that the variance of αi,jis also independent of signal intensity, we calculated outlier rates by bins of signal intensity. We defined 'alpha outliers' as having an αi,jin the top 10 % of all values, and calculated the outlier rate for bins of increasing expression values. As shown in Fig. 1B, we found no obvious dependence of outlier rate on binned expression level, indicating that the variance of αi,jis approximately constant over the entire expression range (linear regression test, p = 0.1413).
Having shown that αi,jand its variance are independent of intensity, we sought to examine the utility of αi,jas an unbiased predictor of probe set reliability. For that purpose, we plotted the standard deviation of each set of five replicates as a function of their mean signal intensity. We next identified the probe sets that exhibited the highest or lowest 5 % of αi,jin two of the three experimental paradigms (cell exposed to troglitazone or GW7845), and examined their ability to predict αi,jin the third paradigm (control). If variance was independent of probe set, the information derived from the first two paradigms would fail to predict a similar pattern for the third paradigm (control) and all noted probe sets would exhibit the expected standard deviation and lie near or on the line αi,j= 1 in the third paradigm. As shown in Fig. 2, data points exhibiting extreme values in the first two paradigms (left panel) successfully predicted a similar value in the third paradigm (right panel). We next sought to further quantify the presence of probe set-specific variability. Under the null hypothesis that all probe sets have a similar inherent variability, α1,j, α2,j, and α3,jare independent observations of a probe set's intensity-corrected variability. If αi,jis truly independent of the probe set j, then each observation of αi,jshould be independent of the other two and the correlation between α1,j,• α2,j, and α3,jshould be zero. If, however, probe sets that have high variance in one paradigm tend to also have high variance in other paradigms, α1,j, α2,j, and α3,jwould exhibit a positive correlation. Pearson correlations for αi,jwere computed roughly between all three paradigms (i) across all probe sets (j). As shown in Table 1, we found that all three pair-wise correlations were significant. It should be noted that the observed raw correlation of approximately 0.3 reflects the fact that α is a sample variance of only five observations, and therefore exhibits high variance by itself. Postulating an ideal case in which α is constant across all paradigms, we created random normal variables based on the variability implied by α, and found that the computed average "ideal correlation" was 0.46 for each of the three pair-wise correlations. As an alternative measure of intensity-independent variance we used a variance-stabilizing transformation for microarray data . In this approach, variance is again modeled as a function of intensity, but the model is based on normally-distributed error terms. Those error terms were estimated and transformation derived , resulting in intensity-independent variance across the full range of expression. The standard deviation of the transformed data then serves as an alternate measure of intensity-corrected variance. Using the same logic as previously detailed, if all probe sets have equal reliability we expect the standard deviation of the transformed data to exhibit no correlation across the three paradigms. As shown earlier, the standard deviations of the transformed data exhibited strong correlation across the three paradigms (Table 1, right column). Because these standard deviations are corrected for intensity by transformation, the correlation in observed standard deviation is due to the probe set variability. Taken together, our analysis supports the concept that individual probe sets exhibit unique variances, and underscores the need for a custom-made, probe set-specific approach for detection of expression differences.
To incorporate intensity and probe set-specific variability into determination of differentially expressed genes, we introduce novel methodology and an ad hoc t-statistic, called T-score. We have previously shown  that for two conditions i = 1 and i = 2 on probe set j, our intensity corrected T-score is derived as follows:
Where f(zj) gives the predicted standard deviation value corresponding to the observed mean intensity of x1,jand x2,jderived from the LOESS regression. This T-score was shown to be independent of signal intensity. To correct for probe set-specific variability we multiplied the predicted standard deviation by (since αi,jis a ratio, we define to be the geometric mean of αi,jacross all i paradigms) and obtained an intensity- and gene-specific estimate for the underlying variance. This makes sense intuitively, since αi,jis the ratio of observed to expected standard deviation, and is the mean of αi,jover the three paradigms. We noted that each of the 60 000 calculations of is based on only 15 observations. Therefore, a few observations for each probe set may inevitably result in falsely deviant values for . Cognizant of this possibility, we diminished the effect of extreme outliers of by using in its place. This function has the convenient properties of being symmetric with respect to the geometric mean and bringing extreme outliers closer to the null effect of one. The combined T-score becomes:
While issues of normality and degrees of freedom confound the distribution of T-score, we can simplify our analysis by using the T-score as a ranking statistic to properly order genes for their statistical significance. An empirical null distribution of T-score, generated using replicate RNA samples, is presented in Fig. 3. In addition, a set of p-values derived from this replicate set and associated with the T-Score values is available at http://www.sadovsky.wustl.edu/tscore.html. Since the T-score is intensity and probe set independent, the null distribution applies to any gene expression change using this platform.
The replicate-based a priori generation of creates a database of probe set-specific variability coefficients that can be used for computation of T-score even from data derived without replicates, as conducted by most researchers. Instead of estimating variance with costly replicates, we contend that similar results can be obtained using estimates of variance generated from the intensity and probe set database we have created. Nevertheless, this simplification is not necessary. If an experiment has replicates the sample sizes can be carried through for their effect on the distribution of the mean. The T-score becomes:
and n1 and n2 represent the number of observations in paradigm 1 and 2, respectively. Here, we would disregard the observed variances and use the intensity and probe reliability index to create an estimate of the variance. A Bayesian approach could also be used with arbitrary degrees of freedom to incorporate the observed sample variances.
We verified that the T-score is superior to fold-change methodology at identifying differentially expressed genes. For this purpose, we randomly selected 52 probe sets from the gene population in the U95A set in which there was an agreement (26 sets) or disagreement (26 sets) between Affymetrix Fold Change and T-score results across different paradigms. We used real-time quantitative PCR to assess the expression change of these 52 genes, and defined these results as our "gold standard". We then correlated these "gold standard" results with the expression changes as predicted using several methods. These methods included Fold Change, difference after transformation, T without
(correcting solely for intensity, eq. 3), T-score (correcting for intensity and probe set-specific variability, eq. 4), and Cyber-T, a t-test based method that combines the empirical variance of a replicate set with the local background of intensity-dependent variance. After variance-stabilizing transformation, the biggest absolute difference in the transformed data should represent the most significant change in gene expression. Using the Pearson Correlation test we determined the strength of association of each of these methods with the change in expression level of these 52 genes, as determined by quantitative RT-PCR. The results are shown in Table 2. Clearly, the t-test based methods are superior to both Affymetrix Fold Change and absolute difference using transformed data. The Spearman Rank Correlation, which disregards data distribution and solely uses rank orders, exhibited similar results (Table 2). To determine the significance of the correlation differences we formulated a permutation test designed to define the null distribution of the correlations. Under the null hypothesis (two prediction methods are equivalent) each value was equally likely to appear under either method. The null distribution is therefore composed of correlations under which predictive values are permuted within each probe set j. The difference in either Pearson or Spearman Correlation coefficients between T-score and Fold-Change was found to be significant at p < 0.0002. T-score also performed better than the difference after transformation (p < 0.0002). The difference in correlations between T-score, Cyber-T , and T (without probe set correction) was not statistically significant.
Generation of replicate data in microarray experiments can be utilized to assess variability, and consequently enhance data consistency . Whereas our approach to identify bias related to intensity-dependent variance is consistent with that of others [1, 2, 13, 15, 23, 24], we have also shown that consideration of probe set-specific variance is critical, given that the often-used assumption of probe set independence is false. Practically, we demonstrated that small expression differences detected by a more reliable probe set might be more important than larger expression differences detected by a less reliable probe set. Therefore, our analysis indicates that an optimal method for determining differentially expressed genes must account and correct for reliability of each individual probe set. While previous efforts have used replication to distinguish non-functional, or highly unreliable probe sets for elimination [1–5], none has incorporated probe set-specific reliability into expression change statistics. Our methodology is the first that allows independent use of previously obtained information on probe set reliability in subsequent experiments. Therefore, inference from prior estimates of probe set-specific variances into new experiments could not have been utilized using previously published approaches.
We were reassured by the fact that the correlation of our T-score method was similar to that of the t-test based Cyber-T . However, while exhibiting a similar performance, Cyber-T does not incorporate an a priori-defined gene correction factor, and relies on replication for estimation of variance. In contrast, our T-score approach integrates previously defined probe set-specific variance (via our database of αj), defined by means of additional experimental paradigms (e.g., other ligands) for the same probe sets. Thus T-score, which independently accounts for intensity and probe set variances, may be utilized in array experiments even when performed without replication. Nevertheless, when adequate replicates are available, t-test is a suitable approach. It is also important to note that our analysis exclusively focuses on technical variability. While we demonstrate that a previously defined probe-specific variance can substitute for technical replicates, biological replicates are paramount to enhance accuracy of microarray-based expression analysis. In addition, the p-values associated with our empirically derived T-score do not include a correction for multiple comparisons, which should be accounted for when comparisons of expression level among thousands of genes are made [25, 26].
Another fundamental strength of our approach is the novel analysis of an extraordinarily large data set. Although our downstream analysis demonstrated the superiority of our methodology over an arbitrary cut-off approach, our results are limited by the fact that a fraction of our analyzed gene pool was expressed at a low level across all three paradigms. This might have led to an erroneous estimation of the specific variability of some probe sets. It should also be noted that our analysis is based on Affymetrix U95A-E gene-chip microarrays. Affymetrix has recently generated a new chip set (U133). While this may represent a technological advancement, the principles underlying our novel methodology are not addressed by the new chip-set. Whereas T-score values reported here are applicable only to the U95A-E set, the basis and principles underlying our analysis are applicable to any oligonucleotide microarray. We not only provide the first definitive proof that probe set-specific variability exists, but also offer the first generic methodology designed to utilize this information without performing costly replicate experiments. Probe reliability information based on cRNA targets that are expressed at high and low levels could be generated by biotechnology companies specializing in microarrays. This information, provided in conjunction with commonly available changes in p-value and determination of transcript presence, may serve to correct for technical variability in array experiments performed without replication.
We have developed a large replicate data set, based on gene expression in a single pool of primary trophoblast cells. Procurement of the placentas used in this study was approved by the human studies committee at Washington University School of Medicine, St. Louis, Missouri, USA. Primary human trophoblasts were prepared from three normal term human placentas as previously described  with previously published modifications , and cultured in 10 cm plates as previously detailed by our lab . Four hours after plating, the medium was replaced with fresh medium supplemented by the PPARγ ligands troglitazone (10 μM) (Biomol, Plymouth Meeting, PA) or GW7845 (1 μM, a gift from GlaxoWellcome), or by dimethylsulfoxide (DMSO) vehicle control. Fresh media and ligands were added after 24 h in culture. After 48 h the cells were collected for RNA. Total RNA was isolated using Tri-reagent (MRC, Cincinnati, OH, USA) and purified using RNeasy (Qiagen, Valencia, CA, USA). RNA samples (30 μg) from three placentas were mixed, and the mixture was used for double stranded cDNA synthesis using Superscript Choice system (InVitrogen Life Technologies, Carlsbad, CA, USA) and a T7T21 oligonucleotide primer (GenSet, La Jolla, CA, USA). Biotin-labeled RNA was synthesized by in vitro transcription using Enzo Bioarray RNA labeling kit (Enzo Diagnostics, Farmingdale, NY, USA). The RNA was fragmented and divided into five identical aliquots (15 μg of cRNA per aliquot). Each of the five cRNA aliquots was added to hybridization cocktail and sequentially hybridized to identical lot number U95A, U95B, U95C, U95D, and U95E arrays. All arrays were hybridized, washed, stained, and scanned using standard Affymetrix protocols. Together, we used identical cRNA samples to probe the expression of approximately 60 000 genes for each of the three conditions, for a total of 180 000 probe sets, each sampled in five replicates using a total of 75 chips.
Standard Affymetrix protocols using control oligonucleotide B2 were used for proper scanning and grid alignment, and samples were pre-tested using a control cRNA mix from Escherichia coli bioB, bioC, bioD, and P1 cre recombinase, used for monitoring of hybridization, washing, and staining conditions as well as reference samples for normalizing between experiments. Immediately after hybridization the chips were placed in the Affymetrix GeneChip Fluidics Station 400 and sequentially processed for low stringency wash, followed by high stringency wash, streptavidin / phycoerythrin stain, repeat low stringency wash, anti-streptavidin antibody stain, a second streptavidin / phycoerythrin stain and a final low stringency wash. After washing and staining each chip was placed in the Affymetrix Gene-Chip array scanner for image capture and conversion to numerical output using the Microarray Analysis Suite version 5.0. Comparison between chips using the Affymetrix protocol was performed using 'baseline chip' intensity values, normalized to average signal intensity.
Using this information we recently determined that technical variability in our data set depends on signal intensity . Ignoring this factor results in bias of expression changes toward genes with low signal intensity (Fig. 4).
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.
Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics. 2001, 17: 509-519. 10.1093/bioinformatics/17.6.509.
Colantuoni C, Henry G, Zeger S, Pevsner J: Local mean normalization of microarray element signal intensities across an array surface: quality control and correction of spatially systematic artifacts. Biotechniques. 2002, 32: 1316-1320.
Lee ML, Kuo FC, Whitmore GA, Sklar J: Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci USA. 2000, 97: 9834-9839. 10.1073/pnas.97.18.9834.
Schaiff WT, Carlson MG, Smith SD, Levy R, Nelson DM, Sadovsky Y: Peroxisome proliferator-activated receptor-g modulates differentiation of human trophoblast in a ligand-specific manner. J Clin Endocrinol Metab. 2000, 85: 3874-3881. 10.1210/jc.85.10.3874.
This research was supported by NIH R01 ES11597-01 and the Siteman Cancer Center GeneChip Core Facility, Washington University School of Medicine, St Louis, MO, USA. We thank Elena Sadovsky and Lori Rideout for technical assistance. We thank Tim Willson from GlaxoWellcome for generously providing GW7845. We also thank Jeff Milbrandt and Mark Watson (both from Washington University) for discussions.
Authors and Affiliations
Department of Obstetrics and Gynecology, and Cell Biology and Physiology, Washington University School of Medicine, St. Louis, MO, 63110, USA
Vikram Budhraja, W Timothy Schaiff & Yoel Sadovsky
Department of Mathematics, Washington University, St. Louis, MO, 63130, USA
VB conceived the study's idea, generated the methods, created the mathematical models, wrote and programmed the T-score software, created the web tool, performed the PCR experiments and wrote the manuscript, all as part of his undergraduate training in YS's lab. ES oversaw the conception and creation of the mathematical models, and reviewed the validity of the approach and T-score. WTS carried out the experiments related to placental cell culture and sample processing, and reviewed the data. YS supervised the entire project and coordinated the team's efforts. He conceived the idea and generated the methods with VB, reviewed the development of the mathematical model and supervised all aspects of the experimental work, data assembly and manuscript generation. All authors read and approved the final manuscript.
Additional File 1: The regressions of alpha for the three different pair-wise comparisons, presented in Table 1, displaying the ability of alpha from one paradigm to predict alpha from another paradigm. (DOC 794 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Budhraja, V., Spitznagel, E., Schaiff, W.T. et al. Incorporation of gene-specific variability improves expression analysis using high-density DNA microarrays
BMC Biol1, 1 (2003). https://doi.org/10.1186/1741-7007-1-1