ExAtlas: Meta-analysis of gene expression data

ExAtlas Overview and Help          

Contents

  1. General Information
  2. Register with ExAtlas
  3. How to use ExAtlas? (Step-by-step instructions)
  4. List of terms
  5. Disclaimer: terms of use

1. General Information

ExAtlas is a software for meta-analysis of gene expression data. In contrast to other software, it compares multi-component data sets and generates results for all combinations (e.g., all gene expression profiles vs. all GO annotations). Its functions include:
(1) standard meta-analysis: fixed and random effects (DerSimonian-Laird), z-score and Fisher's methods;
(2) global correlation analysis between different gene expression data sets;
(3) gene set enrichment among upregulated and downregulated genes;
(4) gene set overlap (e.g., between upregulated genes and Gene Ontology (GO) sets of genes);
(5) gene association (e.g., find sets of coregulated genes in two similar cell types of tissues or find targets of transcription factors that are significantly enriched among upregulated or downregulated genes.
(6) statistical analysis of gene expression (similar to NIA Array Analysis). Expression profile data can be uploaded manually or extracted from the Gene Expression Omnibus (GEO) database. In particular, users can combine samples from multiple data sets (and possibly different platforms) in GEO database, assess the quality of data, and perform statistical analysis.
(7) preloaded public data Several most popular public data sets (e.g., GNF, BrainScope, Gene Ontology, KEGG, GAD phenotypes) are pre-loaded for immediate use.

NOTE: The error message "Operation failed!" means a suspicious command. Please send a message to administrator and explain the problem.

ExAtlas has been developed in the Laboratory of Genetics and Genomics at the National Institute of Aging (NIA/NIH).
How to cite ExAtlas: Sharov, A.A., Schlessinger, D., and Ko, M.S.H. 2015. Exatlas: An interactive online tool for meta-analysis of gene expression data. J. Bioinform. Comput. Biol., DOI: 10.1142/S0219720015500195.

The workflow in ExAtlas is shown in Fig.1 below.

Fig. 1. Workflow in ExAtlas - software for gene expression meta-analysis.

For example, a user may search GEO database for specific terms such as "kidney", "muscle", or "T-cells", and the software provides information on samples where these terms are found. The user then selects samples from the list and the software generates a gene expression profile matrix. ExAtlas can evaluate the quality of data and then low-quality samples can be removed. Alternatively, expression profile data can be uploaded manually. The gene expression profile matrix can then be used for ANOVA, pair-wise comparison between tissues or cell types, Principal Component Analysis (PCA), making scatter-plots, expression profiles of individual genes, and heatmaps. Several gene expression matrices (e.g., GNF data on tissue/organ expression profiles) are pre-loaded in the software as public resources and are available to every user. Each gene expression matrix can be compared using correlation analysis with any other expression matrix.

Another important type of data is a gene set (or "geneset"). Each geneset file combines multiple genesets. The ExAtlas software stores many preloaded public geneset files, including Gene Ontology (GO), KEGG pathways, and BIOCARTA pathways. Gene set enrichment analysis (PAGE) is used to compare a gene expression matrix file with a geneset file. It evaluates if genes that are upregulated or downregulated in each tissue or cell type are enriched in specific genesets (e.g., GO or KEGG). Another option is to generate a new geneset file that contains genes that are significantly upregulated or downregulated in each tissue or cell type. This geneset file can be then tested for geneset overlap. The overlap is evaluated using hypergeometric distribution. Results of analysis are presented as color-coded tables or bar chart profiles.

Fig. 2. The main menu in ExAtlas.

The main menu of the program (Fig. 2) includes pull-down menus for selecting data files (expression profile matrices, genesets, samples, and outputs), buttons for opening these files (used for visualization and for staring analysis), buttons to search GEO database and upload custom data files. The bottom portion of the main menu is used for downloading or deleting data files, as well as editing file names, file descriptions, column headers, and geneset names and descriptions. File editing also includes options to delete or copy selected expression profiles or genesets.

2. Register with ExAtlas

Registration is optional; you can login as guest (click the "Start using ExAtlas" button) and do the analysis. However, registration has many benefits because you can keep your profile, uploaded files, and save results of analysis for future sessions. We will not use your e-mail address except to notify you of a new feature of ExAtlas (less than 1 message per year) and will not release it to any third party. To register click on the "Register here" link on the front page.

3. How to use ExAtlas? Step-by-step instructions

List of tasks you can do with ExAtlas
  1. Search GEO database for gene expression data, extract data, and make a matrix
  2. Open gene expression data and statistical results (ANOVA)
  3. Plot a heatmap for the gene expression profile matrix
  4. Principal Component Analysis (PCA)
  5. Search for a gene and display the expression profile for this gene
  6. Pair-wise comparison of expression profiles of tissues or cell types
  7. Standard meta-analysis
  8. Correlation between different gene expression data sets
  9. Exploring the output file for correlation and other analyses
  10. Geneset enrichment analysis of up/down-regulated genes
  11. Generate a file with differentially-expressed genesets
  12. Explore a geneset file and/or analyze gene overlaps with another file
  13. Evaluate quality of individual samples and remove low-quality samples
  14. Upload files for analysis (formats, normalization, editing, copying)
  15. Edit files
  16. Practice session

A short tutorial for each of these tasks can be found below.

3.1. Search GEO database for gene expression data

After you logged in, the first thing you do is to select the organism species. Then click the button "Find samples in GEO". In the form that appears, type in search terms (comma separated) and click the "Search" button. Search terms may include tissues, cell types/lines treatments, knockout genes, or GEO accession numbers (series or samples). If multiple terms are entered, then search results are sorted by the decreasing number of matching terms, and then, by the decreasing number of hits within each data series. You can specify terms of avoid, such as cancer, tumor, biopsy, patient, hepatitis, HIV. Use buttons "Next page" to progress through the list of results. Links to GEO series (e.g., GSE23310) and samples (e.g., GSM571945) lead to the GEO database website, where you can read more details about the data. Use checkboxes to select samples to be extracted (these selections are stored even if you go to the previous or next page). When you finished selecting samples, edit the name of a new samples file and click the button "Save samples". Alternatively, you can add selected samples to already existing samples file (button "Add samples").

When you saved the samples, these samples will be appear in the web page. Alternatively, you can select any previously save file with samples and open it from the main menu (Fig. 2). The next step is to edit the list of samples. Some samples can be deleted or copied to another file. An important step is to edit descriptions of samples, because samples with identical descriptions within the same data set (GSE series) are considered as replications and will be placed together. Thus, replication numbers or array ID numbers should be removed from sample descriptions. If a description is not clear, click on sample ID and check detailed description in the GEO database, then edit sample description in the list of samples. After the list is finalized, generate a combined matrix of all samples by clicking the button "Generate matrix". Although samples from different data series (GSE accession numbers) can be combined in one list of samples, in many cases it is better to save each data series separately, upload corresponding data from GEO, and later combine data series using batch-normalization method (see Edit files). Downloading and processing the data takes some time. Thus, the "interruption screen" appears (see Fig. 3):

Fig. 3. Interruption screen is used for long computational tasks.

In this window you can check your task, cancel the task, or close the window without cancelling the task. The link to "Log file" is provided so that you can check the status of your task. Keep reloading the log file to see changes. If you click "Check your task" but it is not finished, then the screen will say "Your task is not finished!". Results will be shown when the task is finished. If data comes from different array platforms, expression profiles are combined based on gene symbol, and if multiple probes are available for a gene, then the best probe is used with either higher statistical significance (F-statistics) or higher average signal intensity (if there are no replications). However, if all samples are obtained with the same array platform, then redundant probes are not removed; and thus, a gene can be represented by multiple probes.

If you cannot find a specific data set, which you know exists in GEO, this may have resulted from data filtering. Your data set may have been filtered out because the array platform type is a cDNA array, tiling array, genomic array, exon array, non-matching species, or RNA-seq. If you believe that the data was filtered out by mistake, please send a note to webmaster. Currently, the GEO database has no uniform format for processed RNA-seq data, and thus, automated download is not possible. However, you can upload RNA-seq data (e.g., from Cufflinks) manually using "Upload data file" button (Fig. 2).

3.2. Open gene expression data and statistical results (ANOVA)

If you select a file with gene expression profiles in the main menu (Fig. 2) and then click the button "Open", a new screen (or tab) will open in your browser, which allows you to display expression profiles in various ways (Fig. 4). If you open this file for the first time after uploading, then you may need to wait till the statistical analysis is finished. If the data contains too many columns, the interruption screen (Fig. 3) may appear while the analysis is performed.


Fig. 4. Open expression profile matrix - screen capture.

From this screen you can plot a heatmap, do Principal Component Analysis (PCA), make a scatterplot that displays differentially-expressed genes, and search for a specific gene to display its expression profile. Other functions (see sections 6 - 11) include meta-analysis, correlation with another gene expression profiles data, gene set enrichment analysis (e.g., for functional annotations), generating sets of significant/specific genes, evaluating data quality, downloading statistical results (ANOVA), downloading raw data, normalizing data with quantile method ( Bolstad et al., 2003), and removing redundant probes (leaving best probe for each gene).

Statistical analysis of gene expression data is based on the single-factor ANalysis Of VAriance (ANOVA). The program calculates F-statistics which is a ratio of factor variance (i.e., variance between averages for factor levels) to the error variance. F-statistics is then used to estimate the P-value according to theoretical F-distribution. Because in microarray analysis we simultaneously evaluate changes among several thousands of genes, it is necessary to adjust results for multiple hypotheses testing. The False Discovery Rate (FDR) shows the expected proportion of false positives among genes that are considered significant; it is estimated from p-values using method of Bejamini-Hochnberg. FDR ≤ 0.05 and fold change ≤ 2 are used as default criteria of statistical significance. The error model attempts to get a better estimate for the true error variance than the error variance estimated from data (we call it 'empirical error variance'). In ExAtlas, we use the maximum of empirical error variance and error variance averaged across 500 genes with similar average expression. This error model was proposed in the NIA Array Analysis software as a method to reduce the number of false positives.

Additional options for running ANOVA are available if you chose to "Run ANOVA again" which is a button in the section "Other functions" near the bottom of the web page (Fig. 4). This button is not available for public data sets, but you can make your own copy of public data (hint: use "Edit" button for "expression profiles" in section "File management", Fig. 2; select all samples and save them as your own data set), and then run ANOVA with custom parameters or with a custom annotation file for the array platform. When running ANOVA again you can select one of the following error models: 1 = Actual error variance for each probe, 2 = Average error variance for probes with similar expression level, 3 = Bayesian correction of error variance (Baldi & Long 2001), 4 = Maximum between actual and expected average error variances, 5 = Maximum between actual and Bayesian error variances. You can select a cutoff expression value (probes with maximum value below cutoff are ignored), modify threshold z-value used to remove outliers, modify proportion of probes with high error variances to ignore in error models, or modify the number of probes in a sliding window to average error variance. In addition you can set an option to use probe ID if gene symbol is missing. This option allows processing of non-annotated probes.

If the input file has no replications, then the error variance is estimated based on the assumption that at least half of gene expression values (log-transformed) represent random deviations from the average, and less than half values correspond to the effects of factors. First, genes are sorted according to their average log-expression, and then error variance is estimated within a sliding window of 500 genes with similar expression. Absolute deviations of log-expression of each gene from its mean expression value in all samples (i.e., |x-M|) is then combined for all 500 genes into one data set. For example, if the data matrix has 15 columns (=samples), then there will be 7500 deviation values (15 x 500) within the sliding window. The error variance is then estimated as the median of these deviation values divided by 0.675 (which is inverse half-normal cumulative distribution for 0.5):

Error Variance = median(deviations)/0.675.

In ExAtlas, ANOVA is run when you open the gene expression profile matrix for the first time. If the file has too many samples, the interruption screen (Fig. 3) may appear to wait until the task is finished. A tab-delimited text file with ANOVA results can be downloaded by clicking the button "Get ANOVA output" at the bottom of the screen which appears after you open any gene expression matrix file (see Fig. 4).

3.3. Plot a heatmap for the gene expression profile matrix

To plot a heatmap (see example in Fig. 5A), select gene filtering parameters (FDR threshold and fold change threshold), kind of filtering, and kind of clustering in the upper portion of the "Open expression profile" screen (Fig. 4). You can check the box "Show replications" if you want to see data for individual replications. Then click the button "Make heatmap". Filtering of genes is important, first, to save processing time, and second, to make the heatmap easier to view. Non-significant genes (FDR<0.05) only add noise to the heatmap, and better filtered out. Fold-change threshold = 2 is recommended. After the heatmap is displayed, you can download the filtered and sorted matrix (as a tab-delimited text file) by using the link "Matrix file" at the top of the page. This file can then be examined in Excel. Because of the large number of genes, gene symbols may be not visible and represent by gray area (or lines) at the left side of the heatmap. However, if you click in the row header area, gene name and expression profile will be displayed.


Fig. 5. Example of a heatmap (A) and a scatterplot (B) for GNF mouse v.3 data

The bottom portion of the screen is designed for editing the heatmap. For example, you can change the maximum value and click "Re-plot the matrix" button. If the maximum value is reduced, the colors will become darker, if the maximum value is increased, the colors will become lighter. Then, you can delete of move columns and rows using menu fields. For example, you can select a column (or a column range) and move it before another selected column.

3.4. Principal Component Analysis (PCA)

Principal component analysis (PCA) can be launched using the same filtering parameters as for heatmap generation. You can check the box "Show replications" if you want to see data for individual replications. Click the button "PCA" located near the top of the "Open expression profile" screen (Fig. 4). PCA is computed using the Singular Value Decomposition (SVD) method that generates eigenvectors both for rows and columns of the log-transformed data matrix (Gabriel 1971. Biometrika 58: 453-467; Chapman et al. 2002. Bioinformatics. 18: 202-204). For plotting of tissues and genes (biplot) we used column projections. The advantage of the biplot compared to a traditional PCA is that the user can visually explore associations between genes and tissues. ExAtlas generates 2-dimensional and 3-dimensional (based on VRML) biplots (Fig. 6). All biplots (including 3D) are interactive; each gene is a hyperlink to its annotation and expression pattern. To view PCA in 3-dimensions you need a VRML viewer, for example FreeWRL or Cortona3d.

Fig. 6. PCA and biplot of mouse gene expression in various tissues (GNF database). A and B = 2-D biplot for tissues and genes, respectively; C = 3-D PCA; D = 3-D biplot for tissues (green spheres) and genes (blue cubes).

Checkbox "PC gene clusters" (Fig. 4) is used to identify 2 clusters of genes that are positively and negatively correlated with each principal component (Fig. 7). The degree of gene expression change within a specific PC is measured by the slope of regression of log-transformed gene expression versus the corresponding eigenvector multiplied by the range of values within the eigenvector. Gene is associated with the most correlated PC; however two additional conditions should be met: (a) the degree of gene expression change exceeds the threshold (default = 2-fold change), and (b) the absolute value of correlation exceeds the threshold (default = 0.7). These two parameters can be changed in the menu: "Correlation (PCA cluster)" and "Fold change (PCA cluster)" (Fig. 4).

Fig. 7. Gene clustering based on principal components

3.5. Search for a gene and display the expression profile

Type a gene symbol or GenBank accession number in section "Find genes" near the middle of the "Open expression profile" screen (Fig. 4) and click the button "Search". You can specify what category of gene description you search (gene symbol, GenBank accession, gene name, probe ID). If many genes (or probes) match to your search, all of them will be displayed, and then you can select individual genes or probes. Checkbox "Sort" can be checked if you wish to sort cell/tissue types according to the expression of the gene you search. When the gene (or probe) is found, ExAtlas generates a histogram with gene expression profile (Fig. 8), and a table of expression in each tissue/cell type. The histogram shows average log-expression values for each cell type or tissue relative to the global median; to see values for individual replications click the button "Show replications".

Fig. 8. Expression change of Hoxa1 after induction of 137 transcription factors in mouse ES cells

From the screen with gene expression histogram (Fig. 8) you can search for other genes with the same expression profile using correlation threshold and fold change threshold (which are applied simultaneously).

3.6. Pair-wise comparison of expression profiles of tissues or cell types

Select two tissues or cell types which you want to compare from pull-down menu in the section "Pairwise comparison" near the middle of the "Open expression profile" screen (Fig. 4). As a baseline, you can use median expression instead of a second tissue. Then select parameters (FDR threshold and fold change threshold) and click the button "Scatter-plot". The scatterplot is a graph where each point represents one gene with x-coordinate = log expression in tissue #2 (or median) and y-coordinate = log expression in tissue #1. Gray dots = non-significant genes, red dots = significant upregulated genes, and green dots = significant downregulated genes. Statistical significance is based on z-value which are estimated from the ANOVA error variance by equation:
z = |m1 - m2| / sqrt[ErrVar*((1/n1)+(1/n2))],
where mi and ni are the mean and sample size for tissue i, and ErrVar is the error variance estimated from the error model of ANOVA. From z-values the program estimates p-values, and finally, FDR values which are used to evaluate statistical significance. To display the list of significant genes click on the link "List of over-expressed genes" or "List of under-expressed genes". If you click on probe ID (or gene symbol) in the list you get the expression profile of the gene. The list of genes can be further examined for significant overlap with various genesets for functional annotation (e.g., GO, KEGG). The menu for comparison is located just above the table of significant genes. The list of genes is also available as tab-delimited text (link at the top of the screen).

If you use median expression profile for comparison (as control) then an additional feature is recorded in the output table: a z-value that characterizes gene specificity (column header "Specificity"). This z-value is estimated by comparing log-expression in a given tissue (mi) with average expression in other tissues (M) that are not correlated with this tissue (see details here).

3.7. Standard meta-analysis

The goal of standard meta-analysis is to integrate information from multiple independent studies. It can increase statistical power and reduce false-positive effects. ExAtlas implements four most popular methods: Fisher's, Z-score, Fixed effects, and Random effects. First three methods are relevant only if combined studies implement exacltly the same methodology (e.g., same cell lines same reagents, and same equipment). In practice, the methodologies often differ between studies, and thus, the Random effect method appears most relevant. Fisher's method combines log-transformed p-values from m studies and generates a chi-square statistics with 2m degrees of freedom:

Z-score method combines z-scores (i.e., ratio of mean effect to the S.D. of effect) of different studies using weights which are estimated from sample sizes. Here the term "effect" means logratio of gene expression change.

Fixed effects method estimates a weighted sum of effects (i.e., logratio of gene expression change), where weights are inverse to variance:

Random effects method takes into account the variance of heterogeneity between studies (DerSimonian-Laird); thus, the weights are adjusted for heterogeneity:

The first step in meta-analysis is to specify gene expression data to be combined. When you open a file with gene expression profiles (Fig. 4), use section 2 (Pairwise comparison), and select the sample (cell type or tissue) which you want to examine and a baseline sample for comrparison (e.g., control or median profile). Then click the button "Meta-analysis". A new screen will open (Fig. 9) where you can add data for meta-analysis. Specify another expression profile data set and select the sample of interest and a baseline sample, and then click the button "Add data". If all data sets use the same array platform, then the meta-analysis is done for each probe ID (there may be multiple probe ID-s for the same gene). Otherwise, the meta-analysis is done for each gene symbol. If the new data set belongs to a different species, first select a new species, and then select the data file. Gene symbols will be converted to the first species using HomoloGene. To delete a data pair, use the corresponding checkbox and then click on the button "Delete checked data". Part 3 in the menu (Fig. 9) allows you to save your meta-analysis design for the future: fill up description field (optional), make sure that selected file is "--- New file ---" then click on the button "Save metaanalysis". The pop-up window will appear where you type in file name and click "Ok". Also you can load one of the previously saved meta-analysis designs: select the file you need and click the button "Load metaanalysis".


Fig. 9. Screen for compiling data for meta-analysis.

When all data sets for meta-analysis are assembled, select parameters for meta-analysis (FDR threshold and fold-change threshold), and click the button "Start analysis". The output page shows the number of significant genes for each method of meta-analysis. Click on the number of gene to display the list of genes and corresponding statistics. Effects are shown as either logratio (log10) (default) or as fold change. The format of effects can be selected above the output table that shows the number of significant genes for each method. The list of significant genes can be further explored for significant overlap with various data sets.

3.8. Correlation between different gene expression data sets

To characterize the effect of treatments on gene expression profiles it is often necessary to examine correlations between different gene expression data sets. For example, the change of expression of genes following the induction of various individual transcription factors in ES cells was compared with gene expression profiles in various tissues and cell types Nishiyama et al. 2011. Results indicated that some transcription factors (e.g., Ascl1, Gata3, Myod1, Sfpi1) induced tissue-specific genes. To estimate correlations, open the first file with gene expression profiles, then click the "Correlation" button in the section "Other tasks" (Fig. 4). This will take you to the next screen where you can select the second file with gene expression profiles (Fig. 10). The second file can be the same as the first one if you wish to generate an auto-correlation matrix. If you want to compare gene expression change between different species, then select a species for comparison. The screen will be reloaded with a list of data for that species. Use FDR threshold and fold change threshold to limit the number of genes. Lower values of FDR and higher values of fold change correspond to more stringent filtering.

Fig. 10. Screen for correlation analysis of two data sets with gene expression profiles.

The algorithm for estimating correlations is the following.

  1. Log-transform gene expression data and run ANOVA for each file
  2. For each gene, select the best probe (with highest F-statistics, ANOVA)
  3. In each file, select genes based on FDR and fold-change thresholds (default: FDR ≤ 0.05 and change ≥ 2 fold); FDR is calculated from ANOVA and it indicates the significance of gene expression change in all tissues or cell-types; fold change is calculated from highest and lowest expression in all tissues or cell-types
  4. Find common genes that are selected for both files - these genes will be used for estimating correlations
  5. Subtract median expression value (or other baseline value) in each row. User can select "control" as a baseline. Because all expression values are already log-transformed in ANOVA, the subtraction yields a logratio value
  6. Take column i from the first matrix and estimate its correlation (Pearson or Spearman) with the column j in the second matrix. This correlation value is placed in column i and row j in the output table.
All these steps are done automatically after you click on the button "Estimate correlation matrix". Before you start the task, specify the output file name and its description (edit suggested name). Because estimating correlations usually takes several minutes, an interruption web page appears where you need to check the status of your task.

If you check the box "Identify coregulated genes", then ExAtlas will identify lists of genes that are both upregulated or both downregulated in two data files if correlation is positive and significant (z ≥ 2), and Expected Proportion of False Positives (EPFP) is smaller than specified threshold (default threshold = 0.5). The algorithm for finding positively coregulated genes is based on the analysis of data points in the positive quadrant (i.e. x>0 and y>0). Negatively coregulated genes are identified in the same way in the negative quadrant. First, logratios of gene expression change are all replaced by their rank. If null-hypothesis is true (no correlation) then the genes are expected to have a uniform random distribution in the positive quadrant (Fig. 11A). To estimate EPFP for a gene with rank rx in the first expression profile (file #1) and rank ry in the first expression profile (file #2), we estimate the density of dots/genes in a rectangle with lower left corner at (rx,ry) coordinates (Fig. 11B, dark-shaded area) and compare it with the density of dots in two adjacent rectangles to the left and down (light-shaded areas). EPFP equals the density of dots in the light-shaded divided (which serves as a baseline) by the density of dots in the dark-shaded area. Because we have two light-shaded rectangles, EPFP is estimated twice, and then we select the larger value (to be conservative in our assessment). Because EPFP may not monotonically decrease with increasing rank rx and ry, it is forced to decrease monotonically. In particular, if EPFP(rx1,ry1) > EPFP(rx,ry), and rx1 > rx, and ry1 > ry, then EPFP(rx1,ry1) is set equal to EPFP(rx,ry).

Fig. 11. Estimating Expected Proportion of False Positives (EPFP) for coregulated genes: (A) scatter-plot of gene expression rank in the positive quadrant if there is no correlation. (B) The same plot if gene expression profiles are correlated, numbers indicate gene counts. The density of dots/genes in the dark-shaded rectangle, 130/(132+130)/(130+87)=0.002287, is compared with the density of dots/genes in two light-shaded rectangles: 132/(132+130)/(132+651)=0.000643 and 87/(651+87)/(130+87)=0.000543. Two estimates of EPFP are generated for the gene at the low left corner of the dark shaded rectangle (with expression ranks rx=651+132=783 and ry=651+87=738): EPFP1 = 0.000643/0.002287 = 0.281 and EPFP2 = 0.000543/0.002287 = 0.237. The greater value is selected: EPFP = 0.281. (C) All coregulated genes with EPFP≤0.3 are highlighted (magenta).

To identify oppositely coregulated genes (i.e. upregulation in file #1 associated with downregulation in file #2 and vice versa), set "Direction of change (file #2)" to "Reversed" (Fig. 10). Then gene expression change for File #2 is inverted (multiplied by -1).

3.9. Exploring the output file

When the output table for correlation analysis is generated, results are saved in the output file, which is opened automatically. Output files can also be opened manually from the main menu from the pull-down selection list (Fig. 2); after selecting file click the "Open" button. When output screen is displayed (Fig. 12), then it can be used to plot the full output table as a heatmap (section #1) or to plot bar charts for rows and columns of the output table (section #2). Examples of output graphs are shown in Fig. 13. When plotting the full table, select which values to plot. Plotting options depend on the type of analysis and generally include z-values, which indicate the significance of correlation. In addition, correlation values and/or the number of associated genes is provided. After you selected which table to plot, click the button "Plot output table". You can also plot profiles for individual rows and columns of the output table by selecting respective rows or columns in section #2 "Profiles of rows, columns, and cells". Values are sorted in profiles from high to low because sorting is convenient for functional annotations of genes (e.g., Gene Ontology or pathways).

Fig. 12. Open output file screen: results of correlation analysis.

Fig. 13. Example of correlation matrix (A) and profile for a single row/column (B).

3.10. Geneset enrichment analysis of up/down-regulated genes

Geneset enrichment analysis is used to evaluate if specific genesets (such as Gene Ontology or KEGG pathways) are over-represented among upregulated and/or downregulated genes. The advantage of geneset enrichment analysis compared to a simple overlap of genesets is that no thresholds are used for selecting differentially expressed genes. In particular, geneset enrichment analysis can find significant associations with functional genesets even if there are no significantly upregulated genes based on standard criteria (e.g., FRD $le; 0.05 and change ≥ 2 fold). Among various existing methods for geneset enrichment analysis we use Parametric Analysis of Gene Enrichment (PAGE) (Kim & Volsky 2005, PMID:15941488) because of its simplicity and reliability (Zhang et al. 2010, PMID: 20092628). PAGE is based on the comparison of the average expression change in a specific subset of genes, xset, with the average expression change in all genes, xall:
z = (xset - xall)*sqrt(nset)/SDall,
where nset is the size of the gene set and SDall is standard deviation of expression change among all genes. This method is modified here by applying the equation to the subset of N top upregulated and another subset of N top downregulated genes rather than to all genes combined (here we use N = 25% of all genes). This modification allows one to detect enrichment of the same gene set among both upregulated and downregulated genes. Upregulation or downregulation is estimated relative to the median expression of each gene or to a user-specified baseline (e.g., "control"). The probability distribution of expression change within subsets of N upregulated or downregulated genes is not normal; however, because we compare averages for large sets of genes (usually, nset > 50), the probability distribution of these averages is close to normal based on the central limit theorem. Thus, it is reasonable to use equation above as approximation.

Fig. 14. Screen for starting geneset enrichment analysis (PAGE)

To start PAGE analysis, select the geneset file using pull-down list (Fig. 14). To use geneset file for a different species, first select species. The screen will be reloaded with a list of data for that species; after that select the geneset file. To identify associated genes (e.g., target genes with binding sites of transcription factor which at the same time responded to the induction or knockdown of the same transcription factor) check the box "Identify associated genes". Use EPFP threshold and fold change threshold to limit the number of associated genes. Lower values of EPFP and higher values of fold change correspond to more stringent filtering.

Viewing the output file is similar to that for correlation analysis. You can plot a matrix heatmap or profile for individual columns or rows. If associated genes were identified they will appear in the profile (as in Fig. 13B). If the list of genes is too long it is truncated. To see the full list of genes (Fig. 15A), click on the row header. In addition, at the end of the list you will find a rankplot that shows graphically the enrichments of genes that belong to the given geneset among either upregulated or downregulated genes (Fig. 15B).

Fig. 15. List of associated genes (A) and a rankplot (B). In this specific case, genes from geneset are enriched among both upregulated and downregulated genes, but more strongly - for upregulated genes.

3.11. Generate a file with differentially-expressed genesets

ExAtlas automates the generation of genesets of upregulated and downregulated genes, which can be later used for comparison with other data sets. Expression of each gene is compared to the baseline expression, which can be selected as a median expression value (default) or expression in some specific tissue/organ or cell line. Conditions of statistical significance are defined by FDR threshold and fold change threshold. Additional condition is gene specificity which allows to narrow down the list of genes to specific genes only. Specificity is measured by z-value, as explained in the pair-wise comparison section. To select highly-specific genes use z-values ≥ 6. Before starting the task, don't forget to edit the name and description of the output geneset file, then click the button "Save significant genes". When the task is finished, the output file displays a histogram of the number of significantly upregulated (orange) and downregulated (dark blue) genes (Fig. 16).

Fig. 16. Histogram of the number of significantly upregulated (orange) and downregulated (dark blue) genes after the induction of various transcription factors in mouse ES cells.

3.12. Explore a geneset file and/or analyze gene overlaps with another file

When you open a geneset file from the main menu (Fig. 2), a new window appears which allows the user to find a geneset with a specific name/description (use button "Search") or select a geneset from the alphabetically ordered list of all genesets (use button "Display genes") (Fig. 17). The second portion of the menu is designed for starting the analysis of geneset overlap. You can either select another geneset file and simply paste a list of genes into the provided text area. Then select parameters of statistical significance (FDR threshold and fold enrichment threshold) and click the button "Overlap analysis". The program identifies common genes for each pair of genesets from the first and second geneset files. And if the number of overlapping genes is greater than expected by random, then it uses hypergeometric distribution to evaluate the significance of gene enrichment. This is the traditional way of analyzing gene enrichment which is a simpler alternative to a more sophisticated PAGE method described above.

Fig. 17. Open geneset of targets of transcription factors in mouse ES cells.

When a specific geneset is selected, then the full list of member genes is displayed. From this screen you can test the significance of overlap with any other available geneset data, such as GO, KEGG, etc.

3.13. Evaluate quality of samples and remove low-quality samples

Click the button "Data quality" near the bottom of the "Open expression profile" screen (Fig. 4) to run quality control program. If the data file is large, the interruption screen (Fig. 3) may appear as discussed above. Quality control checks (a) correlation of log10-transformed expression of housekeeping genes with standard data (RNA-seq), and (b) consistency between replications. Consistency of replications is assessed by modified standard deviation (SD) of the log-transformed expression in each sample from the tissue-specific median (where outliers with z > 3.5 are not used for estimating median). In general, SD < 0.1 means good quality, and SD > 0.3 means bad quality. Correlation of expression of housekeeping genes usually is in the range from 0.5 to 0.95. If it falls below 0.5, then the quality may be low. Checkboxes located near each sample allow the user to select samples with low quality for deletion.

3.14. Upload files for analysis (formats, normalization, etc.)

The "Upload data file" button in the main menu (Fig. 2) is used to open the screen for file upload (Fig. 18). You either browse for the file to be uploaded (button "Browse..") or paste the text file into the provided text area. Then, select the type of file (i.e., Gene expression profile matrix, Gene set file, Samples file, List of geneset, Output file, or Annotation file). If you want to store the file under different name, type-in the file name in the "Rename file as:" field. Fill-out file description. If the file with gene expression profile table does not include information on array platform, then you need to select array platform. If the array platform is not present in the pull-down menu list, you need to upload a file with platform annotation which should include at least 3 columns: "probe ID", "gene symbol", and "gene name". You can add more columns that specify GenBank accession numbers, Entrez ID, or Unigene ID. If gene symbols or GenBank accession numbers are used as probe ID, then select "Gene symbols" or "public-genebank" platform annotation, respectively.

Fig. 18. Screen for uploading custom data files.

Here is a brief description of file formats.
The gene expression profile is a tab-delimited text that follows MIAME standards. All matrix files downloaded from GEO can be directly uploaded to ExAtlas. The file has header lines that start with "!" sign. However, these lines are optional. You can upload a file even without these lines if you specify platform for the gene expression profile file. Header lines are followed by a table with data lines that specify the intensity of feature signals. Here is an example of a gene expression profile matrix file:

!Series_title	"Gene expression of human soft tissue sarcoma"
!Series_geo_accession	"GSE2719"
!Series_pubmed_id	"15994966"
!Series_summary	"Gene expression profiles of 39 human sarcoma samples (GSM 52571-GSM52609)..."
!Series_type	"Expression profiling by array"
!Series_platform_id	"GPL96"
!Series_platform_taxid	"9606"
!Series_sample_taxid	"9606"

!Sample_title	"brain"	"stomach"	"colon"	"pancreas"	"prostate" ...
!Sample_geo_accession	"GSM52556"	"GSM52557"	"GSM52558"	"GSM52559"	"GSM52560" ...
!Sample_taxid_ch1	"9606"	"9606"	"9606"	"9606"	"9606" ...
!Sample_data_row_count	"22283"	"22283"	"22283"	"22283"	"22283" ...
!series_matrix_table_begin
"ID_REF"	"GSM52556"	"GSM52557"	"GSM52558"	"GSM52559"	"GSM52560" ...
"1007_s_at"	2867.1	1780.8	1921.8	2486.1	4151.4	...
"1053_at"	216.4	196.8	145.3	127.1	109.7	...
"117_at"	135	121	157.2	162.6	267.8	...
"121_at"	916.1	1075.7	922	2192.9	1198.8	...
"1255_g_at"	149.8	35.5	32.7	96.3	47.6	...
..................................................................
!series_matrix_table_end

Sample names are taken from the line "!Sample_title" or from the line of column headers that follows after "!series_matrix_table_begin". Column headers for replication samples should be exactly matching (case-sensitive). It is not required to reorder columns so that all replications are placed together; replicetion samples are recognized by column headres even if they are separated by other samples in the table. ExAtlas can process 2-dye arrays that use reference RNA consistently as one of the channels (e.g., Cy5 or Cy3). In this case, two columns that correspond to the same array (channel #1 and channel #2) should be placed together and the column representing reference RNA should be named "reference". If data are log-transformed or Z-value transformed, then select transformation type from the pull-down menu.

Because background subtractions may result in negative values, some array scanning programs avoid negatives by adding some constant value to signal intensity (e.g., 50 or 100). Usually this does not cause problems, but low-expressed genes may show weaker expression fold-change. If you would like to remove this constant value, then select "adjustment" value from the pull-down menu.

After you upload a new gene expression profile it will appear in the main menu. When you try to open it for the first time, it will run ANOVA (which may take some time).

Alternatively you can compile gene expression data column-by-column from one or multiple tab-delimited text tables. To use this option, select "Compile expression profile" option from the pull-down list "Select file type:". Type-in file name in the field "Rename file as" and description. Select array platform if applicable, then browse to select the first data table and click "Upload" button. After the table is parsed and column headers displayed on the screen, select columns to be extracted, specify their usage (Probe ID/tracking ID, Gene ID/name, or Gene expression), and possibly edit column header. If you have specified array platform, use column with probe ID as "Probe/tracking ID". Alternatively, select a column as Gene ID/name if it has gene symbols, GenBank acc., Entrez gene ID, or Ensembl gene ID. Please, edit column headers as 'symbol', 'refseq', 'genbank', 'entrez', or 'ensembl'. Probe/tracking ID or Gene ID/name should be common for all data files that are assembled together. When these data are uploaded, you can choose another data table and extract data from it until all data are compiled. It is necessary to specify Gene ID/name at least in one of the tables. For example you can upload an annotation table where both Probe ID/tracking ID and Gene ID/name are present. At any time you can edit sample names to make them meaningful and ensure that replications have exactly the same sample names (case-sensitive). If you have 2-dye arrays and one channel is used for reference RNA, then edit column name as 'reference'. In this case reference expression will be used for normalization as follows: norm(x) = x*My/y, where x is signal intensity for sample, y is signal intensity for reference, and My is geometric mean of all reference values.

In a geneset data file (tab-delimited text), each line corresponds to one geneset. First item is geneset ID, the second is geneset description (which may be blank or duplicate ID), followed by all genes that belong to this geneset. Because some lines are rather long, geneset files may not always be opened in Excel. Geneset file may include header lines that all start with "!". Here is example of a geneset file:

CITRATE_CYCLE_TCA_CYCLE	CITRATE_CYCLE_TCA_CYCLE	Idh3g	Pdha2	Fh1	Suclg1	Idh2	Pcx	Pdha1	Idh3b	Sucla2	Mdh1	Suclg2 ...
ETHER_LIPID_METABOLISM	ETHER_LIPID_METABOLISM	Pla2g4e	Pla2g7	Pla2g12a	Pla2g4a	Lpcat4	Agps	Pafah2	Pla2g3	Pla2g2f	Ppap2a ...
..........................................................................................................
An alternative acceptable format of geneset files uses comma-separated lists of gene symbols:

CITRATE_CYCLE_TCA_CYCLE	CITRATE_CYCLE_TCA_CYCLE	Idh3g,Pdha2,Fh1,Suclg1,Idh2,Pcx,Pdha1,Idh3b,Sucla2,Mdh1,Suclg2,...
ETHER_LIPID_METABOLISM	ETHER_LIPID_METABOLISM	Pla2g4e,Pla2g7,Pla2g12a,Pla2g4a,Lpcat4,Agps,Pafah2,Pla2g3,Pla2g2f,Ppap2a,...
..........................................................................................................
Sample files (tab-delimited text) have 4 columns: (1) series ID from GEO, (2) Platform ID, (3) Sample ID, and (4) sample title/name. Samples with identical titles within the same data series are considered as replications. Check title spelling, spaces, and character case, because in the case of mismatch replications will not be recognized. Example:
GSE6290	GPL1261	GSM144590	renal corpuscle
GSE6290	GPL1261	GSM144591	renal corpuscle
GSE6290	GPL1261	GSM144594	Early Proximal Tubule
GSE6290	GPL1261	GSM144595	Early Proximal Tubule
GSE6290	GPL1261	GSM144596	Medullary Collecting Duct
GSE6290	GPL1261	GSM144597	Medullary Collecting Duct
GSE6290	GPL1261	GSM144603	sshaped_body
GSE6290	GPL1261	GSM144604	sshaped_body
GSE6290	GPL1261	GSM144605	sshaped_body
............................................................
Annotation file has at least 3 columns: (1) Probe ID, (2) Gene symbol, and (3) Gene name. Additional columns may show accession number, Entrez, Ensembl, Unigene or other IDs. Do not use multiple gene symbols in the second coumn! If a probe matches to multiple symbols then select the best symbol for annotation. If you need to show other matching gene symbols, then make multiple copies of the line with this probe ID in the gene expression profile data and modify probe ID (enter unique new ID) which will be associated with alternative symbols. Annotation file always has a line with column headers and may include optional header lines that start with "!".

NIA-oligo	Gene symbol	Gene name	GenBank	Entrez
Z00000225-1	Wdr74	WD repeat domain 74	NM_134139.1,NM_134139.1	107071
Z00000233-1	Tro	trophinin	NM_001002272.2,NM_001002272.2	56191
Z00000238-1	Edf1	endothelial differentiation-related factor 1	NM_021519.1,NM_021519.1	59022
Z00000241-1	Pfn1	profilin 1	NM_011072.2,NM_011072.2	18643
Z00000244-1	Rabep1	rabaptin, RAB GTPase binding effector protein 1	AK163126.1,AK163126.1	54189
.........................................................................
Output files may include one or several tab-delimited tables. When you perform any analysis in ExAtlas (correlation, gene enrichment, significant genes, etc.) you can then download the output file to explore its format. Any tab-delimited table with first line of column headers and with the first column as row headers can be uploaded as output file for plotting as a heatmap. No additional formatting is needed.

Lists of genes (official gene symbols) can be uploaded to explore the enrichment of various genesets for functional annotations (e.g., for comparison with GO-terms, KEGG pathways). Genes can be formatted in one column or pasted as comma-separated text. After the list of genes is uploaded, select the geneset file for comparison (e.g., GO_mouse_geneset), specify parameters (FDR and fold enrichment) and click "Enrichment analysis". When the output opens, click on the button "Get profile".

3.15. Edit files

ExAtlas supports minor editing of uploaded files (except platform annotations). If you made a mistake during file upload, you can fix it using the editing tool. In particular, users can rename the file, edit its annotation, or specify a different microarray platform for gene expression profiles. More editing options are available for gene expression profiles and geneset files. In particular, users can select gene expression profiles (e.g., microarray samples) or genesets and either delete them or copy to another file. If gene expression profiles are copied to already existing file, then the user can select to co-normalize data in various ways: (a) by quantile method, (b) by equalizing global median values for each gene, or (c) by equalizing median values for selected samples within each data set. For example, if two projects have data on gene expression profiles in normal liver, then the user can select all liver samples in each data set and then use option (c). Options (b) and (c) represent batch-normalization procedure which is often used for combining heterogeneous data sets. Because batch-normalization generates better results than quantile method, we suggest not to combine different data series from GEO in "Search GEO database" option, but to save each series separately and later combine them using batch-normalization.

3.16. Practice session

  1. Search on Google for "ExAtlas"; log in as guest (click the button or type "guest" in the login box).
  2. Open expression profile data set: public-GNFv3_mouse_tissues. Set fold change threshold to 4 and click "Make a heatmap". Select "liver" column and assign it to move before "kidney" column; also change "Maximum value" to 2.5. Then click the "Re-plot the matrix" button.
  3. Save heatmap "Matrix file" in your "practice" folder. Open the file in Excel to see genes associated with each tissue.
  4. Close the heatmap and click on "PCA" button (you may select "show replications" option). If you have VRML plug-in, view the PCA in 3D. Click on the positive PC1-associated cluster to see the list of genes. Find over-represented GO-annotations in this list of genes. Use other genesets: KEGG patheways and MGI phenotypes for functional annotations of genes.
  5. Do pair-wise comparison of prefrontal_cerebral_cortex with cerebral_cortex, click on individual genes to see their profiles. Display the list of over-expressed genes and do functional annotations (GO, KEGG).
  6. Search for specific genes (e.g., Acer1, Foxl2, Gata3, Sox2). Find genes with similar expression profiles. Do functional annotations of these genes.
  7. Open expression profile data set: public-NIA_induction_137TFs_mouse. Push button "Correlation" to start correlation analysis of this data set and GNF ver.3 data on expression profiles in mouse tissues/organs. Set fold change threshold 1.5 for TF induction data and 2 for GNF data. Select ES cells as a background for GNF and median expression for TF induction. Start the analysis, check the status of the job in the log file. When the correlation matrix is ready, display it. Download the resulting correlation matrix as text file and open it in Excel.
  8. Push button "Significant genes" to generate a geneset file with significantly up-regulated and down-regulated genes. You may select a certain specificity level (e.g., 4 or 7). When results are ready, estimate the overlap with GO-annotations data.
  9. Click "Find samples in GEO" in the main menu. Type in a search term (e.g., kidney, pancreas, skin, brain cortex) and start selecting samples from various data sets. The idea is to find differenbces between cell types or developmental stages of each organ. Add other tissues for a background (e.g., from the GNF database). Save samples.
  10. Edit sample names so that replications have identical names (case sensitive). Generate the data file with gene expression profiles. Open the file and check the quality of data. Delete low-quality data. Reanalyze the data and plot the heatmap and PCA.

4. List of terms

ANOVA
is ANalysis Of VAriances, a statistical technique for detecting statistical significance. The major advantage of ANOVA versus a simple t-test is that variances are averaged over all factor levels, thus the statistics become more stable. In ANOVA we calculate the F-statistics which is then used to estimate P-value and determine if the variation between means is significant. Testing multiple hypotheses with ANOVA (as in the case of microarray data) requires some modifications in ANOVA: variance averaging, and FDR.
Array annotation
is a file with probes (or clones) in the microarray with annotations. The file is a tab-delimited text file with headers in the first row. The following three columns are required: The first column is probe ID (oligo ID), which should match to the gene ID in the data file that you analyze. Gene ID can be either a number or a word. The second column is gene symbol. The third column is gene annotation. The file may have additional columns if necessary (e.g., gene bank accession number, Unigene, Ensembl, Entrez, MGI, etc.). These columns should have headers to be displayed in all tables.
Biplot
was proposed by Gabriel (1971. Biometrika 58: 453-467). This is a method for plotting together rows and columns of the data matrix, which can be used for examining associations between genes (rows) and tissues/experiments (columns). The technique is based on the Singular Value Decomposition (SVD) method.
Web references:
SVD and PCA for microarrays
Biplot and SVD
Clustering
In ExAtlas, three methods of clustering are implemented: (1) hierarchical clustering and (2) "diagonal" clustering, and (3) PCA-based clustering. Hierarchical clustering is applied to genes and/or tissues/samples with distance matrix and average linking. "Diagonal" clustering is designed for plotting sparse matrices. It attempts to place high values near the diagonal by permutation of rows and columns. PCA-based clustering is done as follows: gene is associated with a specific principal component (PC) based on highest correlation, and if the change of gene expression along the PC (see figure below) is greater than selected fold change threshold.

Two clusters of genes are identified with each principal component: those that are positively and negatively correlated with PC.
EPFP (Expected Proportion of False Positives)
Expected Proportion of False Positives is applied in ExAtlas to the sets of genes associated with two different properties (e.g., coregulated in different tissues, or being targets of transcription factors, and in addition, activated by these transcription factors). EPFP is inverse to the enrichment ratio as compared to the null hypothesis of no association between examined properties. It indicates, what proportion of false positives to expect in the set of genes which we consider as significantly associated with two different properties.
Error model
is the model of error variance used in ANOVA for determining statistical significance of differential gene expression. The error model attempts to get a better estimate for the true error variance than the error variance estimated from data (we call it 'actual error variance'). In ExAtlas we use the maximum of actual error variance and error variance averaged across 500 genes with similar average expression. This error model was proposed in the NIA Array Analysis software and was shown to reduce the number of false positives.
Error variance
is the variance of replications within groups. It is estimated as the sum of square differences between data and corresponding group means. Error variance can be used directly in ANOVA or indirectly via error model and variance averaging.
FDR (false discovery rate)
is the proportion of false positives among all genes that we consider significant. FDR can be viewed as an equivalent of a P-value in experiments with multiple hypotheses testing. In microarray experiments we test simultaneously null-hypotheses for all genes. If there are 20000 genes on a chip, then by using P-value=0.05 we will consider 5% genes significant even if null-hypotheses are true for all genes (i.e., no differential expression). It means that we will get 1000 false positives! This example shows that P-value is meaningless for multiple hypotheses testing. A possible solution of the problem is to use Bonferroni correction by multiplying P-value by the total number of genes. This method ensures no false positives with probability of 95%; however it is too stringent because we can tolerate some small proportion of false positives. FDR is an intermediate method between the P-value and Bonferroni correction; it is equal to the proportion of false positives among all genes that we consider significant. The equation is
where r is the rank of a gene ordered by increasing p-values, pi is the p-value for gene with rank i, and N is the total number of genes tested (Benjamini, Y. & Hochberg, Y., 1995. J Roy Stat Soc B 57: 289-300) The FDR value increases monotonously with increasing p-value. (or decreasing t-statistics or F-statistics).
F-statistics
is a ratio of factor variance to the error variance in ANOVA. F-statistics is then used to estimate the P-value according to theoretical F-distribution. The P-value is then used for determining if the variation between means is significant. If multiple hypotheses are tested, then FDR is estimated from P-values.
Gene expression
is the intensity of transcription (mRNA synthesis from DNA template) in a cell. Gene expression profile is the data on expression of all genes (or majority of genes) in the genome. It is also called "global gene expression profile". Each cell type or tissue has its specific gene expression profile, which is measured either by microarrays or with high-throughput sequencing (RNA-seq).
Microarray
is a slide with numerous probes that represent various genes of some biological species. Probes are either oligo-nucleotides that range in length from 25 to 60 bases or cDNA clones. The quality of data from cDNA arrays is usually low because cDNA often include non-specific regions. Thus, cDNA arrays are excluded from ExAtlas search. Microarrays are hybridized with labeled cDNA synthesized from a mRNA-sample of some tissue. The intensity of label (radioactive or fluorescent) of each spot on a microarray indicates the expression of each gene. One-color arrays show the absolute expression level of each gene. Two-color arrays can indicate relative expression level of the same gene in two samples that are labeled with different colors and mixed before hybridization. One of these samples can be a universal reference which helps to compare samples that were hybridized on different arrays.
Organism species
ExAtlas supports the analysis of the following 32 species: human, mouse, rat, rhesus monkey, macaque, chimpanzee, dog, sheep, pig, cow, horse, rabbit, chicken, turkey, xenopus frog, zebrafish, rainbow trout, salmon, fruit fly, nematode, thale cress, rice, soybean, tomato, maize, yeast (2 species), salmonella, bacteria (5 species). However, public data sets are currently available for for human, mouse, and rat. Organism species have to be selected from the main menu in ExAtlas before you start any analysis in order to avoid confusion of combining incompatible data on different species.
Outliers
are data that are suspiciously different from other data from the same experiment. Outliers can be detected using the z-value: z=|x-Mean|/SD, where x in the tested value, Mean is the mean value for the same experiment, and SD is standard deviation from mean. In ANOVA, SD is calculated as a square root from mean square error (NSE). Values with high z-values can be outliers. How to determine what z-value to select for outlier removal? The answer depends on the volume of data. If you analyze 22000 genes with 12 1-color arrays, then you have 264000 numbers. Assuming no real outliers, the highest z-value is expected to be 4.6. To be sure that you remove real outliers you need to select the value z somewhat higher than 4.6, for example z=6 or z=8. If you think the data have problems you may want to remove more outliers by reducing the z-value. If you don't want to remove any outliers, select z=10000. Removing outliers means replacing them with missing values.
Overlap analysis
A common way to annotate a set of genes (e.g., significantly upregulated or downregulated) is to compare it with already available annotated gene sets, e.g., Gene Ontology (GO). If the number of common (=overlapping) genes is greater than expected by random, then a hypergeometric distribution is used to evaluate the significance of gene overlap: z = (q-p)/sqrt[p*(1-p)*(N-n)/(N-1)/n], where z = z-value; p = number of genes in the annotated set, n, divided by the total number of annotated genes, N; and q = number of overlapping genes divided by the number of genes in your initial set. See also section 3.12..
PCA
Principal Component Analysis (PCA) is a multivariate analysis technique which finds major patterns in data variability. In mathematical terms, it is finding eigenvalues and corresponding eigenvectors (=principal components, PC). Most important are first few principal components that explain most of observed variance; the rest of them are mostly random fluctuations. Thus, by plotting data versus first 2 or 3 PC we can reduce dimensionality of the data without much loss of information. Singular Value Decomposition (SVD) is a more generic method than PCA which identifies eigenvectors both for the rows (=genes) and columns (=tissues) of the data matrix. In fact, both gene-points and tissue-points can be plotted on the same graph using technique called "biplot" which is implemented in our software.
Rankplot (rank-plot)
It is used to show graphically the enrichments of genes that belong to the given geneset among either upregulated or downregulated genes (see Fig. 15B). First, genes are sorted according to their expression change (e.g., after manipulation of transcription factor), then the proportion of genes from the geneset (e.g., geneset of target genes with binding site[s] of transcription factor) are estimated in a sliding window (e.g. N = 300-500 genes).
Replication
is an independent repeat of an experiment. Biological replicates should be truly independent. For example, shRNA experiments should use different shRNA sequences as replications. Transgenic clones should be derived independently and used as replications. In practice it may be difficult to achieve absolute independence of replicates, but it is very important to reduce dependency between replicates to a minimum. For example, it is better to take samples from different animals than from the same animal, unless you are interested in a particular animal. If sample preparation requires multiple steps, it is best if samples are separated from the very beginning, rather than from some intermediate step.
Specificity of genes
Gene specificity is characterized by z-value which is estimated by comparing log-expression in a given tissue (mi) with average log-expression in other tissues (M) that are not correlated with tissue i: z = |mi - M| / SD, where SD is standard deviation of gene expression in other tissues, used for estimating M. Tissue is considered correlated with given tissue i if the multi-dimensional distance to tissue i is <1/3*(maximum distance between tissues). Low specificity corresponds to z-values below 3. High specificity corresponds to z-values above 6.
Statistical significance
means rejection of a null-hypothesis, H0, that two samples have the same probability distribution. H0 is tested using some statistics (e.g., t or F); if its value appears in the tail of the theoretical probability distribution for this statistics, and hence, the likelihood of the H0 drops below some threshold (usually P=0.05), then we consider the difference between 2 samples significant. This does not guarantee that the H0 was indeed false. A case, where H0 true but we consider the difference between means statistically significant, is called "false positive". If we did not detect significant differences but H0 was false, then it is called "false negative". When multiple hypotheses are tested, the meaning of statistical significance becomes more complicated (see FDR).
Universal reference
is a mixture of cDNA that represent (almost) all genes of a species, and their relative abundance is standardized. Universal reference is synthesized from mRNA of various tissues. Universal reference can be used as a second sample for hybridization on 2-color microarrays. Then all other samples become comparable via the universal reference.
Variance averaging
is averaging the error variance for genes with similar average expression level (=intensity). Variance averaging is a method for stabilizing t- or F-statistics in microarray experiments with a small number of replications. Error variance often depends on the average intensity of genes (usually it increases as intensity decreases). Thus, variance should be averaged only for genes with similar intensity. First genes are sorted according to their average intensity, and then the average error variance is estimated in a sliding window of 500 or 1000 genes. We do not recommend to reduce the size of sliding window below 500. Some genes may have unusually high error variance because of outlier values. To avoid the effect of these genes on the averaged error variance, it is better to remove 1% or 5% top values of error variances before averaging. Average error variance can be used in ANOVA instead of the actual error variance, or it can be combined with the actual error variance according to error model.
VRML
stands for Virtual Reality Markup Language. It is an object-oriented language for describing 3D objects. To view the image you need a VRML viewer (e.g., FreeWRL or Cortona3d. Web resources: Floppy's Web 3D, Web 3D Consortium
Z-value
Z-value is a deviation from the mean in the standard normal distribution. It is the same as t-statistics if the number of degrees of freedom is sufficiently large. P-values can be estimated from z-values as follows: p = 2*(1 - cnd(|z|)), where cnd = cumulative nurmal distribution. Then p-values can be used to estimate FDR.

5. Disclaimer: terms of use

This software is provided "AS IS". NIA makes no warranties, expressed or implied, including no representation or warranty with respect to the performance of the software and derivatives or their safety, effectiveness, or commercial viability. NIA does not warrant the merchantability or fitness of the software and derivatives for any particular purpose, or that they may be exploited without infringing the copyrights, patent rights or property rights of others. NIA shall not be liable for any claim, demand or action for any loss, harm, illness or other damage or injury arising from access to or use of the software or associated information, including without limitation any direct, indirect, incidental, exemplary, special or consequential damages.

Feedback
Contact Alexei Sharov sharoval@mail.nih.gov if you have problems with ExAtlas or suggestions for improvement.