Gene expression data sets

We consider 3 gene expression data sets which are provided by Broad Institute as their "Cancer Program Data Sets" and analyzed in (Hoshida et al. 2007). There the samples are clustered using additional data sets and then the clusters are verified by gene set enrichment analysis. The goal is to re-identify these clusters without additional data sets.

The datasets can be downloaded via:

Following data sets have been downloaded:

Data sets and results:

Description of the data sets

The first data set is from (Van't Veer et al. 2002), where the goal was to find a gene signature to predict the outcome of a cancer therapy. We removed array S54 because it is an outlier which leads to a data set with 97 samples and 1213 probe sets with skewness of 0.45 and excess kurtosis of 0.93 after standardization. In (Hoshida et al. 2007) 3 subclasses have been found. These classes are biological meaningful because 50/61 cases from class 1 and 2 were estrogen receptor positive and only in 3/36 from class 3.

The second data set is from (Su et al. 2002), where gene expressions from human and mouse samples across a diverse tissues, organs, and cell lines have been profiled. The goal was to have a reference for the normal mammalian transcriptome. The data set contains 102 samples with 5565 probe sets with skewness of 0.15 and excess kurtosis of 1.3 after standardization. We try to re-identify the tissue types.

The third data set is from (Rosenwald et al. 2002) consisting of 180 samples and 661 probe sets with skewness of -0.05 and excess kurtosis of 0.35 after standardization. The goal was to predict the survival after chemotherapy. In (Hoshida et al. 2007) 3 classes were found: "OxPhos" (oxidative phosphorylation), "BCR" (B-cell response), and "HR" (host response). Our goal is to identify these subclasses directly by biclustering.