Lymphomas originate in lymphatic cells of the lymphoid system. The main types ofcells are T lymphocytes and B lymphocytes (see Figure 12.1). The lymphomas aredivided into Hodgkin lymphoma and non-Hodgkin lymphoma. This division is forhistorical reasons; Hodgkin lymphoma is caused by an abnormal B lymphocyte thatis easily recognized under a microscope. These cells are called ReedSternberg cells.There are several subtypes of Hodgkin lymphoma, but they are all malignant.
All other lymphomas are referred to as non-Hodgkin lymphomas. A large number ofdifferent non-Hodgkin lymphomas exist. The most common types are diffuse large B-cell lymphoma (DLBCL), which constitutes about 31% of all lymphomas, and follicularlymphoma (FL), which constitutes about 22% of all lymphomas. The two related dis-eases, chronic lymphocytic leukemia (CLL) and small lymphocytic lymphoma (SLL),together account for 7% of all lymphomas.
DLBCL is curable in less than 50% of patients.Lymphoma is diagnosed by histopathological examination of the cells from a biopsy.
Immunohistochemistry may be required to distinguish between the individual types ofnon-Hodgkin lymphoma. Non-Hodgkin lymphomas are staged according to Ann ArborStaging Systems that divides the lymphomas into four stages according to how muchthey have spread in the body.
The Ann Arbor Staging System describes the spreading of the disease in stagesIIV. An International Prognostic Index has been developed that takes into accountclinical observationsage, stage, spreading, performance status, and serum lactatedehydrogenase levels. It adds one point for each of the five poor prognostic factors:01 means low, 23 means medium, and 45 means high risk.
The treatment of non-Hodgkin lymphoma follows the standard cancer therapies likeradiation therapy, chemotherapy, immunotherapy, and bone marrow transplantation.
Cancer Diagnostics with DNA Microarrays, By Steen KnudsenCopyright c 2006 John Wiley & Sons, Inc.104
MICROARRAY STUDIES OF LYMPHOMA 105
13.1 MICROARRAY STUDIES OF LYMPHOMA
13.1.1 The Stanford Group
Alizadeh et al. (2000) published a Nature paper on microarray analysis of the threemost common types of lymphoma. A special Lymphochip was designed, contain-ing 12,069 genes from germinal B-cell library, additional genes from libraries createdfrom specific lymphomas, and genes known to be involved in cancer. In total, 17,856cDNA clones were included on the Lymphochip. This chip was applied to 96 nor-mal and malignant samples including the three most common types of non-Hodgkinlymphomas: diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL), andchronic lymphocytic leukemia (CLL). Hierarchical clustering revealed two distinct sub-types of DLBCL, hitherto unrecognized in clinical practice. One subtype was referredto as germinal center B-like DLBCL, the other activated B-like DLBCL. Patients withthe former subtype had a significantly better overall survival than patients with thelatter subtype.
In 2001 the same group published results with a reduced set of six genes measuredby quantitative real-time PCR (Lossos et al., 2001). The performance of the six genesfor classification was validated in a test set and was found to be independent of theInternational Prognostic Index and added to its power.
13.1.2 The Boston Group
Shipp et al. (2002) published a Nature Medicine paper detailing the study of 58 DLBCLand 19 FL patients using the Affymetrix HuGeneFL GeneChip. They built a weighted-voting classifier to classify outcome. In cross-validation tests the classifier produced twogroups of patients with very different five-year overall survival rates (70% versus 12%).The performance was distinct from the International Prognostic Index. They comparedtheir findings to the Alizadeh results. Hierarchical clustering based on some of thesame genes that Alizadeh had used on their Lymphochip resulted in similar divisioninto germinal center B-like and activated B-like clusters. These clusters however, didnot have any significant difference in outcome, suggesting that the association betweenclusters and outcome in the Alizadeh cohort may have been indirect or incidental.Going the other way, however, using the classifier built on the Shipp data to predictoutcome in the Alizadeh data was more successful.
13.1.3 The NIH Group
Rosenwald et al. (2003a) published a large study of 240 DLBCL patients studied withthe Lymphochip. In addition to the previously identified subgroups of DLBCL, theydiscovered a third group and called it type 3 DLBCL. Seventeen genes were selectedfor a classifier of outcome that was shown to be independent of the InternationalPrognostic Index.
In 2003 the group published an update (Wright et al., 2003), where they used aBayes rule predictor of membership of one of the two major DLBCL subgroups. Theyapplied this predictor to the Boston group dataset and found that, with this subgrouping,the two clusters had significant difference in outcome.
13.1.4 The NCI Group
Dave et al. (2004) published a study of 191 follicular lymphoma patients using Affy-metrix HG-U133A arrays. An expression pattern associated with the length of survivalwas determined in a training set of 95 specimens. A molecular predictor of survival wasconstructed from these genes and validated in an independent test set of 96 specimens.
The predictor was later criticized by Robert Tibshirani (2005) as being a fragileresult. He was not able to find any association between gene expression and survivalin the dataset using standard methods.
13.2 META-CLASSIFICATION OF LYMPHOMA
In the review of breast cancer studies, a meta-classifier based on principal componentswas shown. Joining datasets via their principal components is a supervised process:the class labels of the samples are used both for gene selection and for adjusting thesigns of the components. It is possible, however, to extract and join components in acompletely unsupervised manner, independent of the class relationships of the samples.Independent component analysis (ICA) is very well suited for this. The main differencebetween ICA and PCA is that ICA extracts statistically independent components thatare non-Gaussian, whereas components extracted by PCA can be pure noise (Gaussian).One advantage of ICA in this context is that it is possible to skip the gene selectionstep that is often used before PCA to reduce noise or uninteresting variation. Insteadwe can just eliminate genes that have no variation at all, either because they are notexpressed in any of the samples, or because their expression is constant.
Independent component analysis is computationally much more complex than PCA:it uses iteration from a random starting point to arrive at the final components.
As an example, an ICA-based MetaClassifier was built for predicting how welllymphoma patients respond to chemotherapy.
Three studies have made their raw data available: NIH, Stanford, and Boston. Alldatasets were normalized with logit. For each study the 500 genes with the maximumvariance were extracted and 7 independent components were extracted using the Rimplementation of fastICA (Hyvarinen, 1999), without row normalization. The perfor-mance is not influenced significantly whether 500 or 1000 genes are used or whether5 or 10 components are extracted.
For each sample, the projections on the 7 components were extracted from theestimated mixing matrix A:
X = SA,
where X is the transposed expression matrix of genes versus samples, S contains theindependent components, and A is a linear mixing matrix. We assume that independentcomponents correspond to fundamental biological processes or pathways, and that eachcomponent describes all the genes that participate in one such process or pathway. Theadvantage is that not all genes from that pathway or process are necessary to measurethe activity of the component, making it possible to extract components from differentarray platforms with different subsets of genes and to compare them afterwards.
META-CLASSIFICATION OF LYMPHOMA 107
13.2.1 Matching of Components
The order and sign of the independent components are arbitrary and have to be matchedacross the three datasets. We perform this matching based on the above assumptionthat each component describes the genes that participate in some biological process orpathway, or describes genes that are coordinately expressed. Thus, despite the severelimitations in matching genes between platforms, we should be able to identify similarcomponents because they have more genes in common than dissimilar components. Thegenes for each platform are converted to their RefSeq IDs and matched to the genes onthe other platforms based on this. Now we are able to calculate a correlation coefficientbetween components, because each component is merely a weighted sum of genes. Wecalculate the Pearson correlation coefficient between the weights for those genes thathave been matched across platforms using RefSeq. Those components that have thehighest correlation coefficient are matched. The sign of one component is adjusted ifits highest correlation coefficient to another component is negative. Components thatcannot be matched to components extracted from other platforms with a correlationcoefficient higher than 0.3 or below 0.3 are discarded.
It is important to realize that while the many false negatives and false positivesobtained when matching genes across platforms make it difficult to build a gene-basedclassifier, they do not prevent us from matching components, because all we needis a difference in correlation coefficient. This difference can be detected even in thepresence of false negatives and false positives.
Figure 13.1 shows two such independent components from the three d