An Overview of Mathematical Models for RNA Sequence-based Glioblastoma Subclassification

Molecular marker-based glioblastoma (GBM) subclassification is emerging as a key factor in personalized GBM treatment planning. Multiple genetic alterations, including methylation status and mutations, have been proposed in GBM subclassification. RNA-Sequence (RNA-Seq)-based molecular profiling of GBM is widely implemented and readily quantifiable. Machine learning (ML) algorithms have been reported as an applicable method that can consistently subgroup GBM. In this study, we systematically studied the applicability of the commonly used ML algorithms based on The Cancer Genome Atlas Glioblastoma Multiforme (TCGA-GBM) dataset and cross-validated in the Chinese Glioma Genome Atlas (CGGA) dataset. ML algorithms studied include Binomial and multinomial Logistic Regression, Linear discriminant analysis, Decision trees, K-Nearest Neighbors, Gaussian Naive Bayes, Support Vector Machine, Random Forest, Gradient Boosting, Multi-Layer Perceptron, and Voting Ensemble. We found Support Vector Machine, Multi-Layer Perceptron, and Voting Ensemble best equipped to assign GBM to correct molecular subgroups of GBM without histological studies. © 2021 aioncology.org. All rights reserved.

Introduction study carried out by Verhaak et al. (2010) provided evidence of distinct and clinically relevant 4 subtypes of glioblastoma multiforme (GBM) distinguishable by genomic abnormalities [1]. These include a Mesenchymal subtype defined by NF1 mutation, a Classical subtype with notable EGFR abnormalities, a Proneural subtype typified by distinct PDGFRA and IDH1 events, and a Neural subtype defined by abnormal expressions of neuron markers such as NEFL [1,2]. Each of these also exhibited unique alterations and mutations in numerous genetic markers [1,2]. More recent studies proposed that the Neural subtype arose from contamination of the nonmalignant samples, and therefore, GBM is now grouped into three subtypes, i.e., Mesenchymal, Classical and Proneural [3].
Emerging studies indicated the clinical significance of these subclassifications in predicting treatment effectiveness [2,[4][5][6]. Cases describable by each of the 3 subtypes respond uniquely to therapies and may guide personalized treatment [6]. The Classical and Proneural subtypes correspond with better therapeutic response than the Mesenchymal subtype, which often exhibits more aggressive features than other subtypes [1,7]. Meanwhile, the Classical and Mesenchymal subtypes may benefit from more intensive treatment regimens (concurrent chemo-/radio-therapy or greater than 4 cycles of chemotherapy) [1]. Given this, reliably identification of GBM subtypes is of exceptional importance for personalized GBM treatment. Complicated genetic markers such as mutations and methylation identifying the individual subtype with high accuracy have been proposed [8]. However, identifying patients likely to benefit from intensive treatment for improved overall survival (OS) rates constitutes a more straightforward and clinically significant binary A classification problem. Leveraging genetic expression levels to predict GBM subtype or likely response to intensive treatment can be accomplished by employing one of many available machine learning (ML) classification algorithms. Potentially applicable models include probabilistic models, tree-based classifiers, connectionist models, ensemble models, and more. Existing studies highlight the potential value of employing ML models with RNAseq data to solve cancer classification tasks, and suggest the relative effectiveness of support vector machines (SVMs) and ensemble models such as random forests (RF) in some contexts [9,10]. However, further research also illustrates the value of gradient boosting and connectionist models in solving problems with similar quantitative data [5,11]. GBM subtype classification with gradient boosting classifiers and radiomics data has also yielded encouraging results, suggesting both the potential applicability of boosting models and the importance of bolstering systematic subtype classification efforts [12]. To identify the most appropriate models for GBM subtype classification, we evaluated the performance of these approaches as well as numerous other traditionally employed model types for comparison.
This study aims to make advances toward the goal of reliably labelling subtypes based on continuous genetic data while comparing the applicability of various options in this context. To achieve this goal, numerous popular classification models are trained and evaluated with metrics that account for class imbalances. Cross-validation is then performed across datasets to verify the consistency of each model. Consequently, this study constitutes an overview and experimental comparison of mathematical algorithms for GBM subclassification based on RNA-sequence (RNA-Seq) of key genetic markers.

Public datasets
To train and experimentally compare models prior to cross-validation, The Cancer Genome Atlas Glioblastoma Multiforme (TCGA-GBM) dataset has been used. For cross-validation, models have been tested on GBM data from the Chinese Glioma Genome Atlas (CGGA) dataset [13]. Both are available through the GlioVis platform (http://gliovis.bioinfo.cnio.es/, last accessed April 10th, 2021). Basic patient information from TCGA-GBM and CGGA datasets that are included in the study in presented in Table 1,2. While each set includes numerous gene expression levels features and labeled subtypes, the labels in the CGGA set only describe Mesenchymal, Classical, and Proneural subtypes. The TCGA set has two separate label features describing all four subtypes and only these three subtypes.
Classification was performed on the feature lacking the Neural subclassification.

Patients included in the study
The TCGA dataset includes cases of patients with the following subtype, gender, IDH mutation, and age distributions. The cases labeled Neural with all subclassifications present are split as in Figure 1A, B. The patterns of abnormalities in EGFR, NF1, PDGFRA, and IDH1 expression levels described by Verhaak et al. hold true in the absence of the Neural subtype. Classical cases see high EGFR levels, Mesenchymal cases see low NF1 levels, and Proneural cases see unique PDGFRA/IDH1 levels. The CGGA   dataset used for cross-validation includes the same expression level features alongside the following distribution ( Table 3).

Biomarkers selection for GBM subclassification
Verhaak et al. highlight differences in gene expression of EGFR, NF1, PDGFRA, and IDH1, but also describe mutations in TP53, RB1, PIK3R1, ERBB2, and numerous other potential biomarkers [1]. To improve model performance, a total of up to 44 biomarkers are employed in training and predicting subtypes. A heatmap of Spearman rank's correlation coefficients for each biomarker by subtype (Figure 2).

Algorithms for GBM subclassification
The models considered and compared in this study are available from the scikit-learn package [14,15]. While hyperparameter selection varied for each model type, 5-fold cross-validation within the TCGA dataset has been performed to select values when not specified. Included here are brief summaries of each model type. Data is scaled or standardized as required for models requiring comparable scales or that predictors may be described by a standard distribution. Model performances are then evaluated by calculating macro-averaged F-scores to account for subtype representation imbalances exhibited by both the TCGA and CGGA datasets. Average accuracy, precision, and recall are also calculated and compared for a more holistic understanding of model performance.

Binomial and multinomial logistic regression (LR).
LRs are statistical classification models that leverage a logistic representation of data to predict the probability of cases belonging to various classes [16]. Performing logistic regression involves optimizing a cost function and requires regularization, as excluding unlikely values can combat overfitting that would otherwise deteriorate performance in the context of highdimensional data. The sklearn model used employs L2 regularization: 2. Linear discriminant analysis (LDA). LDA also employs probabilistic models [17]. New data is classified by generating linear boundaries between inputs and determining the probability of a novel vector belonging in each defined region. In the case of sci-kit learn's implementation, probabilities are calculated with the assumption that prediction data conforms to a Gaussian distribution: 3. Decision trees (DTs). DTs learn by splitting data into subsets based on the features that result in the highest class purity following each split [18]. Even after selecting preferable stopping criteria and pruning methods designed to minimize overfitting, single decision trees are often referred to as weak learners due to poor performance. However, they are fast, easily interpretable, and can be leveraged effectively in ensemble models; both the random forest and gradient boosting models tested in this study involve classification by majority vote with numerous distinct decision trees. An example built with a max depth of 3 using Gini impurity as a metric splits on FGFR3, ASCL1, SNCG, CASP5, ERBB2, and ERBB3; dividing training data on these features results in leaves with the highest subclass purities (Figure 3).

K-Nearest Neighbors (KNN).
KNN is a simple model that calculates distances between a given vector and all training points [19]. The majority class of a given number of closest points is returned as the prediction. While efficient to train, KNN can struggle with high dimensional data. K-d trees are employed in this context.

5.
Gaussian Naive Bayes (GNB). GNB classifiers are probabilistic models. Although they assume the conditional independence of predictors, they can sometimes perform surprisingly well even if this assumption is false [20]. To make predictions, the likelihoods of a case belonging to each subclass are calculated and compared. These likelihoods are products of the prior probability and the likelihoods for each individual predictor. Gaussian  Naive Bayes determines these individual likelihoods for continuous features by assuming each feature belongs to a standard distribution: 6. Support Vector Machines (SVMs). SVMs are kernel-based models that create hyper-planes to separate data based on subclassification. While there are numerous usable kernel functions, the popular linear (SVCLIN) and radial basis (SVCRBF) kernels are tested here. 7. Random forests (RFs). RFs are bootstrap aggregating ensemble models that create numerous decision trees. Each tree is trained using random subsets of the training data sampled with replacement, ideally resulting in many weak learners with unique strengths and weaknesses. Predictions are then made using each tree, and the most common result serves as the final prediction.

Gradient Boosting (GBOOST).
Boosting models are ensemble models consisting of many weak learners. Decision trees are also used in this case. Each created tree is subsequently built to optimize a loss function. This ideally results in many unique and increasingly effective trees. However, this additional optimization also has the capacity to render GBOOST models subject to overfitting.

Multi-Layer Perceptron (MLP)
. MLP was also studied to represent connectionist approaches to classification. Neural networks are flexible and potent, but with the variety of potentially performant architectures, activation functions, and numerous other options for further optimization comes additional complexity and a lack of interpretability. In this case, MLPs with a single hidden layer of 10 neurons have provided a relatively superior balance of performance and simplicity in the context of cross validation within the TCGA dataset. Consequently, this architecture underlies the results for classification described in this comparison. Stochastic gradient descent is used to train the network. 10. Voting Ensemble (VOTE). VOTE models can be created using other classification models. Ideally, taking the majority vote of multiple unique performant models can result in greater accuracy should the used models compensate for each others' weaknesses. In this case, a voting model employing the most performant instance of each of the other model types is tested.

Three Subtypes Training and Testing with TCGA Dataset
The accuracies of five-fold cross validation for the most performant of each model type are as presented in Table 3. The average classification F-score across models is 0.702. The best average F-score for a single model, attributable to logistic regression (LR), is 0.759. Notably, LR models trained on data with many dimensions can fall prey to overfitting; cross validation results will reveal that the voting model, with an F-score of 0.747, proves more consistent. Accuracies by subtype are presented in Figure 4A, B. While models can more reliably identify a Classical subtype, they universally have difficulties correctly identifying Proneural subtypes.
The misclassified Proneural subtypes are predicted to be Classical and Mesenchymal with similar frequencies (Figure 5C).

Three Subtypes Cross Validation with CGGA Dataset
Accuracies of predictions made for the CGGA dataset using model trained on the TCGA dataset are presented in Table 4. Models such as LR had greater drops in accuracy than SVMs, the MLP, and the ensemble models, but all types saw a drop in accuracy across the board. The average accuracy was 59.86%, and the best accuracy, attributable to the SVM employing a linear kernel function, was 63.39%. Interestingly, we found accuracy by subtype reveals a similar pattern to the one observable in the context of the TCGA dataset (Figure 5 A, B). This data also indicate Classical subtype in  general is more easily identified, while Proneural subtypes are more easily misidentified (Figure 5 A, B).

Intensive Treatment Response Training and Testing
Predicting the likely response to intensive treatment based on subtype group rather than individual subtype leads to an unsurprising increase in accuracy in the TCGA-GBM dataset. The average F-score across models is now 0.782, and the SVM with a linear kernel predicted subtype category with an F-score of 0.832 (Table 5, Figure 6). Intensive treatment response was cross validated in the CGGA dataset (Table 6, Figure 7). Although drops in accuracy are also observable, the scores are also higher. On average, the models predicted prospective treatment response with an F-score of 0.631. The support vector machine using the radial basis function kernel performed best with an accuracy of 0.686.

Conclusion
Our study showed that F-scores around 0.750 for the training set and just over 0.600 for the cross-validation far outclasses guessing, but still leaves much to be desired. We suspected that overlap between biomarkers as well as the limited number of cases with the relevant data within these sets constitutes partial explanations for why these scores fall short. Another possible explanation involves potential systematic differences in the demographics and determination of expression levels between the two datasets. This may also contribute to the gap in performance. Simplifying the problem by classifying prospective response to intensive treatment does improve these scores to over 0.800 for the training set and over 0.675 for the testing set, but at the cost of identifying the individual subtype. To better understand these metrics and how they may be improved, a closer look can be taken at performance by model type.
Outside of single decision trees and linear discriminant analysis, no models performed particularly well or poorly in comparison to the rest. This constitutes the fundamental takeaway regarding comparisons of models; the similar performance across the board indicates that the inability to reach higher accuracies may primarily be attributable to factors beyond model selection. Additionally, the lack of improvement seen in ensemble models and the voting model in particular can be explained by a lack of unique weaknesses or strengths among the different model types in this context.    However, the final scores, while similar, have some differences worth consideration.
Following cross-validation with the CGGA dataset, it becomes clear that SVMs, MLPs, and voting ensemble models ultimately performed best. The relatively poor performance of some models may be explained by their weaknesses. Though some regularization has been performed prior to crossvalidation between datasets, LR models can suffer from overfitting. GNB models must assume conditional independence between the biomarkers used for prediction. While the tenuous nature of this assumption in this context may slightly lower performances, GNB performed decently enough in cross-validation to warrant recognition. However, DTs and KNN models, while easily interpreted, are ostensibly unable to outperform other more elaborate models for this particular problem. Such shortcomings afford the superior performances of SVMs, connectionist models, and ensemble models both in literature concerning classification of RNA-seq data and here. Determining the reasons underlying the gaps in performance between these models proves more difficult given the largely similar outcomes.
The relatively high performance of SVMs may be attributable to the ability to perform well with high dimensional data. This may partially explain the slightly superior results in cross-validation with the CGGA dataset when compared to tree-based ensemble models. While both RF and GBOOST models outperformed the weakest models, they achieved marginally lower scores in cross-validation. Particularly in the case of GBOOST, the higher relative scores prior to cross-validation may indicate some degree of overfitting. While this can also present a problem for MLP and VOTE models, their performances and the latter's ability to balance the weaknesses of its estimators suggest distinct value in evaluating RNA-seq data. Fundamentally though, the similarity of scores attributable to all of the more performant models indicate that the key to real improvement in subtype classification will require work beyond theoretically informed model selection.
Other avenues to improving model performances in the future could include further examination of individual subtype accuracies. Though Proneural cases were often misidentified as both the Classical and Mesenchymal subtypes, the slightly higher rate of misclassifications as Mesenchymal suggests that emphasizing the differences between Proneural and Mesenchymal subtypes may prove valuable in future endeavors; further preprocessing may be tailored to limit Proneural misclassification specifically. However, increasing data quantity constitutes the most promising path to substantial performance improvements. Particularly when employing demonstrably promising models like MLP, additional cases may be necessary for reaching the desired consistency.
Although tuning models and experimenting with further preprocessing could increase overall performance, the consistency of prediction accuracy across models indicates that the keys to a sizable improvement in performance lie elsewhere. That said, the evidence indicates that SVMs, MLPs, and ensemble models are best equipped to handle both classification problems with continuous genetic data in this context.