We utilize empirical learning curves for evaluating and comparing the data scaling properties of two neural networks (NNs) and two gradient boosting decision tree (GBDT) models trained on four cell line drug screening datasets. The learning curves are accurately fitted to a power law model, providing a framework for assessing the data scaling behavior of these models.
Learning Curves Design Sketching Pdf Download
Download File: https://protabliho.blogspot.com/?file=2vFxPc
The curves demonstrate that no single model dominates in terms of prediction performance across all datasets and training sizes, thus suggesting that the actual shape of these curves depends on the unique pair of an ML model and a dataset. The multi-input NN (mNN), in which gene expressions of cancer cells and molecular drug descriptors are input into separate subnetworks, outperforms a single-input NN (sNN), where the cell and drug features are concatenated for the input layer. In contrast, a GBDT with hyperparameter tuning exhibits superior performance as compared with both NNs at the lower range of training set sizes for two of the tested datasets, whereas the mNN consistently performs better at the higher range of training sizes. Moreover, the trajectory of the curves suggests that increasing the sample size is expected to further improve prediction scores of both NNs. These observations demonstrate the benefit of using learning curves to evaluate prediction models, providing a broader perspective on the overall data scaling characteristics.
A fitted power law learning curve provides a forward-looking metric for analyzing prediction performance and can serve as a co-design tool to guide experimental biologists and computational scientists in the design of future experiments in prospective research studies.
In ML-driven cancer research, a common question is whether existing predictive models can be further improved with more training data. Given recent advances in artificial neural networks (NNs), deep learning (DL) methods have become a favorite approach across a variety of scientific disciplines for discovering hidden patterns in large volumes of complex data. This trend is also observed in medical applications, including the prediction of drug response in cancer cell lines [10,11,12,13,14]. Regardless of the learning algorithm, supervised learning models are expected to improve generalization performance with increasing amounts of high-quality labeled data. Generalization performance refers to the aggregated accuracy of model predictions on a set of unseen data samples. Analytically estimating the learning capacity of models is a challenging task. Alternatively, given a dataset and a learning algorithm, the projected improvement of predictions with increasing number of training samples can be empirically estimated by using learning curves.
A learning curve is a plot of the generalization performance of a supervised learning model as a function of training set size (Fig. 1). These curves have been explored as an efficient method for modeling the power law relationship, \(s(m)\propto am^b\), between the generalization score s (such as generalization error or accuracy) and the number of training samples m, where a and b are two parameters of the power law model. The power law characteristics of learning curves can provide insights into the data scaling behavior of drug response prediction models, which otherwise could not be investigated by merely analyzing single-value performance measures obtained with the full training set size.
A main bottleneck of utilizing learning curves, however, is often the limited availability of sufficient computational resources for performing the analysis. Particularly challenging is analysis with DL models and large datasets because of the large computational cost. While learning curves have been explored in a variety of small-scale applications with classical ML [15,16,17,18], only a few recent studies have applied DL methods to large benchmark datasets in vision and text applications [19,20,21]. To the best of our knowledge, learning curves of drug response prediction models have not been previously explored.
In this paper, we utilize learning curves to evaluate the data scaling properties of drug response prediction models in cancer cell lines. The primary objective of fitting a power law expression to raw data is twofold: (1) efficient and accurate estimation of prediction performance with a larger, not yet available or computationally prohibitive dataset, and (2) fair and systematic comparison of prediction models across various learning algorithms and datasets. To that end, we perform a systematic comparison between classical ML and DL models, implemented with large pharmacogenomic drug response datasets. To accomplish these objectives, we develop an efficient computational workflow, leveraging high-performance computing (HPC) resources to conduct the large-scale analysis. We use this workflow for generating learning curve data with gradient boosting decision tree (GBDT) models and NNs, where each model is trained on four large drug response datasets of cancer cell lines. To assess the data scaling trajectory of each dataset-model pair, the power law expression is fitted to the raw learning curve data to generate a learning curve and uncertainty estimates of the curve. We apply this methodology to analyze sixteen dataset-model combinations.
Theoretical [22, 23] and empirical [15, 20, 21] studies demonstrate that learning curves of predictive models are characterized by a power law relationship between the training set size m and the generalization score s,
Learning curves provide intuitive insight into the data scaling behavior of prediction performance, as opposed to single-value performance measures obtained with the entire set of training samples. The shape of these curves facilitates comparison between ML models by illustrating a global trajectory of model improvement. Thus, learning curves can be utilized for quantifying the learning capacity of prediction models with increasing amounts of training data.
Classical ML algorithms such as GBDT ignore the arrangement of features in datasets while utilizing the feature values only. Since we want to compare the learning curves of classical ML and DL models, we used molecular descriptors as drug representations in which the ordering of features is not intended to carry meaningful information. The descriptors were generated by using the Mordred software package [30]. The full descriptor set comprises 1,826 features, including both 2-D and 3-D molecular structure descriptors. Since most of the 3-D descriptors resulted in invalid (NaN) values for the majority of compounds, we retained only the 2-D descriptors, providing a total of 1,613 drug features.
Note that each cancer cell line was screened against multiple drugs and, vice versa, each drug was tested on multiple cell lines. Thus, although each cell-drug combination is unique in the training set T, the feature vectors of individual cells, \(\boldsymbolx_c\), and drugs, \(\boldsymbolx_d\), appear multiple times in T. The analysis of how this redundancy in feature space affects prediction models and learning curves is beyond the scope of this paper and provides a topic for further investigation.
NNs can be designed to enhance learning from a particular feature type of cell or drug [11, 31]. In such models, the prediction performance depends on the availability, quality, and diversity of that specific feature type in a training set. In contrast, our primary objective is to gain insight into the overall prediction improvement with an increasing number of training samples. Thus, we refrained from using architectures that focus on learning from specific feature types of cells or drugs.
Because of the large difference in performance between dGBDT and the other models, we analyze the learning curves of dGBDT separately in Fig. 5. Since LightGBM is highly parallelizable and allows faster model convergence as compared with NNs, we train 1,000 dGBDT models (\(N=20\), \(K=50\)) with each of the four datasets, as shown in Fig. 5a. Note that the three regions of the learning curve are apparent in each plot.
Learning curves generated by using dGBDT for multiple data splits of each of the datasets in Table 1. a The entire set of learning curve scores, \(LC_raw\), where each data point is the mean absolute error of predictions computed on test set E as a function of the training set size \(m_k\). A subset of scores in which the sample size is above \(m_kmin\) (dashed black line) was considered for curve fitting. b Three curves were generated to represent the fit: \(q_0.1\) (blue curve) and \(q_0.9\) (green curve) representing the variability of the fit, and \(\tildey\) (black curve) representing the learning curve fit
To assess the utility of learning curves as a global metric for evaluating prediction models, we collected \(LC_raw\) for the hGBDT, sNN, and mNN models for each dataset. To obtain error scores for an appropriate power-law fit, we qualitatively selected, based on empirical observations of plots in Fig. 5b, a range of \(m_k\) that excludes the small-data region for each dataset. The remaining data points were used to obtain \(\tildey\), \(q_0.1\) and \(q_0.9\), and the corresponding power law fits, as shown in Fig. 6. The \(\tildey\) values and the corresponding power law fits are shown in Fig. 6, including the fits for \(q_0.1\) and \(q_0.9\) which are represented by the shaded regions. The selected \(m_k\) range is summarized in Table 3 for each dataset, including the goodness-of-fit measure \(MAE_fit\) for the power law fits.
As Fig. 6 indicates, mNN outperforms sNN across the entire range of the explored training sizes on every dataset, albeit with the similar number of trainable parameters in these NNs. This superiority of mNN can be attributed to the separate encoding of gene expressions and drug descriptors within the individual input subnetworks, enhancing the overall model learning. Moreover, mNN exhibits the lowest prediction error at the full sample size for all datasets. On both GDSC datasets, however, no single model dominates across the entire \(m_k\) range: hGBDT outperforms both NNs at a lower range but performs worse than NNs as the training size increases. Another important observation is the different trajectories of the curves among the datasets. On CTRP, for example, the slope of mNN is considerably steeper than that of sNN. Thus, the mNN is expected to exhibit a higher rate of improvement on prediction score if the training size further increases. On NCI-60, however, while the NNs exhibit a similar curve for the majority of the observed range, mNN shows a sign of convergence and begins to transition from the power law region to plateau. 2ff7e9595c
Comments