Algorithm

Pharmacological Data

We used the data from the Genomics of Drug sensitivity in Cancer (release 2). In this project, 714 cancer cell lines were screened against the 138 anticancer drugs and their IC50 values were calculated by curve-fitting technique. Out of these 714 cancer cell lines, 17 were pancreatic cancer cell lines. We extracted the pharmacological screening data for these 16 pancreatic cancer cell lines. In original data, log IC50 values ranges from -11 (most sensitive) to +13.6 (most resistant). But, the most resistant values are not the real values, they were just extrapolation in the IC50 values, which have no clinical relevance and may also affect in training. So, we narrow down the range of logIC50 values to -7 to +7 to avoid misinterpretation of the data.

Chemical Features

To develop QSAR models, we have calculated the 863 chemical descriptors (1D, 2D and 3D) and 10 types of fingerprints by using PaDEL software. These chemical descriptors include constitutional, topological, geometric, electrostatic, hydrophobic and many other types of descriptors listed in Table. Before calculating the chemical descriptors, we have downloaded the structures (SDF file format) of molecules from PubChem and draw the structure of unavailable molecules by PubChem editor. These structures were further converted into 3D and energy minimized by OpenBabel software and these minimized structures were subjected to descriptor calculation.

QSAR Models

We have developed separate QSAR models for each of the 16 cancer cell lines by using SMOreg. SMOreg uses the sequential minimal optimization algorithm for training a support vector classifier using polynomial or Gaussian kernels for regression problem. For implementing SMOreg at RBF kernel, we used command line version of Weka machine learning tool (version 3.6.6). Chemical descriptors calculated by the PaDEL were used as input. But not all descriptors were relevant in determining IC50 value. Therefore, it is necessary to remove non-informative descriptors to develop robust QSAR models. We used remove-useless function followed by CfsSubsetEval module implemented in Weka for the selection of relevant descriptors. CfsSubsetEval determines the predictive ability of each attribute (chemical descriptor) and the redundancy among those and selected the best set of attributes which were highly correlated with the class for prediction, but at the same time had low inter-correlation.


Cross Validation

For avoiding under- and over-fitting, cross-validation is must. So, we used 10-fold cross validation in our model building. In which, we have randomly divided the original dataset into 10 parts. Nine of them were used in training and remaining one was used exclusively for testing. This process was repeated for 10 times, which generated 10 predictive models. Finally, to evaluate the performance of QSAR models, we have calculated the Pearson correlation coefficient (R) and root mean square error (RMSE).


Clustering of cell lines

For clustering of pancreatic cancer cell lines, we used hclust function available in R on the basis of expression data of 109 important genes, which was obtained from Cancer Cell Line Encyclopedia (CCLE). We used Pearson correlation coefficient (R) between the expression values of these genes as input for hclust.