Algorithm to develop ntEGFR server

Algorithm of ntEGFR server

In this server there are three QSAR models are developed:

1. QSAR model based on wild type EGFR:

This model was developed using experimetally validated 128 anti-EGFR quinazoline derivative inhibitors.In order to provide unbiased evaluation of our models, we randomly divide our dataset into training (80% inhibitors) and validation (20% inhibitors) dataset. In summary, we created three datasets called wild_whole, wild_train and wild_valid which contanis 128, 103 and 25 inhibitors respectively.
2. QSAR model based on L858R mutant EGFR:
Besides, this we also developed QSAR model using 56 imidazothiazoles and pyrazolopyrimidines derivatives L858R mutant EGFR inhibitors. Similar to above wild datasets, we created three datasets called mutant_whole, mutant_train and mutant_valid consisting of 56, 42 (80%) and 14 (20%) inhibitors repectively.
3. QSAR model based on both wild and mutant EGFR: In order to develop models for predicting inhibitor against both wild and mutant type of EGFR, we created a combined or hybrid dataset which consits of 184 inhibitors (128 wild + 56 mutant). (Figure 1)

Datasets:

The dataset was compiled from the literature. It comprises of three types of datasets: 1. Wild EGFR inhibitors 2. Mutant EGFR inhibitors 3. Hybrid inhibitors.

Figure 1: Flow chart showing training and validating datasets used in developing prediction models.

Descriptor Calculation:

For QSAR model we have calculated descriptors from different softwares like Dragon, V-life, Web-Cdk,, PowerMv, PaDEL Docking based energy descriptors. These descriptors falls in different category like Topological descriptors, molecular descriptors, constitutional descriptors etc.

Figure 2: The pictorial representation of ntEGFR method

Feature Selection:

Feature selection is an important criteria in QSAR modeling. It is generally seen that some descriptors shows negative contribution in model thus is necessery to identify those descriptors and remove them from model. For this purpose we used Weka software cfsubseteval feature selection method that give highly important descriptors. After that we used F-steping approach to further reduce descriptors without any significant change in model performance.

Machine learning techniques:

For model building we used both light SVM and Weka-based SMOreg and SVMreg statistical approach. Our finding suggest that SMOreg method perform better over SVM light technique. Thus finally we developed a QSAR model on SMOreg techniques.

Performance Evaluation:

The performance of constructed model were evaluated using a five fold and LOOCV cross-validation technique. In the LOOCV cross-validation, every time a molecule comes under testing and remaining(n-1) comes under training.The performance of the methods was computed using the following formulas:

1. Correlation coefficient (R)

2. Cofficient of determinent(R²)

3. Mean absolute error (MAE)

4. Root mean square error (RMSE)