Top

DBpred Help Page

This is a help page of DBpred, developed for predicting DNA-Interacting residues in a protein. This page provides diffrent type of information on DBpred. In order to provide information in structured forms, we have divided information in following topics.

  1. Datasets: Datasets were generated from DNA-interacting protein structure in PDB (April 2019, release).
  2. Evaluation of models: Standard protocols were used for evaluating models develop in this study.
  3. Algorthm: Standard algorithm are used for devloping DNA interacting residues.
  4. Help on Web Pages: Help pages with screen shots

Goto Top

Importance of DNA-binding

DNA-protein interaction is one of the most crucial interactions in the biological system, which decide the fate of many processes such as transcription, regulation of gene expression, splicing, and many more. To explore the underlying mechanisms of the biological process, it is essential to recognize the specific residues in the protein sequences that interact with the DNA.

In this study, we are presenting a method, DBPred, that can predict the DNA-interacting residues in a protein from its primary structure. We have downloaded the benchmark dataset from the method hybridNAP for training and validation purposes.

Goto Top

Datasets

We have downloaded the benchmark dataset from the method hybridNAP for training and validation purposes. This dataset comprises 817 proteins for training and 47 proteins for validation; after pre-processing the data, we left with 646 proteins for training and 47 proteins for validation.

Prediction models were developed on the training dataset containing 15636 DNA-interacting and 298503 non-interacting residues, validated on validation dataset consisting of 875 interacting and 9078 non-interacting residues and independent dataset that contains 232 interacting and 776 non-interacting residues, with the pattern size of 17.

Goto Top

Model Evaluation

In this study, we used stanadrd procedure for evaluating the performance of our models. First, we divide in traing and validation dataset in ratio of 80% and 20%. All traing and testing is performed on training dataset and final performance of model is evaluated on validation or independent dataset. Following is brief description on evaluation
  • Five-fold cross validation: Five-fold cross technique was performed to evaluate the performance of different models developed in this study. In this technique, dataset is divided into five different sets, out of which four sets are used to train the model and the fifth set is used for testing the performance of the model. This process is repeated five times, therefore, each set is used once for testing. The final performance is reported by averaging the performance obtained on five different sets.
  • External Validation: Five-fold validation described above is on internal validation where same data is used for training and testing; model may be over optimized. In order to measure realistic performance of our models, we also perform external valiadtion. In external validation we measure performance of our model developed on training dataset on an validation or independent dataset. As both training and validation dataset donot share sequence. It means datasets used for traing and validation are different so performance is realistic.
  • Performance Measures: In this study, we used both threshold dependent as well as threshold independent parameters to evaluate the performance of our models. In case of threshold dependent measures, we used all standard parameters to measure performance it includes sensitivity, specificity, accuracy. Similarly, we used area under curve of ROC to measure overall performance in case of threshold independent measures.
  • Goto Top

    Algorithm

    We have used amino acid binary profiles, physicochemical-properties based binary profiles, PSSM profiles, and combination of all, as the input features.
  • Machine Learning and Deep Learning Techniques: In this study, machine learning techniques have been used for developing models. Major machine learning techniques used for developing models includes RF, XGB, KNN, LR, GNB, DT, and 1D-CNN. These machine learning and Deep-learning technique has been implemented using Python libraries scikit-learn, Tensorflow, and Keras.
  • Goto Top

    Help Pages

    DBpred server discriminate the DNA-Interacting residues and non-interacting residues from a given sequence.

    This page provides help on different modules of server. Following are major modules in this server:
    Sequence: This page provides the facility to enter multiple sequences in FASTA format and select the desired method for prediction. User can see the result online and facility of downlaoding the result in ".txt", ".png" and "".pdf" file format is also provided.





    PSSM Profile: This module allows users to predict the DNA-interacting residues in the given protein sequence using evolutionary information in the form pf PSSM Matrix. User is asked to submit either single protein sequence or very few sequence in FASTA format. We have also provided the option to upload the file containing sequences, if the number of sequences are in large number.



    Hybrid Profile: This server allow users to predict DNA binding residues in a protein from its primary structure information, by using the hybrid of three different features such as amino acid binary (AAB) profile, physico-chemical properties based binary (PCB) profile, and Position-Specific Scoring Matrix (PSSM) profile. This module allow users to submit multiple protein sequences at a time in FASTA format.