Protein General Modules

 

 

In this chapter we have described the small programs developed at our group; these programs can be used as building block to develop complex prediction modules. The question arises how it is different then existing software libraries or modules like BioPERL. InBioPER or similar packages one need to have knowledge of computer programming in order to uses these modules/subroutines. In GPSR package we have developed small programs, which can be run by any person have little knowledge of computers. Following are important programs included in this package.

 

Program

Purpose

·            fasta2sfasta

Convert fasta format to single fasta format

·            pro2aac

To calculate amino acid composition of protein

·            pro2aac_nt

To calculate amino acid composition of N-terminal (nt) residues of a protein

·            pro2aac_ct

To calculate amino acid composition

of C-terminal (ct) residues of a protein

·            pro2aac_rest.pl

To calculate amino acid composition of a

protein after removing N-, and C-terminal residues

·         pro2aac_split

To calculate split amino acid composition (SSAC) of a protein

·         pro2dpc

To calculate dipeptide composition of protein

·         pro2dpc_nt

To calculate dipeptide composition of N-terminal (nt) residues of a protein

·         pro2dpc_ct

 To calculate dipeptide composition of C-terminal (ct)

residues of a protein

·         pro2tpc

To calculate tripeptide composition of protein

·         add_cols

To add columns of two files

·         col2svm

To generating SVM_light input format

·         col_mult

To multiplying each column of input file with a number

·         col_mult_sel

To multiplying selective columns with a number

·         perl col_rem

To remove selective columns from a file

·         col_ext

To extract selective columns from a file

·         col_corr

To compute correlation co-efficient between two column

·         col_avg

To calculate average column of two files

·         seq2pssm_imp

To calculate PSSM matrix in column format without any normalization

·         pssm_n1

To normalize pssm profile based on 1/(1+e-x) formula

·         pssm_n2

To normalize pssm profile based on (numb -min)/(max -min) formula

·         pssm_n3

To normalize pssm profile based on (numb -min)*100/(max -min) formula

·         pssm_n4

To normalize pssm profile based on 1/(1+e-(x/100) formula

·         pssm_comp

To compute PSSM composition (400 points)

·         col_sig

Significance of columns in two column files

·         pssm2pat

To generate patterns of given size from PSSM matrix

·         pssm_smooth

To designed smooth pssm profile for plot

·         seq2motif

To create motifs by sliding window of user defined length with option of adding terminal X

·         motif2bin

To make binary input from the multifasta motif file

·         blast_similarity

To perform blast

 

Title
Description
 
Fasta format
fasta2sfasta (Convert fasta format to single fasta format)
 (Pearon format) is used to represent peptide sequences or nucleic acid sequences using single-letter codes. It begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol.
Single fasta format
Our programs use input sequence in single fasta format. Therefore, fasta file should first convert into single fasta format. In the single fasta format the description and sequence data merged into single line. Two hash marks (##) were present to distinguish description and sequence data.
Usage
fasta2sfasta –i seq.fa -o seq.sfa
-i
Input file name having sequence in fasta format
-o
Output file name that gives sequence in single fasta format
seq.fa
>seq_1
MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE
>seq_2
ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF
seq.sfa
>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE
>seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Title
Description

 

pro2aac (To calculate amino acid composition of protein)

The amino acid composition in a protein is simply the percentage of the different amino acids represented in a particular protein. The aim of calculating the composition of proteins is to transform the variable length of protein sequences to fixed length feature vectors. This is an important and most crucial step during classification of proteins using machine-learning techniques because they require fixed length patterns. In addition the conversion of a protein sequence to a vector of 20 dimensions using amino acid composition will encapsulate the properties of the protein into the vector.

 

The composition of all 20 natural amino acids were calculated by using the following equation

 

 

Composition of amino acid i  =

       Total number of amino acid i x 100

Total number of all amino acids in protein

 

Where i can be any amino acid

Usage

pro2aac -i seq.sfa -o seq.out

-i

Input file name contains single fasta format

-o

Output file name gives amino acid composition

seq.sfa

>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE

>seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF

seq.out

# Amino Acid Composition of proteins

# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,

19.18, 1.37, 4.11, 5.48, 2.74, 9.59, 1.37, 1.37, 0.00, 9.59, 4.11, 1.37, ...... 2.74,

19.18, 6.85, 5.48, 2.74, 6.85, 8.22, 1.37, 1.37, 1.37, 5.48, 4.11, 4.11, ...... 2.74,

 

Vector

20 dimension (i.e 20 types of amino acid composition is generated)

 

 

Title
Description
 
pro2aac_nt (To calculate amino acid composition of N-terminal (nt) residues of a protein)
It is well known that some proteins having N-terminal signal sequence which is responsible to transport whole protein into their specific subcellular compartment like, lysosome, endoplasmic reticulum, mitochondria, and chloroplast. Evidences indicate that divergent N-terminal sequences also do influence catalytic behavior, protein-protein interactions, and intracellular distributions of enzymes. Report shows that N-terminal signal sequence can vary from 13 to 36 amino acid residues in length and having all the information needed to localize into specific location. Therefore, N-terminal information could be exploited by using amino acid composition feature to predict subcellular protein. For example:

                        N    5 nt                                                             C
 
Usage
pro2aac_nt -i seq.sfa -o seq.out -n 5
-i
Input file name
-o
Output file name
-n
Number of residues to calculate composition from N-terminal
seq.sfa
>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE
>seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF
Seq.out
# Amino Acid Composition of 5 n-terminal residues of proteins
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,
 0.00, 0.00, 0.00, 0.00, 0.00, 20.00, 0.00, 0.00, 0.00, 0.00, 20.00 ..... 0.00,
20.00, 20.00, 0.00, 0.00, 0.00, 40.00, 0.00, 0.00, 0.00, 0.00,  ..... 0.00,
Vector
20 dimension

 

 

 

Title
Description
 
pro2aac_ct (To calculate amino acid composition of C-terminal (ct) residues of a protein)
While the N-terminus of a protein often contains targeting signals, the C-terminus can contain retention signals for protein sorting. The most common ER retention signal is the amino acid sequence -KDEL (or -HDEL) at the C-terminus, which keeps the protein in the endoplasmic reticulum and prevents it from entering the secretory pathway. The C-terminus of proteins can be modified post-translationally, most commonly by the addition of a lipid anchor to the C-terminus that allows the protein to be inserted into a membrane without having a transmembrane domain. The c-terminal domain of RNA polymerase II typically consists of up to 52 repeats of the sequence Tyr-Ser-Pro-Thr-Ser-Pro-Ser. Other proteins often bind the C-terminal domain of RNA polymerase in order to activate polymerase activity. It is the protein domain, which is involved in the initiation of DNA transcription, the capping of the RNA transcript, and attachment to the spliceosome for RNA splicing. Therefore the information at C-terminal in could be utilized using amino acid composition feature to predict different classes of proteins. For example:          
 
 

                                     N                                                            5nt     C
 
Usage
pro2aac_ct -i seq.sfa -o seq.out -n 5
-i
Input file name
-o
Output file name
-n
Number of residues to calculate composition from C-terminal
seq.sfa
>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE
>seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF
seq.out
# Amino Acid Composition of 5 c-terminal residues of proteins
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,
40.00, 0.00, 0.00,20.00, 0.00, 0.00, 0.00,20.00, 0.00, 0.00, 0.00, 0.00, 0.00, .... 0.00,
 0.00,20.00, 0.00, 0.00,40.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ...... 0.00,
Vector
20 dimension

 

 

 

Title
Description
 
pro2aac_rest (To calculate amino acid composition of a protein after removing N-, and C-terminal residues)
This program is used to calculate the composition of remaining part of a protein after removing specific residues from N-, and C-terminus. Transmembrane proteins having membrane spanning signal in the middle of protein. This program can be used to calculate the amino acid composition of middle part and successfully used in classification family of proteins. For example:
             

                                        N      5 nt                        5 nt      C
 
 
Usage
pro2aac_rest -i seq.sfa -o seq.out -n 5 –c 5
-i
Input file name
-o
Output file name
-n
Number of residues removed from N-terminal
-c
Number of residues removed from C-terminal
seq.sfa
>seq_1##AAAAACCCCCGGGGG
>seq_2##CCCGCAAAAASNMKL
seq.out
# Amino Acid Composition of protein after removing 5 n-terminal and 5 c-terminal residues
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,
 0.00, 100.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..... 0.00,
100.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ..... 0.00,
Vector
20 dimension

 

 

 

Title
Description
 
pro2aac_split (To calculate split amino acid composition (SSAC) of a protein)
It has been reported that some sequence motifs are present into specific region of a protein. Therefore, instead of computing the composition of whole sequence it is useful to split the sequence into different equal parts. Composition of each part is separately calculated, thus feature of region specific motifs is utilized, and added to each other. Some reports show that is increases the prediction accuracy after using this strategy. The advantage of SSAC over standard amino acid composition is that it provides greater weight to proteins that have a signal at either the N or C terminus. For Example:

                                         N                                               C
Usage
pro2aac_split -i seq.sfa -o seq.out -n 3
-i
Input file name
-o
Output file name
-n
Number of parts split into, here 3 i.e. three equal parts of whole protein
seq.sfa
>seq_1##AAAAACCCCCGGGGG
>seq_2##CCCGCAAAAASNMKL
seq.out
# Amino Acid Composition of 3 equal parts of proteins
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,
 5.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 5.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 5.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
 0.00, 4.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 5.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 1.00, 1.00, 1.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00,
Vector
60 dimension (20*3 parts)

 

 

 

Title
Description
 
pro2dpc (To calculate dipeptide composition of protein)
 
The dipeptide composition in a protein is simply the percentage of the different adjacent pairs of amino acids represented in a particular protein. The aim of calculating the composition of proteins is to transform the variable length of protein sequences to fixed length feature vectors. This is an important and most crucial step during classification of proteins using machine-learning techniques because they require fixed length patterns. In addition the conversion of a protein sequence to a vector of 400 dimensions using dipeptide composition will encapsulate the properties of the neighboring amino acids.

 
 
The composition of all 400 natural amino acids were calculated by using the following equation
Composition of dipep (i +1) =
       Total number of amino acid (i +1) x 100
       Total number of all possible dipeptides
 
 
Where dpep (i) is fraction or composition of dipeptide type i. Di and N are the number of dipeptide of type i and number of residues in protein i, respectively.
Usage
pro2dpc -i seq.sfa -o seq.out
-i
Input file name
-o
Output file name
seq.sfa
>seq_2##AAAAACCCCCGGGGG
seq.out
#AA , AC , AD ,….., CC ,…..,  CG , ….. , GG ,….., YY,
28.571, 7.143, 0.000, ….., 28.571,….., 7.143, ….., 28.571,….., 0.000
Vector
400 dimension (20*20)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Title
Description
 
pro2dpc_nt (To calculate dipeptide composition of N-terminal (nt) residues of a protein)
It is well known that some proteins having N-terminal signal sequence which is responsible to transport whole protein into their specific subcellular compartment like, lysosome, endoplasmic reticulum, mitochondria, and chloroplast. Evidences indicate that divergent N-terminal sequences also do influence catalytic behavior, protein-protein interactions, and intracellular distributions of enzymes. Report shows that N-terminal signal sequence can vary from 13 to 36 amino acid residues in length and having all the information needed to localize into specific location. Therefore, N-terminal information could be exploited by using dipeptide composition feature to predict subcellular protein.
Usage
pro2dpc_nt -i seq.sfa -o seq.out -n 5
-i
Input file name
-o
Output file name
-n
Number of residues to calculate dipeptide composition from N-terminal
seq.sfa
>seq_2##AAAAACCCCCGGGGG
Seq.out
# Dipeptide composition of 5 n-terminal residues of proteins
#AA , AC , AD ,….., CC ,…..,  CG , ….. , GG ,….., YY,
100.00, 0.000, 0.000, ….., 00.000,….., 0.000, ….., 00.000,….., 0.000
Vector
400 dimension

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Title
Description
 
pro2dpc_ct (To calculate dipeptide composition of C-terminal (ct) residues of a protein)
 
While the N-terminus of a protein often contains targeting signals, the C-terminus can contain retention signals for protein sorting. The most common ER retention signal is the amino acid sequence -KDEL (or -HDEL) at the C-terminus, which keeps the protein in the endoplasmic reticulum and prevents it from entering the secretory pathway. The C-terminus of proteins can be modified post-translationally, most commonly by the addition of a lipid anchor to the C-terminus that allows the protein to be inserted into a membrane without having a transmembrane domain. The c-terminal domain of RNA polymerase II typically consists of up to 52 repeats of the sequence Tyr-Ser-Pro-Thr-Ser-Pro-Ser. Other proteins often bind the C-terminal domain of RNA polymerase in order to activate polymerase activity. It is the protein domain, which is involved in the initiation of DNA transcription, the capping of the RNA transcript, and attachment to the spliceosome for RNA splicing. Therefore the information at C-terminal in could be utilized using dipeptide composition feature to predict different classes of proteins.
Usage
pro2dpc_ct -i seq.sfa -o seq.out -n 5
-i
Input file name
-o
Output file name
-n
Number of residues to calculate dipeptide composition from C-terminal
seq.sfa
>seq_2##AAAAACCCCCGGGGG
Seq.out
# Dipeptide composition of 5 n-terminal residues of proteins
#AA , AC , AD ,….., CC ,…..,  CG , ….. , GG ,….., YY,
100.00, 0.000, 0.000, ….., 00.000,….., 0.000, ….., 100.000,….., 0.000
Vector
400 dimension

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Title
Description
 
pro2tpc (To calculate tripeptide composition of protein)
The tripeptide composition in a protein is simply the percentage of the three adjacent amino acids represented in a particular protein. The aim of calculating the composition of proteins is to transform the variable length of protein sequences to fixed length feature vectors. This is an important and most crucial step during classification of proteins using machine-learning techniques because they require fixed length patterns. In addition the conversion of a protein sequence to a vector of 8000 dimensions using tripeptide composition will encapsulate the properties of the neighboring amino acids.
 

The composition of all 8000 natural amino acids were calculated by using the following equation
 
Composition of tripep (i +2) =
       Total number of amino acid (i +2) x 100
       Total number of all possible tripeptides
Usage
pro2tpc -i seq.sfa -o seq.out
-i
Input file name
-o
Output file name
seq.sfa
>seq_2##AAAAACCCCCGGGGG
Seq.out
# Tripeptide Composition of Protein
#AAA ,AAC ,AAD ,AAE ,AAF , …..,,YYW ,YYY
23.0769 , 7.6923, 0.000, 00.000, 0.000, ….., 0.000 , 0.000
Vector
8000 dimension

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Title
Description
 
col_add (To add columns of two files)
It is used to make a hybrid method. In this two different features (e.g. amino acid composition, and dipeptided) of a sequence are added to make a more informative hybrid features.
Usage
add_cols -i se1.out -c se2.out  -o seq.out
-i
Input file (first column file for add)
-c
Input file (second column file for add)
-o
Output file name
se1.out
Amino Acid Composition of proteins
# A , C , D , E , F , G , ….., Y,
33.33,33.33, 0.00, 0.00, 0.00,33.33, ….., 0.00
se2.out
# Dipeptide Composition of Protein
#AA , AC , ….., YY
28.571,7.143…..,0.00
seq.out
# Amino Acid Composition of proteins  # Dinucleic Composition of Protein
# A , C , D , E , F , G , ….,  #AA , AC, .…, YY
33.33,33.33, 0.00, 0.00, 0.00, 33.33,……, 28.571,7.143…..,0.00
Vector
420 (20 for amino acid + 400 dipeptide composition)

 

 

 

 

Title
Description
 
col_mult (To multiplying each column of input file with a number)
This program is used to multiply each column of input file with a specific number. This is used especially in the hybrid case to make the features equal weight. Suppose one wants to make a hybrid file of amino acid and dipeptide composition. If amino acid and dipeptide composition was added directly the values of mononucleotide is very high with respect to dinucleotide. Thus performance of SVM will be nearly similar to the performance of amino acid because the weight of dipeptide is diluted. But when we multiply the amino acid with 10 or dipeptide with 0.1 and then added to each other. There is chance that performance will increase. 
Usage
col_mult -i se1.out -o se1_mult -n 0.1
-i
Input file name
-o
Output file name
-n
Number with which column is multiplying
se1.out
Amino Acid Composition of proteins
# A , C , D , E , F , G , ….., Y,
33.33, 33.33, 0.00, 0.00, 0.00, 33.33, ….., 0.00
se1_mult
3.333000, 3.333000, 0.000000, 0.000000, 0.000000, 3.333000,….. , 0.000000,
Vector
Same as in input file

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Title
Description
 
col2svm (To generating SVM_light input format)
This program can convert composition output file into a format used in SVM training. In SVM format, (1) starts with +1 or –1 denotes class of sequence positive or negative respectively. (2) A numerical order is given before each value. 
Usage
col2svm -i se1.out -o svm.out -s +1
-i
Input file name
-o
Output file name
-s
Class for svm (+1 or –1)
se1.out
Amino Acid Composition of proteins
# A, C, D, E, F, G,… Y,
33.33, 33.33, 0.00, 0.00, 0.00, 33.33, ….., 0.00
svm.out
+1 1:33.330000 2:33.330000 3:0.000000 4:0.000000 5:0.000000 6:33.330000 ……. 20:0.000000
Vector
20 dimension

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Title
Description
 
col_mult_sel (To multiplying selective columns with a number)
Instead of multiplying whole column, here only column from 1 to 3 are multiplied with specific number (10).
Usage
col_mult_sel -i se1.out -o se1_mult -n 10 -a 1 -b 3
-i
Input file name
-o
Output file name
-n
Number with which column is multiplying
-a
Number of starting column (eg 1)
-b
Number of last column (eg 3)
se1.out
Amino Acid Composition of proteins
# A , C , D , E , F , G , ….., Y,
33.33, 33.33, 0.00, 0.00, 0.00, 33.33, ….., 0.00
se1_mult
333.300000, 333.300000, 0.000000, 0.000000, 0.000000, 33.330000,….., 0.000000
Vector
Same as in input file

 

 

 

 

Title
Description
 
col_rem (To remove selective columns from a file)
This program is used to remove specific column from files. You can remove the composition of A and C from whole file to check the importance of these amino acids in prediction methods.
Usage
perl col_rem -i  seq.out -o seq.rm -a 1 -b 2
-i
Input file name
-o
Output file name
-a
Number of starting column (eg 1) to removed
-b
Number of last column (eg 3) removed 
seq.out
# Amino Acid Composition of proteins
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,
18.60, 2.33, 4.65, 5.81, 5.81, 8.14, 1.16, 1.16, 0.00, 8.14, 3.49, 1.16, 3.49, 0.00, 13.95, 4.65, 8.14, 5.81, 0.00, 3.49,
Seq_rm
5.810000,5.810000,8.140000,1.160000,1.160000,0.000000,8.140000,3.490000, 1.160000,3.490000,0.000000,13.950000,4.650000,8.140000,5.810000,0.000000,3.490000
Vector
Total number = Total columns in the input file – total number of removed column
E.g.(17=20-3)

 

 

 

Title
Description
 
col_ext (To extract selective columns from a file)
This program only takes specific column from a file. In this example we only take the feature of amino acid composition of F, G, H, I, and K as an input for SVM.
Usage
col_ext -i seq.out -o seq.ext -a 5 -b 10
-i
Input file name
-o
Output file name
-a
Number of starting column (eg 5) to take
-b
Number of last column (eg 10) to take
seq.out
# Amino Acid Composition of proteins
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,
18.60, 2.33, 4.65, 5.81, 5.81, 8.14, 1.16, 1.16, 0.00, 8.14, 3.49, 1.16, 3.49, 0.00, 13.95, 4.65, 8.14, 5.81, 0.00, 3.49,
Seq.ext
5.81, 8.14, 1.16, 1.16, 0.00, 8.14
Vector
Total number = Total number of column selected from input file
Eg(6=from 5 to 10 colum)

 

 

 

Title
Description
 
col_corr (To compute correlation co-efficient between two column)
Correlation co-efficient indicates the strength and direction of a linear relationship between two random variables. The correlation varies between –1 to 1. The closer the coefficient is to either -1 or 1, the stronger the correlation between the variables. Value of 1 in the case of an increasing linear relationship, -1 in the case of a decreasing linear relationship, 0 in case no correlation. Example shows the correlation between amino acid A and G in the file.
Usage
col_corr -i pos -a 1 -b 6
-i
Input file name
-a
Number of column (eg 1)
-b
Number of column (eg 6) 
pos
# Amino Acid Composition of proteins
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,
15.31, 1.30, 7.49, 3.91, 2.28, 9.45, 1.63, 3.26, 1.95, 10.10, 1.95, 2.28, 5.54, 2.28, 8.14, 3.91, 8.47, 6.84, 0.98, 2.93,
15.31, 1.30, 7.49, 3.91, 2.28, 9.45, 1.63, 3.26, 1.95, 10.10, 1.95, 2.28, 5.54, 2.28, 8.14, 3.91, 8.47, 6.84, 0.98, 2.93,
12.83, 1.60, 8.56, 2.14, 2.67, 6.95, 6.95, 2.67, 1.60, 9.09, 3.74, 3.74, 6.42, 6.42, 4.28, 5.35, 6.42, 5.35, 1.07, 2.14,
12.30, 1.60, 8.56, 2.14, 2.67, 6.95, 6.95, 2.67, 1.60, 9.09, 3.74, 3.74, 6.42, 6.42, 4.28, 5.35, 6.42, 5.88, 1.07, 2.14,
13.76, 0.53, 4.76, 4.23, 3.70, 5.29, 1.06, 3.17, 2.12, 8.47, 3.17, 3.70, 13.76, 1.59, 4.76, 9.52, 8.99, 5.82, 1.06, 0.53,
output
0.749 (a positive correlation between column 1 and 6)
Vector
Total number = Total number of column selected from input file
E.g.(6=from 5 to 10 colum)

 

 

Title
Description
 
col_avg (To calculate average column of two files)
In this case composition value of each column of a file is added to its corresponding column of another file and means value is calculated. It can be used to generate an average feature of two different files (belonging from same family of protein) to make input in machine learning techniques. For instance, 15.31 (1st column of file pos1) + 6.87 (1st column of file pos2) = 22.18/2 = 11.09 (file out).
Note: Each file should have equal number of columns and rows
Usage
col_avg -a pos1 -b pos2 -o out
-a
First input file name
-b
Second input file name
-o
Output file name
pos1
# Amino Acid Composition of proteins
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,
15.31, 1.30, 7.49, 3.91, 2.28, 9.45, 1.63, 3.26, 1.95,10.10, 1.95, 2.28, 5.54, 2.28, \
8.14, 3.91, 8.47, 6.84, 0.98, 2.93,
15.31, 1.30, 7.49, 3.91, 2.28, 9.45, 1.63, 3.26, 1.95,10.10, 1.95, 2.28, 5.54, 2.28, \
8.14, 3.91, 8.47, 6.84, 0.98, 2.93,
12.83, 1.60, 8.56, 2.14, 2.67, 6.95, 6.95, 2.67, 1.60, 9.09, 3.74, 3.74, 6.42, 6.42, \
4.28, 5.35, 6.42, 5.35, 1.07, 2.14,
pos2
# Amino Acid Composition of proteins
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,
6.87, 1.29, 6.87, 2.15, 0.86, 6.87, 1.72, 4.72, 5.58, 9.87, 1.29, 5.15, 4.72, 5.15, 1.72,13.73, 9.01,10.30, 1.72, 0.43,
9.87, 1.29, 7.30, 3.00, 0.86, 8.58, 1.29, 4.72, 4.29, 9.87, 0.86, 3.43, 5.15, 4.29, 3.43,11.16, 7.73, 10.73, 1.72, 0.43,
12.64, 1.10, 4.40, 1.10, 2.75, 9.34, 0.55, 6.59, 3.30, 9.34, 2.20, 6.59, 6.04, 5.49, 4.40, 6.59, 7.14, 9.34, 0.55, 0.55,
out
11.09; 1.295; 7.18; 3.03; 1.57; 8.16; 1.675; 3.99; 3.765; 9.985; 1.62; 3.715; 5.13; 3.715; 4.93; 8.82; 8.74; 8.57; 1.35; 1.68; 0
12.59; 1.295; 7.395; 3.455; 1.57; 9.015; 1.46; 3.99; 3.12; 9.985; 1.405; 2.855; 5.345; 3.285; 5.785; 7.535; 8.1; 8.785; 1.35; 1.68; 0
12.735; 1.35; 6.48; 1.62; 2.71; 8.145; 3.75; 4.63; 2.45; 9.215; 2.97; 5.165; 6.23; 5.955; 4.34; 5.97; 6.78; 7.345; 0.81; 1.345; 0
Vector
Total number of column (same as in input file) 

 

 

Title
Description
 
col_sig (significance of columns in two column files)
This program used to calculate significant of each column in two different file. If any one want to compare the positive and negative file of amino acid composition. Like Differences in positive-negative, its significance, average of each colomn in positive and each column in negative, standard deviation. Output result will give comparison of each column.
Usage
col_sig -i file1 -j file2 >out
-i
Input file1 of positive example
-j
Input file2 of negative example
file1
# Amino Acid Composition of proteins
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y
7.88,1.27,4.05,6.39,3.62,7.88,2.13,6.61,5.75,10.23,3.62,2.98,3.41,5.11,5.54,6.39,4.47,7.03,1.70,3.83
5.46,1.12,7.14,6.72,3.78,5.46,1.68,4.76,6.58,10.64,0.98,7.42,2.52,4.06,4.90,9.38,8.54,5.04,0.98,2.80
8.96,2.06,4.82,7.58,4.13,6.20,0.69,4.82,7.58,13.10,1.37,2.75,4.82,8.96,6.89,4.13,2.75,6.20,0.00,2.64
file2
# Amino Acid Composition of proteins
# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y
8.55,0.65,3.94,9.86,2.63,8.55,0.00,3.28,11.84,13.81,2.63,3.28,1.31,3.94,3.94,6.57,1.31,8.55,0.65,1.89
13.29,2.53,3.16,2.53,5.69,14.55,1.89,5.69,3.16,6.32,3.16,3.16,6.96,2.53,6.32,6.32,2.53,6.32,3.16,2.18
9.88,0.48,7.45,8.91,5.18,3.56,0.48,4.70,11.02,9.88,2.10,6.48,3.89,4.21,3.24,5.99,5.02,2.91,0.16,4.87
out
# Parameters Measured: % Difference, Significance, Average1, Average2, Standard Deviation(SD1), SD2
Column 1: -34.83, -490.31, 7.43, 10.57, 0.88, 0.39
Column 2: 19.44, 69.33, 1.48, 1.22, 0.33, 0.42
Column 3: 9.51, 53.98, 5.34, 4.85, 0.29, 1.50
Column 4: -2.89, -28.15, 6.90, 7.10, 0.39, 1.04
Column 5: -15.71, -234.15, 3.84, 4.50, 0.16, 0.39
Column 6: -30.79, -145.77, 6.51, 8.89, 0.18, 3.07
Column 7: 61.49, 218.36, 1.50, 0.79, 0.46, 0.17
Column 8: 16.83, 408.83, 5.4, 4.56, 0.33, 0.07
Column 9: -26.55, -214.22, 6.64, 8.67, 0.54, 1.35
Column 10: 12.34, 240.14, 11.32, 10.01, 1.02, 0.07
Column 11: -27.64, -193.90, 1.99, 2.63, 0.35, 0.30
Column 12: 1.76, 6.98, 4.38, 4.31, 0.94, 1.25
Column 13: -12.27, -115.47, 3.58, 4.05, 0.71, 0.09
Column 14: 51.68, 241.21, 6.04, 3.56, 1.68, 0.37
Column 15: 24.79, 185.57, 5.78, 4.50, 0.64, 0.73
Column 16: 5.22, 41.72, 6.63, 6.30, 1.44, 0.17
Column 17: 56.04, 174.63, 5.26, 2.95, 1.44, 1.19
Column 18: 2.69, 17.94, 6.09, 5.93, 0.06, 1.74
Column 19: -38.94, -72.75, 0.89, 1.32, 0.516, 0.67
Column 20: 3.47, 15.63, 3.093, 2.98, 0.26, 1.09

 

 

 

 



Title
Description
 
seq2pssm_imp (To calculate PSSM matrix in column format without any normalization)
The PSSM for each sequence was generated by performing PSI-BLAST search against specific database (e.g. nr) using different iterations (e.g. 3) with cut off e-value 0.001. For a sequence of length N residues, PSSM is represented by an NX20 matrix. Each element of this matrix, m [i, j], provides information on evolutionary conservation of residue type j at sequence position i. For example:
 
 
 
 
Usage
seq2pssm_imp -i seq1.fa -o pssm.out –d nr
-i
Input file in the fasta format (not use single fasta format)
-o
Output file
-d
Database against which PSSM profile is generated
seq1.fa
>1BISA PDBID   CHAIN_SEQUENC
GSHMHGQVDCSPGIWQLDCTHLEGKVILVAVHVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTVHTDNGSNFTSTTVKAACEWAGIKQEFGIPYNPQSQGVIESMNKELK
pssm.out
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200
M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200,\-300, -200, 200
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100
……….

 

 

Title
Description
 
pssm_n1 (To normalize pssm profile based on 1/(1+e-x) formula)
The value of PSSM matrix varies between large range which make difficult for SVM training. Thus every element of PSSM is normalized by using 1/(1+e-x) for normalization.
Various formulae can be used for normalization.
Usage
pssm_n1 -i pssm.out –o pssm_n1
-i
Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o
pssm.out –d nr.02)
-o
Output file having normalized value
pssm.out
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200
M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200,-300, -200, 200
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100
……….
pssm_n1
G, 0.5, 5.19e-131, 3.73e-44, 1.39e-87, 5.19e-131, 1, 1.39e-87, 1.93e-174, 1.39e-87, 1.93e-174, 5.19e-131, 0.5, 1.39e-87, 1.39e-87, 1.39e-87, 0.5, 1.39e-87, 5.19e-131, 1.39e-87, 5.19e-131
S, 1, 3.73e-44, 0.5, 0.5, 1.39e-87, 0.5, 3.73e-44, 1.39e-87, 0.5, 1.39e-87, 3.73e-44, 1, 3.73e-44, 0.5, 3.73e-44, 1, 1, 1.39e-87, 5.19e-131, 1.39e-87
H, 1.39e-87, 5.19e-131, 3.73e-44, 0.5, 3.73e-44, 1.39e-87, 1, 5.19e-131, 3.73e-44, 5.19e-131, 1.39e-87, 1, 1.39e-87, 0.5, 0.5, 3.73e-44, 1.39e-87, 5.19e-131, 1.39e-87, 1
M, 3.73e-44, 3.73e-44, 5.19e-131, 1.39e-87, 0.5, 5.19e-131, 1.39e-87, 1, 3.73e-44, 1, 1, 1.39e-87, 1.39e-87, 0.5, 3.73e-44, 3.73e-44, 3.73e-44, 1, 3.73e-44, 3.73e-44
H, 1.39e-87, 5.19e-131, 3.73e-44, 0.5, 3.73e-44, 1.39e-87, 1, 5.19e-131, 3.73e-44, 5.19e-131, 1.39e-87, 1, 1.39e-87, 0.5, 0.5, 3.73e-44, 1.39e-87, 5.19e-131, 1.39e-87, 1
G, 0.5, 5.19e-131, 3.73e-44, 1.39e-87, 5.19e-131, 1, 1.39e-87, 1.93e-174, 1.39e-87, 1.93e-174, 5.19e-131, 0.5, 1.39e-87, 1.39e-87, 1.39e-87, 0.5, 1.39e-87, 5.19e-131, 1.39e-87, 5.19e-131
Q, 3.73e-44, 5.19e-131, 0.5, 1, 5.19e-131, 1.39e-87, 0.5, 5.19e-131, 1, 1.39e-87, 0.5, 0.5, 3.73e-44, 1, 1, 0.5, 3.73e-44, 1.39e-87, 1.39e-87, 3.73e-44
 

 

 

 

Title
Description
 
pssm_n2 (To normalize pssm profile based on (numb -min)/(max -min) formula)
 
The value of PSSM matrix varies between large range which make difficult for SVM training. Thus every element of PSSM is normalized by using (numb -min)/(max -min) for normalization. For example:
 
 
 
 
 
Usage
pssm_n2 -i pssm.out –o pssm_n2
-i
Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o
pssm.out –d nr.02)
-o
Output file having normalized value
pssm.out
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200
M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200,-300, -200, 200
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100
……….
pssm_n2
G, 0.26, 0.06, 0.20, 0.13, 0.06, 0.66, 0.13, 0, 0.13, 0, 0.06, 0.26, 0.13, 0.13, 0.13, 0.26, 0.13, 0.06, 0.13, 0.06
S, 0.33, 0.20, 0.26, 0.26, 0.13, 0.26, 0.20, 0.13, 0.26, 0.13, 0.20, 0.33, 0.20, 0.26, 0.20, 0.53, 0.33, 0.13, 0.06, 0.13
H, 0.13, 0.06, 0.20, 0.26, 0.20, 0.13, 0.79, 0.06, 0.20, 0.06, 0.13, 0.33, 0.13, 0.26, 0.26, 0.20, 0.13, 0.06, 0.13, 0.40
M, 0.20, 0.20, 0.06, 0.13, 0.26, 0.06, 0.13, 0.33, 0.20, 0.40, 0.59, 0.13, 0.13, 0.26, 0.20, 0.20, 0.20, 0.33, 0.20, 0.20
H, 0.13, 0.06, 0.20, 0.26, 0.20, 0.13, 0.79, 0.06, 0.20, 0.06, 0.13, 0.33, 0.13, 0.26, 0.26, 0.20, 0.13, 0.06, 0.13, 0.40
G, 0.26, 0.06, 0.20, 0.13, 0.06, 0.66, 0.13, 0, 0.13, 0, 0.06, 0.26, 0.13, 0.13, 0.13, 0.26, 0.13, 0.06, 0.13, 0.06
Q, 0.20, 0.06, 0.26, 0.40, 0.06, 0.13, 0.26, 0.06, 0.33, 0.13, 0.26, 0.26, 0.20, 0.59, 0.33, 0.26, 0.20, 0.13, 0.13, 0.2

 

 

 

 

 

Title
Description
 
pssm_n3 (To normalize pssm profile based on (numb -min)*100/(max -min) formula)
The value of PSSM matrix varies between large ranges which make difficult for SVM training. Thus every element of PSSM is normalized by using (numb -min)*100/(max -min)for normalization.
Usage
pssm_n3 -i pssm.out –o pssm_n3
-i
Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o pssm.out –d nr.02)
-o
Output file having normalized value
pssm.out
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200
M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200,-300, -200, 200
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100
pssm_n3
G, 26.68, 6.67, 20.01, 13.34, 6.67, 66.64, 13.34, 0, 13.34, 0, 6.67, 26.68, 13.34, 13.34, 13.34, 26.68, 13.34, 6.67, 13.34, 6.67
S, 33.35, 20.01, 26.68, 26.68, 13.34, 26.68, 20.01, 13.34, 26.68, 13.34, 20.01, 33.35, 20.01, 26.68, 20.01, 53.36, 33.35, 13.34, 6.67, 13.34
H, 13.34, 6.67, 20.01, 26.68, 20.01, 13.34, 79.98, 6.67, 20.01, 6.67, 13.34, 33.35, 13.34, 26.68, 26.68, 20.01, 13.34, 6.67, 13.34, 40.02
M, 20.01, 20.01, 6.67, 13.34, 26.68, 6.67, 13.34, 33.35, 20.01, 40.02, 59.97, 13.34, 13.3, 26.68, 20.01, 20.01, 20.01, 33.35, 20.01, 20.01
H, 13.34, 6.67, 20.01, 26.68, 20.01, 13.34, 79.98, 6.67, 20.01, 6.67, 13.34, 33.35, 13.34, 26.68, 26.68, 20.01, 13.34, 6.67, 13.34, 40.02
G, 26.68, 6.67, 20.01, 13.34, 6.67, 66.64, 13.34, 0, 13.34, 0, 6.67, 26.68, 13.34, 13.34, 13.34, 26.68, 13.34, 6.67, 13.34, 6.67
Q, 20.01, 6.67, 26.68, 40.02, 6.67, 13.34, 26.68, 6.67, 33.35, 13.34, 26.68, 26.68, 20.01, 59.97, 33.35, 26.68, 20.01, 13.34

 

 

 

Title
Description
 
pssm_n4 (To normalize pssm profile based on 1/(1+e-(x/100) formula)
The value of PSSM matrix varies between large range which make difficult for SVM training. Thus every element of PSSM is normalized by using 1/(1+e-(x/100) for normalization.
Usage
pssm_n4 -i pssm.out –o pssm_n4
-i
Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o
pssm.out –d nr.02)
-o
Output file having normalized value
pssm.out
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200
M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200,-300, -200, 200
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100
……….
pssm_n4
G, 0.5, 0.04, 0.26, 0.11, 0.04, 0.99, 0.11, 0.01, 0.11, 0.01, 0.0474258731775668, 0.5, 0.11, 0.11, 0.11, 0.5, 0.11, 0.047, 0.11, 0.04
S, 0.73, 0.26, 0.5, 0.5, 0.11, 0.5, 0.26, 0.11, 0.5, 0.11, 0.26, 0.73, 0.26, 0.5, 0.26, 0.98, 0.73, 0.11, 0.04, 0.11
H, 0.11, 0.04, 0.26, 0.5, 0.26, 0.11, 0.99, 0.04, 0.26, 0.04, 0.11, 0.73, 0.119202922022118, 0.5, 0.5, 0.26, 0.11, 0.04, 0.11, 0.88
M, 0.26, 0.26, 0.04, 0.11, 0.5, 0.04, 0.11, 0.73, 0.26, 0.88, 0.99, 0.11, 0.11, 0.5, 0.26, 0.26, 0.26, 0.73, 0.26, 0.26
H, 0.11, 0.047, 0.26, 0.5, 0.26, 0.11, 0.99, 0.047, 0.26, 0.04, 0.11, 0.73, 0.11, 0.5, 0.5, 0.26, 0.11, 0.04, 0.11, 0.88
G, 0.5, 0.04, 0.26, 0.11, 0.04, 0.99, 0.11, 0.01, 0.11, 0.01, 0.04, 0.5, 0.11, 0.11, 0.11, 0.5, 0.11, 0.04, 0.11, 0.04
Q, 0.26, 0.04, 0.5, 0.88, 0.04, 0.11, 0.5, 0.04, 0.73, 0.11, 0.5, 0.5, 0.26, 0.99, 0.73, 0.5, 0.26, 0.11, 0.11, 0.26

 

 

 

Title
Description
 
pssm_comp (To compute PSSM composition (400 points))
Here pssm matrix is coverted in a vector of dimension 400, by computing composition of occurrences of each type of amino acid corresponding to each type of amino acids in protein sequence. It means for each column we will have 20 values instead of one. Every element in this input vector was subsequently divided by the length of the sequence. The resultant matrix with 400 elements was used as input feature for SVM.
Usage
pssm_comp -i pssm_n4 –o pssm_n4.out
-i
Input file having pssm profile generated by using (seq2pssm_imp -i seq1.fa –o pssm.out –d nr.02) and then scaled by using (pssm_n4.pl -i pssm.out –o pssm_n4)
-o
Output file having 400 elements
pssm_n4
G, 0.5, 0.04, 0.26, 0.11, 0.04, 0.99, 0.11, 0.01, 0.11, 0.01, 0.04, 0.5, 0.11, 0.11, 0.11, 0.5, 0.11, 0.04, 0.11, 0.04
S, 0.73, 0.26, 0.5, 0.5, 0.11, 0.5, 0.26, 0.11, 0.5, 0.11, 0.26, 0.73, 0.26, 0.5, 0.26, 0.98, 0.73, 0.11, 0.04, 0.11
H, 0.11, 0.04, 0.26, 0.5, 0.26, 0.11, 0.99, 0.047, 0.26, 0.047, 0.11, 0.73, 0.11, 0.5, 0.5, 0.26, 0.11, 0.04, 0.11, 0.88
M, 0.26, 0.26, 0.04, 0.11, 0.5, 0.04, 0.11, 0.73, 0.26, 0.88, 0.99, 0.11, 0.11, 0.5, 0.26, 0.26, 0.26, 0.73, 0.26, 0.26
H, 0.11, 0.04, 0.26, 0.5, 0.26, 0.11, 0.99, 0.04, 0.26, 0.04, 0.11, 0.73, 0.11, 0.5, 0.5, 0.26, 0.11, 0.04, 0.11, 0.88
G, 0.5, 0.04, 0.26, 0.11, 0.04, 0.99, 0.11, 0.01, 0.11, 0.01, 0.04, 0.5, 0.11, 0.11, 0.11, 0.5, 0.11, 0.04, 0.11, 0.04
Q, 0.26, 0.04, 0.5, 0.88, 0.04, 0.11, 0.5, 0.04, 0.73, 0.11, 0.5, 0.5, 0.26, 0.99, 0.73, 0.5, 0.26, 0.11, 0.11, 0.26
pssm_n4.out
0.98, 0.50, 0.11, 0.26, 0.11, 0.50, 0.11, 0.26, 0.26, 0.26, 0.26, 0.11, 0.26, 0.26894142, 0.26, 0.73, 0.50, 0.50, 0.04, 0.11, 0.50, 0.99, 0.04, 0.01, 0.11, 0.04, 0.04, 0.26, 0.04, 0.26, 0.26, 0.04, 0.04, 0.04, 0.04, 0.26, 0.26, 0.26, 0.11, 0.11, ……..
Vector
400

 

 

 

 

Title
Description
 
pssm2pat (To generate patterns of given size from PSSM matrix)
Here we generate PSSM matrix from different window size. If we want to generate 5-window matrix, take two nucleotide matrix forms upstream and downstream and concatenates all matrix in sequential order. For example: pattern matrix of GSHMH, add matrix of each nucleotide it makes the 100 vector long matrix representing H (middle) of nucleotide. For starting nucleotide zero (0) is considered two upstream nucleotide.
Usage
pssm2pat -i pssm.out –o pssm_pat –w 5
-i
Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o pssm.out –d nr.02)
-o
Output file
-w
Window size generated from PSSM matrix
pssm.out
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200
M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100
pssm_pat
# Pattern Window size 5 generated from PSSM matrix. Each line represents pattern for central residue
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-300,-100,-200,-300,599,-200,-400,-200,-400,-300,0,-200,-200,-200,0,-200,-300,-200,-300,100,-100,0,0,-200,0,-100,-200,0,-200,-100,100,-100,0,-100,400,100,-200,-300,-200,-200,-300,-100,0,-100,-200,799,-300,-100,-300,-200,100,-200,0,0,-100,-200,-300,-200,200 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-300,-100,-200,-300,599,-200,-400,-200,-400,-300,0,-200,-200,-200,0,-200,-300,-200,-300,100,-100,0,0,-200,0,-100,-200,0,-200,-100,100,-100,0,-100,400,100,-200,-300,-200,-200,-300,-100,0,-100,-200,799,-300,-100,-300,-200,100,-200,0,0,-100,-200,-300,-200,200,-100,-100,-300,-200,0,-300,-200,100,-100,200,499,-200,-200,0,-100,-100,-100,100,-100,-100

…………………………………………………

Vector
20*window size (20*5=100)

 

 

 

Title
Description
 
pssm_smooth (To designed smooth pssm profile for plot)
Here we generate smooth matrix from different window size. If we want to generate 5-window matrix, take two nucleotide matrix forms upstream and downstream and add all five values from each column and divided by five to make average. The matrix of each nucleotide is 20. Each matrix represents about it matrix neighbour nucleotide. Therefore a smooth graph will be generated.
Usage
pssm_smooth -i pssm.out –o pssm_pat –w 5
-i
Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o pssm.out –d nr.02)
-o
Output file
-w
Window size generated from PSSM matrix
pssm.out
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200
M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100
H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200
G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300
Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100
smooth.out
G, -40, -340, -160, -200, -300, 379.2, -60.2, -400, -200, -380, -200.2, 0, -260, -160, -200, 40, -200, -320, -280, -260
S, -80, -340, -160, -160, -260, 219.4, 139.6, -380, -180, -360, -180.2, 20, -260, -120, -160, 20, -200, -320, -280, -160
H, -80, -340, -160, -160, -260, 219.4, 139.6, -380, -180, -360, -180.2, 20, -260, -120, -160, 20, -200, -320, -280, -160
M, -100, -340, -140, -80, -260, 59.6, 179.6, -360, -120, -320, -120.2, 20, -240, 19.8, -100, 20, -180, -300, -280, -120
H, -120, -340, -120, 0, -260, -100.2, 219.6, -340, -60, -280, -60.2, 20, -220, 159.6, -40, 20, -160, -280, -280, -80
G, -160, -380, -120, 40, -280, -140.2, 239.6, -360, -40, -280, -40.2, 0, -220, 259.4, 0, -60, -200, -280, -260, -60
Q, -140, -380, -100, 80, -320, -140.2, 79.8, -360, 0, -260, -0.2, -20, -200, 359.2, 20, -40, -180, -260, -260, -120
Vector
20

 

 

 

Title
Description
 
seq2motif (To create motifs by sliding window of user defined length)
This program creates motif of defined length. Optional ‘X’ at the end of sequence is added to make complete pattern. The binary pattern is generated using the motifs
Usage
seq2motif -i seq1.fa -o motif.out -w 5 -x y
-i
Input file in single fasta format
-o
Output file
-w
Window size to create a pattern
seq1.fa
>seq_1##GSHMHGQVDCSPGIWQLDCTHLEGK
motif_1.out
>seq_1
XXGSH
XGSHM
GSHMH
SHMHG
HMHGQ
MHGQV
HGQVD
GQVDC
QVDCS
VDCSP
DCSPG
CSPGI
SPGIW
PGIWQ
GIWQL
IWQLD
WQLDC
QLDCT
LDCTH
DCTHL
CTHLE
THLEG
HLEGK
LEGKX
EGKXX

 

 

Title
Description
 
motif2bin (To make binary input from the multifasta motif file)
It generates binary input from the multifasta motif file of fixed length into column format.
Usage
motif2bin -i motif_1.out -o bin.out  -x y
-i
Input in multifasta file (seq2motif.pl -i seq1.fa -o motif.out -w 5)
-o
Output file
-x
If additional X is added in pattern then y (for yes), or n (for no)
motif_1.out
>seq_1
XXGSH
XGSHM
GSHMH
SHMHG
HMHGQ
MHGQV
HGQVD
GQVDC
QVDCS
VDCSP
DCSPG
CSPGI
SPGIW
PGIWQ
GIWQL
IWQLD
WQLDC
QLDCT
LDCTH
DCTHL
CTHLE
THLEG
HLEGK
LEGKX
EGKXX
 
bin.out
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
…………………………………………
Vector
20*window size (20*5=100)
 

 

           

 

Title
Description
 
blast similarity (To perform blast)
Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown protein in the mouse, a scientist will typically perform a BLAST search of the protein database (nr) to see if humans carry a similar protein; BLAST will identify sequences in the previously known database that resemble the mouse protein based on similarity of sequence.
 
Usage
blast_similarity -i fasta -d nr -j 3 -e 1 –o blast.out
-i
Input file of Fasta format
-o
Output result
-d
Database used blast
-j
Number of Iteration
-e
Cut off Evalue
fasta
>amla_1
ASDATAYAACVAYANMANNNAMAKLAWQAPTCAGYAAKTGCVQRATRQOPKALVNAASDREW
>amla_2
ACDEFGHIKLMNPQRSTVWMRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLG
blast.out
>amla_1  Zero    Zero
>amla_2  ref|NP_216584.1| blaC [Mycobacterium tuberculosis H37Rv] >gi|158...     2e-27