Protein General Modules

In this chapter we have described the small programs developed at our group; these programs can be used as building block to develop complex prediction modules. The question arises how it is different then existing software libraries or modules like BioPERL. InBioPER or similar packages one need to have knowledge of computer programming in order to uses these modules/subroutines. In GPSR package we have developed small programs, which can be run by any person have little knowledge of computers. Following are important programs included in this package.

Program	Purpose
· fasta2sfasta	Convert fasta format to single fasta format
· pro2aac	To calculate amino acid composition of protein
· pro2aac_nt	To calculate amino acid composition of N-terminal (nt) residues of a protein
· pro2aac_ct	To calculate amino acid composition of C-terminal (ct) residues of a protein
· pro2aac_rest.pl	To calculate amino acid composition of a protein after removing N-, and C-terminal residues
· pro2aac_split	To calculate split amino acid composition (SSAC) of a protein
· pro2dpc	To calculate dipeptide composition of protein
· pro2dpc_nt	To calculate dipeptide composition of N-terminal (nt) residues of a protein
· pro2dpc_ct	To calculate dipeptide composition of C-terminal (ct) residues of a protein
· pro2tpc	To calculate tripeptide composition of protein
· add_cols	To add columns of two files
· col2svm	To generating SVM_light input format
· col_mult	To multiplying each column of input file with a number
· col_mult_sel	To multiplying selective columns with a number
· perl col_rem	To remove selective columns from a file
· col_ext	To extract selective columns from a file
· col_corr	To compute correlation co-efficient between two column
· col_avg	To calculate average column of two files
· seq2pssm_imp	To calculate PSSM matrix in column format without any normalization
· pssm_n1	To normalize pssm profile based on 1/(1+e-x) formula
· pssm_n2	To normalize pssm profile based on (numb -min)/(max -min) formula
· pssm_n3	To normalize pssm profile based on (numb -min)*100/(max -min) formula
· pssm_n4	To normalize pssm profile based on 1/(1+e-(x/100) formula
· pssm_comp	To compute PSSM composition (400 points)
· col_sig	Significance of columns in two column files
· pssm2pat	To generate patterns of given size from PSSM matrix
· pssm_smooth	To designed smooth pssm profile for plot
· seq2motif	To create motifs by sliding window of user defined length with option of adding terminal X
· motif2bin	To make binary input from the multifasta motif file
· blast_similarity	To perform blast

Title	Description
Fasta format	fasta2sfasta (Convert fasta format to single fasta format) (Pearon format) is used to represent peptide sequences or nucleic acid sequences using single-letter codes. It begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol.
Single fasta format	Our programs use input sequence in single fasta format. Therefore, fasta file should first convert into single fasta format. In the single fasta format the description and sequence data merged into single line. Two hash marks (##) were present to distinguish description and sequence data.
Usage	fasta2sfasta –i seq.fa -o seq.sfa
-i	Input file name having sequence in fasta format
-o	Output file name that gives sequence in single fasta format
seq.fa	>seq_1 MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE >seq_2 ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF
seq.sfa	>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE >seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF

Title

Description

pro2aac (To calculate amino acid composition of protein)

The amino acid composition in a protein is simply the percentage of the different amino acids represented in a particular protein. The aim of calculating the composition of proteins is to transform the variable length of protein sequences to fixed length feature vectors. This is an important and most crucial step during classification of proteins using machine-learning techniques because they require fixed length patterns. In addition the conversion of a protein sequence to a vector of 20 dimensions using amino acid composition will encapsulate the properties of the protein into the vector.

The composition of all 20 natural amino acids were calculated by using the following equation

Composition of amino acid i =	Total number of amino acid i x 100
	Total number of all amino acids in protein

Where i can be any amino acid

Usage

pro2aac -i seq.sfa -o seq.out

-i

Input file name contains single fasta format

-o

Output file name gives amino acid composition

seq.sfa

>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE

>seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF

seq.out

# Amino Acid Composition of proteins

# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,

19.18, 1.37, 4.11, 5.48, 2.74, 9.59, 1.37, 1.37, 0.00, 9.59, 4.11, 1.37, ...... 2.74,

19.18, 6.85, 5.48, 2.74, 6.85, 8.22, 1.37, 1.37, 1.37, 5.48, 4.11, 4.11, ...... 2.74,

Vector

20 dimension (i.e 20 types of amino acid composition is generated)

Title

Description

pro2aac_nt (To calculate amino acid composition of N-terminal (nt) residues of a protein)

It is well known that some proteins having N-terminal signal sequence which is responsible to transport whole protein into their specific subcellular compartment like, lysosome, endoplasmic reticulum, mitochondria, and chloroplast. Evidences indicate that divergent N-terminal sequences also do influence catalytic behavior, protein-protein interactions, and intracellular distributions of enzymes. Report shows that N-terminal signal sequence can vary from 13 to 36 amino acid residues in length and having all the information needed to localize into specific location. Therefore, N-terminal information could be exploited by using amino acid composition feature to predict subcellular protein. For example:

N 5 nt C

Usage

pro2aac_nt -i seq.sfa -o seq.out -n 5

-i

Input file name

-o

Output file name

-n

Number of residues to calculate composition from N-terminal

seq.sfa

>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE

>seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF

Seq.out

# Amino Acid Composition of 5 n-terminal residues of proteins

# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,

0.00, 0.00, 0.00, 0.00, 0.00, 20.00, 0.00, 0.00, 0.00, 0.00, 20.00 ..... 0.00,

20.00, 20.00, 0.00, 0.00, 0.00, 40.00, 0.00, 0.00, 0.00, 0.00, ..... 0.00,

Vector

20 dimension

Title

Description

pro2aac_ct (To calculate amino acid composition of C-terminal (ct) residues of a protein)

While the N-terminus of a protein often contains targeting signals, the C-terminus can contain retention signals for protein sorting. The most common ER retention signal is the amino acid sequence -KDEL (or -HDEL) at the C-terminus, which keeps the protein in the endoplasmic reticulum and prevents it from entering the secretory pathway. The C-terminus of proteins can be modified post-translationally, most commonly by the addition of a lipid anchor to the C-terminus that allows the protein to be inserted into a membrane without having a transmembrane domain. The c-terminal domain of RNA polymerase II typically consists of up to 52 repeats of the sequence Tyr-Ser-Pro-Thr-Ser-Pro-Ser. Other proteins often bind the C-terminal domain of RNA polymerase in order to activate polymerase activity. It is the protein domain, which is involved in the initiation of DNA transcription, the capping of the RNA transcript, and attachment to the spliceosome for RNA splicing. Therefore the information at C-terminal in could be utilized using amino acid composition feature to predict different classes of proteins. For example:

N 5nt C

Usage

pro2aac_ct -i seq.sfa -o seq.out -n 5

-i

Input file name

-o

Output file name

-n

Number of residues to calculate composition from C-terminal

seq.sfa

>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE

>seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF

seq.out

# Amino Acid Composition of 5 c-terminal residues of proteins

# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,

40.00, 0.00, 0.00,20.00, 0.00, 0.00, 0.00,20.00, 0.00, 0.00, 0.00, 0.00, 0.00, .... 0.00,

0.00,20.00, 0.00, 0.00,40.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ...... 0.00,

Vector

20 dimension

Title

Description

pro2aac_rest (To calculate amino acid composition of a protein after removing N-, and C-terminal residues)

This program is used to calculate the composition of remaining part of a protein after removing specific residues from N-, and C-terminus. Transmembrane proteins having membrane spanning signal in the middle of protein. This program can be used to calculate the amino acid composition of middle part and successfully used in classification family of proteins. For example:

N 5 nt 5 nt C

Usage

pro2aac_rest -i seq.sfa -o seq.out -n 5 –c 5

-i

Input file name

-o

Output file name

-n

Number of residues removed from N-terminal

-c

Number of residues removed from C-terminal

seq.sfa

>seq_1##AAAAACCCCCGGGGG

>seq_2##CCCGCAAAAASNMKL

seq.out

# Amino Acid Composition of protein after removing 5 n-terminal and 5 c-terminal residues

# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,

0.00, 100.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..... 0.00,

100.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ..... 0.00,

Vector

20 dimension

Title

Description

pro2aac_split (To calculate split amino acid composition (SSAC) of a protein)

It has been reported that some sequence motifs are present into specific region of a protein. Therefore, instead of computing the composition of whole sequence it is useful to split the sequence into different equal parts. Composition of each part is separately calculated, thus feature of region specific motifs is utilized, and added to each other. Some reports show that is increases the prediction accuracy after using this strategy. The advantage of SSAC over standard amino acid composition is that it provides greater weight to proteins that have a signal at either the N or C terminus. For Example:

N C

Usage

pro2aac_split -i seq.sfa -o seq.out -n 3

-i

Input file name

-o

Output file name

-n

Number of parts split into, here 3 i.e. three equal parts of whole protein

seq.sfa

>seq_1##AAAAACCCCCGGGGG

>seq_2##CCCGCAAAAASNMKL

seq.out

# Amino Acid Composition of 3 equal parts of proteins

# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,

5.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 5.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 5.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,

0.00, 4.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 5.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 1.00, 1.00, 1.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00,

Vector

60 dimension (20*3 parts)

Title

Description

pro2dpc (To calculate dipeptide composition of protein)

The dipeptide composition in a protein is simply the percentage of the different adjacent pairs of amino acids represented in a particular protein. The aim of calculating the composition of proteins is to transform the variable length of protein sequences to fixed length feature vectors. This is an important and most crucial step during classification of proteins using machine-learning techniques because they require fixed length patterns. In addition the conversion of a protein sequence to a vector of 400 dimensions using dipeptide composition will encapsulate the properties of the neighboring amino acids.

The composition of all 400 natural amino acids were calculated by using the following equation

Composition of dipep (i +1) =	Total number of amino acid (i +1) x 100
	Total number of all possible dipeptides

Where dpep (i) is fraction or composition of dipeptide type i. Di and N are the number of dipeptide of type i and number of residues in protein i, respectively.

Usage

pro2dpc -i seq.sfa -o seq.out

-i

Input file name

-o

Output file name

seq.sfa

>seq_2##AAAAACCCCCGGGGG

seq.out

#AA , AC , AD ,….., CC ,….., CG , ….. , GG ,….., YY,

28.571, 7.143, 0.000, ….., 28.571,….., 7.143, ….., 28.571,….., 0.000

Vector

400 dimension (20*20)

Title	Description
	pro2dpc_nt (To calculate dipeptide composition of N-terminal (nt) residues of a protein) It is well known that some proteins having N-terminal signal sequence which is responsible to transport whole protein into their specific subcellular compartment like, lysosome, endoplasmic reticulum, mitochondria, and chloroplast. Evidences indicate that divergent N-terminal sequences also do influence catalytic behavior, protein-protein interactions, and intracellular distributions of enzymes. Report shows that N-terminal signal sequence can vary from 13 to 36 amino acid residues in length and having all the information needed to localize into specific location. Therefore, N-terminal information could be exploited by using dipeptide composition feature to predict subcellular protein.
Usage	pro2dpc_nt -i seq.sfa -o seq.out -n 5
-i	Input file name
-o	Output file name
-n	Number of residues to calculate dipeptide composition from N-terminal
seq.sfa	>seq_2##AAAAACCCCCGGGGG
Seq.out	# Dipeptide composition of 5 n-terminal residues of proteins #AA , AC , AD ,….., CC ,….., CG , ….. , GG ,….., YY, 100.00, 0.000, 0.000, ….., 00.000,….., 0.000, ….., 00.000,….., 0.000
Vector	400 dimension

Title	Description
	pro2dpc_ct (To calculate dipeptide composition of C-terminal (ct) residues of a protein) While the N-terminus of a protein often contains targeting signals, the C-terminus can contain retention signals for protein sorting. The most common ER retention signal is the amino acid sequence -KDEL (or -HDEL) at the C-terminus, which keeps the protein in the endoplasmic reticulum and prevents it from entering the secretory pathway. The C-terminus of proteins can be modified post-translationally, most commonly by the addition of a lipid anchor to the C-terminus that allows the protein to be inserted into a membrane without having a transmembrane domain. The c-terminal domain of RNA polymerase II typically consists of up to 52 repeats of the sequence Tyr-Ser-Pro-Thr-Ser-Pro-Ser. Other proteins often bind the C-terminal domain of RNA polymerase in order to activate polymerase activity. It is the protein domain, which is involved in the initiation of DNA transcription, the capping of the RNA transcript, and attachment to the spliceosome for RNA splicing. Therefore the information at C-terminal in could be utilized using dipeptide composition feature to predict different classes of proteins.
Usage	pro2dpc_ct -i seq.sfa -o seq.out -n 5
-i	Input file name
-o	Output file name
-n	Number of residues to calculate dipeptide composition from C-terminal
seq.sfa	>seq_2##AAAAACCCCCGGGGG
Seq.out	# Dipeptide composition of 5 n-terminal residues of proteins #AA , AC , AD ,….., CC ,….., CG , ….. , GG ,….., YY, 100.00, 0.000, 0.000, ….., 00.000,….., 0.000, ….., 100.000,….., 0.000
Vector	400 dimension

Title

Description

pro2tpc (To calculate tripeptide composition of protein)

The tripeptide composition in a protein is simply the percentage of the three adjacent amino acids represented in a particular protein. The aim of calculating the composition of proteins is to transform the variable length of protein sequences to fixed length feature vectors. This is an important and most crucial step during classification of proteins using machine-learning techniques because they require fixed length patterns. In addition the conversion of a protein sequence to a vector of 8000 dimensions using tripeptide composition will encapsulate the properties of the neighboring amino acids.

The composition of all 8000 natural amino acids were calculated by using the following equation

Composition of tripep (i +2) =	Total number of amino acid (i +2) x 100
	Total number of all possible tripeptides

Usage

pro2tpc -i seq.sfa -o seq.out

-i

Input file name

-o

Output file name

seq.sfa

>seq_2##AAAAACCCCCGGGGG

Seq.out

# Tripeptide Composition of Protein

#AAA ,AAC ,AAD ,AAE ,AAF , …..,,YYW ,YYY

23.0769 , 7.6923, 0.000, 00.000, 0.000, ….., 0.000 , 0.000

Vector

8000 dimension

Title	Description
	col_add (To add columns of two files) It is used to make a hybrid method. In this two different features (e.g. amino acid composition, and dipeptided) of a sequence are added to make a more informative hybrid features.
Usage	add_cols -i se1.out -c se2.out -o seq.out
-i	Input file (first column file for add)
-c	Input file (second column file for add)
-o	Output file name
se1.out	Amino Acid Composition of proteins # A , C , D , E , F , G , ….., Y, 33.33,33.33, 0.00, 0.00, 0.00,33.33, ….., 0.00
se2.out	# Dipeptide Composition of Protein #AA , AC , ….., YY 28.571,7.143…..,0.00
seq.out	# Amino Acid Composition of proteins # Dinucleic Composition of Protein # A , C , D , E , F , G , …., #AA , AC, .…, YY 33.33,33.33, 0.00, 0.00, 0.00, 33.33,……, 28.571,7.143…..,0.00
Vector	420 (20 for amino acid + 400 dipeptide composition)

Title	Description
	col_mult (To multiplying each column of input file with a number) This program is used to multiply each column of input file with a specific number. This is used especially in the hybrid case to make the features equal weight. Suppose one wants to make a hybrid file of amino acid and dipeptide composition. If amino acid and dipeptide composition was added directly the values of mononucleotide is very high with respect to dinucleotide. Thus performance of SVM will be nearly similar to the performance of amino acid because the weight of dipeptide is diluted. But when we multiply the amino acid with 10 or dipeptide with 0.1 and then added to each other. There is chance that performance will increase.
Usage	col_mult -i se1.out -o se1_mult -n 0.1
-i	Input file name
-o	Output file name
-n	Number with which column is multiplying
se1.out	Amino Acid Composition of proteins # A , C , D , E , F , G , ….., Y, 33.33, 33.33, 0.00, 0.00, 0.00, 33.33, ….., 0.00
se1_mult	3.333000, 3.333000, 0.000000, 0.000000, 0.000000, 3.333000,….. , 0.000000,
Vector	Same as in input file

Title	Description
	col2svm (To generating SVM_light input format) This program can convert composition output file into a format used in SVM training. In SVM format, (1) starts with +1 or –1 denotes class of sequence positive or negative respectively. (2) A numerical order is given before each value.
Usage	col2svm -i se1.out -o svm.out -s +1
-i	Input file name
-o	Output file name
-s	Class for svm (+1 or –1)
se1.out	Amino Acid Composition of proteins # A, C, D, E, F, G,… Y, 33.33, 33.33, 0.00, 0.00, 0.00, 33.33, ….., 0.00
svm.out	+1 1:33.330000 2:33.330000 3:0.000000 4:0.000000 5:0.000000 6:33.330000 ……. 20:0.000000
Vector	20 dimension

Title	Description
	col_mult_sel (To multiplying selective columns with a number) Instead of multiplying whole column, here only column from 1 to 3 are multiplied with specific number (10).
Usage	col_mult_sel -i se1.out -o se1_mult -n 10 -a 1 -b 3
-i	Input file name
-o	Output file name
-n	Number with which column is multiplying
-a	Number of starting column (eg 1)
-b	Number of last column (eg 3)
se1.out	Amino Acid Composition of proteins # A , C , D , E , F , G , ….., Y, 33.33, 33.33, 0.00, 0.00, 0.00, 33.33, ….., 0.00
se1_mult	333.300000, 333.300000, 0.000000, 0.000000, 0.000000, 33.330000,….., 0.000000
Vector	Same as in input file

Title	Description
	col_rem (To remove selective columns from a file) This program is used to remove specific column from files. You can remove the composition of A and C from whole file to check the importance of these amino acids in prediction methods.
Usage	perl col_rem -i seq.out -o seq.rm -a 1 -b 2
-i	Input file name
-o	Output file name
-a	Number of starting column (eg 1) to removed
-b	Number of last column (eg 3) removed
seq.out	# Amino Acid Composition of proteins # A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y, 18.60, 2.33, 4.65, 5.81, 5.81, 8.14, 1.16, 1.16, 0.00, 8.14, 3.49, 1.16, 3.49, 0.00, 13.95, 4.65, 8.14, 5.81, 0.00, 3.49,
Seq_rm	5.810000,5.810000,8.140000,1.160000,1.160000,0.000000,8.140000,3.490000, 1.160000,3.490000,0.000000,13.950000,4.650000,8.140000,5.810000,0.000000,3.490000
Vector	Total number = Total columns in the input file – total number of removed column E.g.(17=20-3)

Title	Description
	col_ext (To extract selective columns from a file) This program only takes specific column from a file. In this example we only take the feature of amino acid composition of F, G, H, I, and K as an input for SVM.
Usage	col_ext -i seq.out -o seq.ext -a 5 -b 10
-i	Input file name
-o	Output file name
-a	Number of starting column (eg 5) to take
-b	Number of last column (eg 10) to take
seq.out	# Amino Acid Composition of proteins # A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y, 18.60, 2.33, 4.65, 5.81, 5.81, 8.14, 1.16, 1.16, 0.00, 8.14, 3.49, 1.16, 3.49, 0.00, 13.95, 4.65, 8.14, 5.81, 0.00, 3.49,
Seq.ext	5.81, 8.14, 1.16, 1.16, 0.00, 8.14
Vector	Total number = Total number of column selected from input file Eg(6=from 5 to 10 colum)

Title	Description
	col_corr (To compute correlation co-efficient between two column) Correlation co-efficient indicates the strength and direction of a linear relationship between two random variables. The correlation varies between –1 to 1. The closer the coefficient is to either -1 or 1, the stronger the correlation between the variables. Value of 1 in the case of an increasing linear relationship, -1 in the case of a decreasing linear relationship, 0 in case no correlation. Example shows the correlation between amino acid A and G in the file.
Usage	col_corr -i pos -a 1 -b 6
-i	Input file name
-a	Number of column (eg 1)
-b	Number of column (eg 6)
pos	# Amino Acid Composition of proteins # A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y, 15.31, 1.30, 7.49, 3.91, 2.28, 9.45, 1.63, 3.26, 1.95, 10.10, 1.95, 2.28, 5.54, 2.28, 8.14, 3.91, 8.47, 6.84, 0.98, 2.93, 15.31, 1.30, 7.49, 3.91, 2.28, 9.45, 1.63, 3.26, 1.95, 10.10, 1.95, 2.28, 5.54, 2.28, 8.14, 3.91, 8.47, 6.84, 0.98, 2.93, 12.83, 1.60, 8.56, 2.14, 2.67, 6.95, 6.95, 2.67, 1.60, 9.09, 3.74, 3.74, 6.42, 6.42, 4.28, 5.35, 6.42, 5.35, 1.07, 2.14, 12.30, 1.60, 8.56, 2.14, 2.67, 6.95, 6.95, 2.67, 1.60, 9.09, 3.74, 3.74, 6.42, 6.42, 4.28, 5.35, 6.42, 5.88, 1.07, 2.14, 13.76, 0.53, 4.76, 4.23, 3.70, 5.29, 1.06, 3.17, 2.12, 8.47, 3.17, 3.70, 13.76, 1.59, 4.76, 9.52, 8.99, 5.82, 1.06, 0.53,
output	0.749 (a positive correlation between column 1 and 6)
Vector	Total number = Total number of column selected from input file E.g.(6=from 5 to 10 colum)

Title	Description
	col_avg (To calculate average column of two files) In this case composition value of each column of a file is added to its corresponding column of another file and means value is calculated. It can be used to generate an average feature of two different files (belonging from same family of protein) to make input in machine learning techniques. For instance, 15.31 (1^st column of file pos1) + 6.87 (1^st column of file pos2) = 22.18/2 = 11.09 (file out). Note: Each file should have equal number of columns and rows
Usage	col_avg -a pos1 -b pos2 -o out
-a	First input file name
-b	Second input file name
-o	Output file name
pos1	# Amino Acid Composition of proteins # A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y, 15.31, 1.30, 7.49, 3.91, 2.28, 9.45, 1.63, 3.26, 1.95,10.10, 1.95, 2.28, 5.54, 2.28, \ 8.14, 3.91, 8.47, 6.84, 0.98, 2.93, 15.31, 1.30, 7.49, 3.91, 2.28, 9.45, 1.63, 3.26, 1.95,10.10, 1.95, 2.28, 5.54, 2.28, \ 8.14, 3.91, 8.47, 6.84, 0.98, 2.93, 12.83, 1.60, 8.56, 2.14, 2.67, 6.95, 6.95, 2.67, 1.60, 9.09, 3.74, 3.74, 6.42, 6.42, \ 4.28, 5.35, 6.42, 5.35, 1.07, 2.14,
pos2	# Amino Acid Composition of proteins # A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y, 6.87, 1.29, 6.87, 2.15, 0.86, 6.87, 1.72, 4.72, 5.58, 9.87, 1.29, 5.15, 4.72, 5.15, 1.72,13.73, 9.01,10.30, 1.72, 0.43, 9.87, 1.29, 7.30, 3.00, 0.86, 8.58, 1.29, 4.72, 4.29, 9.87, 0.86, 3.43, 5.15, 4.29, 3.43,11.16, 7.73, 10.73, 1.72, 0.43, 12.64, 1.10, 4.40, 1.10, 2.75, 9.34, 0.55, 6.59, 3.30, 9.34, 2.20, 6.59, 6.04, 5.49, 4.40, 6.59, 7.14, 9.34, 0.55, 0.55,
out	11.09; 1.295; 7.18; 3.03; 1.57; 8.16; 1.675; 3.99; 3.765; 9.985; 1.62; 3.715; 5.13; 3.715; 4.93; 8.82; 8.74; 8.57; 1.35; 1.68; 0 12.59; 1.295; 7.395; 3.455; 1.57; 9.015; 1.46; 3.99; 3.12; 9.985; 1.405; 2.855; 5.345; 3.285; 5.785; 7.535; 8.1; 8.785; 1.35; 1.68; 0 12.735; 1.35; 6.48; 1.62; 2.71; 8.145; 3.75; 4.63; 2.45; 9.215; 2.97; 5.165; 6.23; 5.955; 4.34; 5.97; 6.78; 7.345; 0.81; 1.345; 0
Vector	Total number of column (same as in input file)

Title	Description
	col_sig (significance of columns in two column files) This program used to calculate significant of each column in two different file. If any one want to compare the positive and negative file of amino acid composition. Like Differences in positive-negative, its significance, average of each colomn in positive and each column in negative, standard deviation. Output result will give comparison of each column.
Usage	col_sig -i file1 -j file2 >out
-i	Input file1 of positive example
-j	Input file2 of negative example
file1	# Amino Acid Composition of proteins # A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y 7.88,1.27,4.05,6.39,3.62,7.88,2.13,6.61,5.75,10.23,3.62,2.98,3.41,5.11,5.54,6.39,4.47,7.03,1.70,3.83 5.46,1.12,7.14,6.72,3.78,5.46,1.68,4.76,6.58,10.64,0.98,7.42,2.52,4.06,4.90,9.38,8.54,5.04,0.98,2.80 8.96,2.06,4.82,7.58,4.13,6.20,0.69,4.82,7.58,13.10,1.37,2.75,4.82,8.96,6.89,4.13,2.75,6.20,0.00,2.64
file2	# Amino Acid Composition of proteins # A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y 8.55,0.65,3.94,9.86,2.63,8.55,0.00,3.28,11.84,13.81,2.63,3.28,1.31,3.94,3.94,6.57,1.31,8.55,0.65,1.89 13.29,2.53,3.16,2.53,5.69,14.55,1.89,5.69,3.16,6.32,3.16,3.16,6.96,2.53,6.32,6.32,2.53,6.32,3.16,2.18 9.88,0.48,7.45,8.91,5.18,3.56,0.48,4.70,11.02,9.88,2.10,6.48,3.89,4.21,3.24,5.99,5.02,2.91,0.16,4.87
out	# Parameters Measured: % Difference, Significance, Average1, Average2, Standard Deviation(SD1), SD2 Column 1: -34.83, -490.31, 7.43, 10.57, 0.88, 0.39 Column 2: 19.44, 69.33, 1.48, 1.22, 0.33, 0.42 Column 3: 9.51, 53.98, 5.34, 4.85, 0.29, 1.50 Column 4: -2.89, -28.15, 6.90, 7.10, 0.39, 1.04 Column 5: -15.71, -234.15, 3.84, 4.50, 0.16, 0.39 Column 6: -30.79, -145.77, 6.51, 8.89, 0.18, 3.07 Column 7: 61.49, 218.36, 1.50, 0.79, 0.46, 0.17 Column 8: 16.83, 408.83, 5.4, 4.56, 0.33, 0.07 Column 9: -26.55, -214.22, 6.64, 8.67, 0.54, 1.35 Column 10: 12.34, 240.14, 11.32, 10.01, 1.02, 0.07 Column 11: -27.64, -193.90, 1.99, 2.63, 0.35, 0.30 Column 12: 1.76, 6.98, 4.38, 4.31, 0.94, 1.25 Column 13: -12.27, -115.47, 3.58, 4.05, 0.71, 0.09 Column 14: 51.68, 241.21, 6.04, 3.56, 1.68, 0.37 Column 15: 24.79, 185.57, 5.78, 4.50, 0.64, 0.73 Column 16: 5.22, 41.72, 6.63, 6.30, 1.44, 0.17 Column 17: 56.04, 174.63, 5.26, 2.95, 1.44, 1.19 Column 18: 2.69, 17.94, 6.09, 5.93, 0.06, 1.74 Column 19: -38.94, -72.75, 0.89, 1.32, 0.516, 0.67 Column 20: 3.47, 15.63, 3.093, 2.98, 0.26, 1.09

Title	Description
	seq2pssm_imp (To calculate PSSM matrix in column format without any normalization) The PSSM for each sequence was generated by performing PSI-BLAST search against specific database (e.g. nr) using different iterations (e.g. 3) with cut off e-value 0.001. For a sequence of length N residues, PSSM is represented by an NX20 matrix. Each element of this matrix, m [i, j], provides information on evolutionary conservation of residue type j at sequence position i. For example:
Usage	seq2pssm_imp -i seq1.fa -o pssm.out –d nr
-i	Input file in the fasta format (not use single fasta format)
-o	Output file
-d	Database against which PSSM profile is generated
seq1.fa	>1BISA PDBID CHAIN_SEQUENC GSHMHGQVDCSPGIWQLDCTHLEGKVILVAVHVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTVHTDNGSNFTSTTVKAACEWAGIKQEFGIPYNPQSQGVIESMNKELK
pssm.out	G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200 M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200,\-300, -200, 200 G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100 ……….

Title	Description
	pssm_n1 (To normalize pssm profile based on 1/(1+e-x) formula) The value of PSSM matrix varies between large range which make difficult for SVM training. Thus every element of PSSM is normalized by using 1/(1+e-x) for normalization. Various formulae can be used for normalization.
Usage	pssm_n1 -i pssm.out –o pssm_n1
-i	Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o pssm.out –d nr.02)
-o	Output file having normalized value
pssm.out	G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200 M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200,-300, -200, 200 G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100 ……….
pssm_n1	G, 0.5, 5.19e-131, 3.73e-44, 1.39e-87, 5.19e-131, 1, 1.39e-87, 1.93e-174, 1.39e-87, 1.93e-174, 5.19e-131, 0.5, 1.39e-87, 1.39e-87, 1.39e-87, 0.5, 1.39e-87, 5.19e-131, 1.39e-87, 5.19e-131 S, 1, 3.73e-44, 0.5, 0.5, 1.39e-87, 0.5, 3.73e-44, 1.39e-87, 0.5, 1.39e-87, 3.73e-44, 1, 3.73e-44, 0.5, 3.73e-44, 1, 1, 1.39e-87, 5.19e-131, 1.39e-87 H, 1.39e-87, 5.19e-131, 3.73e-44, 0.5, 3.73e-44, 1.39e-87, 1, 5.19e-131, 3.73e-44, 5.19e-131, 1.39e-87, 1, 1.39e-87, 0.5, 0.5, 3.73e-44, 1.39e-87, 5.19e-131, 1.39e-87, 1 M, 3.73e-44, 3.73e-44, 5.19e-131, 1.39e-87, 0.5, 5.19e-131, 1.39e-87, 1, 3.73e-44, 1, 1, 1.39e-87, 1.39e-87, 0.5, 3.73e-44, 3.73e-44, 3.73e-44, 1, 3.73e-44, 3.73e-44 H, 1.39e-87, 5.19e-131, 3.73e-44, 0.5, 3.73e-44, 1.39e-87, 1, 5.19e-131, 3.73e-44, 5.19e-131, 1.39e-87, 1, 1.39e-87, 0.5, 0.5, 3.73e-44, 1.39e-87, 5.19e-131, 1.39e-87, 1 G, 0.5, 5.19e-131, 3.73e-44, 1.39e-87, 5.19e-131, 1, 1.39e-87, 1.93e-174, 1.39e-87, 1.93e-174, 5.19e-131, 0.5, 1.39e-87, 1.39e-87, 1.39e-87, 0.5, 1.39e-87, 5.19e-131, 1.39e-87, 5.19e-131 Q, 3.73e-44, 5.19e-131, 0.5, 1, 5.19e-131, 1.39e-87, 0.5, 5.19e-131, 1, 1.39e-87, 0.5, 0.5, 3.73e-44, 1, 1, 0.5, 3.73e-44, 1.39e-87, 1.39e-87, 3.73e-44

Title	Description
	pssm_n2 (To normalize pssm profile based on (numb -min)/(max -min) formula) The value of PSSM matrix varies between large range which make difficult for SVM training. Thus every element of PSSM is normalized by using (numb -min)/(max -min) for normalization. For example:
Usage	pssm_n2 -i pssm.out –o pssm_n2
-i	Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o pssm.out –d nr.02)
-o	Output file having normalized value
pssm.out	G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200 M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200,-300, -200, 200 G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100 ……….
pssm_n2	G, 0.26, 0.06, 0.20, 0.13, 0.06, 0.66, 0.13, 0, 0.13, 0, 0.06, 0.26, 0.13, 0.13, 0.13, 0.26, 0.13, 0.06, 0.13, 0.06 S, 0.33, 0.20, 0.26, 0.26, 0.13, 0.26, 0.20, 0.13, 0.26, 0.13, 0.20, 0.33, 0.20, 0.26, 0.20, 0.53, 0.33, 0.13, 0.06, 0.13 H, 0.13, 0.06, 0.20, 0.26, 0.20, 0.13, 0.79, 0.06, 0.20, 0.06, 0.13, 0.33, 0.13, 0.26, 0.26, 0.20, 0.13, 0.06, 0.13, 0.40 M, 0.20, 0.20, 0.06, 0.13, 0.26, 0.06, 0.13, 0.33, 0.20, 0.40, 0.59, 0.13, 0.13, 0.26, 0.20, 0.20, 0.20, 0.33, 0.20, 0.20 H, 0.13, 0.06, 0.20, 0.26, 0.20, 0.13, 0.79, 0.06, 0.20, 0.06, 0.13, 0.33, 0.13, 0.26, 0.26, 0.20, 0.13, 0.06, 0.13, 0.40 G, 0.26, 0.06, 0.20, 0.13, 0.06, 0.66, 0.13, 0, 0.13, 0, 0.06, 0.26, 0.13, 0.13, 0.13, 0.26, 0.13, 0.06, 0.13, 0.06 Q, 0.20, 0.06, 0.26, 0.40, 0.06, 0.13, 0.26, 0.06, 0.33, 0.13, 0.26, 0.26, 0.20, 0.59, 0.33, 0.26, 0.20, 0.13, 0.13, 0.2

Title	Description
	pssm_n3 (To normalize pssm profile based on (numb -min)100/(max -min) formula) The value of PSSM matrix varies between large ranges which make difficult for SVM training. Thus every element of PSSM is normalized by using (numb -min)100/(max -min)for normalization.
Usage	pssm_n3 -i pssm.out –o pssm_n3
-i	Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o pssm.out –d nr.02)
-o	Output file having normalized value
pssm.out	G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200 M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200,-300, -200, 200 G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100
pssm_n3	G, 26.68, 6.67, 20.01, 13.34, 6.67, 66.64, 13.34, 0, 13.34, 0, 6.67, 26.68, 13.34, 13.34, 13.34, 26.68, 13.34, 6.67, 13.34, 6.67 S, 33.35, 20.01, 26.68, 26.68, 13.34, 26.68, 20.01, 13.34, 26.68, 13.34, 20.01, 33.35, 20.01, 26.68, 20.01, 53.36, 33.35, 13.34, 6.67, 13.34 H, 13.34, 6.67, 20.01, 26.68, 20.01, 13.34, 79.98, 6.67, 20.01, 6.67, 13.34, 33.35, 13.34, 26.68, 26.68, 20.01, 13.34, 6.67, 13.34, 40.02 M, 20.01, 20.01, 6.67, 13.34, 26.68, 6.67, 13.34, 33.35, 20.01, 40.02, 59.97, 13.34, 13.3, 26.68, 20.01, 20.01, 20.01, 33.35, 20.01, 20.01 H, 13.34, 6.67, 20.01, 26.68, 20.01, 13.34, 79.98, 6.67, 20.01, 6.67, 13.34, 33.35, 13.34, 26.68, 26.68, 20.01, 13.34, 6.67, 13.34, 40.02 G, 26.68, 6.67, 20.01, 13.34, 6.67, 66.64, 13.34, 0, 13.34, 0, 6.67, 26.68, 13.34, 13.34, 13.34, 26.68, 13.34, 6.67, 13.34, 6.67 Q, 20.01, 6.67, 26.68, 40.02, 6.67, 13.34, 26.68, 6.67, 33.35, 13.34, 26.68, 26.68, 20.01, 59.97, 33.35, 26.68, 20.01, 13.34

Title	Description
	pssm_n4 (To normalize pssm profile based on 1/(1+e-(x/100) formula) The value of PSSM matrix varies between large range which make difficult for SVM training. Thus every element of PSSM is normalized by using 1/(1+e-(x/100) for normalization.
Usage	pssm_n4 -i pssm.out –o pssm_n4
-i	Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o pssm.out –d nr.02)
-o	Output file having normalized value
pssm.out	G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200 M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200,-300, -200, 200 G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100 ……….
pssm_n4	G, 0.5, 0.04, 0.26, 0.11, 0.04, 0.99, 0.11, 0.01, 0.11, 0.01, 0.0474258731775668, 0.5, 0.11, 0.11, 0.11, 0.5, 0.11, 0.047, 0.11, 0.04 S, 0.73, 0.26, 0.5, 0.5, 0.11, 0.5, 0.26, 0.11, 0.5, 0.11, 0.26, 0.73, 0.26, 0.5, 0.26, 0.98, 0.73, 0.11, 0.04, 0.11 H, 0.11, 0.04, 0.26, 0.5, 0.26, 0.11, 0.99, 0.04, 0.26, 0.04, 0.11, 0.73, 0.119202922022118, 0.5, 0.5, 0.26, 0.11, 0.04, 0.11, 0.88 M, 0.26, 0.26, 0.04, 0.11, 0.5, 0.04, 0.11, 0.73, 0.26, 0.88, 0.99, 0.11, 0.11, 0.5, 0.26, 0.26, 0.26, 0.73, 0.26, 0.26 H, 0.11, 0.047, 0.26, 0.5, 0.26, 0.11, 0.99, 0.047, 0.26, 0.04, 0.11, 0.73, 0.11, 0.5, 0.5, 0.26, 0.11, 0.04, 0.11, 0.88 G, 0.5, 0.04, 0.26, 0.11, 0.04, 0.99, 0.11, 0.01, 0.11, 0.01, 0.04, 0.5, 0.11, 0.11, 0.11, 0.5, 0.11, 0.04, 0.11, 0.04 Q, 0.26, 0.04, 0.5, 0.88, 0.04, 0.11, 0.5, 0.04, 0.73, 0.11, 0.5, 0.5, 0.26, 0.99, 0.73, 0.5, 0.26, 0.11, 0.11, 0.26

Title	Description
	pssm_comp (To compute PSSM composition (400 points)) Here pssm matrix is coverted in a vector of dimension 400, by computing composition of occurrences of each type of amino acid corresponding to each type of amino acids in protein sequence. It means for each column we will have 20 values instead of one. Every element in this input vector was subsequently divided by the length of the sequence. The resultant matrix with 400 elements was used as input feature for SVM.
Usage	pssm_comp -i pssm_n4 –o pssm_n4.out
-i	Input file having pssm profile generated by using (seq2pssm_imp -i seq1.fa –o pssm.out –d nr.02) and then scaled by using (pssm_n4.pl -i pssm.out –o pssm_n4)
-o	Output file having 400 elements
pssm_n4	G, 0.5, 0.04, 0.26, 0.11, 0.04, 0.99, 0.11, 0.01, 0.11, 0.01, 0.04, 0.5, 0.11, 0.11, 0.11, 0.5, 0.11, 0.04, 0.11, 0.04 S, 0.73, 0.26, 0.5, 0.5, 0.11, 0.5, 0.26, 0.11, 0.5, 0.11, 0.26, 0.73, 0.26, 0.5, 0.26, 0.98, 0.73, 0.11, 0.04, 0.11 H, 0.11, 0.04, 0.26, 0.5, 0.26, 0.11, 0.99, 0.047, 0.26, 0.047, 0.11, 0.73, 0.11, 0.5, 0.5, 0.26, 0.11, 0.04, 0.11, 0.88 M, 0.26, 0.26, 0.04, 0.11, 0.5, 0.04, 0.11, 0.73, 0.26, 0.88, 0.99, 0.11, 0.11, 0.5, 0.26, 0.26, 0.26, 0.73, 0.26, 0.26 H, 0.11, 0.04, 0.26, 0.5, 0.26, 0.11, 0.99, 0.04, 0.26, 0.04, 0.11, 0.73, 0.11, 0.5, 0.5, 0.26, 0.11, 0.04, 0.11, 0.88 G, 0.5, 0.04, 0.26, 0.11, 0.04, 0.99, 0.11, 0.01, 0.11, 0.01, 0.04, 0.5, 0.11, 0.11, 0.11, 0.5, 0.11, 0.04, 0.11, 0.04 Q, 0.26, 0.04, 0.5, 0.88, 0.04, 0.11, 0.5, 0.04, 0.73, 0.11, 0.5, 0.5, 0.26, 0.99, 0.73, 0.5, 0.26, 0.11, 0.11, 0.26
pssm_n4.out	0.98, 0.50, 0.11, 0.26, 0.11, 0.50, 0.11, 0.26, 0.26, 0.26, 0.26, 0.11, 0.26, 0.26894142, 0.26, 0.73, 0.50, 0.50, 0.04, 0.11, 0.50, 0.99, 0.04, 0.01, 0.11, 0.04, 0.04, 0.26, 0.04, 0.26, 0.26, 0.04, 0.04, 0.04, 0.04, 0.26, 0.26, 0.26, 0.11, 0.11, ……..
Vector	400

Title	Description
	pssm2pat (To generate patterns of given size from PSSM matrix) Here we generate PSSM matrix from different window size. If we want to generate 5-window matrix, take two nucleotide matrix forms upstream and downstream and concatenates all matrix in sequential order. For example: pattern matrix of GSHMH, add matrix of each nucleotide it makes the 100 vector long matrix representing H (middle) of nucleotide. For starting nucleotide zero (0) is considered two upstream nucleotide.
Usage	pssm2pat -i pssm.out –o pssm_pat –w 5
-i	Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o pssm.out –d nr.02)
-o	Output file
-w	Window size generated from PSSM matrix
pssm.out	G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200 M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200 G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100
pssm_pat	# Pattern Window size 5 generated from PSSM matrix. Each line represents pattern for central residue 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-300,-100,-200,-300,599,-200,-400,-200,-400,-300,0,-200,-200,-200,0,-200,-300,-200,-300,100,-100,0,0,-200,0,-100,-200,0,-200,-100,100,-100,0,-100,400,100,-200,-300,-200,-200,-300,-100,0,-100,-200,799,-300,-100,-300,-200,100,-200,0,0,-100,-200,-300,-200,200 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-300,-100,-200,-300,599,-200,-400,-200,-400,-300,0,-200,-200,-200,0,-200,-300,-200,-300,100,-100,0,0,-200,0,-100,-200,0,-200,-100,100,-100,0,-100,400,100,-200,-300,-200,-200,-300,-100,0,-100,-200,799,-300,-100,-300,-200,100,-200,0,0,-100,-200,-300,-200,200,-100,-100,-300,-200,0,-300,-200,100,-100,200,499,-200,-200,0,-100,-100,-100,100,-100,-100 …………………………………………………
Vector	20window size (205=100)

Title	Description
	pssm_smooth (To designed smooth pssm profile for plot) Here we generate smooth matrix from different window size. If we want to generate 5-window matrix, take two nucleotide matrix forms upstream and downstream and add all five values from each column and divided by five to make average. The matrix of each nucleotide is 20. Each matrix represents about it matrix neighbour nucleotide. Therefore a smooth graph will be generated.
Usage	pssm_smooth -i pssm.out –o pssm_pat –w 5
-i	Input file having pssm profile generated by using (seq2pssm_imp.pl -i seq1.fa -o pssm.out –d nr.02)
-o	Output file
-w	Window size generated from PSSM matrix
pssm.out	G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 S, 100, -100, 0, 0, -200, 0, -100, -200, 0, -200, -100, 100, -100, 0, -100, 400, 100, -200, -300, -200 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200 M, -100, -100, -300, -200, 0, -300, -200, 100, -100, 200, 499, -200, -200, 0, -100, -100, -100, 100, -100, -100 H, -200, -300, -100, 0, -100, -200, 799, -300, -100, -300, -200, 100, -200, 0, 0, -100, -200, -300, -200, 200 G, 0, -300, -100, -200, -300, 599, -200, -400, -200, -400, -300, 0, -200, -200, -200, 0, -200, -300, -200, -300 Q, -100, -300, 0, 200, -300, -200, 0, -300, 100, -200, 0, 0, -100, 499, 100, 0, -100, -200, -200, -100
smooth.out	G, -40, -340, -160, -200, -300, 379.2, -60.2, -400, -200, -380, -200.2, 0, -260, -160, -200, 40, -200, -320, -280, -260 S, -80, -340, -160, -160, -260, 219.4, 139.6, -380, -180, -360, -180.2, 20, -260, -120, -160, 20, -200, -320, -280, -160 H, -80, -340, -160, -160, -260, 219.4, 139.6, -380, -180, -360, -180.2, 20, -260, -120, -160, 20, -200, -320, -280, -160 M, -100, -340, -140, -80, -260, 59.6, 179.6, -360, -120, -320, -120.2, 20, -240, 19.8, -100, 20, -180, -300, -280, -120 H, -120, -340, -120, 0, -260, -100.2, 219.6, -340, -60, -280, -60.2, 20, -220, 159.6, -40, 20, -160, -280, -280, -80 G, -160, -380, -120, 40, -280, -140.2, 239.6, -360, -40, -280, -40.2, 0, -220, 259.4, 0, -60, -200, -280, -260, -60 Q, -140, -380, -100, 80, -320, -140.2, 79.8, -360, 0, -260, -0.2, -20, -200, 359.2, 20, -40, -180, -260, -260, -120
Vector	20

Title	Description
	seq2motif (To create motifs by sliding window of user defined length) This program creates motif of defined length. Optional ‘X’ at the end of sequence is added to make complete pattern. The binary pattern is generated using the motifs
Usage	seq2motif -i seq1.fa -o motif.out -w 5 -x y
-i	Input file in single fasta format
-o	Output file
-w	Window size to create a pattern
seq1.fa	>seq_1##GSHMHGQVDCSPGIWQLDCTHLEGK
motif_1.out	>seq_1 XXGSH XGSHM GSHMH SHMHG HMHGQ MHGQV HGQVD GQVDC QVDCS VDCSP DCSPG CSPGI SPGIW PGIWQ GIWQL IWQLD WQLDC QLDCT LDCTH DCTHL CTHLE THLEG HLEGK LEGKX EGKXX

Title	Description
	motif2bin (To make binary input from the multifasta motif file) It generates binary input from the multifasta motif file of fixed length into column format.
Usage	motif2bin -i motif_1.out -o bin.out -x y
-i	Input in multifasta file (seq2motif.pl -i seq1.fa -o motif.out -w 5)
-o	Output file
-x	If additional X is added in pattern then y (for yes), or n (for no)
motif_1.out	>seq_1 XXGSH XGSHM GSHMH SHMHG HMHGQ MHGQV HGQVD GQVDC QVDCS VDCSP DCSPG CSPGI SPGIW PGIWQ GIWQL IWQLD WQLDC QLDCT LDCTH DCTHL CTHLE THLEG HLEGK LEGKX EGKXX
bin.out	0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 …………………………………………
Vector	20window size (205=100)

Title	Description
	blast similarity (To perform blast) Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown protein in the mouse, a scientist will typically perform a BLAST search of the protein database (nr) to see if humans carry a similar protein; BLAST will identify sequences in the previously known database that resemble the mouse protein based on similarity of sequence.
Usage	blast_similarity -i fasta -d nr -j 3 -e 1 –o blast.out
-i	Input file of Fasta format
-o	Output result
-d	Database used blast
-j	Number of Iteration
-e	Cut off Evalue
fasta	>amla_1 ASDATAYAACVAYANMANNNAMAKLAWQAPTCAGYAAKTGCVQRATRQOPKALVNAASDREW >amla_2 ACDEFGHIKLMNPQRSTVWMRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLG
blast.out	>amla_1 Zero Zero >amla_2 ref\|NP_216584.1\| blaC [Mycobacterium tuberculosis H37Rv] >gi\|158... 2e-27

To calculate amino acid composition of N-terminal (nt) residues of a protein

To calculate amino acid composition

To calculate dipeptide composition of C-terminal (ct)

To add columns of two files

To extract selective columns from a file

Title

Description

Fasta format

fasta2sfasta (Convert fasta format to single fasta format)

(Pearon format) is used to represent peptide sequences or nucleic acid sequences using single-letter codes. It begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol.

Single fasta format

Our programs use input sequence in single fasta format. Therefore, fasta file should first convert into single fasta format. In the single fasta format the description and sequence data merged into single line. Two hash marks (##) were present to distinguish description and sequence data.

Usage

fasta2sfasta –i seq.fa -o seq.sfa

-i

Input file name having sequence in fasta format

-o

Output file name that gives sequence in single fasta format

seq.fa

>seq_1

MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE

>seq_2

ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF

seq.sfa

>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE

>seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF

Title

Description

pro2aac (To calculate amino acid composition of protein)

Title

Description

pro2aac_nt (To calculate amino acid composition of N-terminal (nt) residues of a protein)

N 5 nt C

Usage

pro2aac_nt -i seq.sfa -o seq.out -n 5

-i

Input file name

-o

Output file name

-n

Number of residues to calculate composition from N-terminal

seq.sfa

>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE

>seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF

Seq.out

# Amino Acid Composition of 5 n-terminal residues of proteins

# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,

0.00, 0.00, 0.00, 0.00, 0.00, 20.00, 0.00, 0.00, 0.00, 0.00, 20.00 ..... 0.00,

20.00, 20.00, 0.00, 0.00, 0.00, 40.00, 0.00, 0.00, 0.00, 0.00, ..... 0.00,

Vector

20 dimension

Title

Description

pro2aac_ct (To calculate amino acid composition of C-terminal (ct) residues of a protein)

N 5nt C

Usage

pro2aac_ct -i seq.sfa -o seq.out -n 5

-i

Input file name

-o

Output file name

-n

Number of residues to calculate composition from C-terminal

seq.sfa

>seq_1##MRNRGFGRRELLVAMAMLVSVTGCARHASGARPASTTLPAGADLADRFAELERRYDARLGVYVPATGTTAAIE

>seq_2##ACGRGFGVKLACNMNNACRTYFSDVAMAMLVSVTGCARHASGARPASTTLPAGADLADIEYRADERFAFCSTF

seq.out

# Amino Acid Composition of 5 c-terminal residues of proteins

# A , C , D , E , F , G , H , I , K , L , M , N , P , Q , R , S , T , V , W , Y,

40.00, 0.00, 0.00,20.00, 0.00, 0.00, 0.00,20.00, 0.00, 0.00, 0.00, 0.00, 0.00, .... 0.00,

0.00,20.00, 0.00, 0.00,40.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ...... 0.00,

Vector

20 dimension

Title

Description

pro2aac_rest (To calculate amino acid composition of a protein after removing N-, and C-terminal residues)

N 5 nt 5 nt C

Usage

pro2aac_rest -i seq.sfa -o seq.out -n 5 –c 5

-i