This page contains brief description about GPSR.1.0 and GPSR.2.0 packages. Both packages describes some programs which can be used as building block to develop complex prediction modules. GPSR.1.0 packages mainly contains small PERL programs related to Bioinformatics problems whereas GPSR.2.0 basically contains small PERL and R based programs related to Biostatistics and Chemoinformatics.
Following are important programs included in GPSR.1.0 package:
Program | Purpose | Usage |
fasta2sfasta | Convert fasta format to single fasta format | fasta2sfasta -i seq.fa -o seq.sfa |
pro2aac | To calculate amino acid composition of protein | pro2aac -i seq.sfa -o seq.out |
pro2aac_nt | To calculate amino acid composition of N-terminal (nt) residues of a protein | pro2aac_nt -i seq.sfa -o seq.out -n 5 |
pro2aac_ct | To calculate amino acid composition of C-terminal (ct) residues of a protein | pro2aac_ct -i seq.sfa -o seq.out -n 5 |
pro2aac_rest.pl | To calculate amino acid composition of a protein after removing N-, and C-terminal residues | pro2aac_rest -i seq.sfa -o seq.out -n 5 -c 5 |
pro2aac_split | To calculate split amino acid composition (SSAC) of a protein | pro2aac_split -i seq.sfa -o seq.out -n 3 |
pro2dpc | To calculate dipeptide composition of protein | pro2dpc -i seq.sfa -o seq.out |
pro2dpc_nt | To calculate dipeptide composition of N-terminal (nt) residues of a protein | pro2dpc_nt -i seq.sfa -o seq.out -n 5 |
pro2dpc_ct | To calculate dipeptide composition of C-terminal (ct) residues of a protein | pro2dpc_ct -i seq.sfa -o seq.out -n 5 |
pro2tpc | To calculate tripeptide composition of protein | pro2tpc -i seq.sfa -o seq.out |
add_cols | To add columns of two files | add_cols -i se1.out -c se2.out -o seq.out |
col2svm | To generating SVM_light input format | col2svm -i se1.out -o svm.out -s +1 |
col_mult | To multiplying each column of input file with a number | col_mult -i se1.out -o se1_mult -n 0.1 |
col_mult_sel | To multiplying selective columns with a number | col_mult_sel -i se1.out -o se1_mult -n 10 -a 1 -b 3 |
col_rem | To remove selective columns from a file | perl col_rem -i seq.out -o seq.rm -a 1 -b 2 |
col_ext | To extract selective columns from a file | col_ext -i seq.out -o seq.ext -a 5 -b 10 |
col_corr | To compute correlation co-efficient between two column | col_corr -i pos -a 1 -b 6 |
col_avg | To calculate average column of two files | col_avg -a pos1 -b pos2 -o out |
seq2pssm_imp | To calculate PSSM matrix in column format without any normalization | seq2pssm_imp -i seq1.fa -o pssm.out -d nr |
pssm_n1 | To normalize pssm profile based on 1/(1+e-x) formula | pssm_n1 -i pssm.out -o pssm_n1 |
pssm_n2 | To normalize pssm profile based on (numb -min)/(max -min) formula | pssm_n2 -i pssm.out -o pssm_n2 |
pssm_n3 | To normalize pssm profile based on (numb -min)*100/(max -min) formula | pssm_n3 -i pssm.out -o pssm_n3 |
pssm_n4 | To normalize pssm profile based on 1/(1+e-(x/100) formula | pssm_n4 -i pssm.out -o pssm_n4 |
pssm_comp | To compute PSSM composition (400 points) | pssm_comp -i pssm_n4 -o pssm_n4.out |
col_sig | Significance of columns in two column files | col_sig -i file1 -j file2 >out |
pssm2pat | To generate patterns of given size from PSSM matrix | pssm2pat -i pssm.out -o pssm_pat -w 5 |
pssm_smooth | To designed smooth pssm profile for plot | pssm_smooth -i pssm.out -o pssm_pat -w 5 |
seq2motif | To create motifs by sliding window of user defined length with option of adding terminal X | seq2motif -i seq1.fa -o motif.out -w 5 -x y |
motif2bin | To make binary input from the multifasta motif file | motif2bin -i motif_1.out -o bin.out -x y |
blast_similarity | To perform blast | blast_similarity -i fasta -d nr -j 3 -e 1 -o blast.out |
GPSR.2.0 package contains following PERL and R based progrrams::
Installation of gpsR version 2.0:
gpsR version 2.0 is a collection of programs which are written in Perl and R. Before using it ensure that you have installed Perl and R in your operating systems. (Perl is by default installed in Unix based OS).
To check whether Perl is installed in your system type following command
perl -v
If it is installed, it will give you the details about the version number of Perl.
To check whether R is installed in your system type following command
R --version
If it is installed, it will give you the details about the version number of R.
GPSR.2.0:divided into five parts:
Tools for Chemo-informatics: Part A
This part deals with the case when you are developing method for classification of molecules like inhibitors and non-inhibitors. In this case we are using binary descriptors where descriptors have value 0 or 1 for example fingerprints/descriptors from PADEL. In this situation, we advise following programs.
Program | Dependency | Purpose |
desc_imp_a.pl | i. R ii. desc_imp_a.R | Gives n most important descriptors for predicting positive and negative examples (n given by user) |
desc_sel_a.pl | i. R ii. desc_sel_a.R iii. make_selectedfile.R | Selects the final set of descriptors for prediction by removing very similar descriptors. |
desc_graph_a.pl | i. R ii. desc_graph_a.R | Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors |
desc_mod_a.pl | i. R ii. desc_mod_a.R | Modifies the binary descriptors based on relative frequency in positive and negative datasets |
desc_clust_a.pl | i. R ii.chem_desc_clust_a.R | Performs clustering of descriptors (i.e. column wise) with graphical representation. |
chem_clust_a.pl | i. R ii.chem_desc_clust_a.R | Performs clustering of chemicals (i.e. row wise) with graphical representation. |
sim_chem_a.pl | i. R ii. fingerprint package in R iii. sim_chem_a.R | Finds the most similar chemical from the database of chemicals based on distance between descriptors of chemicals. |
Example | Description |
desc_imp_a.pl | It calculates importance of descriptor (in terms of IDD/IDR/IDL) for classifying positive and negative examples and gives a file listing top n important descriptors (user can select the value of n) |
usage | desc_imp_a.pl -i file.pos -j file.neg -n 10 -s 4 |
file.pos | comma separated file in which rows represents samples to classify and column represent descriptors e.g. 0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0 0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0 0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0 0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0 |
file.neg | comma separated file in which rows represents samples to classify and column represent descriptors e.g. 0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0 0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0 0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0 0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0 |
-n = 10 | number for selecting top descriptors (here n is 10) |
-s = 4 | Number for calculating importance of descriptors based on 4 - IDD 5 - IDR 6 - IDL |
Output will be | i) out.desc_imp_a giving top n descriptors with their names and their IDD/IDR/IDL values in tab separated format |
Tools for Chemo-informatics: Part B
This part deals with the case when you are developing method for regression (like predicting IC50 value of chemicals) using binary descriptors where descriptors have value 0 or 1. In this situation, we advise following programs.
Program | Dependency | Purpose |
desc_imp_b.pl | i. R ii. desc_imp_b.R | Gives n most important descriptors for predicting positive and negative examples (n given by user) |
desc_sel_b.pl | i. R ii. desc_sel_b.R iii. make_selectedfile.R | Selects the final set of descriptors for prediction by removing very similar descriptors |
desc_graph_b.pl | i. R ii. desc_graph_b.R | Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors |
desc_clust_b.pl | i. R ii.chem_desc_clust_b.R | Performs clustering of descriptors (i.e. column wise) with graphical representation. |
chem_clust_b.pl | i. R ii.chem_desc_clust_b.R | Performs clustering of chemicals (i.e. row wise) with graphical representation. |
sim_chem_b.pl | i. R ii. fingerprint package in R iii. sim_chem_b.R | Finds the most similar chemical from the database of chemicals based on distance between descriptors of chemicals. |
Tools for Chemo-informatics: Part C
This part deals with the case when you are developing method for classification (like inhibitors and non-inhibitors) using descriptors having real values. In this situation, we advise following programs.
Program | Dependency | Purpose |
desc_imp_c.pl | i. R ii. desc_imp_c.R | Gives n most important descriptors for predicting positive and negative examples (n given by user) |
desc_sel_c.pl | i. R ii. desc_sel_c.R iii. make_selectedfile.R | Selects the final set of descriptors for prediction by removing very similar descriptors |
desc_graph_c.pl | i. R ii. desc_graph_c.R | Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors |
desc_clust_c.pl | i. R ii.chem_desc_clust_c.R | Performs clustering of descriptors (i.e. column wise) with graphical representation. |
chem_clust_c.pl | i. R ii.chem_desc_clust_c.R | Performs clustering of chemicals (i.e. row wise) with graphical representation. |
Tools for Chemo-informatics: Part D
This part deals with the case when you are developing method for regression analysis based upon the real values of response (say IC50 values). In this case we are using descriptors with real values. In this situation, we advise following programs.
Program | Dependency | Purpose |
desc_imp_d.pl | i. R ii. desc_imp_d.R | Gives n most important descriptors based upon correlation with response. (n given by user). An additional file with all descriptors with correlation values is also given as output |
desc_sel_d.pl | i. R ii. desc_sel_d.R iii. make_selectedfile.R | Selects the final set of descriptors for prediction by removing very similar descriptors |
desc_graph_d.pl | i. R ii. desc_graph_d.R | Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors |
desc_clust_d.pl | i. R ii.chem_desc_clust_d.R | Performs clustering of descriptors (i.e. column wise) with graphical representation. |
chem_clust_d.pl | i. R ii.chem_desc_clust_d.R | Performs clustering of chemicals (i.e. row wise) with graphical representation. |
Miscellaneous
These programs are used in file preparations and manipulations, which can be helpful in any Bioinformatics and Chemoinformatics work
Program | Dependency | Purpose |
make_selectedfile.R | R | Extracts specific columns from input file and writes in output file |
shiftcol.pl | perl | Shifts the 2 columns in a file and writes in an output file |
rem_identicalcol.R | R | Removes identical columns in a file and writes unique columns in output file |
matrix_optimization.pl | R | For a given positive and negative dataset of protein sequences this program optimizes the substitution matrix which can be used in classification of positive and negative examples |
randomizefile.pl | perl | shuffles the rows of a file randomly and writes in an output file. (can also extract user defined number of lines randomly from input file and write in output file). |
mean.pl | R | Calculates row wise or column wise mean of file in csv format. |
median.pl | R | Calculates row wise or column wise median of file in csv format. |
stdev.pl | R | Calculates row wise or column wise standard deviation of file in csv format. |
stderr.pl | R | Calculates row wise or column wise standard error of file in csv format. |
correlation.pl | R | Calculates correlation of all columns of a file or between 2 columns. |
barplot.pl | R | Draws a barplot between 2 properties. |
roc.pl | R and R-libraries | Draws a roc plot. |
PSSM-pattern.pl | gpsr_1.0, blastpgp (for psiblast). | Makes PSSM profile of positive and negative patterns for prediction at residue level (see gpsr_1.0 manual for residue level prediction). |