Positive dataset: the positive datasets for both chemokines and chemokine receptors
was downloaded from http://cytokine.medic.kumamoto-u.ac.jp/.Chemokine dataset had
431 protein sequences whereas the dataset for chemokine receptor had
314 protein sequences. CD-HIT software was used to remove sequences that
are highly homologous from these datasets, the cut-off being 90%. Thus no two
sequences in the datasets were >90% similar and hence, the datasets used are
nonredundant. After using CD-HIT 193 chemokine and 96 chemokine receptor sequences
were left in the datasets. These datasets (as appear in respective order) were used
as positive datasets for developing the method.
Negative dataset: A systematic approach was adopted to select the negative datasets to
be used for training the method. Protein sequences were retrieved from UniProt using the
“BUTNOT chemokines criteria”. Now a blast search was performed against the database of
these sequences for each and every protein in the positive dataset. The result obtained were
sequences that were homlogous to chemokines but were performing different function.
193 proteins were selected from the resulting sequences. These were used as negative
dataset against chemokines. Similar strategy was used to get 96 protein sequences to
be used as negative dataset against the positive chemokine receptors.
Chemokines are divided into four subfamilies:
- CXC or alpha chemokines
- CC or beta chemokines
- C or gamma chemokine, and
- CX3C or delta chemokines
Chemokine receptors are also classified into same 4 sub-families as the chemokines.
As the number of examples in C and CX3C sub-families were very few so they were
clubbed together into a "joint-class". Thus, the server classifies any query
sequence into 3 sub-families i.e. the CC, CXC and joint subfamily(consisting of
C and CX3C subfamily).
|