Download Datasets
 



Positive dataset: the positive datasets for both chemokines and chemokine receptors was downloaded from http://cytokine.medic.kumamoto-u.ac.jp/.Chemokine dataset had 431 protein sequences whereas the dataset for chemokine receptor had 314 protein sequences. CD-HIT software was used to remove sequences that are highly homologous from these datasets, the cut-off being 90%. Thus no two sequences in the datasets were >90% similar and hence, the datasets used are nonredundant. After using CD-HIT 193 chemokine and 96 chemokine receptor sequences were left in the datasets. These datasets (as appear in respective order) were used as positive datasets for developing the method.

Negative dataset: A systematic approach was adopted to select the negative datasets to be used for training the method. Protein sequences were retrieved from UniProt using the “BUTNOT chemokines criteria”. Now a blast search was performed against the database of these sequences for each and every protein in the positive dataset. The result obtained were sequences that were homlogous to chemokines but were performing different function. 193 proteins were selected from the resulting sequences. These were used as negative dataset against chemokines. Similar strategy was used to get 96 protein sequences to be used as negative dataset against the positive chemokine receptors.

Chemokines are divided into four subfamilies:
  • CXC or alpha chemokines
  • CC or beta chemokines
  • C or gamma chemokine, and
  • CX3C or delta chemokines
Chemokine receptors are also classified into same 4 sub-families as the chemokines. As the number of examples in C and CX3C sub-families were very few so they were clubbed together into a "joint-class". Thus, the server classifies any query sequence into 3 sub-families i.e. the CC, CXC and joint subfamily(consisting of C and CX3C subfamily).