In this study, instead of traditional approaches to virus classification, we proposed a novel approach in the vector space model for virus classification via two types of genome sequences, DNA and CDS. For DNA sequence, in this study, the k-mer approach was adopted for pattern extraction and the entropy of the pattern frequency distribution among classes was for pattern weighting. For CDS sequence, however, the pattern extraction was based on the identification of distinctive protein functions which were formed by CDS clustering and a weighting method, similar to tf * idf approach, for these protein functions was proposed. The experimental resources were download from NCBI and there were 35 classes (virus family) consisted of 1,877 viruses selected. The highest values of classification accuracy via SVM classifier were as high as 94.7% and 91.3% via DNA and CDS sequences, respectively. This study not only proposed a novel approach for virus classification but also provided a new methodology for comparative genomic analysis.
Relation:
The 11th IEEE International Conference on Bioinformatics and Bioengineering(BIBE2011)