This work presents an external memory approach to extract the maximal repeats from whole
genome sequences with the statistics of these repeats across classes, where the definition of a class is
determined from the statistics to be computed. A heuristic method consisting of a bucket-sort-like
approach and the Chinese term extraction approach is adopted. The bucket-sorting method is used to
sort the suffixes of DNA sequences stored in files, and the term extraction is used to extract maximal
repeats by scanning the sorted suffixes while computing the statistics of these repeats. The statistics of
these repeats across classes might be useful for sequence classification and species identification.
Relation:
Asian Journal of Health and Information Sciences 1(3) : 276-295