Improving accuracy for cancer classification with a new algorithm for genes selection
BMC Bioinformatics 2012, 13:298 doi:10.1186/1471-2105-13-298
Published: 13 November 2012
Published: 13 November 2012
Abstract (provisional)
Background
Even though the classification of cancer tissue samples based on gene expression data
has advanced considerably in recent years, it faces great challenges to improve accuracy.
One of the challenges is to establish an effective method that can select a parsimonious
set of relevant genes. So far, most methods for gene selection in literature focus
on screening individual or pairs of genes without considering the possible interactions
among genes. Here we introduce a new computational method named the Binary Matrix
Shuffling Filter (BMSF). It not only overcomes the difficulty associated with the
search schemes of traditional wrapper methods and overfitting problem in large dimensional
search space but also takes potential gene interactions into account during gene selection.
This method, coupled with Support Vector Machine (SVM) for implementation, often selects
very small number of genes for easy model interpretability.
Results
We applied our method to 9 two-class gene expression datasets involving human cancers.
During the gene selection process, the set of genes to be kept in the model was recursively
refined and repeatedly updated according to the effect of a given gene on the contributions
of other genes in reference to their usefulness in cancer classification. The small
number of informative genes selected from each dataset leads to significantly improved
leave-one-out (LOOCV) classification accuracy across all 9 datasets for multiple classifiers.
Our method also exhibits broad generalization in the genes selected since multiple
commonly used classifiers achieved either equivalent or much higher LOOCV accuracy
than those reported in literature.
Conclusions
Evaluation of a gene's contribution to binary cancer classification is better to be
considered after adjusting for the joint effect of a large number of other genes.
A computationally efficient search scheme was provided to perform effective search
in the extensive feature space that includes possible interactions of many genes.
Performance of the algorithm applied to 9 datasets suggests that it is possible to
improve the accuracy of cancer classification by a big margin when joint effects of
many genes are considered.
No comments:
Post a Comment