ABSTRACT: BACKGROUND: Computational prediction of protein subcellular localization can greatly help to elucidate itsfunctions. Despite the existence of dozens of protein localization prediction algorithms, theprediction accuracy and coverage are still low. Several ensemble algorithms have beenproposed to improve the prediction performance, which usually include as many as 10 ormore individual localization algorithms. However, their performance is still limited by therunning complexity and redundancy among individual prediction algorithms. RESULTS: This paper proposed a novel method for rational design of minimalist ensemble algorithmsfor practical genome-wide protein subcellular localization prediction. The algorithm is basedon combining a feature selection based filter and a logistic regression classifier. Using anovel concept of contribution scores, we analyzed issues of algorithm redundancy, consensusmistakes, and algorithm complementarity in designing ensemble algorithms. We applied theproposed minimalist logistic regression (LR) ensemble algorithm to two genome-widedatasets of Yeast and Human and compared its performance with current ensemblealgorithms. Experimental results showed that the minimalist ensemble algorithm can achievehigh prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemblealgorithms, which greatly reduces computational complexity and running time. It was foundthat the high performance ensemble algorithms are usually composed of the predictors thattogether cover most of available features. Compared to the best individual predictor, ourensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popularweighted voting based ensemble algorithms, our classifier-based ensemble algorithmsachieved much better performance without suffering from inclusion of too many individualpredictors CONCLUSIONS: We proposed a method for rational design of minimalist ensemble algorithms using featureselection and classifiers. The proposed minimalist ensemble algorithm based on logisticregression can achieve equal or better prediction performance while using only half or onethirdof individual predictors compared to other ensemble algorithms. The results alsosuggested that meta-predictors that take advantage of a variety of features by combiningindividual predictors tend to achieve the best performance. The LR ensemble server andrelated benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgibin/predict.cgi.
|Evidence ID||Analyze ID||Interactor||Interactor Systematic Name||Interactor||Interactor Systematic Name||Type||Assay||Annotation||Action||Modification||Phenotype||Source||Reference||Note|
|Evidence ID||Analyze ID||Gene||Gene Systematic Name||Gene Ontology Term||Gene Ontology Term ID||Qualifier||Aspect||Method||Evidence||Source||Assigned On||Reference||Annotation Extension|
|Evidence ID||Analyze ID||Gene||Gene Systematic Name||Phenotype||Experiment Type||Experiment Type Category||Mutant Information||Strain Background||Chemical||Details||Reference|
|Evidence ID||Analyze ID||Regulator||Regulator Systematic Name||Target||Target Systematic Name||Experiment||Conditions||Strain||Source||Reference|