By validating the kDAP against existing clinical factors, we envision its application to emerging problems where no suitable factors exist. models, data analysis protocol, predictable good performance, cancer prediction, parameter selection == Introduction == The US Food and Drug Administration MicroArray Quality Control (MAQC) project is a community-wide effort to analyze the technical performance and practical use of emerging biomarker technologies (such as DNA microarrays, genome-wide association studies and next generation sequencing) for clinical application and risk/safety assessment. A major objective of the second phase of the project (MAQC-II) is to evaluate the performance of microarray-based classifiers for clinical use.1To facilitate this investigation, the MAQC-II project obtained three large clinical data sets containing approximately 700 samples. These data profile three types of cancers (breast cancer, neuroblastoma and multiple myeloma) generated by the Affymetrix or Agilent microarray technologies. The MAQC-II organized these samples into six clinical end points, two positive controls and two unfavorable controls (Table 1). == Table 1. Data set properties for 10 clinical end points. == Provided by the Solithromycin University of Texas MD Anderson Cancer Center (Houston, TX, USA).2 Provided by the Myeloma Institute for Research and Therapy at the University of Arkansas for Medical Sciences (Little Rock, AR, USA).3 Provided by the Children’s Hospital of the University of Cologne, Germany.4 The MAQC-II project extensively evaluated common practices for classifier development and validation, such as dealing with an exceedingly large feature space (that is, curse of dimensionality’), selecting the best performing model among those developed (that is, multiple comparisons problem) and Mouse monoclonal to LAMB1 estimating the performance of the classifiers for future prediction (that is, cross-validation (CV) versus external validation (EV)). An unbiased way to determine best practices for classifier development and validation is to systematically explore the entire parameter space of various classification algorithms. However, due to the overwhelming number of modeling parameters that Solithromycin contribute to the classifier performance, the MAQC-II consortium decided that it was not administratively feasible to conduct such a study. Consequently, 36 MAQC-II analysis teams from academia, industry and the Food and Drug Administration selected their own methods and parameter spaces to build classifiers using the same labeled data sets and then submitted them to MAQC-II. Among the 19 779 classification models submitted by 36 teams, 9742 werek-nearest neighbor-based (KNN-based) models (that is, 49.3% of the total). Analyzing these KNN classifiers, we made two key observations: first, KNN models have generally performed well compared with more Solithromycin complicated modelsa obtaining which is also in line with previous studies.5,6Second, there have been large variations in prediction performance among KNN models submitted by different teams (Supplementary Determine S1). Thus, the main goals of this study were (1) to motivate the use of classifiers such as KNN that capture nonlinear interactions between features as apposed to main effects; (2) to investigate the modeling factors that contribute to the variations in KNN classifier performance; (3) to develop a robust KNN data analysis protocol (kDAP) that can provide reliable KNN models for clinical use; (4) to show how this kDAP can be applied to a newly generated clinical data set and (5) to validate the KNN predictor results through both biological interpretation and comparison with practical clinical risk factors. As shown inFigure 1, we develop the kDAP using MAQC-II data and assess its clinical use by comparing its performance to existing clinical factors for risk stratification. == Determine 1. == Neuroblastoma case study to show clinical applications of KNN classifier. We designed a method to test whether KNN produces classifiers of good clinical relevance. First, we developed our approach using MAQC-II gene expression data. Then, we applied this approach to additional Neuroblastoma data and compared it to existing clinical factors for risk. == Background == Besides being popular in the MAQC-II Project, KNN is also a common method used for classification in the literature such asNatureseries journals,7,8Proceedings of the National Academy of Sciences9,10,11and theNew England Journal of Medicine.12,13The KNN classifier assigns a label to a new unknown sample by considering the labels of thekmost.