|
|
|
|
|
|
| About Curt Breneman (RPI) |
|
Curt Breneman was born in Santa Monica, California in 1956, and went on to earn a B.S. in Chemistry at UCLA in 1980 followed by a Ph.D. in Chemistry at UC Santa Barbara (with an emphasis on Physical Organic and Computational Chemistry) in 1987. Following two years of post-doctoral research at Yale University, Dr. Breneman joined the faculty of the Department of Chemistry at Rensselaer Polytechnic Institute (RPI) and began a program in molecular recognition and computational chemistry based on his concept of "Transferable Atom Equivalents", or TAEs, as building blocks for describing the electronic and reactive character of molecules. Dr. Breneman currently holds the rank of Full Professor in the RPI Department of Chemistry and Chemical Biology, and is taking a leading role in the Center for Biocomputation and Quantitation in Rensselaer's new Biotechnology and Interdisciplinary Studies.
The Breneman research group primarily specializes in the development of new molecular property descriptors and machine learning methods that can be applied to a diverse set of physical and biochemical problems. Of paramount interest are methods that can increase the information content of molecular descriptors, and machine learning techniques that can exploit this data for the creation of fully validated, predictive property models. Current application areas include pharmaceutical ADME prediction, virtual high-throughput screening of drug candidates, protein chromatography modeling (HIC and ion-exchange), as well as polymer property prediction.
|
|
A Hard Look at Predictive Modeling: How Much Data is Enough?
Curt M. Breneman, Director, RECCR Center, RPI Department of Chemistry, 110 8th St, Troy, NY 12180, USA
A frequent concern within the predictive cheminformatics community revolves around knowing for any given situation when sufficient data exists to create a usable model with quantifiable reliability when applied to specific test sets. This problem is particularly acute in modeling toxicity, since assay data is often confounded by high in-vivo error bars and multiple mechanisms of action. Coupled with a proliferation of easily available molecular property descriptors and non-linear modeling methods, this creates an environment conducive to the creation of highly "local" models which are only valid within the confined chemical space of the training set. This phenomenon frequently defies objective measurement, since the size and scope of a model's applicability domain depends on a number of factors, including the assay endpoint, the descriptor set and details of the machine learning method used, as well as other factors.
This presentation will summarize our efforts to predict model performance on a number of cases using different descriptor sets through an analysis of its stability to a loss of training information, coupled with an analysis of similarity of the training data with test cases. This analysis involved the development of the "Rank Order Entropy" (ROE) metric as well as an equivalent floating point metric for determining if enough data is present to support the production of a predictive model. Examples of its use in training and test cases will be provided in the presentation.
|
|
|
|
|
|
|
|
|