The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Commun Chem. 2024 Jun 12;7(1):134. doi: 10.1038/s42004-024-01220-4.

Scott H Snyder ¹, Patricia A Vignaux ¹, Mustafa Kemal Ozalp ¹, Jacob Gerlach ¹, Ana C Puhl ¹, Thomas R Lane ¹, John Corbett ¹, Fabio Urbina ², Sean Ekins ³

Affiliations

1 Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA.
2 Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA. fabio@collaborationspharma.com.
3 Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA. sean@collaborationspharma.com.

PMID: 38866916 DOI: 10.1038/s42004-024-01220-4

Abstract

Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the 'no-free lunch' theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a 'goldilocks zone' for each model type, in which dataset size and feature distribution (i.e. dataset "diversity") determines the optimal algorithm strategy. When datasets are small ( < 50 molecules), FSLC tend to outperform both classical ML and transformers. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers outperform both classical models and few-shot learning. Finally, when datasets are of larger and of sufficient size, classical models then perform the best, suggesting that the optimal model to choose likely depends on the dataset available, its size and diversity. These findings may help to answer the perennial question of which ML algorithm is to be used when faced with a new dataset.

Products

Cat. No.

Product Name

Category/Application
HY-L066

FDA Approved & Pharmacopeial Drug Library

Drug Repurposing Series

MedChemExpress (MCE) 只为有资质的科研机构、医药企业基于科学研究或药证申报的用途提供医药研发服务，不为任何个人或者非科研性质的、非用于药证申报使用等其他用途提供服务。	站点地图隐私声明
沪ICP备15051369号-4 上海工商沪公网安备31011502019417 沪(浦)应急管危经许[2021]201709(QFYS) 营业执照（三证合一）
Copyright © 2013-2025 MedChemExpress. All Rights Reserved.

MedChemExpress (MCE) 只为有资质的科研机构、医药企业基于科学研究或药证申报的用途提供医药研发服务，不为任何个人或者非科研性质的、非用于药证申报使用等其他用途提供服务。

站点地图隐私声明

沪ICP备15051369号-4