Artificial Intelligence in Oncology Drug Discovery

1.- Understanding the current possibilities and expectations of drug discovery in oncology

This article is the first part of a series dedicated to Artificial Intelligence (AI) in drug discovery and development in oncology. It is an introductory overview on drug discovery (see Fig 1) that formulates and provides answers to key questions of the application of AI in drug discovery for oncology.

Fig 1. Overview of key components of drug discovery and development. Graphic from DrugBank. The present article focuses on the discovery phase.

What is the current state of AI in drug discovery supported by figures?

AI has become increasingly relevant within the pharmaceutical industry [1].

Up to date, $13.8B has been invested in companies and partnerships leveraging AI in drug discovery in a consolidating AI-enabled industry. In figures, more than 40 Pharma companies (and more than 230 start-ups) are using AI for drug discovery. Over 205 companies claim to offer AI-based drug discovery services / technologies in lead identification, optimization and generation and more than 115 drugs have been developed aided by AI technology. Its use has also been proved to have a positive economic impact in terms of an estimated 25% cost savings. The AI-based drug discovery market is projected to grow at an annualized rate of 25%, during the period 2022-2035, [2].

What are the general applications and techniques of AI that are being applied to drug discovery? 

The various subfields of AI which are centered around its general applications include reasoning, knowledge representation, planning, learning, natural language and sequence processing, perception (including artificial vision), and the ability to move and manipulate objects. All these subfields are applicable to drug discovery.

AI techniques in drug discovery are numerous, and a few examples of AI techniques from older to more recent, are:

  1. Search-tree algorithms and expert rule-based systems which are still widely used in retrosynthesis to find optimal synthesis routes
  2. Traditional machine learning relying on manual feature extraction such as decision-tree based methods, dimensionality reduction and clustering techniques in sparse multidimensional data which are still extensively used by biologists
  3. Deep representational learning approach, which has evolved currently as a vast AI discipline illustrated by large-scale deep neural networks with highly sophisticated training regimes and architectures (including graph neural networks, transformers, deep reinforcement learning, etc.) which holds the greatest promise for drug discovery.

What differentiates general drug discovery tools from cancer-specific applications?

The subfield of oncological drug discovery shares most of the challenges and characteristics of broader drug discovery. Nonetheless, different from other therapeutic areas, cancer is a more complex disease where the principle of target-based drug discovery using isolated mechanisms and targets leads to frequent failures in the clinic, in particular as a result of poor efficacy [3].  For example, in the case of viral infections, it may be enough to target a certain protease required for replication or a receptor required for cell entry [3]. In contrast, cancer therapeutics require the targeting of essential biological capabilities – summarized in the 14 Hallmarks of Cancer (Fig 2) – that rule tumor development in humans.

Fig 1. Overview of key components of drug discovery and development. Graphic from DrugBank. The present article focuses on the discovery phase.

Toward this end, although still evolving, newer approaches facilitated by AI-based tools are emerging such as phenotypic screening which incorporates some of the cellular complexity of biology. It attempts to merge disease-relevant biology, such as transcriptome, with large numbers of compounds that can be screened [5]. Also, AI is increasingly used to advance our understanding of functional genomics to help decode how gene expression is regulated in normal and complex diseases like cancer.

What is the expectation of AI on drug discovery in oncology?

In the drug discovery field, clinical testing in human patients is the most expensive and difficult step. Drug discovery still fails 25% of the time on toxicity in humans and about 50% of the time on efficacy [6]. Estimates are still worse in oncology; hence the first expectation is that AI could help to reduce clinical failure and not only reduce the time and cost of current drug discovery phases.

In the last two decades, drug discovery tools have exponentially integrated AI techniques in a stepwise fashion focused each on vertical narrow tasks such as QSAR (quantitative structure-activity relationship) properties prediction, molecular dynamics simulations in allosteric modulation analysis, high throughput virtual screening or retrosynthesis, improving both the efficiency and reducing costs at those early steps of drug discovery. Of note, all these methods are focused mainly on traditional physicochemical and structural aspects of the generation of compounds where quality is often encoded by unidimensional metrics of activity and properties, and do not cover the likely biological consequences for the organism’s dynamics which are only assessed at later stages of clinical drug development in clinical trials. As a result, an “AI-discovered compound” nowadays does not guarantee success in clinical trials. Hence, the full potential of AI on drug discovery is not yet realized and will only be reached once the complexity of cancer biology – as recapitulated in the Hallmarks of Cancer -, and efficacy and toxicity of drugs in the human body can be modeled computationally, even if it does not necessarily require to be completely understood by humans.

This will not happen at once but will be a long-collaborative way to run, being the availability of foundation models+ trained on multimodal omic data++ as the first step to conquer. Nonetheless, in the time being, as we approach this goal, we will increasingly be more capable of using AI to reduce clinical failure, helping to decide at the drug discovery stages which drug candidate should be moved to clinical development based on more accurate predictions of drug candidate’s safety and efficacy. Also, it will pave the way for better patient selection and predictive biomarkers towards a more personalized medicine.

+ Foundation models are large-scale deep learning models trained (usually by self-supervised learning) on vast quantities of data at scale resulting in a model that can be adapted to a wide range of downstream tasks.

++ Multimodal refers to different modalities of data such as image, text, biological sequences, signals and tabular data. In cancer drug development multimodal omic data refers to different sources and modalities of data ranging from physico-chemical and structural properties of molecules to multi omic biological data (e.g. genomic, transcriptomic, epigenomic, reactoma, metabolomics, microbiome) up to clinical data (e.g toxicity, tumor response, patient outcomes etc…)

How could we realize the full potential of AI in drug discovery?

  • AI is limited by the quality of the data it has to process. Curating data of low quality is of paramount relevance with the focus on quality of labeling and the  assessment of signal to noise ratio. This is particularly important in biological omic data which is particularly characterized by sparsity, noise and batch effects introduced by biology readouts. Biological readouts are highly dependent on the experimental system/assay used and are thus often not reproducible. Correction of batch effects is needed (e.g., for the integration of single-cell RNA-seq data [7] or histopathology images [8]).
  • Small datasets, proprietary undisclosed datasets and disparate efforts are not enough to train large-scale foundation models. Collaborative and investment efforts are required to compile large-sized datasets of omic data for computational oncological drug discovery.
  • Deep representational learning (DL) is under-represented in cancer biology compared to the ubiquitous use of traditional machine learning and heavy feature engineering mainly due to small sized datasets, computational resources and experience. Instead, DL holds the greatest potential for multi omic data handling both multimodal data and raw biological sequences minimizing information loss.
  • Large-scale foundation models are required in cancer biology following contemporary examples in the general domain in natural language and/or image such as DALLE-2 [9]  CLIP [10] , BLOOM [11]. In the field of drug discovery, AlphaFold [12], is currently being explored as a foundation model for different drug discovery end tasks involving the 3D structure of target proteins
  • The industry should back and incentivize open-source projects and help grow the community of AI/DL software engineers in drug discovery. Relevant open-source initiatives  are DeepPurpose [13] and DeepChem [14]. Open-source tools coupled with public data would reduce the barrier of applying AI in drug discovery for medium and small sized biotechs.
  • Finally experimental validation is always required in computational drug discovery.

AI can be applied to all stages during drug discovery. In the next article, we will review concrete applications of AI and available AI-based resources including datasets applicable to the different phases of drug discovery in oncology: from target identification, phenotypic screening, target validation, virtual screening, de-novo design, retrosynthesis, drug synthesis and optimization and drug repositioning. It will also discuss the potential of Alpha Fold in drug discovery applied to oncology and the emergence of deep generative models in this field.

Dr. Aurelia Bustos, MD, PhD