Artificial Intelligence in Oncology 2/3
2.- Current AI applications, innovations and resources available to the industry at the different steps of drug discovery in oncology
2000 to 3000 molecules are typically required over many years before a single compound is suitable for clinical use. AI has the potential to shorten the path to a clinical molecule. The aim is to design and develop novel, precision engineered drugs with an improved probability of clinical success. This article introduces state-of-the-art AI-resources available for pharmaceutical and biotech companies for drug discovery in oncology.
Drug discovery encompasses target identification and validation, hit identification and lead optimization phases (see Fig. 1) and is followed by the drug development phases when preclinical, clinical and post market trials are conducted.
Fig. 1: Key areas of application of artificial intelligence in drug discovery.
AI has the potential to improve all those phases. In particular, it is critical in oncology, where the phases of disease hypothesis, target identification and validation requires the integration of genomic, functional genomics and genome engineering, combining them in structural and functional ways to mimic the tumor phenotype. In the past, most drug targets have been found by combining published scientific literature for insights into molecular pathways or genetic variants linked to disease. Now the focus is on the identification of original novel targets through genomics, functional genomics enabled by artificial intelligence tools.
Current AI applications and tools in drug discovery
Current AI applications cover a broad range of tasks in the drug discovery and development pipelines for the different therapeutic modalities in oncology, including small molecules and biologics, antibodies, peptides, miRNAs, and gene editing therapies. The tasks can be classified in four classes:
- Target discovery: Tasks to identify candidate drug targets.
- Activity modeling: Tasks to screen and generate individual or combinatorial candidates with high binding activity towards targets.
- Efficacy and safety: Tasks to optimize therapeutic signatures indicative of drug safety and efficacy.
- Manufacturing: Tasks in support of synthesis and manufacturing of therapeutics.
For each of those steps and tasks, there is a vast amount of AI resources, both proprietary and freely available.
AI Innovations: protein structure prediction models and generative chemistry
Among recent advances, Deep Learning (DL) enabled solutions are the most promising. Particularly, this section will introduce the potential in drug discovery of generative deep models for novel chemical synthesis and the emergence of highly efficient protein structure prediction models, such as Alpha Fold2 , and more recently ESM Metagenomic protein Atlas , opening innovative solutions for structure-based drug design, target and activity modeling in the protein space.
I. Protein structure prediction models: Alpha Fold 2
AlphaFold2 is a DL model developed by DeepMind that predicts the folding of monomeric proteins for which the availability of homology templates is limited . The accuracy of predictions based on distance difference test score is non-inferior to experimental methods and confidence metrics are provided to guide the usage of 3D structures produced with different levels of uncertainty. It has been shown that when the uncertainty is taken into account, predictions can be applied to existing structural biology challenges, and their quality is near that of experimental models. 
To understand its utility in oncology we need to answer the following questions:
- How relevant is protein folding for drug discovery? The 3D structures of proteins are highly correlated with how they function in a cell and the impacts amino acid mutations cause. A protein structure is a versatile tool to study the gene-disease association and mechanism of action (MoA), to evaluate the druggability as a therapeutic target. Structure-based drug discovery (SBDD) has been a mainstay method to identify hit molecules and perform lead optimization, which requires the 3D structure of a target. [1-3]. After an endeavor of decades, only a small fraction of the known proteins has experimentally determined structures.
Fig 2. “Lock and Key” theory of drug-target interactions. Image source: Christopher Vakoc.
- Which is the dimension of the proteome space covered by AlphaFold2 and specifically in oncology?The recently released AlphaFold database of predicted protein structures  which currently reach over 200M protein structures covering 21 species, has increased the baseline coverage from 48%, considering experimentally-derived or template-based homology models, up to 76% of human proteome. At the same time AlphaFold2 has reduced the fraction of dark proteomes from 26% to just 10%. In the case of oncogenic mutations, there is also an increase in coverage but less pronounced. The coverage of disease-associated genes and mutations was already near complete before AlphaFold2 release, reaching 88% of oncogenic mutations (TCGA and OncoKB) and 69% of Clinvar pathogenic mutations. This high coverage was not only because disease-associated mutations are more studied, but also because they tend to be located in protein regions that form structures. However, AlphaFold2 models still provide an additional coverage of 3% to 13% of these critically important sets of biomedical genes and mutations in cancer biology. 
- Which are the applications of protein structure prediction models in drug discovery? There are 3 main applications that are being explored and/or proposed:
- Structure-based Drug Identification and Design: In the industry, AlphaFold2 is already being used in the field of structure-based drug design in oncology as one of the pieces leveraged in innovative AI-pipeline platforms licensed to an increasing number of pharmaceutical companies. For example, in hepatocellular carcinoma (HCC) for the first time AlphaFold2 was used to predict protein structures to identify a confirmed hit for an AI-identified novel target . Specifically by analysis of text and omic data from 10 databases for hepatocellular carcinoma, the platform called PandaOmic (developed by InSilico) provided a top list of 20 targets after multiple dimensions filtration, including novelty, accessibility by biologics, safety, small molecule accessibility, and tissue specificity. CDK20 was finally selected as the initial target to work on due to its strong disease association, limited experimental structure information and with no publicly known small molecule inhibitor. Next, through a structure-based drug design AI-platform Chemistry42 (developed by InSilico) they generated 8918 molecules upon the AlphaFold2 predicted CDK20 structure, and 7 were selected for synthesis and biological testing after molecular docking, clustering, and pose inspection. Among them, compound ISM042-2-001 demonstrated a Kd value of 8.9 ± 1.6 µM (n = 4) in CDK20 kinase binding assay and the binding mode was also docking as the guidance for further structure modifications. Notably this molecule is the first reported CDK20 inhibitor. However, modeling protein-ligand interactions is not an out-of-the-box feature of AlphaFold2. In silico experiments to predict interactions of proteins with small molecules or compounds indicates that harnessing AlphaFold2 for drug-target prediction remains a nascent method and realizing its potential for drug discovery requires additional ML-based improvements in modeling protein-ligand interactions. For example, a study of molecular docking simulations based on AlphaFold2-predicted structures with high-throughput measurements of protein-ligand interactions revealed weak model performance . Poor performance was attributed to the fact that the protein structures fed into the model are static, while in biological systems, proteins are flexible and often shift their configurations. Supporting this hypothesis, they also found that the performance was similarly weak for high-quality experimentally-derived protein structures. In order to improve the performance of their modeling approach, the researchers ran the predictions through four additional machine-learning models. These models were trained on data that describe how proteins and other molecules interact with each other, allowing them to incorporate more information into the predictions. Finally, rescoring of docking poses using those machine learning-based scoring functions improved model performance, which indicates that advances in modeling protein-ligand interactions, particularly using machine learning-based approaches, are needed to better harness AlphaFold2 for drug discovery.
- Polypharmacology: Providing a more robust pharmacology framework for drug discovery where drugs could be designed for their polypharmacology, i.e., to modulate multiple protein targets intentionally. This would be in contrast to medicinal chemistry as practiced today where the emphasis is on minimizing off-targets and making highly selective small molecules. Drugs with designed polypharmacology may be able to modulate entire signaling pathways instead of acting on one protein at a time. Even if formidable challenges remain to reach this goal the wide availability of protein structures may hasten progress .
- Structural System Biology: Structural data availability is a common bottleneck. Up until now this information only was available in isolated pieces, fragments of proteins and fragments of proteomes, therefore systems biology has been predominantly non-spatials. With the development of protein structure predictions models, that provides with structural information at scale, it is now expected they will enable structural-based models that could advance the following tasks of special relevance in oncology:
3.1 Understand the effects of genetic variants, especially those located in cancer driver genes or oncogenic mutations (TCGA and OncoKB). This task is both important for human diseases but hard. In order to discern whether mutations are deleterious, nowadays almost exclusively statistical approaches are used by comparing healthy and sick populations. AlphaFold2 can predict wild type protein structures with high accuracy, but according to some reports  it cannot predict the impact of cancer missense mutations on protein structures because the training data for AlphaFold2 do not contain altered structures of these mutated proteins. However, this limitation is expected to be circumvented by language-model-based prediction methods, such as ESM Metagenomic, which are better suited to quickly determining how mutations alter a protein’s structure.
3.2 Protein design o de novo protein design.There are emerging efforts in this field such as AlphaDesign , a computational framework for de novo protein design that embeds AlphaFold2 as an oracle within an optimized design process. It was reported to enable rapid prediction of completely novel protein monomers starting from random sequences. They also mention that a recent and unexpected utility of AlphaFold2 to predict the structure of protein complexes, further allows their framework to design higher-order complexes. They also refer to the potential for designing proteins that bind to a pre-specified target protein (protein-to-protein interactions). Structural integrity of predicted structures is validated and confirmed by standard ab initio folding and structural analysis methods as well as more extensively by performing rigorous all-atom molecular dynamics simulations. Their approach also reveals the capacity of AlphaFold2 to predict proteins that switch conformation upon complex formation.
3.3 Transcriptional regulation engineering: Currently the structural machinery for the protein-based signal transduction pathways that gives the cell its form, motility, and function is not yet engineerable. AlphaFold2 may change this, not only in terms of de novo protein design but also in engineering multi-domain proteins with flexible linkers and programmable logic.
- Conclusions and open challenges:
Deep Learning-models for protein structure prediction such as AlphaFold2 and ESM Meta are an outstanding achievement given the wider coverage, precision and speed with which predictions of folded protein states are now possible, but still only a piece of the puzzle that highlight challenges in drug discovery such as the prediction of protein-protein interaction complexes, allostery and dynamics, the relative positioning of protein domains in multi-domain proteins, the identification of immunogenic peptides (neoantigen prediction) or the prediction of the consequences of different types of mutations. In addition to those open challenges in early drug discovery, still crucial questions of in vivo efficacy and safety of any drug remains – even if we are able to dock (and do structure-based design) on more targets than before to discover ligands faster, it does not anticipate its failure or success as a drug, once tested in vivo or clinical trials. Drugs fail in the clinic because the wrong targets have been chosen or because their effects are different than anticipated.
In conclusion, protein structure prediction models are an important but only one piece of the puzzle useful for both target identification and drug design when optimally leveraged in more complex ML pipelines. Also, the potential of those models to understand and engineer structural systems in cancer biology has only started to be explored and could hopefully be used to advance in vivo crucial questions for drug development.
AI Innovations: protein structure prediction models and generative chemistry
II. Generative models in drug discovery
De novo molecular design has increasingly been using generative models based in DL techniques, proposing novel compounds that are likely to possess desired properties or activities. DL techniques based on mixed architectures such as Generative Adversarial Networks (GAN), Variational Autoencoders (VAE) and Deep Reinforcement Learning are being increasingly used for generative chemistry, can be trained on existing data sets and provide for the generation of novel compounds. Typically, the new compounds follow the same underlying statistical distributions of properties exhibited on the training data set. Additionally, different optimization strategies, including transfer learning, Bayesian optimization, reinforcement learning, and conditional generation, can direct the generation process toward desired aims, regarding their biological activities, synthesis processes or chemical features.
Here, the relevance is not the quantities of predicted molecules it can produce, but the ability of those lead molecules to meet the safety and efficacy criteria of successful drugs and to withstand studies in animals and human patients. In particular, the generated molecules have to obtain superior properties given a range of structurally diverse drugs and to suffice other basic properties, such as synthesizability and low off-target effects. Two illustrative works are GENTRL  and Chemistry42 (developed by InSilico).
- GENTRL(General Tensorial Reinforcement Learning) used a GAN-based generative and reinforcement learning approach to select 40 compounds for synthesis and testing against the discoidin domain receptor 1 (DDR1) kinase. This approach used a two-step algorithm. The first step involves learning a mapping of the chemical space; the second step explores this mapping with deep reinforcement learning to learn DDR1 and common kinase inhibitors. GENTRL utilized three distinct Kohonen-based self-organizing maps (SOMs) as reward functions for the reinforcement learning step: the trending SOM (scores compound novelty based on patent disclosure dates), the general kinase SOM (distinguishes kinase inhibitors), and the specific kinase SOM (isolates DDR1 inhibitors). See Fig. 3 The synthesized compounds were followed up with pharmacokinetic studies in mice, resulting in the identification of a lead compound with a favorable property profile; the authors acknowledged the potential for further optimization before progressing to candidate selection.
Fig 3. GENTRL architecture. Image source 
- Chemistry42 was built upon GENTRL and is a small-molecule generating AI platform that can design, rank and score millions of compounds to find hundreds with desired properties—whether those are existing drugs or potentially new therapeutics. The platform is trained on 10 million publicly available compounds, and 100 million building blocks—or virtual molecular fragments. The generative engine of the Chemistry42 platform generates hundreds of molecular structures that are funneled into a reward pipeline. This reward pipeline assesses each structure’s suitability and selects high-scoring molecules, those that meet objectives such as safety, potency, synthetic availability, and metabolic stability. The generated molecules and their subsequent scores are returned to the generative engine so that the models “learn” the types of molecules that score highly and those that score poorly. Based on these data, the generative models are re-trained to generate high scoring molecules. A successful example of an efficient discovery of a novel cyclin-dependent kinase 20 (CDK20) small molecule inhibitor is described in . It was conducted in 30 days covering target selection, molecule generation, compound synthesis and biological testing (see Fig. 4). Further optimization on this molecule as well as the evaluation of ADME properties and kinase selectivity were ongoing at the time of their publication.
Fig 4. In silico Medicine Generative procedures for CDK20 hits. Image source: 
There are many relevant works in the field of generative chemistry, most of them are based on gradient optimization algorithms as the two deep-learning examples referred above, while there are also approaches based on gradient-free stochastics methods such as evolutionary algorithms. Given its extension, we refer the reader to a summary  of exemplar methods for de novo molecular design broken down by the coarseness of molecular representation: in essence, whether molecular design is modeled on an atom-based, fragment-based, or reaction-based paradigm. Also acknowledging the challenges of this field, strong benchmarks to standardize the assessment of generative molecular models are now available as open-source projects [13,14].
The role of ML and AI in drug discovery is growing with a special interest for de novo molecular design methods because of their ability to navigate extremely large chemical spaces more effectively than either virtual screening or a human expert. To put it in context, the space of possible small organic molecules has an estimated number of 1060 molecules. Despite early concerns regarding the use of automated methods for molecular design, often relating to the instability, reactivity, actionability, and synthetic feasibility of the molecules suggested, we now have a variety of tools at our disposal that are proficient generators of sensible molecule structures [12 ].
The application of DL is making outstanding achievements in specific steps of the drug discovery pipeline such as protein structure prediction and generative chemistry, but to truly advance drug discovery in oncology, in addition to those innovations we still need to understand cancer biology better in the first place. It is currently not trivial to apply AI methods in the drug discovery context, which is, to a good extent, because of difficulties in generating and labeling relevant biological, and physiological data for questions related to efficacy and safety, also, the amount of data available in the cancer biology datasets does not qualify as ‘big-data’, the datasets available for cancer therapeutics are substantially smaller than those available in other fields  and data suffer from high technical heterogeneity, high-dimensionality and low signal-to-noise ratio. Omics data often suffer from measurement inconsistencies between cohorts, marked batch effects and dependencies on specific experimental platforms. Such a lack of consistency is a major hurdle when applying AI methods. Consensus on the measurement, alignment and normalization of tumour omics data will be critical for each data type. In agreement with , only when we are then able to measure and capture relevant biological endpoints in vivo we will be able to advance the field significantly further, and to apply the computational algorithms currently available to us fruitfully in the drug discovery area, with respect to compound efficacy and safety in the clinic.
Dr. Aurelia Bustos, MD, PhD
- Structure-Based Drug Discovery Paradigm. Int. J. Mol. Sci. 20, 2783, (2019) DOI: 10.3390/ijms20112783 .
- Structure-based inhibitor design of mutant RAS proteins—a paradigm shift. Cancer Metastasis Rev. 39, 1091–1105, (2020) DOI: 10.1007/s10555-020-09914-6 .
- Marineau, J. J. et al. Discovery of SY-5609: A Selective, Noncovalent Inhibitor of CDK7. J. Medicinal Chem.(2021) DOI: 10.1021/acs.jmedchem.1c01171 .
- AlphaFold Protein Structure Database
- The structural coverage of the human proteome before and after AlphaFold. PLoS Comput Biol. 2022 Jan; 18(1): e1009818. doi: 10.1371/journal.pcbi.1009818
- Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery. Molecular Systems Biology (2022) https://doi.org/10.15252/msb.202211081
- AlphaFold Accelerates Artificial Intelligence Powered Drug Discovery: Efficient Discovery of a Novel Cyclin-dependent Kinase 20 (CDK20) Small Molecule Inhibitor https://arxiv.org/pdf/2201.09647.pdf
- AlphaFold2 @ CASP14: “It feels like one’s child has left home.” https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/
- AlphaDesign: A de novo protein design framework based on AlphaFold | bioRxiv https://www.biorxiv.org/content/10.1101/2021.10.11.463937v1
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv 2022 https://doi.org/10.1101/2022.07.20.500902
- Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nature Biotechnology volume 37, pages 1038–1040 (2019). DOI: 10.1038/s41587-019-0224-x
- De novo molecular design and generative models. Drug Discovery Today (2021) https://doi.org/10.1016/j.drudis.2021.05.019
- GitHub – BenevolentAI/guacamol: Benchmarks for generative chemistry
- MOSES · GitHub
- Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data. Drug Discovery Today (2021) https://doi.org/10.1016/j.drudis.2020.11.037
- Big data in basic and translational cancer research. Nature Reviews Cancer. volume 22, pages 625–639 (2022). https://doi.org/10.1038/s41568-022-00502-0
- Can AlphaFold2 predict the impact of missense mutations on structure? Nature Structural & Molecular Biology volume 29, pages 1–2 (2022). https://doi.org/10.1038/s41594-021-00714-2
- A structural biology community assessment of AlphaFold2 applications. Nat Struct Mol Biol (2022). https://doi.org/10.1038/s41594-022-00849-w