How generative AI language models are unlocking the secrets of DNA
How generative AI language models are unlocking the secrets of DNA Large language models (LLMs) learn from statistical associations between letters and words to predict what comes next in a sentence and are trained on large amounts of data. For instance, GPT-4, which is the LLM underlying the popular generative AI app ChatGPT, is trained on several petabytes (several million gigabytes) of text. Biologists are leveraging the capability of these LLMs to shed new light on genetics by identifying statistical patterns in DNA sequences. DNA language models (also called genomic or nucleotide language models) are similarly trained on large numbers of DNA sequences. DNA as “the language of life” is an oft-repeated cliché. A genome is the entire set of DNA sequences that make up the genetic recipe for any organism. Unlike written languages, DNA has few letters: A, C, G, and T (representing the compounds adenine, cytosine, guanine, and thymine). As simple as this genomic language migh...