An AI-based tool can predict unknown functions of any protein
An AI-based tool can predict unknown functions of any protein

A team led by Rosa Fernández from the Institute of Evolutionary Biology (IBE)—a joint center of the Spanish National Research Council (CSIC) and Pompeu Fabra University (UPF)—and Ana Rojas from the Andalusian Center for Developmental Biology (CABD)—a joint center of CSIC, the Andalusian Regional Government, and Pablo de Olavide University (UPO)—has developed an artificial intelligence (AI)-based tool capable of predicting the unknown functions of proteins directly from genomic sequences without prior reference, using language models.
In just a few hours and without any training required, this open-access and freely available tool can uncover the function of any hidden protein within the “dark proteome” (the set of proteins whose function is still unknown).
Using this novel tool, named FANTASIA (Functional ANnoTAtion based on embedding space SImilArity), the IBE and CABD teams have analyzed nearly 1,000 animal genomes with close to 100% accuracy, assigning functions to 24 million protein-coding genes from the dark proteome.
FANTASIA can handle Big Data, analyzing a complete animal genome within hours on a regular computer—or in just 30 minutes on a specialized workstation.
Today, we take for granted that we can synthesize insulin to treat diabetes, but this would not be possible without understanding the function of that essential protein. Like insulin, every protein performs a specific role, and genes are responsible for encoding them, enabling cells to express them repeatedly through their internal machinery. The genome of any organism holds the formula to synthesize all its proteins—that is, its proteome. Yet, we still lack knowledge of the functions of many genes across the tree of life.
In humans, the functions of most proteins are already known—around 80–90%—but in other mammals this percentage drops, and in invertebrates, the function of more than half of all proteins remains a mystery. Although we can read the billions of letters of their DNA sequences, the biological functions of many of these proteins remain hidden, leaving key clues about species evolution, metabolism, or even health undiscovered.
Until now, the main method for predicting protein function relied on comparing the genes encoding them with similar known genes, called homologs—a limited approach that leaves much of this vast universe unexplored.
Decoding the dark proteome of the animal tree of life
Over the past decade, major international initiatives such as the European Reference Genome Atlas (ERGA)—part of the Earth BioGenome Project (EBP)—have generated reference genome sequences for thousands of animals to support biodiversity research.
But obtaining the sequence that encodes a protein does not mean understanding what it does.
To reveal these functions, traditional (non-AI) methods compare the genes encoding the proteins with similar DNA sequences—known as homologous genes. In this way, a new proteome is annotated by similarity to the coding genes of already-known proteins. However, the vast majority of proteins lack homologous references and remain hidden in the terra incognita of the dark proteome.
Artistic interpretation of FANTASIA. Credit: Gemma Isabel Martínez-Redondo.
“Understanding the function of these genes through this new tool opens a new window into the biology of animals. It will allow us to understand how evolutionary innovations arise and what roles unknown proteins play in species diversity and adaptation,” explains Fernández, principal investigator at the IBE’s Metazoa Phylogenomics and Genome Evolution Lab and member of the ERGA executive committee.
Along the same lines, Rojas, who co-led the study from CABD, emphasizes that “the use of language models based on artificial intelligence allows us to go beyond simple homology comparison. These models learn directly from genetic sequences and can infer the potential function of genes with no known equivalents, opening new possibilities for exploring the dark proteome.” In fact, the EBP recommends the use of FANTASIA to all its collaborators on the project’s website.
With language models—a specific type of AI application—it is now possible, for the first time, to predict a protein’s function without comparing its gene sequence to known ones. Instead of searching for direct similarities, these methods translate DNA sequences into fragments and analyze them syntactically, as if they were sentences in a language. Each fragment receives a numerical value, and the system builds its own grammar to predict what’s missing—just like a text processor completes sentences.
This “ChatGPT of proteins” learns from thousands of known examples, identifying what each protein does, what biological processes it participates in, and where in the cell it is located (known as GO terms, from Gene Ontology). With this information, each protein becomes a numerical vector—a sort of mathematical fingerprint summarizing its characteristics. Using these vectors, FANTASIA can analyze new DNA sequences and predict their functions with high precision, unlocking discoveries that previously seemed unattainable—and it can do so for thousands of proteins simultaneously.
“FANTASIA is open-source software that’s easy to use even for users without programming experience. It includes pre-trained models, aligning with sustainability principles, and can be run without access to supercomputers,” notes Gemma Martínez Redondo, a PhD student at IBE and the study’s first author.
Shedding light on “dark biology”: a true FANTAS-IA?
Uncovering the functions of an organism’s proteins is key to understanding genome evolution and the complexity of life. This new language model could therefore boost scientific knowledge in this field—as well as in biodiversity and global health research.
“FANTASIA is a hypothesis generator—it sheds light in the darkness. Studying every gene in every organism one by one is unthinkable. Now it will be easier to target efforts toward deeply investigating protein functions. This can be particularly useful in the pharmaceutical field, for instance, identifying therapeutic targets,” explains Fernández.
The study has already revealed hidden proteins in tardigrades, ctenophores, and micrognathozoans—three little-known invertebrate phyla whose proteomes remain largely unexplored.
Ctenophore Beroidae, a gelatinous zooplankton species that moves using cilia and preys on other ctenophores. Credit: National Oceanic and Atmospheric Administration. Public domain. Via Picryl.
“In evolutionary biology, the change, loss, or gain of protein function across organisms tells the story of their phylum’s or species’ evolution. It can indicate how an organism adapted to a new environment, what it fed on, or why it stopped needing certain genomic tools, among many other things,” adds Fernández.
The AI tool is freely available to any research group worldwide, with the potential to illuminate genomic and proteomic studies in virtually any field.
“We know that other research groups around the world are already using FANTASIA in their studies, and we’re seeing that it works not only in animals, but also in plants, viruses, fungi, and protists. The potential for discovering new genes that could revolutionize biotechnology, medicine, or biodiversity conservation is limitless,” concludes Fernández.
“The possibilities offered by the methods we use in this tool are enormous for expanding our understanding of alternative protein functions—it’s like deciphering their grammar,” adds Rojas.
Referenced article:
Martínez-Redondo, G. I., Perez-Canales, F. M., Carbonetto, B., Fernández, J. M., Barrios-Núñez, I., Vázquez-Valls, M., Cases, I., Rojas, A. M., & Fernández, R. (2025). FANTASIA leverages language models to decode the functional dark proteome across the animal tree of life. Communications Biology, 8(1), 1227. https://doi.org/10.1038/s42003-025-08651-2