José Pereira

Posted on Jun 7, 2022

Julia for protein design?

#protein #design #chemistry #simulation

By now, it's undeniable that Julia has been gaining traction in the field of scientific computing and bioinformatics, with amazing packages such as BioJulia and super active forums of discussion, such as the #biology channel at Julia's Slack.

In this quick overview I would like to dive a little deeper into a more niche topic of scientific computing: protein design.

What is protein design?

Proteins are the workhorse of nature, with a multitude of functions from structural, to transport, enzimatic, hormonal and even immunal response. All this versatility is a product of a "simple" (yet beautiful) combination of just 20 different amino acids - the building blocks of life. Once synthesized, sequences of aminoacids fold into a given conformation, and it is this structural organization that confers a specific task in the context of the cell.

In short, protein design is the scientific area of research that attempts to generate sequences of amino acids “à la carte” that will fold into unnatural conformations with novel activities or behaviors.

This has traditionally been performed "blindly": new sequences were generated at random (sometimes via radiation-induced mutations) just to see what happens! As you may guess, this was horrendously expensive and time consuming.

In the last few decades, however, a new player has entered the game: computationally aided design (a.k.a. CAD).

In this new paradigm, protein squences are simulated "in-silico" beforehand, with prototypes being filtered for propective candidates with a much higher throughput than even the wildest dreams of a couple decades ago.

Software for protein design: are we done?

The history of computational protein design (despite being a somewhat young and fresh field) is rich and filled with breakthroughs. I suggest reading this review on the topic.

A common development architecture has, however, emerged: in order to simulate sequence designs and evaluate how "good" or "bad" they are, two fundamental pieces of software are required:

A sampling motor: a way to introduce change to a protein (to manipulate the particles in the system), think, for example, a way to introduce a mutation.
An energy function: a way to evaluate the current system on how "real" or how "adjusted" it is (sometimes also called a "fitness function").

Both these requirements have been thoroughly explored in the past. Arguably the most sucesseful experiment to date belongs to Rosetta (and it's Python wrapper, PyRosetta), under the supervision of Dr. David Baker.

Honestly, I think I could probably write a 100-page page on how much Rosetta changed the landscape of computational protein design! However, are we done? Is this the best we can do?

Well ...

The good, the bad and the ugly

I don't think we're done. Rosetta is awesome. Rosetta revolutionized the way we make science and the way we engineer proteins for our human things. But Rosetta is not perfect.

Having had to learn the inner-workings of PyRosetta during my PhD, I can safely argue that Rosetta does not constitute an example of a modern API for protein design. Here's why:

The Rosetta (C++) and PyRosetta (Python) are a perfect example of the two-language problem;
The Rosetta software is, for the most part, a patchwork of individual applications for specific uses, interlaced with complicated mechanisms, such as the RosettaScripts (an XML-based syntax for setting up algorithms with multiple of Rosetta's functionalities);
The PyRosetta's documentation is infamous for its lack of information and outdated examples;
Rosetta does not directly benefit from modern hardware, such as GPU or distributed computing;
Rosetta is not an open-source project (and, given the lack of documentation, virtually impossible to modify);

In short: we can do better.

What's next?

I see in Julia the path forward for protein design software. I'll take the chance to add a shameless plug to my own PhD work, where I try to tackle this same problem and develop a modern approach to protein design: ProtoSyn.jl. Albeit still a work-in-progress, I hope future users (and contributors) find in ProtoSyn.jl a home for all things related to molecular manipulation & simulation, with (of course) a strong emphasis on protein design. I'll keep this short and, without going too much into details, here's an incomplete list of features that I feel should shape a modern Julia-based approach to protein design:

Complete molecular manipulation tools
Fast & native energy functions, with optional plug-and-play support for established energy functions for external packages
GPU & distributed computing support
Incorporation of non-canonical aminoacids
Addition of post-translational modifications
Support for ramified peptides and glicoproteins
Support for common optimization simulations (steepest descent, monte carlo, etc)
Full suite of up-to-date examples, tutorials and documentation
Free pizza

Well, ProtoSyn.jl does almost all of the above (we still haven't figured out how to get free pizza)!

I'll finish this post here: with a huge enthusiasm for what's to come. Hopefully, Julia shapes the future of protein design. The potential is all there!

Oldest comments (4)

Ashok Kumar • Jun 8 '22

Quite an ambitious project. All the best.

Rosetta does not directly benefit from modern hardware, such as GPU or distributed computing;

I looked up about Rosetta out of curiosity and saw in the release notes that there are attempts to support TF and GPU calculations through TF in the latest builds.

Rosetta 3.13
New tools and apps:
trRosetta available in C++ Rosetta now. A TensorFlow build (extras=tensorflow or extras=tensorflow_gpu) supports this.

Perhaps you have a better vision for solving this.

José Pereira • Jun 8 '22

Hi, thanks for the support!
It's true that Rosetta is now expanding and attempting to modernize (and standardize) the software. However, trRosetta is a separate application from Rosetta. trRosetta is the direct response to Deepmind's amazing AlphaFold results in CASP14: it uses machine learning models to learn from existing proteins and identify the folding patterns. It's an amazing revolution by itself: it will allow scientists to get a glimpse into the structure (and therefore the function) of unknown proteins simply by sequencing the genome.

However, it is not a direct tool for protein design by itself: the design space is (perhaps) even more gigantic that the conformational/folding space, and given the nature of "creating the not-yet-created", machine learning models struggle to find direct solutions. There are, however, modern tools, such as the SeqDes Model or the TorchANI model that attempt to shed some light into the inner-working of proteins and therefore offer a possible guide to high throughput computational design of proteins (I would love to get into more detail on this, both models are implemented in ProtoSyn.jl ahah).

The Rosetta suite itself is also undergoing a modernization effort, I'm sure GPU (and other sweet goodness) are coming to Rosetta and PyRosetta soon. Alas, I'm highly hopeful that Rosetta manages to implement all the tools we desperately need! Until then, I guess the best way to learn is by doing! So we'll keep coding ahah

Ashok Kumar • Jun 8 '22

Very nice. It seems like a lot of software tools are needed to make your protein design toolkit complete and of course, this project will allow you to experiment in those newer areas.

I am not from the biology field but your articles about this and observing how you optimize ML models in Julia to use GPU for high throughput would be interesting and educational for everyone.

José Pereira • Jun 8 '22

This is highly experimental (as it should be in the academic environment). Hopefully someone more experienced than me gets curious enough to try new things out: that's how science is made! Cheers