By now, it's undeniable that Julia has been gaining traction in the field of scientific computing and bioinformatics, with amazing packages such as BioJulia and super active forums of discussion, such as the #biology channel at Julia's Slack.
In this quick overview I would like to dive a little deeper into a more niche topic of scientific computing: protein design.
Proteins are the workhorse of nature, with a multitude of functions from structural, to transport, enzimatic, hormonal and even immunal response. All this versatility is a product of a "simple" (yet beautiful) combination of just 20 different amino acids - the building blocks of life. Once synthesized, sequences of aminoacids fold into a given conformation, and it is this structural organization that confers a specific task in the context of the cell.
In short, protein design is the scientific area of research that attempts to generate sequences of amino acids “à la carte” that will fold into unnatural conformations with novel activities or behaviors.
This has traditionally been performed "blindly": new sequences were generated at random (sometimes via radiation-induced mutations) just to see what happens! As you may guess, this was horrendously expensive and time consuming.
In the last few decades, however, a new player has entered the game: computationally aided design (a.k.a. CAD).
In this new paradigm, protein squences are simulated "in-silico" beforehand, with prototypes being filtered for propective candidates with a much higher throughput than even the wildest dreams of a couple decades ago.
The history of computational protein design (despite being a somewhat young and fresh field) is rich and filled with breakthroughs. I suggest reading this review on the topic.
A common development architecture has, however, emerged: in order to simulate sequence designs and evaluate how "good" or "bad" they are, two fundamental pieces of software are required:
A sampling motor: a way to introduce change to a protein (to manipulate the particles in the system), think, for example, a way to introduce a mutation.
An energy function: a way to evaluate the current system on how "real" or how "adjusted" it is (sometimes also called a "fitness function").
Both these requirements have been thoroughly explored in the past. Arguably the most sucesseful experiment to date belongs to Rosetta (and it's Python wrapper, PyRosetta), under the supervision of Dr. David Baker.
Honestly, I think I could probably write a 100-page page on how much Rosetta changed the landscape of computational protein design! However, are we done? Is this the best we can do?
I don't think we're done. Rosetta is awesome. Rosetta revolutionized the way we make science and the way we engineer proteins for our human things. But Rosetta is not perfect.
Having had to learn the inner-workings of PyRosetta during my PhD, I can safely argue that Rosetta does not constitute an example of a modern API for protein design. Here's why:
- The Rosetta (C++) and PyRosetta (Python) are a perfect example of the two-language problem;
- The Rosetta software is, for the most part, a patchwork of individual applications for specific uses, interlaced with complicated mechanisms, such as the RosettaScripts (an XML-based syntax for setting up algorithms with multiple of Rosetta's functionalities);
- The PyRosetta's documentation is infamous for its lack of information and outdated examples;
- Rosetta does not directly benefit from modern hardware, such as GPU or distributed computing;
- Rosetta is not an open-source project (and, given the lack of documentation, virtually impossible to modify);
In short: we can do better.
I see in Julia the path forward for protein design software. I'll take the chance to add a shameless plug to my own PhD work, where I try to tackle this same problem and develop a modern approach to protein design: ProtoSyn.jl. Albeit still a work-in-progress, I hope future users (and contributors) find in ProtoSyn.jl a home for all things related to molecular manipulation & simulation, with (of course) a strong emphasis on protein design. I'll keep this short and, without going too much into details, here's an incomplete list of features that I feel should shape a modern Julia-based approach to protein design:
- Complete molecular manipulation tools
- Fast & native energy functions, with optional plug-and-play support for established energy functions for external packages
- GPU & distributed computing support
- Incorporation of non-canonical aminoacids
- Addition of post-translational modifications
- Support for ramified peptides and glicoproteins
- Support for common optimization simulations (steepest descent, monte carlo, etc)
- Full suite of up-to-date examples, tutorials and documentation
- Free pizza
Well, ProtoSyn.jl does almost all of the above (we still haven't figured out how to get free pizza)!
I'll finish this post here: with a huge enthusiasm for what's to come. Hopefully, Julia shapes the future of protein design. The potential is all there!