Full Time Internship
Machine Learning Engineer
At NE47 Bio, we are building the next generation of protein language models and machine learning tools to enable biologists to understand and design proteins with unprecedented ease and accuracy. Our OpenProtein.AI platform places state-of-the-art sequence-to-function prediction and protein design tools, powered by our large protein language models and Bayesian machine learning frameworks, directly into the hands of biologists and protein engineers.
In natural language processing, large language models, which use massive neural networks to model statistics of natural text, are redefining how companies think about search, chat bots, copywriting, text editing, code development, and more. Models like ChatGPT use massive transformer networks trained on large web text corpora to be able to respond to user queries via extending user-supplied text prompts with uncanny ability. These responses are based on learning natural statistical patterns in the training text and generating responses based on the probability assigned to each following word given some prefix text. The remarkable capabilities of these models has been unlocked by the ability to scale transformer language models to massive sizes using huge amounts of GPU compute and to train them on enormous text corpora scraped from the internet.
Much like natural language, proteins are sequences of amino acids that fold into three dimensional structures in order to carry out the vast majority of functions at the molecular level of life. Proteins are responsible for converting sunlight into energy, reading and replicating DNA, transporting nutrients in and out of the cell, forming the protective envelopes of viruses, identifying foreign pathogens, and transmitting and receiving signals between cells and organisms, among many others. All of these functions are determined by the unique sequence of amino acids that makes up each protein. Also much like natural language, we now have enormous databases containing the amino acid sequences of natural, functional proteins. Although the vast majority of these proteins have not been characterized (we only know their sequences), it turns out that statistical analysis of just these sequences can reveal evolutionary pressures, and, therefore, structural and functional characteristics. Over the past few years, large scale deep learning methods, like protein language models, have transformed our ability to understand and predict the structural and functional properties of proteins by learning from these evolutionary patterns. However, current protein language models and their extensions (e.g., AlphaFold2 or ESMfold) have only scratched the surface of what large protein language models can enable for protein design and optimization.
We have developed large protein language models that enable functional, prompt-based protein generation. The objective of this project is to work with our machine learning team to develop the next generation of protein language models and to integrate these into our OpenProtein.AI platform for solving function-driven protein design tasks.
As an intern on the language modeling project at NE47 Bio, you will work with our machine learning team to
- Design and implement new neural network architectures for protein language modeling
- Train and monitor the training of those models
- Identify and curate datasets and training frameworks for training language models to solve new conditional generation problems
- Integrate models into the broader OpenProtein.AI platform
- Review relevant protein engineering and machine learning literature
Some expected deliverables and responsibilities include
- Perform experiments, analyze and report results through internal written documentation and reports, slides, and presentations
- Work with the team to integrate resulting models into the OpenProtein.AI platform to eventually provide user-facing interfaces to these tools
- Write up and publish results in machine learning or protein engineering conferences and journals
Computer Science, Machine Learning, Bioinformatics/Computational Biology, Bioengineering
Candidate should be well versed in Python and have a background in algorithms, probability/statistics, and machine learning. Familiariaty with bioinformatics, especially sequence analysis algorithms, is a plus but not required. Experience with language models and other generative deep learning approaches (e.g., VAEs, GANs, diffusion models) is a plus.
Some tools, libraries, and resources that candidates are likely to use are pytorch, pytorch-lightning, fastAPI, Uniprot, AWS, and Docker. Some prior experience with these is a plus.