Cheat Sheet | Large Language Models+ For Scientific Research

Large Language Models+ For Scientific Research

Updated August 2023

LLMs and Tools for R&D

To help scientists and researchers navigate the increasing number of advanced artificial intelligence (AI) options, Enthought’s experts put together this summary of Large Language Models (LLMs) and related tools that are most relevant for R&D updated as of early August 2023. This is a fast-moving field, so expect that the landscape will continue to change quickly.

We also suggest getting started with our What Every R&D Leader Needs to Know About ChatGPT and LLMs on-demand webinar as well as these additional resources:

Download PDF

The Major Players

Of the major players in AI, only OpenAI is currently offering their LLMs as a commercial service, and then only by invitation (as of this writing). However, many companies have experimental or non-commercial models to experiment with. Keep IP issues in mind with these.

OpenAI -
OpenAI offers a variety of different LLMs and APIs addressing different use-cases, including fine-tuning models on your own data.  Serious commercial use should be via the APIs, which are currently available by invitation.

Meta AI LLaMA 2 -
A collection of related LLMs released by Meta AI (Facebook). Unlike version 1, version 2 is available for commerical and research purposes.

Google Bard -
Google’s experimental LLM. No public APIs available yet, and chatbot conversations are used for further training, so not yet ready for commercial use.

Amazon AlexaTM -
Amazon Science’s LLM, which can be accessed for non-commercial use via AWS SageMaker.

Anthropic Claude -
Unique model because of its large context window (100k+ tokens), allowing it to answer questions about longer documents. API access is only available via inquiries. A chat interface is generally available, but conversations may be used for further training, so not a commercial option.

Hugging Face -
Hugging Face provides infrastructure support for LLM and other Machine Learning operations, including hosting, training and deployment of models. They also host some internally developed and open-source models such as BLOOM.

Open-Source LLMs

If you want to train, fine-tune, or run a LLM on your own, there are a number of models available ranging from older models from major AI companies to non-commercial research models, to some more recent, permissively licensed models.

Google BERT -
One of the first openly available transformer-based LLMs and available under the permissive Apache 2.0 license.  BERT is the foundation for many of the tools for scientific applications of LLMs.

OpenAI GPT-2 -
OpenAI’s 2nd generation LLM, released under a permissive MIT license. GPT-2 is now 4 years old, so well-behind the state-of-the-art, but ground-breaking at the time.

A multi-lingual LLM by a large consortium of researchers and organizations, including Hugging Face.  It is open-sourced under the Responsible AI License (usable commercially with some restrictions, particularly around disclosure and medical use-cases). There is also BLOOMZ which is fine-tuned for following instructions rather than conversation.

Falcon LLM -
An LLM released by the Technology Innovation Institute under a permissive Apache 2.0 license. This is used as a basis for a number of other open tools, such as LAION’s Open Assistant (

MPT-30 -
A collection of LLMs with different optimizations trained inexpensively on very large input sets. Released by MosaicML under the Apache 2.0 license with the intent that it is commercially usable.

Dolly/Pythia -
An LLM tuned by Databricks based on the Pythia LLM. It is not cutting edge but is large and released under an MIT license.

Stanford University Alpaca -
A model based on Meta’s LLaMA v1 produced by the Center for Research on Foundation Models (CRFM) group at Stanford. The model is open-sourced under a non-commercial license and designed to be trained inexpensively on smaller data sets. There are a number of other models derived from this, such as Vicuna (

LeRF -
LeRF combines the ability to reconstruct a 3D scene from a handful of still images using Neural Radiance Fields (NeRF) with LLMs, allowing easy searching of a 3D scene using natural language. The models and code are open source, but currently without a license, and so not yet commercially usable.

Toolkits and APIs

To go beyond simple chat applications of LLMs, you will need some tools to connect the models with other services or even libraries to build and train your own models.

Transformers -
A toolkit built on top of PyTorch and TensorFlow that provides building blocks for LLMs as well as other state-of-the-art machine learning models. It also integrates with the Hugging Face public API to facilitate building, training and running models in the cloud, as well as accessing many 3rd party models.

LangChain -
LangChain is a toolkit for building LLM-centered applications, particularly agents and assistants. It provides automation for building special-purpose prompts which work well with LLMs to produce particular types of outputs, as well as integration with other services such as data sources and code execution.

Science-Specific Tools

In the last few years there have been a number of high-profile papers and toolkits in Material Science and Bioinformatics that use these new ML models.  Most of these have source code and model weights freely available, but there are not yet any services built on top of these. They are research-grade software, not production-grade, with many based on LLM techniques that are a generation or two behind the current state-of-the-art. There are likely to be better models in the future.

ChemBERT -
Chemical property prediction from SMILES molecular structure representation. There are other models derived from this original work.

ChemCrow -
LangChain-based package for solving reasoning-intensive chemical tasks posed using natural language. This currently needs API access for OpenAI and possibly other APIs depending on the tasks.

ProteinBERT -
A framework for building protein property predictors from protein sequence information. The base model is designed to be fine-tuned to for arbitrary properties.

TransUNet -
Next-generation medical image segmentation using transformer-based models. This has the potential to be cheaper to train and more capable of detecting large-scale structures in an image.

Enformer -
Transformer-based gene expression and chromatin prediction from DNA sequences. Similar to LLMs, Enformer has the capability of tracking a wider context within a DNA sequence than previous models.


Looking to accelerate your research by integrating Machine Learning and advanced AI in your R&D lab but don’t know where to start? Enthought understands the complexities of scientific data and can help. Contact us to connect with one of our experts.



Explore more: Blogs and Resources

Share this article:

Related Content


Enthought GKチームは、東京で開催されたライフサイエンスカンファレンス「ファーマIT&デジタルヘルスエキスポ2022」に出展し、技術的な見識と市場成長の活性化を求めて集まる製薬業界のリーダーたちと会談しました。三日間の会期中に200社が出展し、6700人以上の参加者が集まりました。 デジタルトランスフォーメーションが主要テーマである本展示会は、当社のターゲットとする企業に、製薬業界の新薬開発を加速させる当社のサービスを

Read More


OpenAIのChatGPTやGoogleのBardなど、大規模言語モデル(LLM)は自然言語で人と対話する能力において著しい進歩を遂げました。 ユーザーが言葉で要望を入力すれば、LLMは「理解」し、適切な回答を返してくれます。

Read More



Read More

Top 5 Takeaways from the American Chemical Society (ACS) 2023 Fall Meeting: R&D Data, Generative AI and More

By Mike Heiber, Ph.D., Di…

Read More

Life Sciences Labs Optimize with New Digital Technologies and Upskilling

Labs are resetting the tr…

Read More

Real Scientists Make Their Own Tools

There’s a long history of…

Read More

From Data to Discovery: Exploring the Potential of Generative Models in Materials Informatics Solutions

Generative models can be used in many more areas than just language generation, with one particularly promising area: molecule generation for chemical product development.

Read More

7 Pro-Tips for Scientists: Using LLMs to Write Code

Scientists gain superpowe…

Read More

The Importance of Large Language Models in Science Even If You Don’t Work With Language

OpenAI's ChatGPT, Google's Bard, and other similar Large Language Models (LLMs) have made dramatic strides in their ability to interact with people using natural language....

Read More

4 Reasons to Learn Xarray and Awkward Array—for NumPy and Pandas Users

You know it. We know it. …

Read More