2023年06月11日

The Importance of Large Language Models in Science Even If You Don’t Work With Language

 

OpenAI's ChatGPT, Google's Bard, and other similar Large Language Models (LLMs) have made dramatic strides in their ability to interact with people using natural language. Users can describe what they want done and have the LLM "understand" and respond appropriately. 

While R&D leaders and their scientists are aware of ChatGPT, most are unclear what the technology means for them because their data isn’t natural language. Scientific data is different from traditional business data and requires special handling. Much of R&D data isn't text. It's time series or images or video or spectra or molecular structures or any number of other data sources in a myriad of formats. 

Even if your primary data isn't text-based, every lab has significant work with text: reports, code, configuration files, and so on.

The technology behind tools like ChatGPT provide a new flexibility that can lift  a significant amount of the burden of the text-based workflows that every lab has. More importantly, the advances in AI models that permit ChatGPT's dramatically better conversational context are revolutionizing the ability of AI models to work with deeper relationships in non-language data as well. 

This means that innovative research organizations are in a unique position to benefit from these new types of tools. LLMs have the potential to take away some of the drudgery and distraction of text-based tasks like report generation and code writing, letting the domain experts in your organization focus on what they are best at—the science.

The Challenge of R&D Automation

Automation usually requires very standardized processes and is ideal for organizations that are doing the same thing over-and-over, whether on the factory floor, producing sales reports, or drafting business documents - any variables are well understood and constrained. R&D is, almost by definition, the opposite of standardized. R&D organizations are constantly taking on different projects, trying out new equipment, and testing new processes. Successful automation in an R&D context needs to be flexible without needing constant human intervention.

Enthought | ChatGPT, Large Language Models, Generative Artificial IntelligenceThe most obvious difference between this new generation of language models and the older ones (embodied in tools like Siri and Alexa), is the ability to keep the context of the conversation over a much longer time. These large language models can "remember" what you are talking about over many back-and-forth prompts and responses. This is possible due to advances in the architecture of the AI models which allow the models to be trained more efficiently permitting deeper context from the same resources.

The innovations from text-based models can be applied just as readily to other data types. New designs can be built and efficiently trained to learn to recognize relationships in other situations, such as tracking cause and effect in time series or video data, or spatial relationships in images. While they aren't getting the same level of coverage in the popular press as the text-based models, we are starting to see these sorts of technologies emerging, such as ChemBERT in drug discovery. It is likely that as they mature they will start to provide the same sort of qualitative changes in the analysis of scientific data.

LLMs and Scientific Text Workflows

At its heart ChatGPT is just trying to come up with the "best next word" over and over, building up its responses one word at a time. In some sense, an LLM is just a "sophisticated autocomplete." Therefore, these models are very good at producing semi-structured text, such as computer code, configuration files, and standardized reports (and also answers to exam questions!), because semi-structured text is even more predictable than natural language. Of course, to be able to do this, the model has to be trained on appropriate examples of the desired output, but ChatGPT has demonstrated surprising adeptness at producing small but useful routines in common programming languages just from the code examples included in its general training data, without any specific additional training.

Semi-structured text is very common in R&D contexts. It might be an algorithm to perform an analysis of some data; a section of a report on the results of an experiment; or perhaps a SQL query against a knowledge base.  It may never be the same each time, but there are general patterns that it follows and expectations in formatting and style. In a traditional lab, writing these documents generally falls on the researchers, and amounts to a significant change in the flow of their work. They are no longer thinking about the research problem, but instead thinking about computer code or getting data into a document or how to connect to the database.

Leaders of R&D organizations would much rather have their scientists, engineers, and researchers focus on doing science, engineering and research.  Research at UC Irvine1 shows that it can take up to 20 minutes to get focus back on the primary task after a distraction. By leveraging LLM-based tools to generate structured text through conversational prompts, the researcher is more likely to be focused on the high-level research task. In the same way that regular autocomplete can speed up sending a text message, keeping you focused on the message you want to send, these tools can speed up the creation of other types of text while keeping focus on the larger task. 

Of course, just as with regular autocomplete, sometimes LLMs will get things wrong, and so they still need a human in the loop.

Beyond LLMs: Transformers

To build AI models which can track this sort of flexible structure, AI researchers developed the concept of attention: parts of the model that are designed to track important information that should influence later output. Many different attention methods have been developed over the past 30 years, but the best results until recently all required underlying neural networks which are comparatively complex - so called recurrent neural networks - that use their current state as an input for the next state. These are algorithmically expensive to train, because you need to build things up sequentially.  This also makes it harder to break up the work amongst multiple machines.

LLMs: Transformers

The breakthrough that allowed tools like ChatGPT to be built was the idea of transformers: a type of attention mechanism that works holistically on a chunk of a document rather than sequentially, and which can be implemented using simpler, non-recurrent networks. This removed an algorithmic bottleneck in the way that these models are trained allowing work to be done in parallel more easily. More training data could be fed into the model while using fewer resources, and deeper context models could be built. This permitted the broader knowledge displayed by the current LLMs (coming from more input data) as well as the greatly improved ability to carry out conversations with humans (coming from the deeper context).

But again, a lot of R&D data isn't text. You can't, at the moment, give ChatGPT an image or a series of spectra as an input for it to work with without somehow first turning the data into text that ChatGPT understands.

At the level of the AI model, everything is just vectors of numbers, so these transformer-based algorithmic improvements can be applied to other data sources. In computer vision and image processing, for example, better context means better tracking of things over time, better distinction of spatial relationships like "close to" or "to the left", or the ability to distinguish things based on environmental cues. In drug discovery, we are now seeing rapid improvements in the ability to predict biochemical properties of molecules from molecular structure data—the improved ability to track context allows the AI to link active subgroups that work together to produce a particular chemical property even though they may be far apart in the molecular structure description.

Over the next few years, we are likely to see transformative improvements in the analysis of many different types of scientific and engineering data thanks to the advances that led to ChatGPT and its friends.

 

EnthoughtTo learn more, join Enthought AI experts for the webinar What Every R&D Leader Needs to Know about ChatGPT and LLMs for a deeper dive into how these advanced technologies are changing scientific research.

Share this article:

Related Content

Enthoughtが定義する、製薬会社の研究開発ラボにおける真のDX

Enthought GKチームは、東京で開催されたライフサイエンスカンファレンス「ファーマIT&デジタルヘルスエキスポ2022」に出展し、技術的な見識と市場成長の活性化を求めて集まる製薬業界のリーダーたちと会談しました。三日間の会期中に200社が出展し、6700人以上の参加者が集まりました。 デジタルトランスフォーメーションが主要テーマである本展示会は、当社のターゲットとする企業に、製薬業界の新薬開発を加速させる当社のサービスを

Read More

科学における大規模言語モデルの重要性

OpenAIのChatGPTやGoogleのBardなど、大規模言語モデル(LLM)は自然言語で人と対話する能力において著しい進歩を遂げました。 ユーザーが言葉で要望を入力すれば、LLMは「理解」し、適切な回答を返してくれます。

Read More

ライフサイエンス分野におけるデジタル化拡大の課題

研究開発におけるイノベーションの規模拡大は、ラボか…

Read More

Top 5 Takeaways from the American Chemical Society (ACS) 2023 Fall Meeting: R&D Data, Generative AI and More

By Mike Heiber, Ph.D., Di…

Read More

Life Sciences Labs Optimize with New Digital Technologies and Upskilling

Labs are resetting the tr…

Read More

From Data to Discovery: Exploring the Potential of Generative Models in Materials Informatics Solutions

Generative models can be used in many more areas than just language generation, with one particularly promising area: molecule generation for chemical product development.

Read More

The Importance of Large Language Models in Science Even If You Don’t Work With Language

OpenAI's ChatGPT, Google's Bard, and other similar Large Language Models (LLMs) have made dramatic strides in their ability to interact with people using natural language....

Read More

Leveraging AI in Cell Culture Analysis

Mammalian cell culture is a fundamental tool for many discoveries, innovations, and products in the life sciences.

Read More

Scientists Who Code

Digital skills personas f…

Read More

Making the Most of Small Data in Scientific R&D

For many traditional innovation-driven organizations, scientific data is generated to answer specific immediate research questions and then archived to protect IP, with little attention paid...

Read More