The Importance of Large Language Models in Science Even If You Don’t Work With Language

 

OpenAI's ChatGPT, Google's Bard, and other similar Large Language Models (LLMs) have made dramatic strides in their ability to interact with people using natural language. Users can describe what they want done and have the LLM "understand" and respond appropriately. 

While R&D leaders and their scientists are aware of ChatGPT, most are unclear what the technology means for them because their data isn’t natural language. Scientific data is different from traditional business data and requires special handling. Much of R&D data isn't text. It's time series or images or video or spectra or molecular structures or any number of other data sources in a myriad of formats. 

Even if your primary data isn't text-based, every lab has significant work with text: reports, code, configuration files, and so on.

The technology behind tools like ChatGPT provide a new flexibility that can lift  a significant amount of the burden of the text-based workflows that every lab has. More importantly, the advances in AI models that permit ChatGPT's dramatically better conversational context are revolutionizing the ability of AI models to work with deeper relationships in non-language data as well. 

This means that innovative research organizations are in a unique position to benefit from these new types of tools. LLMs have the potential to take away some of the drudgery and distraction of text-based tasks like report generation and code writing, letting the domain experts in your organization focus on what they are best at—the science.

The Challenge of R&D Automation

Automation usually requires very standardized processes and is ideal for organizations that are doing the same thing over-and-over, whether on the factory floor, producing sales reports, or drafting business documents - any variables are well understood and constrained. R&D is, almost by definition, the opposite of standardized. R&D organizations are constantly taking on different projects, trying out new equipment, and testing new processes. Successful automation in an R&D context needs to be flexible without needing constant human intervention.

Enthought | ChatGPT, Large Language Models, Generative Artificial IntelligenceThe most obvious difference between this new generation of language models and the older ones (embodied in tools like Siri and Alexa), is the ability to keep the context of the conversation over a much longer time. These large language models can "remember" what you are talking about over many back-and-forth prompts and responses. This is possible due to advances in the architecture of the AI models which allow the models to be trained more efficiently permitting deeper context from the same resources.

The innovations from text-based models can be applied just as readily to other data types. New designs can be built and efficiently trained to learn to recognize relationships in other situations, such as tracking cause and effect in time series or video data, or spatial relationships in images. While they aren't getting the same level of coverage in the popular press as the text-based models, we are starting to see these sorts of technologies emerging, such as ChemBERT in drug discovery. It is likely that as they mature they will start to provide the same sort of qualitative changes in the analysis of scientific data.

LLMs and Scientific Text Workflows

At its heart ChatGPT is just trying to come up with the "best next word" over and over, building up its responses one word at a time. In some sense, an LLM is just a "sophisticated autocomplete." Therefore, these models are very good at producing semi-structured text, such as computer code, configuration files, and standardized reports (and also answers to exam questions!), because semi-structured text is even more predictable than natural language. Of course, to be able to do this, the model has to be trained on appropriate examples of the desired output, but ChatGPT has demonstrated surprising adeptness at producing small but useful routines in common programming languages just from the code examples included in its general training data, without any specific additional training.

Semi-structured text is very common in R&D contexts. It might be an algorithm to perform an analysis of some data; a section of a report on the results of an experiment; or perhaps a SQL query against a knowledge base.  It may never be the same each time, but there are general patterns that it follows and expectations in formatting and style. In a traditional lab, writing these documents generally falls on the researchers, and amounts to a significant change in the flow of their work. They are no longer thinking about the research problem, but instead thinking about computer code or getting data into a document or how to connect to the database.

Leaders of R&D organizations would much rather have their scientists, engineers, and researchers focus on doing science, engineering and research.  Research at UC Irvine1 shows that it can take up to 20 minutes to get focus back on the primary task after a distraction. By leveraging LLM-based tools to generate structured text through conversational prompts, the researcher is more likely to be focused on the high-level research task. In the same way that regular autocomplete can speed up sending a text message, keeping you focused on the message you want to send, these tools can speed up the creation of other types of text while keeping focus on the larger task. 

Of course, just as with regular autocomplete, sometimes LLMs will get things wrong, and so they still need a human in the loop.

Beyond LLMs: Transformers

To build AI models which can track this sort of flexible structure, AI researchers developed the concept of attention: parts of the model that are designed to track important information that should influence later output. Many different attention methods have been developed over the past 30 years, but the best results until recently all required underlying neural networks which are comparatively complex - so called recurrent neural networks - that use their current state as an input for the next state. These are algorithmically expensive to train, because you need to build things up sequentially.  This also makes it harder to break up the work amongst multiple machines.

LLMs: Transformers

The breakthrough that allowed tools like ChatGPT to be built was the idea of transformers: a type of attention mechanism that works holistically on a chunk of a document rather than sequentially, and which can be implemented using simpler, non-recurrent networks. This removed an algorithmic bottleneck in the way that these models are trained allowing work to be done in parallel more easily. More training data could be fed into the model while using fewer resources, and deeper context models could be built. This permitted the broader knowledge displayed by the current LLMs (coming from more input data) as well as the greatly improved ability to carry out conversations with humans (coming from the deeper context).

But again, a lot of R&D data isn't text. You can't, at the moment, give ChatGPT an image or a series of spectra as an input for it to work with without somehow first turning the data into text that ChatGPT understands.

At the level of the AI model, everything is just vectors of numbers, so these transformer-based algorithmic improvements can be applied to other data sources. In computer vision and image processing, for example, better context means better tracking of things over time, better distinction of spatial relationships like "close to" or "to the left", or the ability to distinguish things based on environmental cues. In drug discovery, we are now seeing rapid improvements in the ability to predict biochemical properties of molecules from molecular structure data—the improved ability to track context allows the AI to link active subgroups that work together to produce a particular chemical property even though they may be far apart in the molecular structure description.

Over the next few years, we are likely to see transformative improvements in the analysis of many different types of scientific and engineering data thanks to the advances that led to ChatGPT and its friends.

 

EnthoughtTo learn more, join Enthought AI experts for the webinar What Every R&D Leader Needs to Know about ChatGPT and LLMs for a deeper dive into how these advanced technologies are changing scientific research.

Share this article:

Related Content

「AIスーパー・モデル」が材料研究開発を革新する

近年、計算能力と人工知能の進化により、材料科学や化学の研究・製品開発に変革がもたらされています。エンソートは常に最先端のツールを探求しており、研究開発の新たなステージに引き上げる可能性を持つマテリアルズインフォマティクス(MI)分野での新技術を注視しています。

Read More

デジタルトランスフォーメーション vs. デジタルエンハンスメント: 研究開発における技術イニシアティブのフレームワーク

生成AIの登場により、研究開発の方法が革新され、前例のない速さで新しい科学的発見が生まれる時代が到来しました。研究開発におけるデジタル技術の導入は、競争力を向上させることが証明されており、企業が従来のシステムやプロセスに固執することはリスクとなります。デジタルトランスフォーメーションは、科学主導の企業にとってもはや避けられない取り組みです。

Read More

産業用の材料と化学研究開発におけるLLMの活用

大規模言語モデル(LLM)は、すべての材料および化学研究開発組織の技術ソリューションセットに含むべき魅力的なツールであり、変革をもたらす可能性を秘めています。

Read More

R&D イノベーションサミット2024「研究開発におけるAIの大規模活用に向けて – デジタル環境で勝ち残る研究開発組織への変革」開催レポート

去る2024年5月30日に、近年注目のAIの大規模活用をテーマに、エンソート主催のプライベートイベントがミッドタウン日比谷6FのBASE Qで開催されました。

Read More

科学研究開発における小規模データの最大活用

多くの伝統的なイノベーション主導の組織では、科学データは特定の短期的な研究質問に答えるために生成され、その後は知的財産を保護するためにアーカイブされます。しかし、将来的にデータを再利用して他の関連する質問に活用することにはあまり注意が払われません。

Read More

デジタルトランスフォーメーションの実践

デジタルトランスフォーメーションは、組織のデジタル成熟度を促進し、ビジネス価値を継続的に提供するプロセスです。真にビジネスを変革するためには、イノベーションを通じて新しい可能性を発見し、企業が「デジタルDNA」を育む必要があります。

Read More

科学研究開発リーダーが知っておくべき AI 概念トップ 10

近年のAIのダイナミックな環境で、R&Dリーダーや科学者が、企業の将来を見据えたデータ戦略をより効果的に開発し、画期的な発見に向けて先導していくためには、重要なAIの概念を理解することが不可欠です。

Read More

科学における大規模言語モデルの重要性

OpenAIのChatGPTやGoogleのBardなど、大規模言語モデル(LLM)は自然言語で人と対話する能力において著しい進歩を遂げました。 ユーザーが言葉で要望を入力すれば、LLMは「理解」し、適切な回答を返してくれます。

Read More

ライフサイエンス分野におけるデジタル化拡大の課題

研究開発におけるイノベーションの規模拡大は、ラボか…

Read More

ITは科学の成功にいかに寄与するか

科学と工学の分野においてAIと機械学習の重要性が高まるなか、企業が革新的であるためには、研究開発部門とIT部門のリーダーシップが上手く連携を取ることが重要になっています。予算やポリシー、ベンダー選択が不適切だと、重要な研究プログラムが不必要に阻害されることがあります。また反対に、「なんでもあり」という姿勢が貴重なリソースを浪費したり、組織を新たなセキュリティ上の脅威にさらしたりすることもあります。

Read More