The Importance of Large Language Models in Science Even If You Don’t Work With Language

 

OpenAI's ChatGPT, Google's Bard, and other similar Large Language Models (LLMs) have made dramatic strides in their ability to interact with people using natural language. Users can describe what they want done and have the LLM "understand" and respond appropriately. 

While R&D leaders and their scientists are aware of ChatGPT, most are unclear what the technology means for them because their data isn’t natural language. Scientific data is different from traditional business data and requires special handling. Much of R&D data isn't text. It's time series or images or video or spectra or molecular structures or any number of other data sources in a myriad of formats. 

Even if your primary data isn't text-based, every lab has significant work with text: reports, code, configuration files, and so on.

The technology behind tools like ChatGPT provide a new flexibility that can lift  a significant amount of the burden of the text-based workflows that every lab has. More importantly, the advances in AI models that permit ChatGPT's dramatically better conversational context are revolutionizing the ability of AI models to work with deeper relationships in non-language data as well. 

This means that innovative research organizations are in a unique position to benefit from these new types of tools. LLMs have the potential to take away some of the drudgery and distraction of text-based tasks like report generation and code writing, letting the domain experts in your organization focus on what they are best at—the science.

The Challenge of R&D Automation

Automation usually requires very standardized processes and is ideal for organizations that are doing the same thing over-and-over, whether on the factory floor, producing sales reports, or drafting business documents - any variables are well understood and constrained. R&D is, almost by definition, the opposite of standardized. R&D organizations are constantly taking on different projects, trying out new equipment, and testing new processes. Successful automation in an R&D context needs to be flexible without needing constant human intervention.

Enthought | ChatGPT, Large Language Models, Generative Artificial IntelligenceThe most obvious difference between this new generation of language models and the older ones (embodied in tools like Siri and Alexa), is the ability to keep the context of the conversation over a much longer time. These large language models can "remember" what you are talking about over many back-and-forth prompts and responses. This is possible due to advances in the architecture of the AI models which allow the models to be trained more efficiently permitting deeper context from the same resources.

The innovations from text-based models can be applied just as readily to other data types. New designs can be built and efficiently trained to learn to recognize relationships in other situations, such as tracking cause and effect in time series or video data, or spatial relationships in images. While they aren't getting the same level of coverage in the popular press as the text-based models, we are starting to see these sorts of technologies emerging, such as ChemBERT in drug discovery. It is likely that as they mature they will start to provide the same sort of qualitative changes in the analysis of scientific data.

LLMs and Scientific Text Workflows

At its heart ChatGPT is just trying to come up with the "best next word" over and over, building up its responses one word at a time. In some sense, an LLM is just a "sophisticated autocomplete." Therefore, these models are very good at producing semi-structured text, such as computer code, configuration files, and standardized reports (and also answers to exam questions!), because semi-structured text is even more predictable than natural language. Of course, to be able to do this, the model has to be trained on appropriate examples of the desired output, but ChatGPT has demonstrated surprising adeptness at producing small but useful routines in common programming languages just from the code examples included in its general training data, without any specific additional training.

Semi-structured text is very common in R&D contexts. It might be an algorithm to perform an analysis of some data; a section of a report on the results of an experiment; or perhaps a SQL query against a knowledge base.  It may never be the same each time, but there are general patterns that it follows and expectations in formatting and style. In a traditional lab, writing these documents generally falls on the researchers, and amounts to a significant change in the flow of their work. They are no longer thinking about the research problem, but instead thinking about computer code or getting data into a document or how to connect to the database.

Leaders of R&D organizations would much rather have their scientists, engineers, and researchers focus on doing science, engineering and research.  Research at UC Irvine1 shows that it can take up to 20 minutes to get focus back on the primary task after a distraction. By leveraging LLM-based tools to generate structured text through conversational prompts, the researcher is more likely to be focused on the high-level research task. In the same way that regular autocomplete can speed up sending a text message, keeping you focused on the message you want to send, these tools can speed up the creation of other types of text while keeping focus on the larger task. 

Of course, just as with regular autocomplete, sometimes LLMs will get things wrong, and so they still need a human in the loop.

Beyond LLMs: Transformers

To build AI models which can track this sort of flexible structure, AI researchers developed the concept of attention: parts of the model that are designed to track important information that should influence later output. Many different attention methods have been developed over the past 30 years, but the best results until recently all required underlying neural networks which are comparatively complex - so called recurrent neural networks - that use their current state as an input for the next state. These are algorithmically expensive to train, because you need to build things up sequentially.  This also makes it harder to break up the work amongst multiple machines.

LLMs: Transformers

The breakthrough that allowed tools like ChatGPT to be built was the idea of transformers: a type of attention mechanism that works holistically on a chunk of a document rather than sequentially, and which can be implemented using simpler, non-recurrent networks. This removed an algorithmic bottleneck in the way that these models are trained allowing work to be done in parallel more easily. More training data could be fed into the model while using fewer resources, and deeper context models could be built. This permitted the broader knowledge displayed by the current LLMs (coming from more input data) as well as the greatly improved ability to carry out conversations with humans (coming from the deeper context).

But again, a lot of R&D data isn't text. You can't, at the moment, give ChatGPT an image or a series of spectra as an input for it to work with without somehow first turning the data into text that ChatGPT understands.

At the level of the AI model, everything is just vectors of numbers, so these transformer-based algorithmic improvements can be applied to other data sources. In computer vision and image processing, for example, better context means better tracking of things over time, better distinction of spatial relationships like "close to" or "to the left", or the ability to distinguish things based on environmental cues. In drug discovery, we are now seeing rapid improvements in the ability to predict biochemical properties of molecules from molecular structure data—the improved ability to track context allows the AI to link active subgroups that work together to produce a particular chemical property even though they may be far apart in the molecular structure description.

Over the next few years, we are likely to see transformative improvements in the analysis of many different types of scientific and engineering data thanks to the advances that led to ChatGPT and its friends.

 

EnthoughtTo learn more, join Enthought AI experts for the webinar What Every R&D Leader Needs to Know about ChatGPT and LLMs for a deeper dive into how these advanced technologies are changing scientific research.

Share this article:

Related Content

研究開発組織の変革を成功させるためのパートナー選び

現在の競争が激しいR&D環境において、適切なテクノロジーパートナーを選ぶことは、組織にとって最も重要な意思決定の1つです。理想的なパートナーとは、単なるツールベンダーやシステムインテグレーターではなく、生産性を向上させ、イノベーションを加速し、競争力を引き出す解決策を提供する科学的な専門知識と戦略的な洞察を兼ね備えた「変革の同志」です。

Read More

「AIスーパー・モデル」が材料研究開発を革新する

近年、計算能力と人工知能の進化により、材料科学や化学の研究・製品開発に変革がもたらされています。エンソートは常に最先端のツールを探求しており、研究開発の新たなステージに引き上げる可能性を持つマテリアルズインフォマティクス(MI)分野での新技術を注視しています。

Read More

デジタルトランスフォーメーション vs. デジタルエンハンスメント: 研究開発における技術イニシアティブのフレームワーク

生成AIの登場により、研究開発の方法が革新され、前例のない速さで新しい科学的発見が生まれる時代が到来しました。研究開発におけるデジタル技術の導入は、競争力を向上させることが証明されており、企業が従来のシステムやプロセスに固執することはリスクとなります。デジタルトランスフォーメーションは、科学主導の企業にとってもはや避けられない取り組みです。

Read More

産業用の材料と化学研究開発におけるLLMの活用

大規模言語モデル(LLM)は、すべての材料および化学研究開発組織の技術ソリューションセットに含むべき魅力的なツールであり、変革をもたらす可能性を秘めています。

Read More

材料科学研究開発ラボのデジタルトランスフォーメーション

「デジタルトランスフォーメーション」「機械学習」「…

Read More

科学研究開発における効率の重要性

今日、新しい発見や技術が生まれるスピードは驚くほど速くなっており、市場での独占期間が大幅に短縮されています。企業は互いに競争するだけでなく、時間との戦いにも直面しており、新しいイノベーションを最初に発見し、特許を取得し、市場に出すためにしのぎを削っています。

Read More

R&D イノベーションサミット2024「研究開発におけるAIの大規模活用に向けて – デジタル環境で勝ち残る研究開発組織への変革」開催レポート

去る2024年5月30日に、近年注目のAIの大規模活用をテーマに、エンソート主催のプライベートイベントがミッドタウン日比谷6FのBASE Qで開催されました。

Read More

科学研究開発における小規模データの最大活用

多くの伝統的なイノベーション主導の組織では、科学データは特定の短期的な研究質問に答えるために生成され、その後は知的財産を保護するためにアーカイブされます。しかし、将来的にデータを再利用して他の関連する質問に活用することにはあまり注意が払われません。

Read More

デジタルトランスフォーメーションの実践

デジタルトランスフォーメーションは、組織のデジタル成熟度を促進し、ビジネス価値を継続的に提供するプロセスです。真にビジネスを変革するためには、イノベーションを通じて新しい可能性を発見し、企業が「デジタルDNA」を育む必要があります。

Read More

科学研究開発リーダーが知っておくべき AI 概念トップ 10

近年のAIのダイナミックな環境で、R&Dリーダーや科学者が、企業の将来を見据えたデータ戦略をより効果的に開発し、画期的な発見に向けて先導していくためには、重要なAIの概念を理解することが不可欠です。

Read More