Small Data, Big Value

Small Data, Big Value

Author: Mason Dykstra, Ph.D., VP Energy Solutions 

Enthought welcomes Mason as VP, Energy Solutions, whose background from Anadarko, Statoil, and as a professor at the Colorado School of Mines qualifies him to make the case for ensuring ‘Small Data’ is equally part of the Fourth Industrial Revolution. The first in a Small Data series.

The origin of the term Big Data will likely never be agreed. However, in the world of science and computing, the case can be made that the term originated in Silicon Graphics in the 1990’s, whose work in video, for surveillance and Hollywood special effects, had it facing orders of magnitude more data than ever before. Recent advances in scientific computing technology and techniques, and massive generation of data, in particular by consumers and from social media, have put the term Big Data at center stage.  

However, in many scientific fields, Big Data does not exist. It’s all about getting the most from ‘Small Data’, ensuring scientific challenges with minimal data also benefit from the ‘Fourth Industrial Revolution’. In many natural sciences and engineering disciplines large volumes of data can be hard or very expensive to generate. The reality is these datasets are often limited in size, poorly curated, and bespoke to particular problems. So, either the fields lacking in Big Data will be left out of the ‘Revolution’, or we need to work on ways of unleashing the power of Small Data.

Scientists are particularly adept at teasing meaning out of Small Data and drawing important conclusions with limited datasets. The future will be a collaboration between humans and machines, but clearly we don’t only want to solve the problems that have Big Data behind them. In cases where datasets are relatively small, or important pieces of information are missing, how can we develop this type of ‘intelligence’ in machines? 

We need to engineer applications that can approach problems the way a scientist would. Scientists typically hypothesize as they go, which is to say they don’t wait until they have enough data to draw conclusions, but they actually generate, evolve and discard hypotheses along the way. While gathering data we are already engaging in problem-solving. 

For example, when a geologist is creating a map of the geologic layers and faults under the Earth, they continually make educated guesses about what some of the map features will look like before they have gathered all the data. Not only does this give the geologist something early on paper (ok, on screen), but actually it provides a basis for hypothesis testing, and can help steer the succeeding data-gathering step. Think of this as akin to coming into a new town for the first time – even though you might never have been to that particular town before, all towns share certain traits and tend to have similarities which we can use to imagine the parts we haven’t yet seen. This kind of intuitive thinking and rule-of-thumb-based guessing, although critical for many sciences, has not been the realm of computers. Yet.

So the real question is can we capture the essential parts of that rule-making process and combine it with ‘machine reasoning’ to develop Small Data approaches that are akin to the way a scientist would approach a problem? But much faster and more consistent? This is one of the major challenges for many scientists today, whether they recognize it yet or not.

One thing we do know, paraphrasing Antonio di Leva in The Lancet; ‘Machines will not replace scientists, but scientists using AI will soon replace those not using it.’

About the Author

Mason Dykstra, Ph.D., VP Energy Solutions  at Enthought, holds a PhD from the University of California Santa Barbara, an MS from the University of Colorado Boulder, and a BS from Northern Arizona University, all in the Geosciences. Mason has worked in Oil and Gas exploration, development, and production for over twenty years, split between oil industry-focused applied research at Colorado School of Mines and the University of California, Santa Barbara; and within companies including Anadarko Petroleum Corporation and Statoil (Equinor).

Share this article:

Related Content

R&D イノベーションサミット2024「研究開発におけるAIの大規模活用に向けて – デジタル環境で勝ち残る研究開発組織への変革」開催レポート

去る2024年5月30日に、近年注目のAIの大規模活用をテーマに、エンソート主催のプライベートイベントがミッドタウン日比谷6FのBASE Qで開催されました。

Read More

科学研究開発における小規模データの最大活用

多くの伝統的なイノベーション主導の組織では、科学データは特定の短期的な研究質問に答えるために生成され、その後は知的財産を保護するためにアーカイブされます。しかし、将来的にデータを再利用して他の関連する質問に活用することにはあまり注意が払われません。

Read More

科学研究開発リーダーが知っておくべき AI 概念トップ 10

近年のAIのダイナミックな環境で、R&Dリーダーや科学者が、企業の将来を見据えたデータ戦略をより効果的に開発し、画期的な発見に向けて先導していくためには、重要なAIの概念を理解することが不可欠です。

Read More

科学における大規模言語モデルの重要性

OpenAIのChatGPTやGoogleのBardなど、大規模言語モデル(LLM)は自然言語で人と対話する能力において著しい進歩を遂げました。 ユーザーが言葉で要望を入力すれば、LLMは「理解」し、適切な回答を返してくれます。

Read More

ライフサイエンス分野におけるデジタル化拡大の課題

研究開発におけるイノベーションの規模拡大は、ラボか…

Read More

ITは科学の成功にいかに寄与するか

科学と工学の分野においてAIと機械学習の重要性が高まるなか、企業が革新的であるためには、研究開発部門とIT部門のリーダーシップが上手く連携を取ることが重要になっています。予算やポリシー、ベンダー選択が不適切だと、重要な研究プログラムが不必要に阻害されることがあります。また反対に、「なんでもあり」という姿勢が貴重なリソースを浪費したり、組織を新たなセキュリティ上の脅威にさらしたりすることもあります。

Read More

Top 5 Takeaways from the American Chemical Society (ACS) 2023 Fall Meeting: R&D Data, Generative AI and More

By Mike Heiber, Ph.D., Di…

Read More

Life Sciences Labs Optimize with New Digital Technologies and Upskilling

Labs are resetting the tr…

Read More

From Data to Discovery: Exploring the Potential of Generative Models in Materials Informatics Solutions

Generative models can be used in many more areas than just language generation, with one particularly promising area: molecule generation for chemical product development.

Read More

The Importance of Large Language Models in Science Even If You Don’t Work With Language

OpenAI's ChatGPT, Google's Bard, and other similar Large Language Models (LLMs) have made dramatic strides in their ability to interact with people using natural language....

Read More