How word embeddings can benefit R&D across industries

Clustering analysis of embeddings derived from R&D World content. This diagram visualizes a high-dimensional dataset using t-SNE (t-distributed stochastic neighbor embedding), a machine learning algorithm used for dimensionality reduction.

Word embeddings have been one of the driving forces behind the recent surge in interest in AI, especially in the areas of large-scale language modeling and natural language processing. Embeddings, which represent vectors of real numbers, have been used in various fields for decades. In the 1980s and 1990s, researchers investigated various vector space models and dimensionality reduction techniques for representing words.

here, Word Embeddings In machine learning specifically, it is a subtype of embedding. In mathematics, embedding in general refers to the way one mathematical structure is contained within an instance of another. As Wikipedia says:

An object X It is said to be embedded in another object yesThe embedding is given by an injective and structure-preserving map. debt: X → yes.

Although the term “embedding” was not explicitly used in the landmark paper “Neural Probabilistic Language Models” by Bengio et al. in 2003, the paper significantly advanced the concept of word embeddings, paving the way for their later widespread adoption. The authors describe a method “to associate a distributed word feature vector (a real-valued vector of size ℝm) with each word in the vocabulary.” This method allows “to express the joint probability function of a sequence of words in terms of the feature vectors of these words in the sequence” and “to simultaneously learn the word feature vectors and the parameters of that probability function.”

This idea gained more traction in 2013 when Tomas Mikolov and colleagues at Google introduced Word2Vec, an efficient way to learn high-quality vector representations of words from large amounts of unstructured text data. This publication showed that simple vector operations could capture semantic relationships, sparking a wave of advances in the field of natural language processing. In one famous example, it was shown that vector (“King”) – vector (“Man”) + vector (“Woman”) is very close to vector (“Queen”). (The Word2Vec system was also controversial for capturing potential biases, as detailed by Brian Christian in “The Alignment Problem”).

The video below provides a clear overview of embeddings in the context of generative pre-trained transformers. The links below skip to the relevant parts, but I encourage you to watch the entire video if you have the time.

In plain English, embedding

But what does embedding mean in plain English? It basically means that instead of treating words as independent units, we can represent them as vectors in a multi-dimensional space. This is a kind of “word map” where it's all about location: words with similar meanings cluster together, and words with different meanings live farther apart. Imagine an embedding that spreads out in all directions from the origin in a high-dimensional space.

The interactive embedding demo below shows embeddings for different R&D domains. The embeddings were generated using Bidirectional Encoder Representations from Transformers (BERT), developed by Google around 2018. The diagram below shows the embeddings in 3D vector space using Principal Component Analysis (PCA). PCA is a technique that allows you to reduce the dimensionality of your data while preserving as much of the original variability as possible. In this diagram, the original BERT embedding (bert-base-uncased) was initially in a high-dimensional space (768 dimensions for bert-base-uncased). PCA allows you to reduce the dimensionality by finding the directions in which the data varies the most (called principal components).

Before the popularity of word embeddings, traditional keyword-based search methods were one of the few tools available to find relevant research papers on graphene. For example, when looking for information about graphene's potential applications in energy storage, researchers might use keywords such as “graphene battery” or “graphene supercapacitor.”

However, such an approach may miss many relevant results in the literature that use different representations. Embeddings therefore enable semantic search, allowing researchers to find relevant papers even if the exact keywords do not match. For example, a search for “graphene in energy storage” returns the following results:

“Carbon nanostructures for electrochemical capacitors”
“Two-dimensional materials for next-generation batteries”
“Atomically thin electrodes for high performance energy devices”

Connecting the interdisciplinary dots

One of the most promising applications of embeddings in research is their ability to discover unobvious connections between different fields. For example, a 2022 Nature paper explains that researchers developed embeddings to represent entire academic fields based on co-occurrence statistics of Field of Research (FoR) codes across millions of published papers. The authors report examining “a dataset of over 21 million papers published in 8400 academic journals between 1990 and 2019,” and propose an embedding-based strategy to explore the geometric closeness between embeddings.

In R&D-intensive industries, embeddings can also help visualize and analyze large patent datasets to identify innovation white spaces and potential infringement risks. Korean researchers at IEEE explained that patent embedding-based search is effective for R&D benchmarking.

Source link