Unleash the potential of AI: Discover hidden data gems

AI News


Compared to a typical day for Yuejie Chi, finding the needle in the haystack is easy.

As a leading authority on the fundamentals of large-scale language models (LLMs) and other machine learning systems, Professor Chee, the Charles C. and Dorothea S. Dilley Professor of Statistics and Data Science in the College of Arts and Sciences and professor of computer science in the Yale School of Engineering, is sifting through multiple haystacks that communicate with each other.

Her specialty is extracting useful information from huge datasets. This includes separating the signal from the noise, understanding the intricacies of how data is collected, and always seeking the most efficient path to using the data. In doing so, she is helping advance AI’s ability to make predictions and decisions across a variety of applications, from medical imaging to materials science.

The process is full of surprises, she says. Useful information is “hidden” in oceans of data, often in predictable shapes and structures once you find it.

“When dealing with efficiency in the context of AI, constructs are ubiquitous and can appear in many forms and in many places across data, models, and systems,” said Chi.

Indeed, Qi’s extensive research has continued to grow since she joined the Yale faculty in 2025 (she is also a member of the Yale Institute for Basic Data Science, the Wu Tsai Institute, and the Center for Algorithms, Data, and Market Design).

Her research has already led to improvements in image processing algorithms. For example, her research on super-resolution fluorescence microscopy, a technology that allows images to be viewed at higher resolution than standard microscopy, has helped produce better, more detailed images for optical and biomedical research while using fewer computational resources. She has also done important research in the field of phase retrieval, an imaging used in crystallography and astronomy.

In the interview, Chi talked about the need for more efficient AI and the joy of combining theoretical research with practical results. The interview has been edited and condensed.

What is the largest or most complex database you have ever encountered?

Yuejie Chi: I helped curate the database for Nationwide Children’s Hospital in Columbus, Ohio. This database contains sleep study data from over 3,000 patients who underwent multiple sleep study sessions. This was three years’ worth of data.

Although the size of the dataset was unprecedented at the time, we believed that data of this size would be extremely valuable for training large-scale machine learning models. In fact, researchers have used it to train large-scale underlying models and derive great insights.

How can theory help inform AI performance?

Chi: Here’s a common example. My colleagues and I have been looking at understanding how LLM works and how the different training paradigms in use can unlock its reasoning power.

You can think of these different training paradigms and associated data as tuning knobs. What is the efficiency of these different tuning knobs? And how can they be used in combination for the best results? Theory is the key to understanding LLM and developing more efficient models.

Once you have identified the “hidden structure”, how can you exploit it?

Chi: One of the main examples is imaging. With a better understanding of the hidden structure in your data, you can often recover high-quality information from fewer or noisier measurements. This could, for example, reduce the need to remain completely still or hold your breath for long periods of time, making medical imaging faster, more comfortable, and less stressful for patients. [during MRI or CT scans, for instance]. More broadly, identifying useful structures in data allows AI systems to do more with fewer resources. This means lower computational costs, lower energy usage, and improved downstream functionality.

What are you currently working on that reflects this?

Chi: We are currently working with researchers at the U.S. Air Force Research Laboratory to leverage AI diffusion models for materials imaging. These are generative models that learn the structure of data and can capture complex data patterns very effectively. This opens up the possibility of using diffusion models to dramatically accelerate material imaging, which can be very time-consuming.

Is there any other ongoing research that you are particularly interested in?

Chi: Yes, there are a few! But the one I’m particularly excited about is reinforcement learning. [RL]a machine learning paradigm that learns through trial and error. It probably became widely known through game systems such as AlphaGo.

I’ve always been interested in RL. My focus was on understanding the efficiency of RL algorithms in different situations and bridging the gap between theory and practice. One of our recent studies in this regard was on how RL is used to train language models. In fact, I plan to offer an entry-level course on RL next semester.

/University Release. This material from the original organization/author may be of a contemporary nature and has been edited for clarity, style, and length. Mirage.News does not take any institutional position or position, and all views, positions, and conclusions expressed herein are those of the authors alone. Read the full text here.



Source link