Artificial intelligence has become a powerful tool for critical systems in healthcare, finance, and national security. However, these complex machine learning systems can also be targeted by attackers.
Weijie Zhao, an assistant professor of computer science at RIT, wants to shed light on AI design and ensure machine learning models don’t have hidden vulnerabilities or dangerous exploits embedded in them. His research helps build more secure and interpretable machine learning models.
“With the increasing use of AI and agent AI, this has become a real societal concern,” Zhao said. “We want to make sure that there is no false information or misinformation and that decisions are not manipulated.”
Ms. Zhao was recently recognized for her work by winning the prestigious National Science Foundation Faculty Early Career Development (CAREER) award and grant. His five-year project is titled “Defending Machine Learning Models from Adversarial Threats with Unified Interpretation and Attribution.”
The main problem stems from the fact that complex machine learning works like a black box. Although the output may be visible, the internal decision-making process of AI is incomprehensible to humans. As a result, there are many unknowns about AI systems.
For example, you might unintentionally train an AI agent with a model that contains incorrect information. Attackers can also build large language models with backdoors and watermarks.
“The attacker doesn’t even need to steal the data,” Zhao explains. “They will be able to create models that will somehow influence decision-making.”
Zhao pointed out that OpenClaw, a popular free and open-source AI agent, has caught attackers using third-party extensions to effectively distribute malware. In February, researchers discovered 341 malicious ClawHub skills that stole data from users.
Machine learning model transparency
CAREER Award projects aim to strengthen the safety, resilience, and accountability of machine learning systems deployed in high-stakes environments.
Through his research, Zhao hopes to bridge the gap between modern AI systems and the classic machine learning frameworks that scientists already understand. These tools allow defenders to track the cause of failures and remediate vulnerabilities. Mr. Zhao plans to:
- Develop techniques to identify how adversarial inputs or training data components lead to harmful outputs.
- We design a rapid strategy to remove harmful behavior without complete retraining and leverage surrogate models to semantically verify that the repair is local and verifiable.
- Build automated pipelines for auditing, provenance tracking, and security-aware evaluation of training data to detect data poisoning and instability.
- Create an interface that allows users to investigate suspicious output and remediate vulnerabilities interactively.
“Essentially, we want to find malicious data within inference time, fix it without retraining the model, and provide evidence that it was fixed,” Zhao said.
At RIT, Mr. Zhao is one of five computing and information science Ph.D. students. He hopes future practitioners will use this defense framework to create machine learning tools that are more resilient, transparent, and reliable.
“This is very important because many developers are now looking for the best AI performance,” Zhao says. “But we also need to pursue responsible AI system security and guardrails.”
The prestigious CAREER Award program recognizes and supports young faculty who embody the role of teacher-scholar through integrated research and teaching activities. RIT is home to more than 10 NSF CAREER Award recipients.
