Symposium on Open Source Investigation Lab: AI for Open Source Investigation

[Guillen Torres Sepulveda is Open Source Investigations Specialist with the Human Rights Center, Berkeley School of Law.

Pınar Yolum is Professor of Trustworthy AI at the Department of Information and Computing Sciences at Utrecht University]

Open source research is the practice of collecting, examining, and analyzing publicly available information to answer research questions. These questions range from fact-checking to human rights monitoring to environmental abuse analysis. Established techniques used for open source research include geolocation and temporal location of images and videos, visualization of large amounts of data for analytical purposes, social network analysis of individuals and organizations, and tracking of ships and flights. Open source researchers typically perform these techniques by collecting, organizing, and making sense of analog and digital data through structured methodologies. Investigators are increasingly incorporating a variety of software tools into their processes. However, the amount of data available still makes the application of these methods time-consuming and error-prone. In recent years, some artificial intelligence (AI) technologies have matured and become powerful enough to support investigators in performing some data-intensive tasks. For example, AI technology can process large amounts of open data (such as text, images, video, and audio geospatial data) much faster than humans, helping researchers effectively identify and predict patterns. Here are some examples to help those interested explore the potential of AI in investigative work.

AI techniques for open source investigation

A key task in many open source research methods is to identify and match images to the content of interest. Examples include checking to see if a drone appears in a photo you find on social media or a police officer in a video of a protest. To assist investigators with these tasks, AI tools must be able to perform advanced computer vision tasks. Image classification (e.g., checking if an object is present in a photo), object location (e.g., locating a position in an image), face recognition (e.g., matching a known human face to a face in a photo), landscape matching (e.g., comparing a photo background to a satellite map), and reverse image retrieval (e.g., matching an image to an existing image to find information) are major examples of computer vision tasks. The AI technology underlying many of these tasks is machine learning (ML), which allows algorithms to learn patterns from data. Simply put, most ML algorithms are first trained on labeled data for learning purposes. This allows ML algorithms to find patterns based on the labels. When an ML algorithm is presented with a new image, the algorithm can classify what the image’s label will be. This technique is especially powerful when the training data is large and extensive, and the labels of interest are well-defined. An example is identifying a cat in an image. Because we can provide a large training data set of images with cats, images without cats, and images with cats of various sizes and types, the ML algorithm can accurately derive patterns. Compare this to identifying a new weapon. In this case, the training data is very small and the new drone may have different capabilities than the existing drone. ML algorithms have been shown not to perform well in such cases and require extensive human supervision when used. ML algorithms are called ML algorithms because of the fact that they automatically find patterns without explicit human instruction. black box algorithmThis means it’s difficult to see and see what’s going on inside. Recent efforts in AI towards explainability aim to overcome this knowledge gap by providing end users with additional methods and tools to support understanding and determining the actions taken by AI algorithms.

Another category of open source research tasks requires processing large data sets. A prime example is the text transcripts of messages and conversations collected from Telegram. In situations like this, it’s useful to be able to summarize text, understand links between messages, and possibly translate it into another language. One category of algorithms is for analyzing text based on how often certain terms or entities are used, what other words are used, etc. Such algorithms can preprocess large amounts of message text to help researchers focus on the right messages, and are particularly useful for understanding trending topics within messages by understanding how often certain words are used. Another important category is sentiment analysis. The goal is to understand whether message creators have positive or negative feelings about a particular concept. This helps researchers understand the feelings that exist among the population about a particular topic and can lead to further investigation if necessary.

Examples of AI usage in open source research

The use of large datasets and computer-assisted investigative techniques has been part of investigative reporting and human rights fact-finding for decades, but has become a staple of open source investigations over the past decade, and more significantly over the last five years. Journalism news organizations and human rights investigators (and some hybrid agencies, such as Forensic Architecture and its multiple spinouts, or the now-defunct Global Justice and Accountability division of Bellingcat) have hired specialized talent to combine complex data processing and visualization techniques with the field’s usual investigative and narrative methods. This combination has produced high-impact research that spans many of the potential practical applications of machine learning. One of the earliest examples was a Buzzfeed News investigation into China’s medical infrastructure construction. Buzzfeed researchers realized that many facilities share physical characteristics that can be observed remotely, so they began experimenting with whether a machine learning model could be trained to find unidentified prisons when fed a large collection of satellite images. For example, the New York Times’ visual research team trained computer vision algorithms to identify craters left by Israeli bombs in the Gaza Strip. The LA Times trained another model to work with text data and identify inconsistencies in how crimes are classified using a large dataset of local crimes. Nongovernmental organizations VFRAME and Tech 4 Tracing use conflict footage to create weapon detection algorithms and create replicas of munitions that can be used for investigative purposes. Bellingcat has created a tool that allows anyone interested to search all local council debates in the UK.

All of these examples make it clear that AI is enabling data to be processed at a scale that humans cannot work with, helping investigators by finding patterns and generating insights that are only possible when processing large amounts of data. For example, just as anti-fraud software can flag suspicious transactions based on historical records, investigators are identifying potential spy planes and illegal mining in the Amazon.

New directions for AI in open source research

An interesting direction in which AI can be further leveraged for open source research is agent-based modeling (ABM). ABM is based on modeling the world by creating agents that represent individual entities with different behaviors and interaction patterns. The model can then be simulated to observe how interactions between agents lead to specific collective behaviors that emerge over time. ABM is widely used to help public sector policy developers understand how a particular policy provision affects individual behavior and whether that provision is likely to be accepted by the public. Using ABM in open source research can help you understand the spread of misinformation in a particular network and understand the effectiveness of various countermeasures. This will help you decide which measures to invest in and how to implement them. Another direction in which ABM can be used is to help monitor humanitarian crises. Beyond simply responding to needs, ABM can also simulate how humanitarian needs change across affected areas over time and help identify actions to take before a need becomes immediate.

Another interesting area is the use of generative AI for remediation purposes. Open source images of conflict zones often have low resolution, making it difficult to understand the background or identify faces. Generative AI can be used to improve the quality of these images or reconstruct parts of the image that have been destroyed. Similarly, blurry satellite images can be cleaned up to make them easier to use for research. However, such reconstruction techniques must be applied transparently to avoid misrepresentation or distortion of evidence.

Ethical and practical concerns in the use of AI in open source research

Since ChatGPT and other similar tools became mainstream in 2022, many have voiced concerns about the impact of generative AI on people’s critical thinking. Previous versions of AI/machine learning have largely escaped this criticism, perhaps because they were adopted more subduedly, but many of the concerns about AI assistants may apply to some of the examples we have provided so far. In both cases, researchers are outsourcing some of the analysis to computers. However, a key difference is the level of auditability maintained when different types of AI are involved. While most of the machine learning applications we have demonstrated so far are complex and can be run multiple times to produce the same output given the same input, this is not the case with generative AI. In the case of generative AI, its unpredictability excludes it from areas where replicability, full transparency, traceability, and authenticity of provenance are relevant, such as open source research.

In that sense, it may perhaps be safer for researchers to approach generative AI only as an experiment, understand its capabilities in proportion to the hype that promotes its uncritical adoption, and lower the barriers to adoption when the technology matures. Alternatively, non-technical researchers may benefit from using generative AI as a tool to reduce the technical challenges of implementing earlier versions of AI and machine learning (i.e., non-generative AI).

Generative AI can also create new content (such as text or images) based on existing content, leading to deepfakes. Deepfakes produce videos or audio of known individuals to mislead the public into believing that the content created is original, and that the video or audio is not real. At the same time, other AI tools are being used to detect these deepfakes using other ML algorithms that recognize signs of manipulation or production. For example, AI algorithms are detecting signs of deepfakes by recognizing mismatched lighting or misaligned body features in videos. The same technology, generative AI, can be used to create text in the form of formal reports and documents that look realistic and make the reader believe. This allows individuals to quickly create credible misinformation, leading to the rapid spread of misinformation.

Integrating AI into open source investigations allows investigators to process large and complex data sets, uncover hidden connections, and gain insights at unprecedented speed and scale. However, technical ability alone is not enough. Ensuring transparency, reproducibility, and ethical oversight remain fundamental to maintaining trust in AI-powered research findings. As investigators increasingly rely on algorithmic tools, combining computational power with human judgment, accountability, and understanding of context is key to maintaining the integrity and trustworthiness of open source investigations.