Organizations in media and entertainment, advertising, social media, education, and other sectors need efficient solutions to extract information from videos and apply flexible evaluation based on policies. Generative artificial intelligence (AI) has brought new opportunities for these use cases. In this post, we introduce the Media Analysis and Policy Evaluation solution, which provides a framework to streamline the process of video extraction and evaluation using AWS AI and generative AI services.
Popular Use Cases
Advertising technology companies own video content, including ad creatives. Brand safety, regulatory compliance, and engaging content are priorities for video analytics. This solution, powered by AWS AI and generative AI services, meets these needs. Advanced content moderation ensures ads are displayed alongside safe and compliant content, building trust with consumers. You can use this solution to evaluate videos against content compliance policies. You can also create compelling headlines and summaries to improve user engagement and ad performance.
Educational technology companies manage large inventories of training videos, and having an efficient way to analyze the videos can help them evaluate content against industry policies, index and efficiently search videos, and perform dynamic discovery and editing tasks like blurring student faces in Zoom recordings.
The solution is available in a GitHub repository and can be deployed into your AWS account using the AWS Cloud Development Kit (AWS CDK) package.
Solution overview
- Media Extraction – Once a video is uploaded, the app begins preprocessing by extracting image frames from the video. Each frame is analyzed using Amazon Rekognition and Amazon Bedrock for metadata extraction. At the same time, the system uses Amazon Transcribe to extract audio transcriptions from the uploaded content.
- Policy Evaluation – The system uses metadata extracted from the video to perform LLM evaluation, allowing you to leverage the flexibility of LLM to evaluate videos against dynamic policies.
The following diagram illustrates the solution workflow and architecture.

The solution employs microservices design principles, with loosely coupled components that can be deployed together for video analytics and policy evaluation workflows, or deployed separately and integrated into existing pipelines. The following diagram illustrates the microservices architecture:

The microservice workflow consists of the following steps:
- Users access a static website frontend through an Amazon CloudFront distribution, with static content hosted on Amazon Simple Storage Service (Amazon S3).
- Users log in to your front-end web application and are authenticated by your Amazon Cognito user pool.
- Users upload videos to Amazon S3 directly from their browser using a multipart pre-signed Amazon S3 URL.
- The front-end UI interacts with the extraction microservice through a RESTful interface provided by Amazon API Gateway, which provides CRUD (Create, Read, Update, Delete) functionality for video task extraction management.
- An AWS Step Functions state machine oversees the analysis process, using Amazon Transcribe to transcribe the audio, moviepy to sample image frames from the video, analyze each image using Anthropic Claude Sonnet image summarization, and the Amazon Titan model to generate text embeddings and multi-modal embeddings at the frame level.
- An Amazon OpenSearch Service cluster stores the extracted video metadata and serves user search and discovery needs. The UI creates evaluation prompts and submits them to Amazon Bedrock LLM to retrieve the evaluation results synchronously.
- Using the solution UI, users can select and customize existing template prompts to initiate policy evaluations powered by Amazon Bedrock. The solution executes the evaluation workflow and displays the results to the user.
The following sections provide more information about the solution's main components and microservices.
Website UI
The solution features a website where users can browse videos and manage the upload process through an easy-to-use interface. It provides details on extracted video information and includes a lightweight analytics UI for dynamic LLM analysis. The following screenshots show some examples.

Extracting information from videos
The solution includes a back-end extraction service that asynchronously manages the extraction of video metadata. This includes extracting information from both visual and audio components, including object, scene, text, and human face identification. The audio component is especially important for videos that contain active narration or dialogue, as it often contains valuable information.
Building a robust solution for extracting information from videos is challenging from both a machine learning (ML) and engineering perspective. From an ML perspective, our goal is to enable generic extraction of information that serves as factual data for downstream analysis. On the engineering side, significant effort is required to manage video sampling with concurrency, provide high availability and flexible configuration options, and ensure an extensible architecture that supports additional ML model plugins.
The extraction service uses Amazon Transcribe to convert the audio portion of a video into text in the form of subtitles. There are several main techniques involved in visual extraction:
- Frame Sampling – Traditional methods of analyzing visual aspects of videos use sampling techniques, which involve capturing screenshots at specific intervals and applying ML models to extract information from each image frame. In our solution, we use sampling because:
- The solution supports configurable intervals for a fixed sampling rate.
- It also provides an advanced smart sampling option that uses the Amazon Titan Multimodal Embeddings model to perform a similarity search on frames sampled from the same video, in the process identifying similar images and discarding redundant images to optimize performance and cost.
- Extracting information from image frames – The solution iterates and processes images sampled from a video simultaneously. For each image, the following ML features are applied to extract information:
The following diagram illustrates how the Extraction Service is implemented:

The extraction service uses Amazon Simple Queue Service (Amazon SQS) and Step Functions to manage concurrent video processing and allows for configurable settings. Based on your account's service quota limits and performance requirements, you can specify how many videos can be processed in parallel and how many frames per video can be processed simultaneously.
Search for videos
Efficiently identifying videos in your inventory is a priority, and effective search capabilities are critical for video analytics tasks. Traditional video search methods rely on full-text keyword search. With the introduction of text embeddings and multimodal embeddings, new search methods based on semantics and images have emerged.
The solution provides search capabilities through an extraction service available as a UI feature. As part of the extraction process to provide video search, it generates vector embeddings at the image frame level. Videos and their underlying frames can be directly searched through a built-in web UI or a RESTful API interface. There are three search options to choose from:
- Full-text search – It is powered by OpenSearch Service and uses a search index generated by a text analyzer, making it ideal for keyword searches.
- Semantic Search – Powered by Amazon Titan Text Embeddings model, generated based on transcription and image metadata extracted at frame level.
- Image Search – Powered by the Amazon Titan multimodal embedding model, it is generated using the same text messages and image frames used for the text embeddings. This feature is well suited for image search, where you can provide an image and find similar frames in a video.
The following UI screenshot shows how to use multimodal embeddings to search for videos containing the AWS logo. The web UI displays three videos that have frames with high similarity scores compared to the provided AWS logo image. The drop-down menu also provides two other text search options, giving you the flexibility to switch between search options.

Analyze the video
After gathering rich insights from the videos, you can analyze the data. The solution has a lightweight UI implemented as a static React web application that is run by a backend microservice called the Rating Service. This service acts as a proxy on Amazon Bedrock LLM and provides real-time ratings. You can use this as a sandbox feature to test LLM prompts for dynamic video analysis. The web UI includes several sample prompt templates that demonstrate how to analyze videos for different use cases, such as:
- Content Moderation – Report unsafe scenes, texts, or comments that violate our Trust and Safety Policy
- Video Summary – Summarize your video with a concise description based on audio and visual content cues
- IAB Classification – Organize your video content into advertising IAB categories for easy understanding
You can also choose from a collection of LLM models provided by Amazon Bedrock and test the assessment results to find the one that best suits your workload. LLM can run analytics based on your instructions using extracted data, making it a flexible and extensible analytics tool that can support a variety of use cases. Below is an example prompt template for video analytics. The placeholders in #### will be replaced with the corresponding data extracted from the video at runtime.
The first example shows how to moderate a video based on the audio transcription and object and moderation labels detected by Amazon Rekognition. This sample contains a basic inline policy. You can expand on this section to add additional rules. You can use Amazon Bedrock Knowledge Bases to integrate longer reliability and safety policy documents and runbooks with Retrieval Augmented Generation (RAG) patterns.
Before generative AI became widespread, classifying videos into IAB categories was difficult. Typically, custom-trained text and image classification ML models were required, but they often faced accuracy issues. The following sample prompt uses the Amazon Bedrock Anthropic Claude V3 Sonnet model, which has built-in knowledge of the IAB taxonomy. Therefore, you don't even need to include the taxonomy definition as part of the LLM prompt.
summary
Video analytics poses challenging technical challenges in both ML and engineering. This solution provides a user-friendly UI to streamline the process of video analysis and policy evaluation. The back-end components serve as building blocks to integrate into your existing analytical workflows, allowing you to focus on analytical tasks that have higher business impact.
You can deploy the solution into your AWS account using the AWS CDK package available in the GitHub repository. For deployment details, see the step-by-step instructions.
About the Author

Lana Chan He is a Senior Solutions Architect in the AI Services team in the AWS Worldwide Specialist organization, specializing in AI and generative AI with a focus on use cases such as content moderation and media analytics. With his expertise, he is dedicated to promoting AWS AI and generative AI solutions, demonstrating how generative AI can transform traditional use cases into advanced business value. He helps customers transform their business solutions across industries including social media, gaming, e-commerce, media, advertising, and marketing.
Negin Rouhanizadeh is a Solutions Architect at AWS focusing on AI/ML in Advertising & Marketing. Apart from creating solutions for clients, Negin enjoys drawing, coding and spending time with his family and his dogs Simba and Huchi.

