Increase productivity when processing scanned PDFs with Amazon Q Business

Machine Learning

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and extract insights directly from the content of digital and scanned PDF documents in your enterprise data sources, without the need to pre-extract text.

Customers in industries such as finance, insurance, and healthcare life sciences need to derive insights from many types of documents that are often in scanned PDF format, such as receipts, healthcare plans, tax statements, etc. These document types are often in semi-structured or unstructured formats and require processing to extract text before they can be indexed by Amazon Q Business.

Amazon Q Business now supports scanned PDF documents, enabling you to seamlessly process a wide range of multimodal document types through the AWS Management Console and APIs in all supported Amazon Q Business AWS Regions. You can use supported connectors to ingest and index documents such as scanned PDFs from your data sources, and then use the documents to answer questions, provide summaries, and generate content from your enterprise systems securely and accurately. This feature eliminates the development effort required to extract text from scanned PDF documents outside of Amazon Q Business, improving document processing pipelines for building generative artificial intelligence (AI) assistants with Amazon Q Business.

In this post, I show you how to use Amazon Q Business to asynchronously index scanned PDF documents and perform real-time queries.

Solution overview

You can use Amazon Q Business with scanned PDF documents from the console, AWS SDK, or AWS Command Line Interface (AWS CLI).

Amazon Q Business provides a versatile suite of data connectors that can integrate with a wide range of enterprise data sources, enabling you to develop generative AI solutions with minimal setup and configuration. To learn more, see Amazon Q Business, which is generally available now. Improving workforce productivity with generative AI.

Once you are ready to use the Amazon Q Business application, you can upload scanned PDFs directly to your Amazon Q Business index using the console or API. Amazon Q Business offers multiple data source connectors that allow you to consolidate and sync data from multiple data repositories into a single index. This post presents two scenarios for working with documents: one using the direct document upload option and one using the Amazon Simple Storage Service (Amazon S3) connector. If you need to ingest documents from other data sources, see Supported Connectors for more information on connecting to additional data sources.

Indexing documents

In this post, I will use three scanned PDF documents as examples (a billing invoice, a health insurance summary, and an employment verification form) and some text documents.

The first step is to index these documents. To index documents using Amazon Q Business's direct upload feature, follow these steps: In this example, we will upload a scanned PDF.

  1. In the Amazon Q Business console, application In the navigation pane, open the application.
  2. choose addition Information source.
  3. choose Upload a file.
  4. Upload your scanned PDF file.

The uploaded file is Data Source tab. Upload Status Changes from received To process To index or Has been updatedAt this point, your file has been successfully indexed into the Amazon Q Business data store. The following screenshot shows a successfully indexed PDF.

The indexed document in the Uploaded Files section.

The following steps show how to integrate and sync documents with Amazon Q Business using the Amazon S3 connector. In this example, we index text documents.

  1. In the Amazon Q Business console, application In the navigation pane, open the application.
  2. choose Add a Data Source.
  3. choose Amazon S3 For connectors.
  4. Enter your information Name, VPC and Security group settings, IAM roles, and Synchronous Mode.
  5. To connect your data source to Amazon Q Business, Add a Data Source.
  6. In Data source details In the Connector details page section: Sync now Allows Amazon Q Business to begin synchronizing (crawling and ingesting) data from your data sources.

Once the sync job is complete, the data source will be available. The following screenshot shows that all five documents (scanned and digital PDFs, and a text file) have been successfully indexed.

Amazon S3 Connector

The following screenshot shows a comprehensive view of two data sources: documents uploaded directly and documents ingested via the Amazon S3 connector.

Amazon Q business data source.

Now let’s run some queries against the data source using Amazon Q Business.

Querying dense, unstructured, scanned PDF documents

Your documents could be dense, unstructured, scanned PDF document types from which Amazon Q Business can identify and extract the most salient information-dense text. In this example, we use a multi-page health insurance plan summary PDF that we indexed earlier. The following screenshot shows a sample page.

Health plan summary document.

This is an example of a health plan summary document.

The Amazon Q Business web UI asks, “What are the annual out-of-pocket limits listed on my health insurance plan summary?”

Amazon Q Business searches indexed documents, retrieves relevant information, and generates answers while citing sources of that information. The following screenshot shows a sample output.

Amazon Q Business Output

Querying structured, tabular, scanned PDF documents

Your documents may also contain structured data elements in a tabular format. Amazon Q Business can automatically identify, extract, and linearize structured data from scanned PDFs to accurately resolve user queries. In the following example, we use the invoice PDF that we indexed earlier. The following screenshot shows an example.


This is an example of an invoice.

The Amazon Q Business web UI asks, “How much are the headphones charged on my invoice?”

Amazon Q Business searches the indexed documents and references the source document to get the answer. The following screenshot shows that Amazon Q Business can extract billing information from an invoice.

Amazon Q Business Output

Semi-structured form of queries

Documents may also contain semi-structured data elements in their form, such as key-value pairs. Amazon Q Business can precisely fulfill queries related to these data elements by extracting specific fields or attributes that are meaningful to the query. In this example, we use an employment verification PDF. The following screenshot shows an example.

Sample employment certificate

This is an example of an employment verification document.

In the Amazon Q Business web UI, when you ask, “What are the applicant's employment dates as stated on the employment certificate?”, Amazon Q Business will search the indexed employment certificates and reference the source document to get the answer.

Amazon Q Business Output

Indexing documents using the AWS CLI

This section shows you how to use the AWS CLI to ingest structured and unstructured documents stored in an S3 bucket into an Amazon Q Business index. You can quickly get detailed information about your documents, such as their status and any errors that occurred during indexing. If you are an existing Amazon Q Business user and have indexed documents in various formats, such as scanned PDFs or other supported types, and you want to reindex your scanned documents, follow these steps:

  1. Check the status of each document and filter failed documents according to their status. "DOCUMENT_FAILED_TO_INDEX"You can filter documents based on the following error messages:

"errorMessage": "Document cannot be indexed since it contains no text to index and search on. Document must contain some text."

If you are a new user and have not indexed any documents, you can skip this step.

Below is an example of using the ListDocuments API to filter documents with a specific status and their error message.

aws qbusiness list-documents --region <region> \
--application-id <application-id> \
--index-id <index-id> \
--query "documentDetailList[?status=='DOCUMENT_FAILED_TO_INDEX'].{DocumentId:documentId, ErrorMessage:error.errorMessage}"
--output json

The following screenshot shows the AWS CLI output, including a list of failed documents with error messages.

List of rejected documents

Next, process the documents in batches. Amazon Q Business supports adding one or more documents to an Amazon Q Business index.

  1. Use the BatchPutDocument API to ingest multiple scanned documents stored in an S3 bucket into an index.
    aws qbusiness batch-put-document —region <region> \
    --documents '[{ "id":"s3://<your-bucket-path>/<scanned-pdf-document1>","content":{"s3":{"bucket":"<your-bucket> ","key":"<scanned-pdf-document1>"}}}, { "id":"s3://<your-bucket-path>/<scanned-pdf-document2>","content":{"s3":{"bucket":" <your-bucket>","key":"<scanned-pdf-document2>"}}}]' \
    --application-id <application-id> \
    --index-id <index-id> \
    --endpoint-url <application-endpoint-url> \
    --role-arn <role-arn> \

The following screenshot shows the output from the AWS CLI, with the failed documents displayed as an empty list.

List of rejected documents

  1. Finally, we use the ListDocuments API again to check if all the documents have been indexed properly.
    aws qbusiness list-documents --region <region> \
    --application-id <application-id> \
    --index-id <index-id> \
    --endpoint-url <application-endpoint-url> \

The following screenshot shows that the documents are indexed in the data source.

A list of indexed documents


If you create a new Amazon Q Business application and no longer want to use it, unsubscribe from the application, remove any assigned users, and delete it to avoid accumulating costs in your AWS account. Also, if you no longer want to use an indexed data source, see Managing Amazon Q Business Data Sources for instructions on deleting the indexed data source.


In this post, I discussed Amazon Q Business support for scanned PDF document types. I highlighted the steps to sync, index, and query supported document types, which now include scanned PDF documents, using Amazon Q Business generative AI. I also provided examples of queries against structured, unstructured, and semi-structured multi-modal scanned documents using the Amazon Q Business Web UI and AWS CLI.

To learn more about this feature, see Document Formats Supported by Amazon Q Business. Try it out today in the Amazon Q Business console. To learn more, see Amazon Q Business and the Amazon Q Business User Guide. You can submit feedback through AWS re:Post for Amazon Q or through your usual AWS Support contacts.

About the Author

Sonali Sahu She leads the Generative AI Specialist Solutions Architecture team at AWS. She is an author, thought leader, and passionate technologist. Her primary focus is AI and ML and she is a frequent speaker at AI and ML conferences and meetups around the world. She has broad and deep experience in technology and the technology industry with industry expertise in healthcare, financial sector, and insurance.

Chinmayi Lane I'm a Generative AI Specialist Solutions Architect at AWS. I'm passionate about applied mathematics and machine learning. I focus on designing intelligent document processing and generative AI solutions for AWS customers. Outside of work, I enjoy dancing salsa and bachata.

Himesh Kumar is an experienced Senior Software Engineer currently working in Amazon Q Business at AWS. He is passionate about building distributed systems in the generative AI/ML space. His expertise extends to developing scalable and efficient systems ensuring high availability, performance and reliability. Beyond his technical skills, his commitment to continuous learning keeps him at the forefront of technological advancements in AI and Machine Learning.

Chin Wei He is a Senior Software Developer in the Amazon Q business team at AWS and is passionate about building modern applications using AWS technologies. He loves community-driven learning and technology sharing, especially on machine learning hosting and inference related topics. Currently, his primary focus is building a serverless and event-driven architecture for RAG data ingestion.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *