understanding solution document...document understanding solution implementation guide contents...

30
Document Understanding Solution Implementation Guide

Upload: others

Post on 31-Dec-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

DocumentUnderstanding Solution

Implementation Guide

Page 2: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

Document Understanding Solution: Implementation GuideCopyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

Amazon's trademarks and trade dress may not be used in connection with any product or service that is notAmazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages ordiscredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who mayor may not be affiliated with, connected to, or sponsored by Amazon.

Page 3: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

Table of ContentsHome ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Overview .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Cost ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Cost for analyzing 1000 pages .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Architecture overview .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Considerations .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Kendra-enabled mode .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Samples files and documents .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Input data constraints ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Regional deployment .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

AWS CloudFormation template .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Automated deployment .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Prerequisites ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Deployment overview .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Step 1. Launch the stack .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Step 2. Access the web application .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Discovery track .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Compliance track .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Workflow automation track .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Security ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16IAM roles and policies ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Amazon Simple Storage Service (Amazon S3) ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Amazon DynamoDB ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Amazon Cognito .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Amazon CloudFront .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Amazon Simple Queue Service .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Amazon Simple Notification Service .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Amazon Elasticsearch Service .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Amazon Kendra .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Amazon Virtual Private Cloud .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17AWS KMS .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Additional security enhancements .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Amazon CloudWatch Logs .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Amazon CloudWatch Logs .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Additional resources .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Bulk processing directly to an S3 bucket .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Uninstall the solution .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Using the AWS Management Console .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Using AWS Command Line Interface .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Deleting the Amazon S3 buckets ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Deleting the DynamoDB tables .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Deleting the CodeCommit Repository .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Source code .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Contributors ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Revisions .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Notices .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

iii

Page 4: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

Document Understanding SolutionAWS Implementation Guide

AWS Solutions Builder Team

October 2020 (last update (p. 26): April 2021)

This implementation guide discusses architectural considerations and configuration steps for deployingthe Document Understanding Solution (DUS) in the Amazon Web Services (AWS) Cloud. It includes linksto an AWS CloudFormation template that launches and configures the AWS services required to deploythis solution using AWS best practices for security and availability.

This guide is intended for IT infrastructure architects, developers, administrators, and DevOpsprofessionals who have practical experience architecting in the AWS Cloud.

1

Page 5: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideCost

OverviewToday, companies use a manual process to extract data from documents and forms and transfer to aprocessing software, which is slow and can lead to errors. Customized optical character recognition andkeyword detection software are used; however, these can result in scrambled outputs and unusableinformation. Furthermore, after extracting the data, and identifying and categorizing domain-specificphrases and keywords, or entities, the data is filtered and tagged, and the inventory is imported intoa central location. The entire process is time consuming and expensive, relying heavily on humanintervention.

The Document Understanding Solution (DUS) provides an easy-to-use web application that enables youto ingest and analyze text files, extract text from documents, identify structural data (tables, key valuepairs), extract critical information (entities), and create smart search indexes from the data. Additionally,files can be uploaded directly to and analyzed files can be accessed from an Amazon Simple StorageService (Amazon S3) bucket in your AWS account.

This solution uses AWS artificial intelligence (AI) services to solve business problems that apply to variousindustry verticals:

• Search and discovery: Search information across multiple scanned documents, PDFs, and images• Compliance: Redact information from documents• Workflow automation: Easily plugs into your existing upstream and downstream applications

For example, organizations can use DUS to digitize and store customer feedback and request forms, anda finance department can convert invoice and balance sheets into consumable CSV files. Hospitals canuse this solution to extract medical entities, for example, medical conditions, medications, and protectedhealth information (PHI) from documents.

Redaction controls enable users to redact fields and text from the document. For example, a law firmcan redact dates and references to people and locations from a document before sharing it with externalparties.

Optionally, you can add Amazon Kendra support, to enable machine learning-based enterprise search.Refer to Kendra-enabled mode (p. 5) for more information.

CostYou are responsible for the cost of the AWS services used while running this solution. At the date ofpublication, the cost of running this solution with the default settings in US East (Virginia) Region isapproximately $2,400/month, with Amazon Kendra enabled. If you choose to disable Amazon Kendra,the cost for running this solution with the default settings in the US East (N. Virginia) is approximately$600/month.

The Document Understanding Solution uses resources that you pay for on an hourly basis, (fixed costs),or pay for by the amount of use, as measured by the number of requests made (variable costs).

The following tables provide an example breakdown of the services that account for the majority of thissolution’s costs.

Cost for analyzing 1000 pagesThe following table displays sample monthly variable costs for analyzing 1000 pages, containingapproximately 2,500 characters including 500 tables and 500 forms:

2

Page 6: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideArchitecture overview

Table 1: Request-based costs (variable costs)

AWS Service Monthly Cost

Amazon Textract $32.50

Amazon Comprehend Medical $250.00

Amazon Comprehend $2.50

Total monthly variable cost: $285.00

The following table displays sample monthly fixed costs with Amazon Kendra enabled.

Table 2: Hourly billed costs (fixed costs)

AWS Service Monthly Cost

Amazon Elasticsearch Service Two M5 largeinstances with one as the dedicated masterinstance and 20 GB storage

$306.60

Amazon Kendra (optional)* $1,800.00

Total monthly fixed cost*: $2,106.60

NoteThe cost for Amazon Kendra is applicable only if Amazon Kendra is enabled (the default setting)in the solution. You can disable Amazon Kendra during set up.

Prices are subject to change. For full details, refer to AWS Pricing.

Architecture overviewDeploying this solution with the default parameters builds the following environment in the AWS Cloud.

3

Page 7: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideArchitecture overview

Figure 1: Document Understanding Solution architecture on AWS

1. The AWS CloudFormation template deploys a static web application hosted on an Amazon S3 bucketand served by an Amazon CloudFront distribution. Users are authenticated using Amazon Cognito.The web application interacts with the backend using an Amazon API Gateway API, supported by anAWS Lambda function.

2. You can upload documents using either the web application, or directly to a dedicated Amazon S3bucket for bulk processing. Document processing is initiated by the API, which triggers a Lambdafunction to add an entry to an Amazon DynamoDB table. This table triggers a second Lambdafunction that supervises the processing.

3. There are two routes for processing, and the file format of the document dictates which one isused. Image files (PNG, JPG) are processed using a synchronous route. PDF files are processedusing an asynchronous route. The asynchronous route uses an Amazon Simple Notification Service(Amazon SNS) topic and an Amazon Simple Queue Service (Amazon SQS) queue to control the flowof documents awaiting processing. Both routes use Amazon Textract to extract text and structuralinformation from the files. The extracted text is then passed to Amazon Comprehend and AmazonComprehend Medical for further analysis.

4. The resulting information from the analyses are stored in an Amazon S3 bucket. Metadata is stored ina DynamoDB database. Extracted information is used to index the document in Amazon ElasticsearchService (Amazon ES) and, if enabled, in Amazon Kendra.

NoteAt the date of publication, other file formats are not supported and are not processed.

The Amazon ES cluster is deployed behind an Amazon Virtual Private Cloud (Amazon VPC) network. ThisAmazon ES cluster and, if enabled, the Amazon Kendra index, enable you to search through documents,while the data stored in the S3 buckets and the DynamoDB tables are used to render the documentinformation in the web application.

The data pipelines are managed using publish-subscribe patterns implemented through Amazon SNS,Amazon SQS, and AWS Lambda. Encryption is enabled for the components that interact with customerdata, and is either managed by server-side encryption (SSE) or by AWS Key Management Service (AWSKMS).

4

Page 8: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideKendra-enabled mode

Considerations

Kendra-enabled modeBy default, the Document Understanding Solution deploys Amazon Elasticsearch Service (Amazon ES).Amazon Kendra is a highly accurate and easy to use enterprise search service that’s powered by machinelearning. Amazon Kendra supports additional search capabilities, including:

• Natural language query search• FAQ matching• Results ranking based on user context

Refer to the Cost (p. 2) section for estimated costs for using this AWS service. By default, Amazon Kendrais enabled, but you can disable it during the initial deployment (p. 7) of the AWS CloudFormationtemplate, and any time after, by updating the stack parameter for Amazon Kendra.

Samples files and documentsBy default, the Document Understanding Solution includes preloaded documents of industry verticalsfor testing purposes. You can access these sample files and documents from the web application. Referto Step 3. Access the web application (p. 9) for instructions.

If Amazon Kendra is enabled, sample medical documents are preloaded and analysis results are provided.Use the documents and results to experiment with the features and functionalities of this solution.

Input data constraintsAt the date of publication, this solution supports the following file formats only:

• Images (PNG, JPEG) up to 5 MB• PDF files up to 150 MB

Other file formats are not processed by this solution and results in a File format not supported errormessage.

This solution supports up to 100 concurrent document uploads through the web application.

Regional deploymentThis solution uses Amazon Cognito, Amazon Textract, Amazon Comprehend, Amazon ComprehendMedical, and Amazon Kendra, which are currently available in specific AWS Regions only. Therefore, youmust launch this solution in an AWS Region where these services are available. For the most currentservice availability by AWS Region, refer to the AWS Regional Services List.

5

Page 9: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

AWS CloudFormation templateThis solution uses AWS CloudFormation to automate the deployment of the Document UnderstandingSolution in the AWS Cloud. It includes the following CloudFormation template, which you can downloadbefore deployment:

document-understanding-solution.template: Use this template to launch the solution and allassociated components. The default configuration deploys Amazon Simple Storage Service, AmazonCloudFront, Amazon Cognito, Amazon API Gateway, AWS Lambda, Amazon DynamoDB, Amazon SimpleNotification Service, Amazon Simple Queue Service, Amazon Textract, Amazon Comprehend, AmazonComprehend Medical, Amazon Elasticsearch Service, Amazon Kendra, Amazon VPC, and AWS KeyManagement Service. You can customize the template to meet your specific needs.

6

Page 10: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuidePrerequisites

Automated deploymentBefore you launch the automated deployment, review the architecture, configuration, and securityconsiderations discussed in this guide. Follow the step-by-step instructions in this section to configureand deploy the solution into your account.

Time to deploy: Approximately 30-60 minutes

PrerequisitesVerify that you have an administrator role in the AWS account where you plan to deploy this solution.To verify, access the AWS Identity and Access Management (IAM) console from the appropriate AWSaccount. For more information about IAM roles, refer to IAM roles in the IAM User Guide.

Deployment overviewThe procedure for deploying this architecture on AWS consists of the following steps. For detailedinstructions, follow the links for each step.

Step 1. Launch the stack (p. 7)

• Launch the AWS CloudFormation template into your AWS account.• Enter values for the required parameter: Email.• Review the other template parameters, and adjust if necessary.

Step 2. Access the web application (p. 9)

• Verify the solution is completely deployed, and access the DUS.

Step 1. Launch the stackThis automated AWS CloudFormation template deploys the Document Understanding Solution in theAWS Cloud. Review the prerequisites before launching the stack.

NoteYou are responsible for the cost of the AWS services used while running this solution. For moredetails, visit the Cost (p. 2) section in this guide, and refer to the pricing webpage for each AWSservice used in this solution.

1. Sign in to the AWS Management Console and use the button to the right to launch the document-understanding-solution.template AWS CloudFormation template. Optionally, you candownload the template as a starting point for your own implementation.

2. The template launches in the US East (N. Virginia) Region by default. To launch the solution in adifferent AWS Region, use the Region selector in the console navigation bar.

7

Page 11: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideStep 1. Launch the stack

NoteThis solution uses Amazon Cognito, Amazon Textract, Amazon Comprehend, AmazonComprehend Medical, and Amazon Kendra, which are currently available in specific AWSRegions only. Therefore, you must launch this solution in an AWS Region where theseservices are available. For the most current availability by Region, refer to the AW RegionalServices List.

3. On the Create stack page, verify that the correct template URL shows in the Amazon S3 URL textbox and choose Next.

4. On the Specify stack details page, assign a name to your solution stack.

NoteDo not use DUSStack, DUSClientStack, or DUSCDKToolkit for your stack name. Duringdeployment, three additional stacks are automatically created which uses these specificnames.

5. Under Parameters, review the parameters for the template and modify them as necessary. Thissolution uses the following default values.

Parameter Default Description

CodeCommit Repository Name document-understanding-reference-architecture

Repository used to initiate thesolution deployment.

Solution Version V1.0.0 The solution version beingdeployed.

Email <Requires input> Email id used for creating auser, and receiving relevantinstructions for accessing thesolution’s web application.

KendraEnabled true Flag used to determinewhether to include an AmazonKendra instance and its relatedresources as a part of thedeployment. By default, thisoption is enabled. Reviewthe Cost (p. 2) section forinformation about the costimpacts for using this AWSservice.

ReadOnlyMode false Flag used to determinewhether to deploy the webapplication in read-only mode.This mode allows for only theanalysis of the preloaded files.You cannot upload additionalfiles to the solution. To disablethe upload capability, set theparameter to true.

6. Choose Next.

7. On the Configure stack options page, choose Next.

8. On the Review page, review and confirm the settings. Check the box acknowledging that thetemplate will create AWS Identity and Access Management (IAM) resources.

9. Choose Create stack to deploy the stack.

8

Page 12: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideStep 2. Access the web application

You can view the status of the stack in the AWS CloudFormation console, in the Status column.During deployment, two additional stacks are created, DUS and DUSClient. You should receive aCREATE_COMPLETE status in approximately 30-60 minutes, depending on the stack parametersselected.

NoteIn addition to the primary and Amazon Kendra-related Lambda functions, thissolution includes the CICD-Helper Lambda function, that runs only during initialconfiguration or when resources are updated or deleted. It also contains twoCustomCDKBucketDeployment Lambda functions that are used by the CloudDevelopment Kit to deploy the DUS and DUSClient stacks.When running this solution, you will see all Lambda functions listed in the AWS console,but the CICD-Helper and two CustomCDKBucketDeployment Lambda functions arenot regularly active. However, do not delete these Lambda functions, as they are necessaryto manage associated resources, and to delete the stack when you want to uninstall thesolution (p. 22).

During deployment, you may receive an email containing instructions and the Amazon CloudFrontURL to access the web application. This URL will not be active until the deployment is complete. Afteryou receive a CREATE_COMPLETE status, verify that both the DUS and DUSClient stacks also display aCREATE_COMPLETE status.

Step 2. Access the web applicationFrom your email account, locate the Document Understanding Solution email. This email contains theAmazon CloudFront URL and your login credentials, including your username and temporary password.

NoteWhen you log in to the web application for the first time, you will need to create a newpassword.

The web application’s homepage displays three tracks. You can easily navigate between each track.

• Discovery track: Search information across multiple scanned documents, PDFs, and images

• Compliance track: Redact information from documents

• Workflow automation track: Upload batches of files directly into the bulk-processing Amazon S3bucket

Discovery trackIn the Discovery track, you can search through multiple documents and find information using traditionalsearch-based technologies such as Amazon Elasticsearch Service (Amazon ES) and, if enabled, AmazonKendra, a natural language service.

9

Page 13: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideDiscovery track

Figure 2: Web application home page

Use the Document list page to view documents that you uploaded and to also upload documents. Thissolution includes a preloaded set of documents that are indexed by Amazon ES, and, if enabled, AmazonKendra. You can use the preloaded documents to explore the capabilities of this solution.

To add documents, follow these steps:

1. Select upload your own documents.

2. Select the documents from your source drive or use your computer’s camera to take pictures of thedocuments you want to upload.

Figure 3: Upload documents page

3. After the documents have been uploaded, search for them from the Document list page using theSearch box.

Your search results show on a search results page displays one or more tabs, depending on whetherAmazon Kendra is enabled. If Amazon Kendra is enabled, the search results page displays three tabs,each showing unique results: the Amazon ES results, the Amazon Kendra results, and a comparisonbetween Amazon ES and Amazon Kendra results.

10

Page 14: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideDiscovery track

Figure 4: Amazon ES search results

Use the web application to navigate between the tracks from the document’s details page. The trackmenus are located on the top right corner of the webpage. You can also return to the homepage at anypoint to start a new activity.

The search results for Amazon Kendra offers unique interactions and filtering capabilities:

• Feedback mechanism: You can provide feedback to the machine learning model using a votingmechanism to either up vote or down vote the search results.

• Personas-based filtering: Personas based on the healthcare industry are available to filter yoursearch results, but are limited to the predefined queries provided in Step 2 (as shown in Figure 4). Theavailable personas include healthcare professionals, the general public, and government officials. Toaccess this filter, navigate to the Amazon Kendra tab, select Filter, and choose a persona.

11

Page 15: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideDiscovery track

Figure 5: Filter search results by persona

• Supports Amazon Kendra FAQs: This solution supports the Amazon Kendra capability to match FAQs.

12

Page 16: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideDiscovery track

Exploring a document

To explore a document, select the document to access a Preview mode. In this mode, you can searchfor text within the document and download a searchable PDF version of the document to analyze thedocument using.

Figure 6: Document exploration capabilities

When in Preview mode, access the document’s Details page to view more information, to preview thedocument, search for text, forms or key-value pairs, tables, and general and industry-specific entities.You can download any or all marked information in the following formats:

• PDF – a searchable or redacted version of the document

• CSV – format for tables or forms that are downloaded

• JSON – format for entities that are downloaded

13

Page 17: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideCompliance track

Figure 7: Download a table as a CSV file

Compliance trackThe compliance track enables you to redact information from a document. For example, you can redactprotected health information (PHI), specific values in key-value pairs, and specific keywords. You canredact information in each tab that is available for a document: Preview, Raw Text, Key-Value Pairs,Tables, Entities, and Medical Entities.

Figure 8: Redacting sensitive information in a document

14

Page 18: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideWorkflow automation track

Workflow automation trackYou can input data into the solution as well as export it for other business purposes. Use the bulkprocessing option to load a large volume of documents for analysis. Once analyzed, the results areavailable in an Amazon S3 bucket that you can export for further processing.

15

Page 19: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideIAM roles and policies

SecurityWhen you build systems on AWS infrastructure, security responsibilities are shared between you andAWS. This shared responsibility model can reduce your operational burden as AWS operates, manages,and controls the components from the host operating system and virtualization layer down to thephysical security of the facilities in which the services operate. For more information about security onAWS, visit AWS Cloud Security.

IAM roles and policiesAWS Identity and Access Management (IAM) roles enable you to assign granular access policies andpermissions to services and users on the AWS Cloud. The solution creates IAM roles, and sets permissionsin the respective accounts to allow the solution to assume a defined role in the member account andextract data when necessary. All automatically created Roles and Policies are defined to provide the leastamount of access permissions necessary for the functioning of the solution.

Amazon Simple Storage Service (Amazon S3)Infrastructure components in the Document Understanding Solution where user data flows through areencrypted using Server-Side Encryption (SSE). Multiple Amazon Simple Storage Service (Amazon S3)buckets are created for this solution, and they are encrypted using S3-SSE AES-256 encryption to secureuser data.

Amazon DynamoDBTwo Amazon DynamoDB tables are created for this solution. These tables are encrypted usingDynamoDB-SSE encryption to secure user metadata.

Amazon CognitoThe Amazon Cognito user created by this solution is a local user with permissions to access only the RESTAPI and website assets for this solution. This user does not have permissions to access any other servicesin your AWS account. User authentication is used to help secure the DUS.

Amazon CloudFrontThis solution deploys a web application hosted in an Amazon S3 bucket. To help reduce latency andimprove security, this solution uses an Amazon CloudFront distribution with an origin access identity,which is a special CloudFront user that helps provide public access to the website being served from thebucket, without giving public access to the bucket itself. For more information, refer to Restricting Accessto Amazon S3 Content by Using an Origin Access Identity in the Amazon CloudFront Developer Guide.

16

Page 20: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideAmazon Simple Queue Service

Amazon Simple Queue ServiceAmazon Simple Queue Service (Amazon SQS) queues are used for queueing documents and managingcommunication between the different AWS Lambda functions used in this solution. These queues areencrypted using AWS Key Management Service (AWS KMS) to help secure data in-transit.

Amazon Simple Notification ServiceAn Amazon Simple Notification Service (Amazon SNS) topic is used to pass data between the AmazonTextract service role and the various AWS Lambda functions used in this solution. The SNS topic isencrypted using an AWS KMS key to secure data in-transit.

Amazon Elasticsearch ServiceAmazon Elasticsearch Service (Amazon ES) enables a faster search experience. The Amazon ES clusteris provisioned in Amazon Virtual Private Network (VPC), allowing select IP addresses based on an allowlist, and denies public access. For more information about VPC support, refer to VPC Support for AmazonElasticsearch Service Domains in the Amazon Elasticsearch Service Developer Guide. The Amazon EScluster is encrypted using an AWS KMS key and the cluster nodes also include node-to-node encryption.

Amazon KendraWhen enabled, the Amazon Kendra cluster is deployed for providing an enhanced search experienceusing natural language processing. The Amazon Kendra cluster index is encrypted using an AWS KMS key.

Amazon Virtual Private CloudThe Amazon Virtual Private Cloud (Amazon VPC) is used to create a virtual network for the Amazon EScluster so as to allow only specific IP addresses to access the Amazon ES cluster.

AWS KMSThis AWS service is used to encrypt the data stored in Amazon Kendra and Amazon ES. In addition, AWSKMS is used to encrypt the data in transit through Amazon SNS, Amazon SQS.

Additional security enhancementsAmazon CloudWatch LogsYou can enable encryption for the Amazon CloudWatch Logs groups created when running this solution.To learn more about encrypting the CloudWatch Logs, refer to Encrypt Log Data in CloudWatch LogsUsing AWS KMS in the Amazon CloudWatch Logs User Guide.

17

Page 21: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideAmazon CloudWatch Logs

Amazon CloudFront TLS configurationThe Amazon CloudFront distribution deployed as part of this solution is configured with the defaultCloudFront certificate which only supports TLS v1. You can enable higher TLS versions by associatinganother certificate using AWS Certificate Manager.

18

Page 22: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

Additional resourcesAWS services

• AWS CloudFormation• AWS Lambda• Amazon Simple Storage Service• Amazon CloudWatch• Amazon API Gateway• Amazon CloudFront• Amazon Cognito• Amazon DynamoDB• Amazon Simple Notification Service

• Amazon Simple Queue Service• Amazon Elasticsearch Service• Amazon Kendra• Amazon Textract• Amazon Comprehend• Amazon Comprehend Medical• AWS KMS• Amazon Virtual Private Cloud

19

Page 23: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

Bulk processing directly to an S3bucket

The Document Understanding Solution deploys a bulk processing Amazon Simple Storage Service(Amazon S3) bucket that you can upload documents into directly, without using the web application.The bulk processing S3 bucket is created by the DUSStack stack. Information about this S3 bucket isavailable from the AWS CloudFormation console, under the Resources tab.

NoteTo manage your files so that they are stored cost effectively throughout their lifecycle, you canconfigure their Amazon S3 Lifecycle. For this solution, the Amazon S3 Lifecycle managementconfiguration policy periodically deletes the files stored in the bucket. For information aboutAmazon S3 Lifecycle, refer to Object lifecycle management in the Amazon S3 Developer Guide.

Before you upload files to the Amazon S3 bucket, you must create the folder that will store the uploadeddocuments.

1. Sign in to the AWS CloudFormation console.

2. Select DUSStack and select the Resources tab.

3. Search for bulk and, under the Physical ID column, select the S3 bucket with the term bulkprocessing as part of the Physical ID name.

4. From the Amazon S3 bucket page, choose Create folder.

5. Under the Name column, enter documentDrop.

6. Choose Save.

NoteIf you are using Amazon Kendra and would like to use the filtering capabilities, you mustalso create a policy folder in this S3 bucket.

Then take the following steps to upload your files to the documentDrop folder.

1. Sign in to the Amazon S3 console.

2. From the S3 buckets page, select the bulk processing s3 bucket (this S3 buckets contains thefollowing prefix as part of the name: dus-bulk-processing).

3. Choose Upload to upload your documents under the documentDrop/ prefix.

If Amazon Kendra is enabled, you can also upload the corresponding access control list underthe policy/ prefix in the same bucket with the following naming convention <document-name>.metadata.json. This access control list file is used by Amazon Kendra to filter results basedon user context. Review one of the preloaded policy files in the bucket for more information. You mustupload the access control policy before you upload the document.

20

Page 24: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

Figure 9: Bulk processing S3 bucket folders

21

Page 25: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideUsing the AWS Management Console

Uninstall the solutionYou can uninstall the Document Understanding Solution from the AWS Management Console or usingthe AWS Command Line Interface. However, you must manually delete the Amazon Simple StorageService (Amazon S3) buckets and Amazon DynamoDB tables created by this solution.

Using the AWS Management Console1. Sign in to the AWS CloudFormation console.2. On the Stacks page, select the solution stack.3. Choose Delete.

NoteThe DUS and DUSClient stacks are deleted, with their associated resources. The CDK stack is notdeleted since it can be reused if deploying other CDK solutions. However, you may delete thisstack manually by selecting it and choosing Delete.

Using AWS Command Line InterfaceVerify that the AWS Command Line Interface (AWS CLI) is available in your environment. For installationinstructions, refer to What Is the AWS Command Line Interface in the AWS CLI User Guide. Afterconfirming that the AWS CLI is available, run the following command.

$ aws cloudformation delete-stack --stack-name DocumentUnderstandingCICD

Deleting the Amazon S3 bucketsThis solution is configured to retain the Amazon S3 buckets if you decide to delete the AWSCloudFormation stack to prevent accidental data loss. After uninstalling the solution, you can manuallydelete the S3 buckets if you do not need to retain the data. Follow these steps to delete the Amazon S3buckets.

1. Sign in to the Amazon S3 console.2. Choose Buckets from the left navigation pane.3. Locate the <stack-name> S3 buckets.4. Select one of the S3 buckets and choose Delete.

Repeat the steps until you have deleted all the <stack-name> S3 buckets.

To delete the S3 buckets using AWS CLI, run the following command:

$ aws s3 rb s3://<bucket-name> --force

22

Page 26: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation GuideDeleting the DynamoDB tables

Alternatively, you can configure the AWS CloudFormation template to delete the Amazon S3buckets automatically. Before deleting the stack, change the deletion behavior in the AWSCloudFormation DeletionPolicy attribute.

Deleting the DynamoDB tablesAfter uninstalling the solution, you can manually delete the DynamoDB tables. Follow these steps:

1. Sign in to the DynamoDB console.2. Choose Tables from the left navigation pane.3. Select the <stack-name> table and choose Delete.

To delete the DynamoDB tables using AWS CLI, run the following command.

$ aws dynamodb delete-table <table-name>

Deleting the CodeCommit repositoryAfter uninstalling the solution, you can manually delete the CodeCommit repository. Follow these steps:

1. Sign in to the CodeCommit console.2. Select the <repository-name> and choose Delete.

To delete the CodeCommit repository using AWS CLI, run the following command.

$ aws codecommit delete-repository --repository-name document-understanding-reference-architecture

23

Page 27: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

Source codeYou can visit our GitHub repository to download the templates and scripts for this solution, and to shareyour customizations with others.

24

Page 28: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

ContributorsThe following individuals contributed to this document:

• Alex Chirayath• George Price• Pierre Dumas• Brenda Booth• Shivani Mehendarge• Curtis Bray• Simran Baxendale• Kashif Imran

25

Page 29: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

RevisionsDate Change

October 2020 Initial release

January 2021 Release version 1.0.1: Added instructions todelete the CodeCommit repository if uninstallingthis solution; for more information, refer to theCHANGELOG.md file in the GitHub repository.

April 2021 Release version 1.0.2: Minor updates andbug fixes; for more information, refer to theCHANGELOG.md file in the GitHub repository.

26

Page 30: Understanding Solution Document...Document Understanding Solution Implementation Guide Contents Overview 4 (p. 4) Cost 4 (p. 4) Architecture overview 6 (p. 5) Considerations 7 (p

Document Understanding Solution Implementation Guide

NoticesCustomers are responsible for making their own independent assessment of the information in thisdocument. This document: (a) is for informational purposes only, (b) represents AWS current productofferings and practices, which are subject to change without notice, and (c) does not create anycommitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or servicesare provided “as is” without warranties, representations, or conditions of any kind, whether express orimplied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and thisdocument is not part of, nor does it modify, any agreement between AWS and its customers.

Document Understanding Solution is licensed under the terms of the of the Apache License Version 2.0available at The Apache Software Foundation.

27