Flight Plan: Automated Document Classification and Discovery

This BoosterPack was created and authored by: FormKiq

DAIR BoosterPacks are free, curated packages of cloud-based tools and resources about a specific emerging technology, built by experienced Canadian businesses who have built products or services using that technology and are willing to share their expertise.

Ready for takeoff?

Here’s what you’ll find in this Flight Plan

Overview

What is Automated Document Classification and Discovery?

An Automated Document Classification and Discovery application is a type of software that uses Natural Language Processing (NLP), artificial intelligence (AI), and machine learning (ML) algorithms to automatically categorize and tag documents based on their content, allowing users to quickly search and retrieve documents (i.e., discovery) based on generated keywords or document tags.

The main tools used in this BoosterPack are Micronaut Framework, Apache Kafka, Tesseract Optical Character Recognition (OCR), PyTorch, and Elasticsearch. The entire solution is deployed across two servers: one server runs the Elasticsearch service, and the other runs the API and UI components.

What value will it add to my business?

Automated Document Classification and Discovery applications can help organizations save time and reduce costs by automating the manual process of organizing and categorizing documents, and making it easier for users to find relevant information quickly and efficiently.

This solution:

  • is fully open-source – there are no licensing costs to use this BoosterPack.
  • is cloud-agnostic – it can be deployed on various cloud platforms.
  • implements a microservices-based architecture – the entire application is a collection of loosely coupled services, with each service being independently deployable, highly maintainable, and testable.
  • is flexible – it can easily be customized to be adapted for various business cases.

Why choose Automated Document Classification and Discovery over the alternatives?

Traditionally, organizations must build custom solutions or purchase expensive document management solutions and build customizations to meet their needs. At the time of this BoosterPack, those alternatives still require manual steps as a significant part of the workflows for classifying documents.

This BoosterPack follows a generalized approach which suits most common document management needs, and helps organizations to get started quickly by using this AI-based solution as a foundation that reduces manual steps. And, because this solution leverages open-source tools, organizations have complete control to customize it according to their specific business use case.

Best Practices

HTTPS Encryption

HTTPS encryption should be used because it provides a secure and encrypted connection between the web server and the client’s web browser, which ensures the confidentiality and integrity of data exchanged between them. Let’s Encrypt is a free, automated, and open certificate authority allowing users to easily add certificates to websites.

Microservice Architecture

Designing applications using a microservice architecture has emerged as a best practice for several reasons:

  • Scalability: services can be scaled independently of each other and with fewer limitations on a maximum or minimum scale.
  • Resilience: each service can be designed to handle failures independently; this means that even if one service fails, the overall system will continue to function.
  • Agility: services can be developed and deployed independently of each other, meaning that updates and changes can be made quickly and with minimal impact on the overall system.
  • Flexibility: services can be developed using different technologies and languages, improving developer efficiency and innovation by allowing the choice of the best technology for each service
  • Scalability in Team Structure: each service can be developed and maintained by a small, cross-functional team, providing independence, and allowing focus on specific areas of expertise while still enabling cross-team collaboration.

Backup Strategy

A backup strategy is crucial for ensuring your data’s availability, integrity, and security. Backups help you recover from disasters, maintain business continuity, comply with regulations, and protect your data from various threats.

The Automated Document Classification and Discovery application uses the local filesystem and Elasticsearch to store data, so the backup strategy must consider both filestores.

Tips and Traps

Generating an SSL Certificate with Let’s Encrypt

Let’s Encrypt requires a domain to generate an SSL certificate. Using nip.io for domain hosting is a great option; however, Let’s Encrypt rate limits based on domain and IP address. If the rate limit is exceeded, the time the rate limit will expire is in the error message.

Elasticsearch & Java API Client

When using Elasticsearch 7.17 and greater, you should use the new Java API Client instead of the previous Java Rest Client. While the Java Rest Client can work, specific configuration changes need to be made to the Elasticsearch server to enable them.

Machine Learning – huggingface.co

Hugging Face is a website that provides a wide range of NLP tools and resources, including pre-trained models, datasets, and libraries. It is a great starting point to help developers and researchers build and train NLP models more efficiently and effectively.

Resources

Tutorials

The table below provides a non-comprehensive list of links to tutorials that we’ve found to be most useful.

Tutorial Content Summary
How to Backup and Restore Elasticsearch using Snapshots This post describes the step-by-step instructions on how to create backups of Elasticsearch indices using snapshots and then restoring them in case of a disaster or data loss.
The Elastic.co documentation on “Taking Snapshots in Elasticsearch” A comprehensive guide on how to create snapshots of indices, shards, or clusters in Elasticsearch, which is useful for creating backups and restoring data.
Elasticsearch in Action: Introducing Java API Client Elasticsearch released a new forward compatible Java client — called Java API Client — from version 7.17.
Hugging Face Hugging Face is a website that provides a wide range of NLP tools and resources, including pre-trained models, datasets, and libraries. It is a great starting point to help developers and researchers build and train NLP models more efficiently and effectively.
Micronaut A modern, JVM-based, full-stack framework for building modular, easily testable microservice and serverless applications.
Tesseract OCR Tesseract is an open-source OCR engine that is widely used for recognizing text within digital images.
Apache Kafka Apache Kafka is an open-source distributed event streaming platform that is used to publish, subscribe, store, and process streams of data in real-time. The project is currently maintained by the Apache Software Foundation.

Documentation

Please see the table below for a set of documentation resources for the intelligent-document-classification BoosterPack:

Document Summary
https://github.com/formkiq/intelligent-document-classification GitHub Code Repository and Documentation

 

Support

As a DAIR participant, you can access support related to this BoosterPack. If you have questions, you can:

  • post them in DAIR Slack #help channel
  • send an email to [email protected]
  • create an issue in the GitHub repository

Got it? Now let us show you how we deployed it on the DAIR Cloud…

Automated Document Classification and Discovery Sample Solution

For any organization needing to recall specific documents from a document repository while requiring additional context to discover and categorize results, this Sample Solution demonstrates how optical character recognition (OCR), natural language processing (NLP), and full-text search can be used to automate the creation of document metadata. Unlike today where this process is manual and error-prone, this Sample Solution achieves full automation and produces much fewer errors.

Please see the Sample Solution page for more information on how the Sample Solution works.

The Sample Solution showcases the following technologies: Micronaut, Apache Kafka, Tesseract OCR, Elasticsearch, and NLP, described in subsequent sections.