Intelligent document parsing with Machine learning



August 21, 2021

Imagine a place where we have a lot of documents and we want to use the data or store it in our database. Suppose we have a lot of invoice documents we want to use that data, Usually we hire a group of data entry guys. Imagine where you reduce your work by a piece of software.

Why do you need a document parser?

  • Elimination of manual entry

  • Digitalizing the data

OCR - Optical Character Recognition

It is easy to understand what is OCR from its name itself. In other words, OCR systems transform a two-dimensional image of text that could contain machine printed or handwritten text from its image representation into machine-readable text. OCR as a process generally consists of several sub-processes to perform as accurately as possible.


Extracting Data from PDF

Take an example of Invoice in PDF format

Pdfplumber: Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging. Works best on machine-generated, rather than scanned, PDFs. Here we will try to extract data and convert into JSON format. We will use the pdfplumber library for extracting data from pdf to text. Use regex functions to extract the required data. convert the extracted data into json format / csv.




{ 'orderDate': 'Thu, Apr 15, 2021','Buyer Name': 'Kaushik Reddy', 'businessType': 'B2C', 'companyName': 'None', 'orderId': '407-7570907-7743562', 'addressLine1': 'Plot 463/2p, flat 406, vindhya homes, lake view colony', 'addressLine2': 'pragati nagar', 'city': 'Hyderabad', 'state': 'TELANGANA', 'pincode': '500090', 'products': [ { 'quantity': '2', 'productTitle': 'NOSQUITO Mosquito Repellent Herbal Body Spray with Goodness of Tulsi, Citriodora and Grape Fruit Extract,', 'sku': 'RSDZ003' }, { 'quantity': '4', 'productTitle': 'Nosquito Mosquito repellent Candle made by 100% green products', 'sku': 'S-NQ-06' } ] }

Extracting required data using NER:


Named entity recognition (NER) is probably the first step towards information extraction that seeks to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answer many real-world questions.


In the above example, we used a pre-trained model and classified Apple as an Organization, India is a Geographical location, 1 Crore is money. For example if we want specific information in a document, we will create a new model with trained data and use that model for extracting our specified data.

SaaS Solutions which offers the solution:

Google Document-AI: (Document AI Solution) Document AI or Document Intelligence is a technology that uses natural language processing (NLP) and machine learning (ML) to train computers to simulate a human review of documents. NLP enables the computer to understand the contents of documents, including the contextual nuances of the language within them, before extracting the information and insights contained in the documents. The technology can then categorize and organize the documents themselves. Document AI is used to process and intelligently parse forms, tables, receipts, invoices, tax forms, contracts, loan agreements, financial reports etc. Document AI utilizes machine learning to extract information from documents in digital and print forms. It supports over 200 languages.


Amazon Textract (Amazon Textract | Extract Text & Data | Amazon Web Services (AWS))

Amazon Textract uses AI to extract data from documents. It doesn’t require any configuration or custom code to be written by the client. Provides Amazon Virtual Private Cloud (VPC) endpoints that enable customers to encrypt their data. It is integrated with Amazon Augmented AI. This allows for a human in the loop approach in case of sensitive workflows that require a high accuracy.


Nanonets (

Data extraction based on OCR using AI and ML algorithms. The models can easily be trained with custom data. This ensures easy customization to your specific use case. Their model can handle different font sizes, image noise, blurred images etc. A single model can be used to extract data from documents written in multiple languages.


Even though many SaaS solution offers this service, the model used by those solutions may not work well on your data. The model has to be customized to suit for your data to get more accurate extractions from the documents. We at Diatoz Solutions can help you with custom trainable, state of the Art models for accurate data extraction from your documents.

Add Your Comment