In today’s fast-paced business environment, processing invoices and payments is a critical task for companies of all sizes.
Invoices contain vital information such as customer and vendor details, order information, pricing, taxes, and payment terms.
Manually managing invoice data extraction can be complex and time-consuming, especially for large volumes of invoices.
For instance, businesses may receive invoices in various formats such as paper, email, PDF, or electronic data interchange (EDI). In addition, invoices may contain structured data, such as tables, as well as unstructured data, such as free-text descriptions, logos, and images.
Manually extracting and processing this information can be error-prone, leading to delays, inaccuracies, and missed opportunities.
Fortunately, Python provides a robust and flexible set of tools for automating the extraction and processing of invoice data.
In this step-by-step guide, we will explore how to leverage Python to extract structured and unstructured data from invoices, process PDFs, and integrate with machine learning models.
By the end of this guide, you’ll have a solid understanding of how to use Python to extract valuable insights from invoice data, which can help you streamline your business processes, optimize cash flow, and gain a competitive advantage in your industry. Let’s dive in.
Before anything else, let’s understand what invoices are!
An invoice is a document that outlines the details of a transaction between a buyer and a seller, including the date of the transaction, the names and addresses of the buyer and seller, a description of the goods or services provided, the quantity of items, the price per unit, and the total amount due.
Despite the apparent simplicity of invoices, extracting data from them can be a complex and challenging process. This is because invoices may contain both structured and unstructured data.
Structured data refers to data that is organized in a specific format, such as tables or lists. Invoices often include structured data in the form of tables that outline the line items and quantities of goods or services provided.
Unstructured data, on the other hand, refers to data that is not organized in a specific format and can be more difficult to recognise and extract. Invoices may contain unstructured data in the form of free-text descriptions, logos, or images.
Extracting data from invoices can be expensive and can lead to delays in payment processing, especially when dealing with large volumes of invoices. This is where invoice data extraction comes in.
Invoice data extraction refers to the process of extracting structured and unstructured data from invoices. This process can be challenging due to the variety of invoice data types, but can be automated using tools such as Python.
As discussed not every invoice is easy to extract as they come in different forms and templates. Here are a few challenges businesses face when extracting data from invoices:
- Variety of invoice formats: Invoices may come in different formats, including paper, email, PDF, or EDI, which can make it difficult to extract and process data consistently.
- Data quality and accuracy: Manually processing invoices can be prone to errors, leading to delays and inaccuracies in payment processing.
- Large volumes of data: Many businesses deal with a high volume of invoices, which can be difficult and time-consuming to process manually.
- Different languages and font-sizes: Invoices from international vendors may be in different languages, which can be difficult to process using automated tools. Similarly, invoices may contain different font sizes and styles, which can impact the accuracy of data extraction.
- Integration with other systems: Extracted data from invoices often needs to be integrated with other systems, such as accounting or enterprise resource planning (ERP) software, which can add an extra layer of complexity to the process.
Python is a popular programming language used for a wide range of data extraction and processing tasks, including extracting data from invoices. Its versatility makes it a powerful tool in the world of technology – from building machine learning models and APIs to automating invoice extraction processes.
Let’s briefly look at Python libraries that can be used for invoice extraction with examples:
Pytesseract
Pytesseract is a Python wrapper for Google’s Tesseract OCR engine, which is one of the most popular OCR engines available. Pytesseract is designed to extract text from scanned images, including invoices, and can be used to extract key-value pairs and other textual information from the header and footer sections of invoices.
Textract is a Python library that can extract text and data from a wide range of file formats, including PDFs, images, and scanned documents. Textract uses OCR and other techniques to extract text and data from these files, and can be used to extract text and data from all sections of invoices.
Pandas
Pandas is a powerful data manipulation library for Python that provides data structures for efficiently storing and manipulating large datasets. Pandas can be used to extract and manipulate tabular data from the line items section of invoices, including product descriptions, quantities, and prices.
Tabula
Tabula is a Python library that is specifically designed to extract tabular data from PDFs and other documents. Tabula can be used to extract data from the line items section of invoices, including product descriptions, quantities, and prices, and can be a useful alternative to OCR-based methods for extracting this data.
Camelot
Camelot is another Python library that can be used to extract tabular data from PDFs and other documents, and is specifically designed to handle complex table structures. Camelot can be used to extract data from the line items section of invoices, and can be a useful alternative to OCR-based methods for extracting this data.
OpenCV
OpenCV is a popular computer vision library for Python that provides tools and techniques for analyzing and manipulating images. OpenCV can be used to extract information from images and logos in the header and footer sections of invoices, and can be used in conjunction with OCR-based methods to improve accuracy and reliability.
Pillow
Pillow is a Python library that provides tools and techniques for working with images, including reading, writing, and manipulating image files. Pillow can be used to extract information from images and logos in the header and footer sections of invoices, and can be used in conjunction with OCR-based methods to improve accuracy and reliability.
It’s important to note that while the libraries mentioned above are some of the most commonly used for extracting data from invoices, the process of extracting data from invoices can be complex and could require multiple techniques and tools.
Depending on the complexity of the invoice and the specific information you need to extract, you may need to use additional libraries and techniques beyond those mentioned here.
Now, before we dive into a real example of extracting invoices, let’s first discuss the process of preparing invoice data for extraction.
Preparing the data before extraction is an important step in the invoice processing pipeline, as it can help ensure that the data is accurate and reliable. This is particularly important when dealing with large volumes of data or when working with unstructured data which may contain errors, inconsistencies, or other issues that can impact the accuracy of the extraction process.
One key technique for preparing invoice data for extraction is data cleaning and preprocessing.
Data cleaning and preprocessing involves identifying and correcting errors, inconsistencies, and other issues in the data before the extraction process begins. This can involve a wide range of techniques, including:
- Data normalization: Transforming data into a common format that can be more easily processed and analyzed. This can involve standardizing the format of dates, times, and other data elements, as well as converting data into a consistent data type, such as numeric or categorical data.
- Text cleaning: Involves removing extraneous or irrelevant information from the data, such as stop words, punctuation, and other non-textual characters. This can help improve the accuracy and reliability of text-based extraction techniques, such as OCR and NLP.
- Data validation: Involves checking the data for errors, inconsistencies, and other issues that may impact the accuracy of the extraction process. This can involve comparing the data to external sources, such as customer databases or product catalogs, to ensure that the data is accurate and up-to-date.
- Data augmentation: Adding or modifying data to improve the accuracy and reliability of the extraction process. This can involve adding additional data sources, such as social media or web data, to supplement the invoice data, or using machine learning techniques to generate synthetic data to improve the accuracy of the extraction process.
Extracting data from invoices is a complex task that requires a combination of techniques and tools. Using a single technique or library is often not sufficient because every invoice is different, and their layouts and formats can vary widely. However, if you have access to a set of electronically generated invoices, you can use various techniques such as regular expression matching and table extraction to extract data from them.
For example, to extract tables from PDF invoices, you can use tabula-py library which extracts data from tables in PDFs. By providing the area of the PDF page where the table is located, you can extract the table and manipulate it using the pandas library.
On the other hand, non-electronically made invoices, such as scanned or image-based invoices, require more advanced techniques, including computer vision and machine learning. These techniques enable the intelligent recognition of regions of the invoice and extraction of data.
One of the advantages of using machine learning for invoice extraction is that the algorithms can learn from training data. Once the algorithm has been trained, it can intelligently recognize new invoices without needing to retrain the algorithm. This means that the algorithm can quickly and accurately extract data from new invoices based on previous inputs.
In this section, let’s use regular expressions to extract a few fields from invoices.
Step 1: Import libraries
To extract information from the invoice text, we use regular expressions and the pdftotext library to read data from PDF invoices.
import pdftotext
import re
Step 2: Read the PDF
We first read the PDF invoice using Python’s built-in open()
function. The ‘rb’ argument opens the file in binary mode, which is required for reading binary files like PDFs. We then use the pdftotext library to extract the text content from the PDF file.
with open('invoice.pdf', 'rb') as f:
pdf = pdftotext.PDF(f)
text="\n\n".join(pdf)
Step 3: Use regular expressions to match the text on invoices
We use regular expressions to extract the invoice number, total amount due, invoice date and due date from the invoice text. We compile the regular expressions using the re.compile()
function and use the search()
function to find the first occurrence of the pattern in the text. We use the group()
function to extract the matched text from the pattern, and the strip()
function to remove any leading or trailing whitespace from the matched text. If a match is not found, we set the corresponding value to None.
invoice_number = re.search(r'Invoice Number\s*\n\s*\n(.+?)\s*\n', text).group(1).strip()
total_amount_due = re.search(r'Total Due\s*\n\s*\n(.+?)\s*\n', text).group(1).strip()
# Extract the invoice date
invoice_date_pattern = re.compile(r'Invoice Date\s*\n\s*\n(.+?)\s*\n')
invoice_date_match = invoice_date_pattern.search(text)
if invoice_date_match:
invoice_date = invoice_date_match.group(1).strip()
else:
invoice_date = None
# Extract the due date
due_date_pattern = re.compile(r'Due Date\s*\n\s*\n(.+?)\s*\n')
due_date_match = due_date_pattern.search(text)
if due_date_match:
due_date = due_date_match.group(1).strip()
else:
due_date = None
Step 4: Printing the data
Lastly, we print all the data that’s extracted from the invoice.
print('Invoice Number:', invoice_number)
print('Date:', date)
print('Total Amount Due:', total_amount_due)
print('Invoice Date:', invoice_date)
print('Due Date:', due_date)
Input
Output
Invoice Date: January 25, 2016
Due Date: January 31, 2016
Invoice Number: INV-3337
Date: January 25, 2016
Total Amount Due: $93.50
Note that the approach described here is specific to the structure and format of the example invoice. In practice, the text extracted from different invoices can have varying forms and structures, making it difficult to apply a one-size-fits-all solution. To handle such variations, advanced techniques such as named entity recognition (NER) or key-value pair extraction may be required, depending on the specific use case.
Extracting tables from electronically generated PDF invoices can be a straightforward task, thanks to libraries such as Tabula and Camelot. The following code demonstrates how to use these libraries to extract tables from a PDF invoice.
from tabula import read_pdf
from tabulate import tabulate
file = "sample-invoice.pdf"
df = read_pdf(file ,pages="all")
print(tabulate(df[0]))
print(tabulate(df[1]))
Input
Output
- ------------ ----------------
0 Order Number 12345
1 Invoice Date January 25, 2016
2 Due Date January 31, 2016
3 Total Due $93.50
- ------------ ----------------
- - ------------------------------- ------ ----- ------
0 1 Web Design $85.00 0.00% $85.00
This is a sample description...
- - ------------------------------- ------ ----- ------
If you need to extract specific columns from an invoice (unstructured invoice), and if the invoice contains multiple tables with varying formats, you may need to perform some post-processing to achieve the desired output. However, to address such challenges, advanced techniques such as computer vision and optical character recognition (OCR) can be used to extract data from invoices regardless of their layouts.
Identifying layouts of Invoices to apply OCR
In this example, we will use Tesseract, a popular OCR engine for Python, to parse through an invoice image.
Step 1: Import necessary libraries
First, we import the necessary libraries: OpenCV (cv2) for image processing, and pytesseract for OCR. We also import the Output class from pytesseract to specify the output format of the OCR results.
import cv2
import pytesseract
from pytesseract import Output
Step 2: Read the sample invoice image
We then read the sample invoice image sample-invoice.jpg using cv2.imread()
and store it in the img variable.
img = cv2.imread('sample-invoice.jpg')
Step 3: Perform OCR on the image and obtain the results in dictionary format
Next, we use pytesseract.image_to_data()
to perform OCR on the image and obtain a dictionary of information about the detected text. The output_type=Output.DICT
argument specifies that we want the results in dictionary format.
We then print the keys of the resulting dictionary using the keys() function to see the available information that we can extract from the OCR results.
d = pytesseract.image_to_data(img, output_type=Output.DICT)
# Print the keys of the resulting dictionary to see the available information
print(d.keys())
Step 4: Visualize the detected text by plotting bounding boxes
To visualize the detected text, we can plot the bounding boxes of each detected word using the information in the dictionary. We first obtain the number of detected text blocks using the len()
function, and then loop over each block. For each block, we check if the confidence score of the detected text is greater than 60 (i.e., the detected text is more likely to be correct), and if so, we retrieve the bounding box information and plot a rectangle around the text using cv2.rectangle()
. We then display the resulting image using cv2.imshow()
and wait for the user to press a key before closing the window.
n_boxes = len(d['text'])
for i in range(n_boxes):
if float(d['conf'][i]) > 60: # Check if confidence score is greater than 60
(x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.imshow('img', img)
cv2.waitKey(0)
Output
Named Entity Recognition (NER) is a natural language processing technique that can be used to extract structured information from unstructured text. In the context of invoice extraction, NER can be used to identify key entities such as invoice numbers, dates, and amounts.
One popular NLP library that includes NER functionality is spaCy. spaCy provides pre-trained models for NER in several languages, including English. Here’s an example of how to use spaCy to extract information from an invoice:
Step 1: Import Spacy and load pre-trained model
In this example, we first load the pre-trained English model with NER using the spacy.load()
function.
import spacy
# Load the English pre-trained model with NER
nlp = spacy.load('en_core_web_sm')
Step 2: Read the PDF invoice as a string and apply NER model to the invoice text
We then read the invoice PDF file as a string and apply the NER model to the text using the nlp()
function.
with open('invoice.pdf', 'r') as f:
text = f.read()
# Apply the NER model to the invoice text
doc = nlp(text)
Step 3: Extract invoice number, date, and total amount due
We then iterate over the detected entities in the invoice text using a for loop. We use the label_ attribute
of each entity to check if it corresponds to the invoice number, date, or total amount due. We use string matching and lowercasing to identify these entities based on their contextual clues.
invoice_number = None
invoice_date = None
total_amount_due = None
for ent in doc.ents:
if ent.label_ == 'INVOICE_NUMBER':
invoice_number = ent.text.strip()
elif ent.label_ == 'DATE':
if ent.text.strip().lower().startswith('invoice'):
invoice_date = ent.text.strip()
elif ent.label_ == 'MONEY':
if 'total' in ent.text.strip().lower():
total_amount_due = ent.text.strip()
Step 4: Print the extracted information
Finally, we print the extracted information to the console for verification. Note that the performance of the NER model may vary depending on the quality and variability of the input data, so some manual tweaking may be required to improve the accuracy of the extracted information.
print('Invoice Number:', invoice_number)
print('Invoice Date:', invoice_date)
print('Total Amount Due:', total_amount_due)
In the next section, let’s discuss some of the common challenges and solutions for automated invoice extraction.
Common Challenges and Solutions
Despite the many benefits of using Python for invoice data extraction, businesses may still face challenges in the process. Here are some common challenges that arise during invoice data extraction and possible solutions to overcome them:
Inconsistent formats
Invoices can come in various formats, including paper, PDF, and email, which can make it challenging to extract and process data consistently. Additionally, the structure of the invoice may not always be the same, which can cause issues with data extraction
Poor quality scans
Low-quality scans or scans with skewed angles can lead to errors in data extraction. To improve the accuracy of data extraction, businesses can use image preprocessing techniques such as deskewing, binarization, and noise reduction to improve the quality of the scan.
Different languages and font sizes
Invoices from international vendors may be in different languages, which can be difficult to process using automated tools. Similarly, invoices may contain different font sizes and styles, which can impact the accuracy of data extraction. To overcome this challenge, businesses can use machine learning algorithms and techniques such as optical character recognition (OCR) to extract data accurately regardless of language or font size.
Complex invoice structures
Invoices may contain complex structures such as nested tables or mixed data types, which can be difficult to extract and process. To overcome this challenge, businesses can use libraries such as Pandas to handle complex structures and extract data accurately.
Integration with other systems (ERPs)
Extracted data from invoices often needs to be integrated with other systems, such as accounting or enterprise resource planning (ERP) software, which can add an extra layer of complexity to the process. To overcome this challenge, businesses can use APIs or database connectors to integrate the extracted data with other systems.
By understanding and overcoming these common challenges, businesses can extract data from invoices more efficiently and accurately, and gain valuable insights that can help optimize their business processes.
With Nanonets, you can easily create and train machine learning models for invoice data extraction using an intuitive web-based GUI.
You can access cloud-hosted models that use state-of-the-art algorithms to provide you with accurate results, without worrying about getting a GCP instance or GPUs for training.
The Nanonets OCR API allows you to build OCR models with ease. You do not have to worry about pre-processing your images or worry about matching templates or build rule based engines to increase the accuracy of your OCR model.
You can upload your data, annotate it, set the model to train and wait for getting predictions through a browser based UI without writing a single line of code, worrying about GPUs or finding the right architectures for your deep learning models. You can also acquire the JSON responses of each prediction to integrate it with your own systems and build machine learning powered apps built on state of the art algorithms and a strong infrastructure.
Using the GUI: https://app.nanonets.com/
You can also use the Nanonets-OCR API by following the steps below:
Step 1: Clone the Repo, Install dependencies
git clone https://github.com/NanoNets/nanonets-ocr-sample-python.git
cd nanonets-ocr-sample-python
sudo pip install requests tqdm
Step 2: Get your free API Key
Get your free API Key from https://app.nanonets.com/#/keys
Step 3: Set the API key as an Environment Variable
export NANONETS_API_KEY=YOUR_API_KEY_GOES_HERE
Step 4: Create a New Model
python ./code/create-model.py
Note: This generates a MODEL_ID that you need for the next step
Step 5: Add Model Id as Environment Variable
export NANONETS_MODEL_ID=YOUR_MODEL_ID
Note: you will get YOUR_MODEL_ID from the previous step
Step 6: Upload the Training Data
The training data is found in images
(image files) and annotations
(annotations for the image files)
python ./code/upload-training.py
Step 7: Train Model
Once the Images have been uploaded, begin training the Model
python ./code/train-model.py
Step 8: Get Model State
The model takes ~2 hours to train. You will get an email once the model is trained. In the meanwhile you check the state of the model
python ./code/model-state.py
Step 9: Make Prediction
Once the model is trained. You can make predictions using the model
python ./code/prediction.py ./images/151.jpg
Summary
Invoice data extraction is a critical process for businesses that deals with a high volume of invoices. Accurately extracting data from invoices can significantly reduce errors, streamline payment processing, and ultimately improve your bottom line.
Python is a powerful tool that can simplify and automate the invoice data extraction process. Its versatility and numerous libraries make it an ideal choice for businesses looking to improve their invoice data extraction capabilities.
Moreover, with Nanonets, you can streamline your invoice data extraction process even further. Our easy-to-use platform offers a range of features, including an intuitive web-based GUI, cloud-hosted models, state-of-the-art algorithms, and field extraction made easy.
So, if you’re looking for an efficient and cost-effective solution for invoice data extraction, look no further than Nanonets. Sign up for our service today and start optimizing your business processes!
Read More: 5 Ways to Remove Pages from PDFs