Optical Character Recognition (OCR) technology has become a vital component for businesses looking to automate document processing and streamline data extraction.
Very simply put, OCR involves scanning documents and converting the scanned image into machine-readable text, allowing organisations to process a wide variety of documents, such as invoices, contracts, and receipts, without manual data entry.
OCR APIs act as an extension of this functionality by providing developers with a pre-built “black box” that can be accessed programmatically. It integrates the “OCR functionality” into their applications effortlessly, eliminating the need to build or develop OCR from scratch.
Over the past five years, the OCR API landscape has witnessed significant transformations. Initially dominated by traditional OCR engines like Tesseract, newer solutions such as CRF (Contextual understanding for Improved recognition) or LSTM (Long Short-term memory for sequence recognition) have emerged, offering improved accuracy, multi-language support, and the ability to process complex document structures. These OCR solutions have enhanced data extraction capabilities.
At the same time, large language models (LLMs) such as OpenAI’s GPT-4 and Anthropic’s Claude have introduced OCR capabilities as part of their services, allowing for a more natural interaction with documents.
And AI-based Intelligent Document Processing (IDP) platforms have carved out a niche by offering highly specialised, industry-specific OCR and document processing workflow automation capabilities.
In this article, we will explore the top OCR APIs from three major categories: cloud service providers, large language models (LLMs), and AI-based IDP software. We will assess each tool’s strengths, weaknesses, and the criteria for selecting the best API for different use cases.
Overview of the Selection Criteria and Testing Process
API categories
To identify the best OCR APIs available today, we first established clear categories of OCR APIs that we would be testing. Let’s take a quick look at the three categories we have picked:
- Cloud service providers
- Large Language Models (LLMs)
- AI-based IDP software
Cloud service providers provide open APIs that are scalable and provide enterprise-grade security. They are popular and can be integrated with other services from the cloud service providers seamlessly.
On the other hand, LLMs started off as AI algorithms that could process information and generate human-like responses have expanded their feature suite to include OCR capabilities to allow for a more seamless engagement. They aren’t explicitly set up for data extraction but it is one of the many features included in their feature suite.
AI-based IDPs are perhaps the most important category of the three. They were set up with OCR capabilities designed to process industry specific documents such as invoices, bank statements or legal documents and have expanded into end-to-end workflow automation engines.
Performance criteria
Next, the primary criteria when evaluating any OCR API is accuracy. It all comes down to the number of data points that can be correctly identified and accurately extracted by an OCR API, but we created a list of basic criteria that we will evaluate these categories on. These criteria include:
- Language Support
- Advanced Features
- Pricing
Sample set of documents
We also chose a set of documents making sure we had different types of documents serving different use-cases across industries. We included a few challenging documents as well, such as a badly scanned receipt with noise or creases or a handwritten legal document. Let’s take a look at a few of these sample documents that we chose:
- A creased invoice
- A Hindi-language invoice
- A low-light (badly scanned) image of a receipt
- A passport
- A scanned bank statement
- A handwritten legal document
Then, we tested out the above documents with various OCR APIs and evaluated them on the above criteria. For transparency, you can access the sample set along with the result images here.
Before we dive into each API in detail, here is a quick summary of our experiment:
Document Type | AWS | GPT-4o | Claude | Nanonets | Veryfi | |
---|---|---|---|---|---|---|
Creased invoice | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Hindi-language invoice | ✅ | ❌ | ✅ | ❌ | ✅ | ✅ |
Badly scanned receipt | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Passport | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
Bank statement | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Hand-written legal document | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
The Best OCR APIs from Cloud Service Providers
1. Vision API (Google Cloud)
Known for its speed and accuracy, Vision API leverages machine learning to provide extensive language support, including over 50 languages.
Beyond basic OCR, Google Vision can detect text within images, perform document layout analysis, recognise handwriting and even extract tables, which makes it suitable for businesses handling diverse document types.
Performance:
Overall, out of 6 documents tested, Vision API provided accurate results for all 6, but not as key-value pairs.
You can check out the results here.
- Accuracy: Although accurate even with handwritten, creased or low-light images, the data extracted was in text blocks and didn’t have key-value pair association. In an Invoice, one would be more concerned with the actual invoice number rather than the text block, “Invoice Number”. Hence, key-value pairs made more sense to us.
- Language support: Vision API supports over 200+ languages. We tested it out with a Hindi-language invoice and the results were accurate.
- Advanced Features: Tables present in bank statements or invoices were not recognised separately using the Vision API but individual cells were picked up as blocks of text with no association. The accuracy with handwritten text was satisfactory.
- Pricing: Google Vision offers a user-friendly pricing model based on the number of API calls. Here is a quick summary of the text detection API:
- First 1000 API calls: Free
- Up to 5 Million API calls: $ 1.5 per call
- Beyond 5 Million API calls: $ 1 per call
For detailed pricing, please see the plans here.
❗
1. While Google Vision text detection API excels in language detection and text recognition, it can struggle with highly complex and unstructured documents.
2. It extracts data in text blocks rather than key-value pairs.
AWS Textract offers a robust OCR solution tailored to enterprises with large-scale document processing needs. It specialises in processing structured data and tables, making it particularly useful for businesses requiring high-precision data extraction from complicated layouts. Textract integrates seamlessly with other AWS services, such as S3 for storage and Lambda for building event-driven applications.
Performance:
Out of the 6 documents we tested Textract Forms API on, we got accurate results for 5 documents as key-value pairs.
- Accuracy: We tried the Forms OCR API offered by AWS textract with the above sample set. It yielded satisfactory results with handwritten documents, low-light and damaged documents. It also offered automatic key-value association, i.e., “INV-001” was labeled as “Invoice Number” exactly as it appeared in the document.
- Language support: We tried extracting data with a Hindi-language invoice but, we got poor results. None of the characters were recognised accurately. Upon further investigation, we found that AWS Textract provides support for 6 languages, namely, English, Spanish, German, Italian, Portuguese and French only.
- Advanced Features: It was able to successfully detect the checkboxes marked and extract the corresponding data with high accuracy.
- Pricing: AWS Textract offers a budget friendly pricing model based on the number of pages processed using the various API options. We tried the Analyze Document API – Forms and Tables, the pricing for which is summarized below:
- Price for page with table = $0.015
- Price for page with form (key-value pair) = $0.05
For more details on the various API options offered and their pricing details, visit here.
❗
1. Textract can be expensive for small and medium-sized businesses that process a large volume of documents regularly.
2. It also has limited support for complex, multi-language documents compared to Google Vision.
The Best OCR APIs from LLMs
3. GPT-4o (OpenAI)
Large Language Models or LLMs are the latest advancement in Artificial Intelligence. With their rising popularity, many players like OpenAI’s GPT-4o have added capabilities that extend beyond natural language processing.
GPT-4o now offers OCR functionality, enabling users to not only extract text but also interact with documents or images through natural language queries. This makes GPT-4o particularly appealing for businesses looking to combine document processing with computing.
Performance:
Out of the 6 document types we tested GPT-4o API on, we got accurate results for all 6 documents.
- Accuracy: We tried out the GPT-4o API for data extraction from the above sample set. The accuracy was satisfactory with minimal prompt engineering, however, with damaged documents such as, the low-light image of the receipt and the creased invoice, a few data points were not detected or inaccurately extracted.
- Language support: The data extracted from Hindi-language invoice was satisfactorily accurate. We further investigated and found that GPT-4o supports 50+ languages, including, English, Spanish, German, Italian, Portuguese, Greek, French, Chinese and Japanese.
- Advanced Features:
- Offers accurate checkbox detection. In the handwritten legal document, it was accurately able to identify the gender of the juror based on a checkmark next to “Female”, with no additional or specific prompting.
- Offers table extraction with multi line-item support, as seen in data extraction from a bank statement.
- Pricing: GPT-4o offers pricing based on the number of words extracted. The currency used is “tokens”, where 1000 tokens is equal to 750 words. GPT-4o costs $2.5 / 1 M tokens. There is batch processing available as well which offers a 50% discount for completion within 24 hours. You can refer here, for more details.
❗
1. While GPT-4’s OCR abilities are powerful, the primary drawback is cost, as OpenAI services tend to be more expensive than dedicated OCR solutions.
2. Additionally, while it excels in complex document processing, it may not be the best fit for simple, large-scale OCR tasks.
4. Claude (Anthropic)
Claude, developed by Anthropic, is another LLM that integrates OCR capabilities. Like GPT-4, Claude’s strength lies in its ability to understand and interact with documents using natural language processing. Claude can extract text from images or PDFs, but its standout feature is its ability to understand context, making it ideal for industries where document comprehension is crucial.
Performance:
Out of the 6 documents we tried the Claude-3.5 Sonnet API on, we got accurate results for 4 documents. The results for the passport took additional prompting over privacy concerns. The results for the Hindi-language invoice were extracted in English.
- Accuracy: Overall, the OCR API from Anthropic, Claude extracted data from the sample set of documents with decent accuracy. The low-light receipt and the creased invoice was processed with high accuracy. Data extraction from the passport required additional prompting due to privacy concerns, which leads us to question the reliability of the API.
- Language support: The Hindi-language invoice was processed but data extracted was in English, however, the accuracy remained high. Upon further investigation, we found out that Claude supports limited languages apart from English, namely, French, Portuguese and German.
- Advanced Features:
- Contextual understanding of extracted text, as seen in data extracted from the Hindi-language invoice.
- The extraction of tables in bank statements required additional prompting despite clear specification the first time around. The multi-line statements were extracted in a single line.
- Pricing:
It is certainly more expensive than modern-day Intelligent Document Processing (IDP) software that have specific OCR capabilities or even compared to its counterpart, OpenAI’s GPT. It follows a token-based pricing approach with $3 / 1 M. tokens. For more details, visit this link.
❗
Claude is still evolving and, like GPT-4, its primary limitation is pricing and the trade-off between its cognitive capabilities and performance in bulk OCR tasks.
The Best OCR APIs from AI-Based IDP Software
5. Nanonets
Nanonets is a specialised IDP platform, powered by generative Artificial Intelligence (AI) that stands out for its ability to automate document workflows through OCR. It is designed to handle complex, industry-specific documents such as invoices, receipts, patient forms, bank statements, ID documents, contracts or any other document you can think of.
Nanonets’ strength lies in its customisability, as users can train their own models, within seconds to cater to their specific document types, ensuring a higher accuracy rate compared to out-of-the-box OCR tools.
Its powerful features, like, automated import, one-click integrations into external software, lookup capabilities against external databases, as well as LLM-powered actions make it an ideal choice for businesses looking to automate their workflows.
Besides offering a user-friendly UI, it also offers comprehensive API endpoints and documentation that makes it easy-to-access.
Performance:
Out of the 6 documents we tested Nanonets API on, we got accurate results for all 6 documents.
- Accuracy: Nanonets offers pre-trained models for common documents, like, Invoices, Receipts, Passports, and bank statements, we tested these models with the sample set of documents and received flawlessly extracted data. For custom documents, like the handwritten legal document, the zero-training extractor worked perfectly. We were able to set up within seconds and the trial was available even on the free plan.
- Language support: Nanonets supports over 110+ languages, including Asian, African and European languages. We tested it with a Hindi-language invoice and the data extracted was perfect, including tables.
- Advanced Features:
- The checkbox detection feature within Nanonets worked well; it was able to identify the gender of the Juror as “female” with a checkmark next to it.
- The tables extracted were flawless, including multi line-items.
- Pricing: Nanonets caters to businesses of all sizes, with a flexible, pay-as-you-go pricing for individual use and SMEs, standard pro plan for mid-sized businesses and a custom plan available for enterprises.
- Pay-as-you-go plan: Offers first 500 steps for free, thereafter, it is charged at $0.3/data extraction step and $0.05/step for others.
- Pro plan: Subscription based plan that allows processing up to 10,000 pages for $999/month.
- Enterprise plan: Customised based on usage and features required.
For more details, visit here.
❗
Nanonets is ideal for businesses in industries like finance, insurance, and logistics. However, the platform’s flexibility comes at a higher cost, and users may need time to understand the extensive functionality.
6. Veryfi
Veryfi is another strong player in the IDP market, offering AI-powered OCR solutions tailored to businesses with high document processing needs. It specialises in automating the extraction of data from receipts, invoices, and other financial documents, making it particularly useful for accounting and expense management.
Veryfi offers real-time processing and integrates seamlessly with existing accounting and ERP systems.
Performance:
Out of the 6 documents, we tested the Veryfi API on, we got accurate results for 4 documents. The docs API was unavailable for public testing for the remaining 2 custom document types.
- Accuracy: Offers pre-trained models for common documents, like, invoices, receipts, bank statements, bank cheques, and W2 forms. The accuracy was concerning in a few documents, including Hindi-language invoice and the low-light receipt image. There were no pre-trained models available for legal documents or passports. The Docs API is powered by LLMs and their own AI models for custom documents but was unavailable for trial.
- Language support: Veryfi supports 35-40 languages, including Asian and European languages, but did not accurately extract the data from Hindi-language invoice.
- Advanced Features:
- Handled tabular extraction from bank statements well, including multi line-item support.
- Since there was no model available for a handwritten legal document, we are unsure of Veryfi being able to support checkbox detection.
- Pricing: Veryfi offers a free trial which allows users to process 100 documents / month. Beyond that, the plans are paid, with the following details:
- Pay-as-you-go plan: $500/month is the minimum charge to process up to 10,000 pages per month.
- Custom plan: Enterprise plans beyond 10,000 pages per month are priced on a custom basis, with no public information available.
❗
While Veryfi performs well in the financial document space, it is less flexible when it comes to processing non-financial documents. Businesses with broader OCR needs may find Nanonets or other AI-based tools to be more versatile.
Conclusion
The OCR landscape has evolved significantly over the past few years, with advancements in cloud computing, large language models, and AI-based IDP software driving innovation. Each OCR API discussed in this article offers unique strengths, whether it’s Google Vision and AWS Textract for enterprise-grade solutions, GPT-4 and Claude for more advanced, cognitive OCR use cases, or Nanonets and Veryfi for industry-specific document processing.
Depending on the size of your business, the types of documents you process, and your budget, one of these tools will likely be the best fit for your OCR API needs.