Webelight Solutions Blog: Insights on IT, Innovation, and Digital Trends

Enhanced Document Search: Beyond OCR with AI-Language Models

Blog-hero

In today’s data-driven world, organizing and searching through vast quantities of documents, each with unique formats and content structures can be daunting. For many organizations, traditional optical character recognition (OCR) is the go-to solution for extracting text from images of documents. Paper-based processes require significant time and storage space for proper management. While transitioning to a paperless environment is ideal, scanning physical documents into digital formats presents challenges. Scanning typically requires manual effort, which can be slow and labour-intensive. Digitized documents are often saved as image files, with the text embedded within the image. Word processing tools like regular text documents cannot process or edit these text-containing images. Hence, optical character recognition (OCR) technology is utilized by organizations to convert text within images into machine-readable data, allowing it to be used for business applications. 

 

Supercharging Document Search Beyond OCR with AI Language Models

 

How does OCR technology work?

OCR (Optical Character Recognition) operates through several key stages. 

 

Image Acquisition

The process begins with a scanner, which digitizes physical documents by converting them into binary data. The OCR software then processes the scanned image, distinguishing between light areas (background) and dark areas (text).

 

Preprocessing

Before extracting text, the OCR software enhances the image for better accuracy. It performs several corrective steps:

  • Deskewing: Slightly adjusting the document's alignment to fix any tilting during scanning.
  • Despeckling: removing unwanted digital artifacts or smoothing rough edges around the text.
  • Cleaning up lines and boxes: eliminating irrelevant marks or structural elements.
  • Script recognition: identifying and handling multiple languages in the text.

 

Text Recognition

The OCR software uses two primary methods for recognizing text: pattern matching and feature extraction.

  • Pattern Matching: The software isolates each character, known as a "glyph," and compares it with stored representations. This works best when the font and size of the scanned text match those in the stored database.
  • Feature Extraction: This method breaks down glyphs into their essential components (lines, loops, intersections) and then matches these features to stored glyphs. It works well for a variety of fonts and text styles.

 

Postprocessing

After recognition, the OCR system converts the detected text into a digital format, such as a word processor document or PDF. Some OCR tools also allow the creation of annotated PDFs, showing both the original scanned document and the converted text side by side.

 

Limitations of OCR technology 

 

However, OCR fails to deal with diverse document types and complex natural language queries. Its accuracy heavily relies on the quality of the input image. If the document is of low quality or imperfect, optical character recognition technology may need help interpreting the text correctly. This can result in errors that are hard to fix unless the document is reprocessed multiple times. OCR can be slow, as it needs to process each image individually to extract text, especially when handling large documents. This can be a bottleneck when speed is essential.

Moreover, OCR software can be expensive and may only work for some documents, adding to the cost. The system can fail to recognize certain punctuation marks, especially those that are small, non-contiguous, or flipped upside down. Unlike LLM-powered OCR systems, traditional OCR isn’t reliable as it can also misinterpret characters. For example, it could mistake the lowercase "l" for the number "1" or confuse "b" with "8," leading to inaccuracies that may require extensive proofreading to correct. It can create errors by misrecognizing characters, which can alter the meaning of the text.  It struggles to maintain the original formatting of the document. 

As a result, the output may be challenging to read or understand due to lost fonts, spacing, or structure. If OCR software does not support the specific language of the document, it may not recognize the text correctly. This technology needs help understanding languages read from right to left, such as Arabic, Hebrew, Japanese, and Chinese.

 

Reason for choosing LLM to enhance OCR tasks

 

Unlike conventional optical character recognition technology, which typically relies on rigid templates, LLMs are designed to interpret the meaning and context of text. They can adapt to various document formats with enhanced accuracy and flexibility. One of the primary advantages of AI-Language Models is their ability to learn effectively from relatively small datasets. Even when limited data is available, LLM-powered OCR systems can showcase impressive performance by understanding the context and nuances of language. It reduces the reliance on strict template matching and makes document processing more versatile and adaptable to a broader range of document types and structures. AI-language models also contribute to greater flexibility and robustness in OCR systems. Traditional OCR methods struggle with documents that don't follow standard templates, but LLMs' contextual learning and adaptability help in accurate recognition across a wide range of document types. This is why I  explored using LLM to enhance OCR tasks and enable a more dynamic and versatile global AI-powered document search using LLMs. Here’s a journey through the roadblocks, insights, and conclusions of my research in this space.

 

Problem Context

 

My goal was to create a global document search feature to extract and search for document details based on various data points, like names, ID numbers, and addresses. This would enable us to locate documents by natural language queries across different document types, such as PAN cards, Aadhaar cards, and vehicle registration certificates. For example:

  • For a PAN card, a search should be possible using the PAN number, holder’s name, or even the father’s name.
  • For an Aadhaar card, we should be able to search by Aadhaar number, name, or address.
  • For a vehicle RC book, we’d want to search by model number, colour, seating capacity, and other attributes.

Given the diversity of document types and the different kinds of information within each document, relying solely on traditional OCR for this task would be inefficient. Each document type would require custom logic to parse relevant data and organize it for search functionality. Instead, leveraging LLMs to extract and return data from these documents in a standardized JSON format seemed ideal.

 

Why Move Beyond OCR?

 

Optical character recognition technology is excellent for extracting raw text from images but needs more intelligence to understand context or organize data meaningfully. For a global document search feature:

  • Diverse Document Types: Handling varied documents (e.g., ID cards, certificates, invoices) with unique structures is challenging for OCR. Each document would require specialized parsing logic.
  • Structured JSON Output: I wanted a JSON format that consistently extracts details with meaningful key names (like name, date_of_birth, etc.), which OCR cannot achieve independently.
  • Natural Language Queries: Supporting searches with phrases like “find Diya’s PAN card” or “show Aadhaar cards issued in 1980” requires understanding content contextually.

This led me to explore various vision-language models that could achieve this structured data extraction.

 

Initial Prompt and Model Testing

 

To achieve the structured JSON output, I devised a simple yet powerful prompt:

“Please parse as much data from the document as possible, including JSON. Use meaningful key names in JSON and return it in a valid format.”

This prompt ensured the model returned data with intuitive keys and covered as much information as possible. Here’s a rundown of the models I tested and my observations:

 

Models Tested

 

I evaluated a series of models that are significantly cheaper than GPT-4:

  • Mistral/pixtral-12b-2409
  • Meta-LLaMA/Llama-3.2–90B-Vision-Instruct-Turbo
  • Nousresearch/nous-hermes-2-vision-7b
  • Meta-LLaMA/Llama-3.2–11B-Vision-Instruct-Turbo
  • Qwen/Qwen2.5–72B-Instruct-Turbo
  • Meta-LLaMA/Llama-Vision-Free

While these models had potential, none provided the consistent, clean JSON output I sought. Common issues included:

  • Partial Data Extraction: Many models missed crucial details, like document numbers or names.
  • Format Inconsistencies: Despite instructing the models to return JSON, outputs often included extraneous text or did not strictly follow the JSON structure.
  • Limited Contextual Understanding: Many models need help adapting their extraction logic to different document types, especially when dealing with details like addresses or unique identifiers.

Given the high costs of GPT-4, I continued exploring alternatives, testing various models across providers like Together.ai, OpenRouter, and Deepinfra without satisfactory results.

 

A Promising Discovery: DeepSeek Chat

 

Based on my initial tests, I encountered DeepSeek Chat during this process, which seemed promising. The model extracted data accurately from several document types, providing results that aligned well with the prompt’s JSON format requirements. 

 

Overview of the DeepSeek API

The DeepSeek API is a robust tool that boosts the functionality of various applications by providing advanced data retrieval and processing features. It easily integrates with multiple platforms and frameworks, offering flexibility for developers who want to utilize cutting-edge AI models. 

Exploring DeepSeek Models

DeepSeek provides a variety of models, each suited to different tasks and with distinct advantages. For example, the DeepSeek Coder series is specially designed for coding-related tasks. These models, ranging from 1.3B to 33B parameters, have been pre-trained on 2 trillion tokens in 80 different programming languages, making them highly proficient at tasks like code completion and infilling.

Unfortunately, while the playground tests were successful, no API was available to integrate this model directly into production. Despite extensive troubleshooting, I couldn’t find a feasible way to connect with DeepSeek Chat via API.

 

Enter Google’s Gemini 1.5 Pro and Flash Models

 

Balancing performance and cost has often been a challenge that limits the broader adoption of AI models. However, Google’s new Gemini 1.5 Flash-8B model has set a remarkable standard by combining impressive performance with affordability. 

As my search continued, I tested Google’s Gemini 1.5 Pro model on Google AI Studio. This model provided results that exceeded my expectations, with clean, well-organized JSON outputs and comprehensive data extraction. However, the model’s cost was a concern.

  • Input Cost: $1.25 per MToken
  • Output Cost: $5 per MToken

This was a substantial expense, especially considering the high volume of intelligent document processing I anticipated. Fortunately, I discovered the Gemini 1.5 Flash-8B model, an enhanced version of the Flash model. It offers significantly greater power and is optimized for handling more demanding and complex tasks. With 8 billion parameters, it can process larger datasets while delivering fast, low-latency performance.

  • Token Capacity: Like the standard Flash model, the 1.5 Flash-8B can manage up to 1,048,576 input tokens and 8,192 output tokens. However, its advanced capabilities allow it to handle more detailed and complex tasks more efficiently.
  • Key Features: The 1.5 Flash-8B retains core functionalities from the Flash model, such as function calling, JSON mode, and adjustable safety settings. Additionally, it provides enhanced processing power to tackle more complex reasoning tasks.
  • Rate Limits: It adheres to the same rate limits as the Flash model, offering scalability for demanding use cases.

Here’s why it proved to be a game-changer for me:

 

Benefits of Gemini 1.5 Flash 8B

 

1. Consistent and Accurate JSON Output: The Flash model consistently returned clean JSON outputs that included all expected details.

2. Multiple Images per Request: It allowed the processing of multiple images within a single API call, which is ideal for documents spanning numerous pages.

3. Support for PDFs: I could extract comprehensive data across all document pages by splitting PDFs into individual pages and uploading them in a single request.

 

Cost-Effectiveness of Gemini 1.5 Flash 8B

A significant advantage of Gemini 1.5 Flash 8B is its cost-effectiveness.

  • Input Tokens: $0.0375 per million tokens
  • Output Tokens: $0.15 per million tokens

This pricing is substantially lower than models like GPT-4, making it a viable option for large-scale intelligent document processing tasks.

 

Example Output from Gemini 1.5 Flash 8B

 

To demonstrate the efficacy of Gemini 1.5 Flash 8B, here’s a JSON output example generated from an Aadhaar card document:

 

Bit blurred Aadhar Card

                                         Bit blurred Aadhar Card

 

JSON output example generated from an Aadhaar card document

 

Here’s a JSON output example generated from an RC Book document:

 

RC Book Document

                                                                 RC Book Document

 

JSON output example generated from an RC Book document

 

The model’s output included:

  • Document Type Detection: correctly identified as an Aadhaar card.
  • Structured Personal Information: Name, date of birth, and gender were extracted under a dedicated key.
  • Additional Information: The captured Hindi text and additional notes would have required manual adjustments with traditional OCR methods.

 

Conclusion

 

Gemini 1.5 Flash 8B’s efficiency, accuracy, and flexibility made it a top choice for our global AI-powered document search project. While cheaper models couldn’t deliver the precision required, the Flash model offered an excellent balance between cost-effectiveness and quality. Leveraging this model enabled us to create a robust solution that meets diverse document search needs without needing OCR-specific, document-by-document customization. 

 

Final Thoughts

 

Implementing global document search using LLMs opens up exciting possibilities. By bypassing traditional OCR limitations, we can now handle various documents with varied content and complex natural language queries. In this process, I learned the importance of testing multiple models, understanding each model’s limitations, and weighing cost considerations against performance. Although this journey presented challenges, finding the right tool enabled a scalable and versatile document search solution.

 

At Webelight Solutions Pvt. Ltd., we specialize in pushing the limits of what's possible with AI and machine learning. From custom AI solutions tailored to your needs to cutting-edge applications like AI-powered document search, facial recognition, and predictive modelling, we are determined to move forward in the future of the AI revolution. Our team can help you integrate AI into business workflows, ultimately increasing your ROI and setting you apart from your competition.

 

Let AI do the heavy lifting while you sit back and watch the magic happen! Contact us today for more innovative AI solutions.

Aiyaj Khalani

Tech Lead & DevOps Enthusiast

Aiyaj is the kind of tech mind who thrives on building seamless systems and tackling challenges head-on. A natural problem-solver, he’s happiest when fine-tuning processes and diving into the intricacies of DevOps. While he’s all about efficiency, don’t be surprised to see him experiment with new tech or streamline a setup just for the joy of it.