How to implement OCR (Optical Character Recognition) in Angular using Tesseract.js

Author: Vishnu Surendran

Published on: April 26, 2024

Greetings Angular enthusiasts! If you’ve ever wondered how to extract text from images within the realm of Angular seamlessly, you’re in the right place. In this blog, I'm about to embark on an exciting journey that demystifies the process of incorporating text extraction functionalities into your Angular applications.

Images speak volumes, but what if we could make them articulate their contents? This blog is tailored for developers eager to explore the world of Optical Character Recognition (OCR) in the Angular ecosystem. From deciphering product labels to extracting valuable information from images, we’re about to unravel the simplicity of integrating text extraction features into your Angular applications.

I’ll guide you through the steps, explore handy libraries, and showcase practical examples that empower you to bring text extraction capabilities to your Angular projects. No need to be a coding wizard or an OCR expert — we’re here to make this journey accessible and enjoyable for everyone.

Whether you’re building a document management system, creating a receipt-scanning application, or just eager to enhance your Angular skills, join us as we navigate the landscape of text extraction in Angular. Let’s turn every image into a source of information, making your applications visually appealing and functionally brilliant.

So, fasten your seatbelts, Angular enthusiasts, as we dive into the world of extracting text from images using the power of Angular.

What is OCR (Optical Character Recognition)?

OCR stands for Optical Character Recognition. It is a technology that converts different types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into editable and searchable data.

The primary purpose of OCR is to recognize and extract text from these documents, allowing computers and software to understand and manipulate the textual content within images or scanned pages. OCR systems analyze the shapes and patterns of characters in an image, recognizing them as individual letters, numbers, or symbols.

Here’s a breakdown of how OCR works:

Image Acquisition: The process starts with obtaining an image or a scanned document that contains text.
Preprocessing: The image undergoes preprocessing to enhance its quality, which may involve tasks like noise reduction, contrast adjustment, or binarization (converting the image to black and white).
Text Detection: OCR algorithms identify the regions of the image where text is present, outlining the boundaries of individual characters and words.
Character Recognition: This is the core of OCR. The system analyzes the identified text regions, recognizing and classifying the characters into readable text. Machine learning models or pattern recognition algorithms are often used to improve accuracy.
Post-processing: After recognizing the characters, the OCR system may perform additional processing to improve the accuracy of the recognized text. This could include spell-checking, correcting errors, or formatting.
Output: The final output is the extracted and recognized text, which can be stored, edited, or used for further analysis.

OCR technology has various applications, including digitizing printed documents, automating data entry, extracting information from images for search engines, enabling accessibility features for visually impaired individuals, and much more. It plays a crucial role in transforming non-editable content into machine-readable and searchable formats.

What is Tesseract.js ?

Tesseract.js is a JavaScript library that brings the power of Optical Character Recognition (OCR) to web applications. It is a browser-based adaptation of Tesseract, an open-source OCR engine developed by Google. Tesseract.js allows developers to perform text recognition directly within web browsers, enabling the extraction of text from images or scanned documents without the need for server-side processing.

Tesseract.js enables OCR functionality directly in the client’s browser, utilizing the hardware resources available on the user’s device. The underlying OCR engine is Tesseract, which has a robust and proven track record for accurate text recognition. Tesseract supports multiple languages and can handle a variety of font types.Tesseract.js, like its parent Tesseract, is open-source, making it freely available for developers to use, modify, and contribute to. Being a JavaScript library, Tesseract.js is cross-platform and can be used in web applications across different devices and operating systems. Developers can easily integrate Tesseract.js into their web projects using JavaScript, making it accessible for a wide range of applications, from simple web pages to more complex web applications.

Tesseract.js is extensible and supports additional languages and custom training for improved accuracy in recognizing specific patterns or fonts. In summary, Tesseract.js is a valuable tool for web developers looking to integrate OCR capabilities directly into their web applications, bringing text recognition functionalities to the client side and providing a seamless user experience.

How to install Tesseract.js?

Tesseract.js works with a <script> tag via local copy or CDN, with webpack via npm and on Node.js with npm/yarn.

<!-- v5 -->
<script src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'></script>

For Node.js (Requires v14 or higher)

# For latest version
npm install tesseract.js
yarn add tesseract.js

# For old versions
npm install tesseract.js@3.0.3
yarn add tesseract.js@3.0.3

Working with Tesseract.js

You can add the logic inside the app.component.ts file and implement UI in app.component.html. Inside the app.component.ts you can import the tesseract.js using the given code below.

const { createWorker } = require('tesseract.js');

The above code imports the createWorker factory that is responsible for creating web workers in the browser. Web workers allow your app to run workloads off the main thread as background scripts. This increases performance and allows the user interface to work smoothly without freezing.

Now we can create an async function for initializing and recognizing the images using tesseract.js.

export class AppComponent implements OnInit {
  text: string;
  ngOnInit(){
  }

  async recognizeText(path: string){
    const worker = await createWorker('eng', 1, {
      logger: m => console.log(m), // Add logger here
    });
    const { data: { text } } = await worker.recognize(path);
    console.log(text);   
    this.text = text;
    await this.worker.terminate();
  }
}

Finally, let’s create a function in the app.component.ts that will be responsible for receiving the user-uploaded file/image through the user interface. Then the document will be passed to the recognizeText(path) and the text will be extracted.

onFileSelected(event: any) {
    const file:File = event.target.files[0];
    if (file) {
      const url:string = URL.createObjectURL(file);
      this.recognizeText(url)
    }
}

Now, let’s implement a simple user interface for uploading the image/file in the app.component.html

<div class="container">
    <div>
      <input type="file" (change)="onFileSelected($event)">
    </div>
    <div>
      {{text}}
    </div>
</div>

Now, let’s see the results.

Conclusion

OCR is a really powerful technology that can be used the make digital documents from physical documents such as invoices for the ease of storing and securing the data from risks like fire, flood, etc. Tesseract.js is the best OCR library you can use to implement character recognition in the browser level. For more information, you can visit the GitHub repo of Tessseract.js and, you can find plenty of examples here

References

https://github.com/naptha/tesseract.js

ocr

tesseract

angular

web-development

javascript