Python Image to Text: Simplifying Data Extraction

Read Time:4 Minute, 45 Second

In today’s digital world, extracting text from images has become an essential task in various business operations, especially when dealing with invoices, receipts, and other types of documents. Optical Character Recognition (OCR) technology has made it possible to convert text from images into machine-encoded text, making it accessible and usable in various applications. In this article, we will explore how to perform this task efficiently using Python.

Setting Up the Environment

To get started with extracting text from images, we need a few essential components:

Tesseract: Tesseract is an open-source OCR engine that allows us to extract text from images;
Python Libraries: We’ll be using two Python libraries:

pytesseract: A wrapper for the Tesseract OCR engine;
Pillow: A library that adds image processing capabilities to Python.

Installation

First, you need to install Tesseract for your operating system. For Windows users, the latest version of the Tesseract installer can be found online. Download the .exe file and install it on your computer.

If you haven’t already installed the required Python libraries, you can do so by opening the Command Prompt (on Windows) and using the following commands:

pip install pytesseractpip install pillow

Sample Images

For the purpose of this tutorial, we will be working with three sample images, each containing text. These images will serve as our source for extracting text. You can use your images following the same approach.

Extracting Text from a Single Image

Let’s start by extracting text from a single image using Python. In this example, we will work with the first sample image, ‘sampletext1-ocr.png’.

Here is the code structure:

All images are placed in the ‘images’ folder;
The Python code is in ‘main.py’.

Now, we can extract text from the image using Python:

from PIL import Imagefrom pytesseract import pytesseract
# Define the path to tesseract.exepath_to_tesseract = r’C:\Program Files\Tesseract-OCR\tesseract.exe’
# Define the path to the imagepath_to_image = ‘images/sampletext1-ocr.png’
# Point pytesseract to tesseract.exepytesseract.tesseract_cmd = path_to_tesseract
# Open the image with PILimg = Image.open(path_to_image)
# Extract text from the imagetext = pytesseract.image_to_string(img)
print(text)

Running this code should display the extracted text from the image.

Extracting Text from Multiple Images

In many scenarios, you may need to extract text from multiple images. To achieve this efficiently, we can use Python’s os library to access all the file names in a given directory and then iterate over them to extract text from each image:

from PIL import Imagefrom pytesseract import pytesseractimport os
# Define the path to tesseract.exepath_to_tesseract = r’C:\Program Files\Tesseract-OCR\tesseract.exe’
# Define the path to the images folderpath_to_images = r’images/’
# Point pytesseract to tesseract.exepytesseract.tesseract_cmd = path_to_tesseract
# Get the file names in the directoryfor root, dirs, file_names in os.walk(path_to_images):    # Iterate over each file name in the folder    for file_name in file_names:        # Open the image with PIL        img = Image.open(path_to_images + file_name)
        # Extract text from the image        text = pytesseract.image_to_string(img)
        print(text)

This code will extract text from all the images in the ‘images’ folder and display it.

Comparison Table

Comparison Criteria	pytesseract	Tesseract
License	Open source (MIT)	Open source (Apache 2.0)
Language Support	Wide range	Extensive
Format Support	PNG, JPEG, GIF, etc.	Multiple formats
Ease of Use	Easy to set up	More complex setup
Performance	Depends on the image	Depends on the image
Community and Support	Active community	Active community
Documentation	Extensive	Extensive

This table should help you choose the most suitable tool for your image-to-text extraction task using Python.

Video Explanation

In order to explain this topic in more detail, we have prepared a special video for you. Enjoy watching it!

Conclusion

In this article, we’ve explored the fascinating world of extracting text from images using Python and two powerful libraries, pytesseract and Tesseract. These tools open up a realm of possibilities for automating data extraction from images, which can be incredibly useful in various industries, from digitizing invoices to processing scanned documents.

Whether you choose pytesseract for its simplicity or Tesseract for its extensive language support, the ability to convert images into machine-encoded text is a valuable skill for any Python developer.

FAQ

1. What is the main difference between pytesseract and Tesseract?

Pytesseract is a Python wrapper for Tesseract, making it more user-friendly and easier to integrate into Python applications. Tesseract is the underlying OCR engine that does the actual text extraction.

2. Can I use these libraries for non-English languages?

Yes, both pytesseract and Tesseract support a wide range of languages, making them suitable for international applications.

3. Are there any limitations to text extraction from images?

While OCR technology has come a long way, it’s important to note that the accuracy of text extraction depends on image quality, font type, and language complexity. Complex fonts and low-quality images may result in errors.

4. Are there any alternatives to pytesseract and Tesseract?

Yes, there are other OCR libraries and services available, such as Google Cloud Vision API, Microsoft Azure Cognitive Services, and Amazon Textract. The choice depends on your specific requirements and budget.

5. How can I improve the accuracy of text extraction from images?

You can enhance accuracy by using high-resolution images, improving image preprocessing techniques (e.g., noise reduction), and selecting the appropriate language settings for the text in the image.