diff --git a/README.org b/README.org index 0ebdcb6..65484b1 100644 --- a/README.org +++ b/README.org @@ -49,7 +49,7 @@ PDF=$1 python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {} | grep table > /tmp/extracted-tables.txt cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt -cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr +cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} for image in $(cat /tmp/extracted-tables.txt); do dir=$(dirname $image) @@ -57,6 +57,7 @@ for image in $(cat /tmp/extracted-tables.txt); do done #+END_SRC + The package was written in a [[https://en.wikipedia.org/wiki/Literate_programming][literate programming]] style. The source code at [[https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html]] is meant to act as the documentation and reference material. diff --git a/ocr_tables b/ocr_tables index 25e936a..5f2e413 100755 --- a/ocr_tables +++ b/ocr_tables @@ -5,8 +5,7 @@ PDF=$1 python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {} | grep table > /tmp/extracted-tables.txt cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt -cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr - +cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} for image in $(cat /tmp/extracted-tables.txt); do dir=$(dirname $image) python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt") diff --git a/pdf_table_extraction_and_ocr.html b/pdf_table_extraction_and_ocr.html index be38955..5126f4d 100644 --- a/pdf_table_extraction_and_ocr.html +++ b/pdf_table_extraction_and_ocr.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> - + PDF Parsing @@ -225,94 +225,98 @@

Table of Contents

-
-

1 Overview

+
+

1 Overview

This Python package provides utilities for extracting tabular data from PDF @@ -359,17 +363,17 @@ The package is split into modules with narrow focuses.

  • pdf_to_images uses Poppler and ImageMagick to extract images from a PDF.
  • extract_tables finds and extracts table-looking things from an image.
  • extract_cells extracts and orders cells from a table.
  • -
  • ocr_image uses Tesseract to turn a OCR the text from an image of a cell.
  • +
  • ocr_image uses Tesseract to OCR the text from an image of a cell.
  • ocr_to_csv converts into a CSV the directory structure that ocr_image outputs.
  • -
    -

    1.1 Requirements

    +
    +

    1.1 Requirements

    -
    -

    1.1.1 Python packages

    +
    +

    1.1.1 Python packages

    • numpy
    • @@ -379,8 +383,8 @@ The package is split into modules with narrow focuses.
    -
    -

    1.1.2 External

    +
    +

    1.1.2 External

    • pdfimages from Poppler
    • @@ -391,8 +395,8 @@ The package is split into modules with narrow focuses.
    -
    -

    1.2 Contributing

    +
    +

    1.2 Contributing

    This package was created in a literate programming style with the help of Babel. @@ -405,8 +409,8 @@ barrier for contributors who aren’t already familiar with Emacs and Babel.

    -
    -

    1.3 Example usage

    +
    +

    1.3 Example usage

    Here is an example of a shell script that uses each module to turn a pdf with a @@ -424,26 +428,29 @@ you need into your own python projects and use them as needed.

    -
    #!/bin/sh
    +
    #!/bin/sh
     
     PDF=$1
     
     python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt
     cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {}  | grep table > /tmp/extracted-tables.txt
     cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
    -cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr
    -
    +cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {}
     for image in $(cat /tmp/extracted-tables.txt); do
         dir=$(dirname $image)
         python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt")
     done
     
    + +

    +Any extra args you pass after the image path to python -m table_ocr.ocr_image will be passed directly to tesseract as options. If you don’t pass anything, reasonable english defaults are used. +

    -
    -

    1.4 Possible improvements

    +
    +

    1.4 Possible improvements

    Detect text with the stroke-width-transform alogoritm. https://zablo.net/blog/post/stroke-width-transform-swt-python/index.html @@ -452,8 +459,8 @@ Detect text with the stroke-width-transform alogoritm. -

    2 Preparing data

    +
    +

    2 Preparing data

    Not all pdfs need to be sent through OCR to extract the text content. If you can @@ -462,27 +469,27 @@ probably aren’t necessary.

    -
    -

    2.1 Converting PDFs to images

    +
    +

    2.1 Converting PDFs to images

    This code calls out to pdfimages from Poppler.

    -
    # Wrapper around the Poppler command line utility "pdfimages" and helpers for
    +
    # Wrapper around the Poppler command line utility "pdfimages" and helpers for
     # finding the output files of that command.
     def pdf_to_images(pdf_filepath):
         """
         Turn a pdf into images
    +    Returns the filenames of the created images sorted lexicographically.
         """
         directory, filename = os.path.split(pdf_filepath)
    -    with working_dir(directory):
    -        image_filenames = pdfimages(pdf_filepath)
    +    image_filenames = pdfimages(pdf_filepath)
     
         # Since pdfimages creates a number of files named each for there page number
         # and doesn't return us the list that it created
    -    return [os.path.join(directory, f) for f in image_filenames]
    +    return sorted([os.path.join(directory, f) for f in image_filenames])
     
     
     def pdfimages(pdf_filepath):
    @@ -495,8 +502,14 @@ This code calls out to     uses 3 digits in its regex.
         """
         directory, filename = os.path.split(pdf_filepath)
    +    if not os.path.isabs(directory):
    +        directory = os.path.abspath(directory)
         filename_sans_ext = filename.split(".pdf")[0]
    -    subprocess.run(["pdfimages", "-png", pdf_filepath, filename.split(".pdf")[0]])
    +
    +    # pdfimages outputs results to the current working directory
    +    with working_dir(directory):
    +        subprocess.run(["pdfimages", "-png", filename, filename.split(".pdf")[0]])
    +
         image_filenames = find_matching_files_in_dir(filename_sans_ext, directory)
         logger.debug(
             "Converted {} into files:\n{}".format(pdf_filepath, "\n".join(image_filenames))
    @@ -516,8 +529,8 @@ This code calls out to 
    -

    2.2 Detecting image orientation and applying rotation.

    +
    +

    2.2 Detecting image orientation and applying rotation.

    Tesseract can detect orientation and we can then use ImageMagick’s mogrify to @@ -546,19 +559,29 @@ to correct the rotation. This makes OCR more straightforward.

    -
    def preprocess_img(filepath):
    -    """
    -    Processing that involves running shell executables,
    +
    def preprocess_img(filepath, tess_params=None):
    +    """Processing that involves running shell executables,
         like mogrify to rotate.
    +
    +    Uses tesseract to detect rotation.
    +   
    +    Orientation and script detection is only available for legacy tesseract
    +    (--oem 0). Some versions of tesseract will segfault if you let it run OSD
    +    with the default oem (3).
         """
    -    rotate = get_rotate(filepath)
    +    if tess_params is None:
    +        tess_params = ["--psm", "0", "--oem", "0"]
    +    rotate = get_rotate(filepath, tess_params)
         logger.debug("Rotating {} by {}.".format(filepath, rotate))
         mogrify(filepath, rotate)
     
     
    -def get_rotate(image_filepath):
    +def get_rotate(image_filepath, tess_params):
    +    """
    +    """
    +    tess_command = ["tesseract"] + tess_params + [image_filepath, "-"]
         output = (
    -        subprocess.check_output(["tesseract", "--psm", "0", image_filepath, "-"])
    +        subprocess.check_output(tess_command)
             .decode("utf-8")
             .split("\n")
         )
    @@ -575,8 +598,8 @@ to correct the rotation. This makes OCR more straightforward.
     
    -
    -

    3 Detecting tables

    +
    +

    3 Detecting tables

    This answer from opencv.org was heavily referenced while writing the code around @@ -597,7 +620,7 @@ that makes things like shape detection more accurate.

    -
    def find_tables(image):
    +
    def find_tables(image):
         BLUR_KERNEL_SIZE = (17, 17)
         STD_DEV_X_DIRECTION = 0
         STD_DEV_Y_DIRECTION = 0
    @@ -680,8 +703,8 @@ cv2.imwrite("resources/examples/example-table.png"
     
    -
    -

    3.1 Improving accuracy

    +
    +

    3.1 Improving accuracy

    It’s likely that some images will contain tables that aren’t accurately @@ -702,8 +725,8 @@ the x, y, width and height are anywhere in that range.

    -
    -

    4 OCR tables

    +
    +

    4 OCR tables

    Tesseract does not perform well when run on images of tables. It performs best @@ -720,9 +743,13 @@ We’ll start with an image shown at the end of the previous section.

    -
    -

    4.1 Training Tesseract

    +
    +

    4.1 Training Tesseract

    +

    +Tesseract is used for recognizing characters. It is not involved in extracting the tables from an image or in extracting cells from the table. +

    +

    It’s a very good idea to train tesseract. Accuracy will improve tremendously.

    @@ -732,7 +759,7 @@ Clone the tesstrain repo at

    -Run the ocr_tables script on a few pdfs to generate some training data. That +Run the ocr_tables script on a few pdfs to generate some training data. That script outputs pairs of .png and .gt.txt files that can be used by tesstrain.

    @@ -772,10 +799,70 @@ Once the training is complete, there will be a new file Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.

    + +
    +

    4.1.1 Training tips

    +
    +

    +Here is a tip for quickly creating training data. +

    + +

    +The output of the ocr_cells script will be a directory named ocr_data that +will have two files for each cell. One file is the image of the cell and the +other file is the OCR text. +

    + +

    +You’ll want to compare each image to its OCR text to check for accuracy. If +the text doesn’t match, you’ll want to update the text and add the image to the +training data. +

    + +

    +The fastest way to do this is with feh. +

    + +

    +feh lets you view an image and a caption at the same time and lets you edit +the caption from within feh. +

    + +

    +feh expects the captions to be named <image-name>.txt, so use a little +shell-fu to do a quick rename. +

    + +
    +
    for f in *.txt; do f1=$(cut -d"." -f1 <(echo $f)); mv $f ${f1}.png.txt; done
    +
    -
    -

    4.2 Blur

    +

    +Then run feh -K . to specify the current directory as the caption directory. +This will open a window with the first image in the directory and its caption. +

    + +

    +Press c to edit the caption (if needed) and n~/~p to move to the +next/previons images. Press q to quit. +

    + +

    +When finished, rename the files back to the filename structure that Tesseract +looks for in training. +

    + +
    +
    for f in *.txt; do f1=$(cut -d"." -f1 <(echo $f)); mv $f ${f1}.gt.txt; done
    +
    +
    +
    +
    +
    + +
    +

    4.2 Blur

    Blurring helps to make noise less noisy so that the overall structure of an @@ -815,8 +902,8 @@ cv2.imwrite("resources/examples/example-table-blur

    -
    -

    4.3 Threshold

    +
    +

    4.3 Threshold

    We’ve got a bunch of pixels that are gray. Thresholding will turn them all @@ -854,8 +941,8 @@ cv2.imwrite("resources/examples/example-table-thre

    -
    -

    4.4 Finding the vertical and horizontal lines of the table

    +
    +

    4.4 Finding the vertical and horizontal lines of the table

    vertical = horizontal = img_bin.copy()
    @@ -894,8 +981,8 @@ cv2.imwrite("resources/examples/example-table-line
     
    -
    -

    4.5 Finding the contours

    +
    +

    4.5 Finding the contours

    Blurring and thresholding allow us to find the lines. Opening the lines allows @@ -944,7 +1031,7 @@ above/below certain sizes.

    -
    contours, heirarchy = cv2.findContours(
    +
    contours, heirarchy = cv2.findContours(
         mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE,
     )
     
    @@ -976,8 +1063,8 @@ above/below certain sizes.
     
    -
    -

    4.6 Sorting the bounding rectangles

    +
    +

    4.6 Sorting the bounding rectangles

    We want to process these from left-to-right, top-to-bottom. @@ -996,7 +1083,7 @@ value of their center. We’ll remove those rectangles from the list and rep

    -
    def cell_in_same_row(c1, c2):
    +
    def cell_in_same_row(c1, c2):
         c1_center = c1[1] + c1[3] - c1[3] / 2
         c2_bottom = c2[1] + c2[3]
         c2_top = c2[1]
    @@ -1070,7 +1157,7 @@ cv2.imwrite("resources/examples/example-table-cell
     
    -
    def extract_cell_images_from_table(image):
    +
    def extract_cell_images_from_table(image):
         BLUR_KERNEL_SIZE = (17, 17)
         STD_DEV_X_DIRECTION = 0
         STD_DEV_Y_DIRECTION = 0
    @@ -1184,8 +1271,8 @@ cv2.imwrite("resources/examples/example-table-cell
     
    -
    -

    4.7 Cropping each cell to the text

    +
    +

    4.7 Cropping each cell to the text

    OCR with Tesseract works best when there is about 10 pixels of white border @@ -1272,8 +1359,8 @@ cv2.imwrite("resources/examples/example-table-cell

    -
    -

    4.8 OCR each cell

    +
    +

    4.8 OCR each cell

    If we cleaned up the images well enough, we might get some accurate OCR! @@ -1303,10 +1390,31 @@ period into a comma, then you might need to do some custom Tesseract training.

    +

    +The second argument passed to ocr_image is a string of the command line arguments passed directly to tesseract. You can view the available options at https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#options +

    + +

    +If no options are passed to tesseract, then language defaults to english. This means tesseract needs to be able to find a file named eng.traineddata on whatever path it searches for languages. +

    + +

    +This python package comes with eng.traineddata and table-ocr.traineddata. table-ocr.traineddata is a personal model that I’ve found to be more accurate for my use case. You should train your own to maximize accuracy. +

    + +

    +When you pip install this package, the traineddata gets copied to a tessdata folder in the same directory in which pip installs the package. +

    + +

    +The ocr_image package in this repo defaults to using the --tessdata-dir option to the package’s tessdata directory in the package install location and the -l option to the table_ocr language. +

    +
    import pytesseract
     import cv2
     import numpy as np
    +import math
     image = cv2.imread("resources/examples/example-table-cell-1-1.png", cv2.IMREAD_GRAYSCALE)
     <<crop-to-text>>
     <<ocr-image>>
    @@ -1322,8 +1430,8 @@ ocr_image(image, "--psm 7")
     
    -
    -

    5 Files

    +
    +

    5 Files

    @@ -1331,8 +1439,8 @@ ocr_image(image, "--psm 7")
     
    -
    -

    5.1 setup.py

    +
    +

    5.1 setup.py

    import setuptools
    @@ -1340,43 +1448,43 @@ ocr_image(image, "--psm 7")
     long_description = """
     Utilities for turning images of tables into CSV data. Uses Tesseract and OpenCV.
     
    -Requires binaries for tesseract and pdfimages (from Poppler).
    +Requires binaries for tesseract, ImageMagick, and pdfimages (from Poppler).
     """
     setuptools.setup(
         name="table_ocr",
    -    version="0.0.1",
    +    version="0.2.0",
         author="Eric Ihli",
         author_email="eihli@owoga.com",
    -    description="Turn images of tables into CSV data.",
    +    description="Extract text from tables in images.",
         long_description=long_description,
         long_description_content_type="text/plain",
         url="https://github.com/eihli/image-table-ocr",
         packages=setuptools.find_packages(),
    +    package_data={
    +        "table_ocr": ["tessdata/table-ocr.traineddata", "tessdata/eng.traineddata"]
    +    },
         classifiers=[
             "Programming Language :: Python :: 3",
             "License :: OSI Approved :: MIT License",
             "Operating System :: OS Independent",
         ],
    -    install_requires=[
    -        "pytesseract~=0.3",
    -        "opencv-python~=4.2",
    -    ],
    -    python_requires='>=3.6',
    +    install_requires=["pytesseract~=0.3", "opencv-python~=4.2",],
    +    python_requires=">=3.6",
     )
     
    -
    -

    5.2 table_ocr

    +
    +

    5.2 table_ocr

    -
    -

    5.2.1 table_ocr/__init__.py

    +
    +

    5.2.1 table_ocr/__init__.py

    -
    -

    5.2.2 table_ocr/util.py

    +
    +

    5.2.2 table_ocr/util.py

    from contextlib import contextmanager
    @@ -1413,15 +1521,15 @@ setuptools.setup(
     
    -
    -

    5.2.3 table_ocr/pdf_to_images/

    +
    +

    5.2.3 table_ocr/pdf_to_images/

    -
    -
    5.2.3.1 table_ocr/pdf_to_images/__init__.py
    +
    +
    5.2.3.1 table_ocr/pdf_to_images/__init__.py
    -
    import os
    +
    import os
     import re
     import subprocess
     
    @@ -1434,14 +1542,14 @@ setuptools.setup(
     def pdf_to_images(pdf_filepath):
         """
         Turn a pdf into images
    +    Returns the filenames of the created images sorted lexicographically.
         """
         directory, filename = os.path.split(pdf_filepath)
    -    with working_dir(directory):
    -        image_filenames = pdfimages(pdf_filepath)
    +    image_filenames = pdfimages(pdf_filepath)
     
         # Since pdfimages creates a number of files named each for there page number
         # and doesn't return us the list that it created
    -    return [os.path.join(directory, f) for f in image_filenames]
    +    return sorted([os.path.join(directory, f) for f in image_filenames])
     
     
     def pdfimages(pdf_filepath):
    @@ -1454,8 +1562,14 @@ setuptools.setup(
         uses 3 digits in its regex.
         """
         directory, filename = os.path.split(pdf_filepath)
    +    if not os.path.isabs(directory):
    +        directory = os.path.abspath(directory)
         filename_sans_ext = filename.split(".pdf")[0]
    -    subprocess.run(["pdfimages", "-png", pdf_filepath, filename.split(".pdf")[0]])
    +
    +    # pdfimages outputs results to the current working directory
    +    with working_dir(directory):
    +        subprocess.run(["pdfimages", "-png", filename, filename.split(".pdf")[0]])
    +
         image_filenames = find_matching_files_in_dir(filename_sans_ext, directory)
         logger.debug(
             "Converted {} into files:\n{}".format(pdf_filepath, "\n".join(image_filenames))
    @@ -1471,19 +1585,29 @@ setuptools.setup(
         ]
         return files
     
    -def preprocess_img(filepath):
    -    """
    -    Processing that involves running shell executables,
    +def preprocess_img(filepath, tess_params=None):
    +    """Processing that involves running shell executables,
         like mogrify to rotate.
    +
    +    Uses tesseract to detect rotation.
    +   
    +    Orientation and script detection is only available for legacy tesseract
    +    (--oem 0). Some versions of tesseract will segfault if you let it run OSD
    +    with the default oem (3).
         """
    -    rotate = get_rotate(filepath)
    +    if tess_params is None:
    +        tess_params = ["--psm", "0", "--oem", "0"]
    +    rotate = get_rotate(filepath, tess_params)
         logger.debug("Rotating {} by {}.".format(filepath, rotate))
         mogrify(filepath, rotate)
     
     
    -def get_rotate(image_filepath):
    +def get_rotate(image_filepath, tess_params):
    +    """
    +    """
    +    tess_command = ["tesseract"] + tess_params + [image_filepath, "-"]
         output = (
    -        subprocess.check_output(["tesseract", "--psm", "0", image_filepath, "-"])
    +        subprocess.check_output(tess_command)
             .decode("utf-8")
             .split("\n")
         )
    @@ -1499,8 +1623,8 @@ setuptools.setup(
     
    -
    -
    5.2.3.2 table_ocr/pdf_to_images/__main__.py
    +
    +
    5.2.3.2 table_ocr/pdf_to_images/__main__.py

    Takes a variable number of pdf files and creates images out of each page of the @@ -1520,7 +1644,7 @@ blank line.

    -
    import argparse
    +
    import argparse
     
     from table_ocr.util import working_dir, make_tempdir, get_logger
     from table_ocr.pdf_to_images import pdf_to_images, preprocess_img
    @@ -1553,15 +1677,16 @@ parser.add_argument("files", nargs=
     
    -
    -

    5.2.4 table_ocr/extract_tables/

    +
    +

    5.2.4 table_ocr/extract_tables/

    -
    -
    5.2.4.1 table_ocr/extract_tables/__init__.py
    +
    +
    5.2.4.1 table_ocr/extract_tables/__init__.py
    -
    import cv2
    +
    import os
    +import cv2
     
     def find_tables(image):
         BLUR_KERNEL_SIZE = (17, 17)
    @@ -1610,13 +1735,35 @@ parser.add_argument("files", nargs=    # Leaving that step as a future TODO if it is ever necessary.
         images = [image[y:y+h, x:x+w] for x, y, w, h in bounding_rects]
         return images
    +
    +def main(files):
    +    results = []
    +    for f in files:
    +        directory, filename = os.path.split(f)
    +        image = cv2.imread(f, cv2.IMREAD_GRAYSCALE)
    +        tables = find_tables(image)
    +        files = []
    +        filename_sans_extension = os.path.splitext(filename)[0]
    +        if tables:
    +            os.makedirs(os.path.join(directory, filename_sans_extension), exist_ok=True)
    +        for i, table in enumerate(tables):
    +            table_filename = "table-{:03d}.png".format(i)
    +            table_filepath = os.path.join(
    +                directory, filename_sans_extension, table_filename
    +            )
    +            files.append(table_filepath)
    +            cv2.imwrite(table_filepath, table)
    +        if tables:
    +            results.append((f, files))
    +    # Results is [[<input image>, [<images of detected tables>]]]
    +    return results
     
    -
    -
    5.2.4.2 table_ocr/extract_tables/__main__.py
    +
    +
    5.2.4.2 table_ocr/extract_tables/__main__.py

    Takes 1 or more image paths as arguments. @@ -1648,60 +1795,33 @@ For each image path given as an agument, outputs:

    -
    import argparse
    -import os
    -
    -import cv2
    +
    import argparse
     
    -from table_ocr.extract_tables import find_tables
    +from table_ocr.extract_tables import main
     
     parser = argparse.ArgumentParser()
     parser.add_argument("files", nargs="+")
    -
    -
    -def main(files):
    -    results = []
    -    for f in files:
    -        directory, filename = os.path.split(f)
    -        image = cv2.imread(f, cv2.IMREAD_GRAYSCALE)
    -        tables = find_tables(image)
    -        files = []
    -        filename_sans_extension = os.path.splitext(filename)[0]
    -        if tables:
    -            os.makedirs(os.path.join(directory, filename_sans_extension), exist_ok=True)
    -        for i, table in enumerate(tables):
    -            table_filename = "table-{:03d}.png".format(i)
    -            table_filepath = os.path.join(
    -                directory, filename_sans_extension, table_filename
    -            )
    -            files.append(table_filepath)
    -            cv2.imwrite(table_filepath, table)
    -        if tables:
    -            results.append((f, files))
    -
    -    for image_filename, table_filenames in results:
    -        print("\n".join(table_filenames))
    -
    -
    -if __name__ == "__main__":
    -    args = parser.parse_args()
    -    files = args.files
    -    main(files)
    +args = parser.parse_args()
    +files = args.files
    +results = main(files)
    +for image, tables in results:
    +    print("\n".join(tables))
     
    -
    -

    5.2.5 table_ocr/extract_cells/

    +
    +

    5.2.5 table_ocr/extract_cells/

    -
    -
    5.2.5.1 table_ocr/extract_cells/__init__.py
    +
    +
    5.2.5.1 table_ocr/extract_cells/__init__.py
    import cv2
    +import os
     
     def extract_cell_images_from_table(image):
         BLUR_KERNEL_SIZE = (17, 17)
    @@ -1798,13 +1918,29 @@ parser.add_argument("files", nargs=            cell_images_row.append(image[y:y+h, x:x+w])
             cell_images_rows.append(cell_images_row)
         return cell_images_rows
    +
    +def main(f):
    +    results = []
    +    directory, filename = os.path.split(f)
    +    table = cv2.imread(f, cv2.IMREAD_GRAYSCALE)
    +    rows = extract_cell_images_from_table(table)
    +    cell_img_dir = os.path.join(directory, "cells")
    +    os.makedirs(cell_img_dir, exist_ok=True)
    +    paths = []
    +    for i, row in enumerate(rows):
    +        for j, cell in enumerate(row):
    +            cell_filename = "{:03d}-{:03d}.png".format(i, j)
    +            path = os.path.join(cell_img_dir, cell_filename)
    +            cv2.imwrite(path, cell)
    +            paths.append(path)
    +    return paths
     
    -
    -
    5.2.5.2 table_ocr/extract_cells/__main__.py
    +
    +
    5.2.5.2 table_ocr/extract_cells/__main__.py

    Takes as a command line argument a path to an image of a table. @@ -1827,146 +1963,61 @@ cells.

    -
    import os
    -import sys
    -
    -import cv2
    -
    -from table_ocr.extract_cells import extract_cell_images_from_table
    +
    import sys
     
    -def main(f):
    -    results = []
    -    directory, filename = os.path.split(f)
    -    table = cv2.imread(f, cv2.IMREAD_GRAYSCALE)
    -    rows = extract_cell_images_from_table(table)
    -    cell_img_dir = os.path.join(directory, "cells")
    -    os.makedirs(cell_img_dir, exist_ok=True)
    -    for i, row in enumerate(rows):
    -        for j, cell in enumerate(row):
    -            cell_filename = "{:03d}-{:03d}.png".format(i, j)
    -            path = os.path.join(cell_img_dir, cell_filename)
    -            cv2.imwrite(path, cell)
    -            print(path)
    -
    -
    -def extract_cell_images_from_table(image):
    -    BLUR_KERNEL_SIZE = (17, 17)
    -    STD_DEV_X_DIRECTION = 0
    -    STD_DEV_Y_DIRECTION = 0
    -    blurred = cv2.GaussianBlur(image, BLUR_KERNEL_SIZE, STD_DEV_X_DIRECTION, STD_DEV_Y_DIRECTION)
    -    MAX_COLOR_VAL = 255
    -    BLOCK_SIZE = 15
    -    SUBTRACT_FROM_MEAN = -2
    -    
    -    img_bin = cv2.adaptiveThreshold(
    -        ~blurred,
    -        MAX_COLOR_VAL,
    -        cv2.ADAPTIVE_THRESH_MEAN_C,
    -        cv2.THRESH_BINARY,
    -        BLOCK_SIZE,
    -        SUBTRACT_FROM_MEAN,
    -    )
    -    vertical = horizontal = img_bin.copy()
    -    SCALE = 5
    -    image_width, image_height = horizontal.shape
    -    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(image_width / SCALE), 1))
    -    horizontally_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)
    -    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(image_height / SCALE)))
    -    vertically_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)
    -    
    -    horizontally_dilated = cv2.dilate(horizontally_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1)))
    -    vertically_dilated = cv2.dilate(vertically_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (1, 60)))
    -    
    -    mask = horizontally_dilated + vertically_dilated
    -    contours, heirarchy = cv2.findContours(
    -        mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE,
    -    )
    -    
    -    perimeter_lengths = [cv2.arcLength(c, True) for c in contours]
    -    epsilons = [0.05 * p for p in perimeter_lengths]
    -    approx_polys = [cv2.approxPolyDP(c, e, True) for c, e in zip(contours, epsilons)]
    -    
    -    # Filter out contours that aren't rectangular. Those that aren't rectangular
    -    # are probably noise.
    -    approx_rects = [p for p in approx_polys if len(p) == 4]
    -    bounding_rects = [cv2.boundingRect(a) for a in approx_polys]
    -    
    -    # Filter out rectangles that are too narrow or too short.
    -    MIN_RECT_WIDTH = 40
    -    MIN_RECT_HEIGHT = 10
    -    bounding_rects = [
    -        r for r in bounding_rects if MIN_RECT_WIDTH < r[2] and MIN_RECT_HEIGHT < r[3]
    -    ]
    -    
    -    # The largest bounding rectangle is assumed to be the entire table.
    -    # Remove it from the list. We don't want to accidentally try to OCR
    -    # the entire table.
    -    largest_rect = max(bounding_rects, key=lambda r: r[2] * r[3])
    -    bounding_rects = [b for b in bounding_rects if b is not largest_rect]
    -    
    -    cells = [c for c in bounding_rects]
    -    def cell_in_same_row(c1, c2):
    -        c1_center = c1[1] + c1[3] - c1[3] / 2
    -        c2_bottom = c2[1] + c2[3]
    -        c2_top = c2[1]
    -        return c2_top < c1_center < c2_bottom
    -    
    -    orig_cells = [c for c in cells]
    -    rows = []
    -    while cells:
    -        first = cells[0]
    -        rest = cells[1:]
    -        cells_in_same_row = sorted(
    -            [
    -                c for c in rest
    -                if cell_in_same_row(c, first)
    -            ],
    -            key=lambda c: c[0]
    -        )
    -    
    -        row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
    -        rows.append(row_cells)
    -        cells = [
    -            c for c in rest
    -            if not cell_in_same_row(c, first)
    -        ]
    -    
    -    # Sort rows by average height of their center.
    -    def avg_height_of_center(row):
    -        centers = [y + h - h / 2 for x, y, w, h in row]
    -        return sum(centers) / len(centers)
    -    
    -    rows.sort(key=avg_height_of_center)
    -    cell_images_rows = []
    -    for row in rows:
    -        cell_images_row = []
    -        for x, y, w, h in row:
    -            cell_images_row.append(image[y:y+h, x:x+w])
    -        cell_images_rows.append(cell_images_row)
    -    return cell_images_rows
    +from table_ocr.extract_cells import main
     
    -if __name__ == "__main__":
    -    main(sys.argv[1])
    +paths = main(sys.argv[1])
    +print("\n".join(paths))
     
    -
    -

    5.2.6 table_ocr/ocr_image/

    +
    +

    5.2.6 table_ocr/ocr_image/

    -
    -
    5.2.6.1 table_ocr/ocr_image/__init__.py
    +
    +
    5.2.6.1 table_ocr/ocr_image/__init__.py
    import math
    +import os
    +import sys
     
     import cv2
     import numpy as np
     import pytesseract
     
    +def main(image_file, tess_args):
    +    """
    +    OCR the image and output the text to a file with an extension that is ready
    +    to be used in Tesseract training (.gt.txt).
    +
    +    Tries to crop the image so that only the relevant text gets passed to Tesseract.
    +
    +    Returns the name of the text file that contains the text.
    +    """
    +    directory, filename = os.path.split(image_file)
    +    filename_sans_ext, ext = os.path.splitext(filename)
    +    image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)
    +    cropped = crop_to_text(image)
    +    ocr_data_dir = os.path.join(directory, "ocr_data")
    +    os.makedirs(ocr_data_dir, exist_ok=True)
    +    out_imagepath = os.path.join(ocr_data_dir, filename)
    +    out_txtpath = os.path.join(ocr_data_dir, "{}.gt.txt".format(filename_sans_ext))
    +    cv2.imwrite(out_imagepath, cropped)
    +    if not tess_args:
    +        d = os.path.dirname(sys.modules["table_ocr"].__file__)
    +        tessdata_dir = os.path.join(d, "tessdata")
    +        tess_args = ["--psm", "7", "-l", "table-ocr", "--tessdata-dir", tessdata_dir]
    +    txt = ocr_image(cropped, " ".join(tess_args))
    +    with open(out_txtpath, "w") as txt_file:
    +        txt_file.write(txt)
    +    return out_txtpath
    +
     def crop_to_text(image):
         MAX_COLOR_VAL = 255
         BLOCK_SIZE = 15
    @@ -2022,8 +2073,8 @@ cells.
     
    -
    -
    5.2.6.2 table_ocr/ocr_image/__main__.py
    +
    +
    5.2.6.2 table_ocr/ocr_image/__main__.py

    This does a little bit of cleanup before sending it through tesseract. @@ -2036,13 +2087,8 @@ Creates images and text files that can be used for training tesseract. See

    import argparse
    -import math
    -import os
    -import sys
    -
    -import cv2
     
    -from table_ocr.ocr_image import crop_to_text, ocr_image
    +from table_ocr.ocr_image import main
     
     description="""Takes a single argument that is the image to OCR.
     Remaining arguments are passed directly to Tesseract.
    @@ -2053,35 +2099,19 @@ Creates images and text files that can be used for training tesseract. See
     parser = argparse.ArgumentParser(description=description)
     parser.add_argument("image", help="filepath of image to perform OCR")
     
    -def main(image_file, tess_args):
    -    directory, filename = os.path.split(image_file)
    -    filename_sans_ext, ext = os.path.splitext(filename)
    -    image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)
    -    cropped = crop_to_text(image)
    -    ocr_data_dir = os.path.join(directory, "ocr_data")
    -    os.makedirs(ocr_data_dir, exist_ok=True)
    -    out_imagepath = os.path.join(ocr_data_dir, filename)
    -    out_txtpath = os.path.join(ocr_data_dir, "{}.gt.txt".format(filename_sans_ext))
    -    cv2.imwrite(out_imagepath, cropped)
    -    txt = ocr_image(cropped, " ".join(tess_args))
    -    print(txt)
    -    with open(out_txtpath, "w") as txt_file:
    -        txt_file.write(txt)
    -
    -if __name__ == "__main__":
    -    args, tess_args = parser.parse_known_args()
    -    main(args.image, tess_args)
    +args, tess_args = parser.parse_known_args()
    +print(main(args.image, tess_args))
     
    -
    -

    5.2.7 table_ocr/ocr_to_csv/

    +
    +

    5.2.7 table_ocr/ocr_to_csv/

    -
    -
    5.2.7.1 table_ocr/ocr_to_csv/__init__.py
    +
    +
    5.2.7.1 table_ocr/ocr_to_csv/__init__.py
    import csv
    @@ -2115,8 +2145,8 @@ parser.add_argument("image", 
    -
    5.2.7.2 table_ocr/ocr_to_csv/__main__.py
    +
    +
    5.2.7.2 table_ocr/ocr_to_csv/__main__.py
    import argparse
    @@ -2145,8 +2175,8 @@ parser.add_argument("files", nargs=
     
    -
    -

    6 Utils

    +
    +

    6 Utils

    The following code lets us specify a size for images when they are exported to @@ -2173,7 +2203,7 @@ with advice-add.

    -
    (concat "#+ATTR_HTML: :width " width " :height " height "\n[[file:" text "]]")
    +
    (concat "#+ATTR_HTML: :width " width " :height " height "\n[[file:" text "]]")
     
    @@ -2195,8 +2225,8 @@ with advice-add.
    -
    -

    6.1 Logging

    +
    +

    6.1 Logging

    def get_logger(name):
    @@ -2217,7 +2247,7 @@ with advice-add.
     

    Author: Eric Ihli

    -

    Created: 2020-04-25 Sat 12:20

    +

    Created: 2020-10-14 Wed 21:28

    diff --git a/pdf_table_extraction_and_ocr.org b/pdf_table_extraction_and_ocr.org index 3d1587b..bc86c47 100644 --- a/pdf_table_extraction_and_ocr.org +++ b/pdf_table_extraction_and_ocr.org @@ -97,14 +97,15 @@ PDF=$1 python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {} | grep table > /tmp/extracted-tables.txt cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt -cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr - +cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} for image in $(cat /tmp/extracted-tables.txt); do dir=$(dirname $image) python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt") done #+END_SRC +Any extra args you pass after the image path to ~python -m table_ocr.ocr_image~ will be passed directly to tesseract as options. If you don't pass anything, reasonable english defaults are used. + ** Possible improvements Detect text with the stroke-width-transform alogoritm. https://zablo.net/blog/post/stroke-width-transform-swt-python/index.html @@ -199,7 +200,7 @@ def preprocess_img(filepath, tess_params=None): like mogrify to rotate. Uses tesseract to detect rotation. - + Orientation and script detection is only available for legacy tesseract (--oem 0). Some versions of tesseract will segfault if you let it run OSD with the default oem (3).