image-table-ocr/table_ocr/ocr_image/__main__.py

import argparse
import math
import os
import sys

import cv2
import pytesseract

description="""Takes a single argument that is the image to OCR.
Remaining arguments are passed directly to Tesseract.

Attempts to make OCR more accurate by performing some modifications on the image.
Saves the modified image and the OCR text in an `ocr_data` directory.
Filenames are of the format for training with tesstrain."""
parser = argparse.ArgumentParser(description=description)
parser.add_argument("image", help="filepath of image to perform OCR")

def crop_to_text(image):
    MAX_COLOR_VAL = 255
    BLOCK_SIZE = 15
    SUBTRACT_FROM_MEAN = -2

    img_bin = cv2.adaptiveThreshold(
        ~image,
        MAX_COLOR_VAL,
        cv2.ADAPTIVE_THRESH_MEAN_C,
        cv2.THRESH_BINARY,
        BLOCK_SIZE,
        SUBTRACT_FROM_MEAN,
    )

    img_h, img_w = image.shape
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(img_w * 0.5), 1))
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(img_h * 0.7)))
    horizontal_lines = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)
    vertical_lines = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)
    both = horizontal_lines + vertical_lines
    cleaned = img_bin - both

    # Get rid of little noise.
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
    opened = cv2.morphologyEx(cleaned, cv2.MORPH_OPEN, kernel)

    contours, hierarchy = cv2.findContours(opened, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
    bounding_rects = [cv2.boundingRect(c) for c in contours]
    NUM_PX_COMMA = 6
    MIN_CHAR_AREA = 5 * 9
    if bounding_rects:
        minx, miny, maxx, maxy = math.inf, math.inf, 0, 0
        for x, y, w, h in [(x, y, w, h) for x, y, w, h in bounding_rects if w * h > MIN_CHAR_AREA]:
            minx = min(minx, x)
            miny = min(miny, y)
            maxx = max(maxx, x + w)
            maxy = max(maxy, y + h)
        x, y, w, h = minx, miny, maxx - minx, maxy - miny
        cropped = image[y:min(img_h, y+h+NUM_PX_COMMA), x:min(img_w, x+w)]
    else:
        # If we morphed out all of the text, fallback to using the unmorphed image.
        cropped = image
    bordered = cv2.copyMakeBorder(cropped, 5, 5, 5, 5, cv2.BORDER_CONSTANT, None, 255)
    return bordered
def ocr_image(image, config):
    return pytesseract.image_to_string(
        image,
        config=config
    )

def main(image_file, tess_args):
    directory, filename = os.path.split(image_file)
    filename_sans_ext, ext = os.path.splitext(filename)
    image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)
    cropped = crop_to_text(image)
    ocr_data_dir = os.path.join(directory, "ocr_data")
    os.makedirs(ocr_data_dir, exist_ok=True)
    out_imagepath = os.path.join(ocr_data_dir, filename)
    out_txtpath = os.path.join(ocr_data_dir, "{}.gt.txt".format(filename_sans_ext))
    cv2.imwrite(out_imagepath, cropped)
    txt = ocr_image(cropped, " ".join(tess_args))
    print(txt)
    with open(out_txtpath, "w") as txt_file:
        txt_file.write(txt)

if __name__ == "__main__":
    args, tess_args = parser.parse_known_args()
    main(args.image, tess_args)
Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`import argparse`
			`import math`
Add module for outputting csv from parsed table Make cell extraction a little more accurate. 5 years ago			`import os`
Add ocr_image module 5 years ago			`import sys`

			`import cv2`
			`import pytesseract`

Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`description="""Takes a single argument that is the image to OCR.`
			`Remaining arguments are passed directly to Tesseract.`

			`Attempts to make OCR more accurate by performing some modifications on the image.`
			Saves the modified image and the OCR text in an `ocr_data` directory.
			`Filenames are of the format for training with tesstrain."""`
			`parser = argparse.ArgumentParser(description=description)`
			`parser.add_argument("image", help="filepath of image to perform OCR")`

Add ocr_image module 5 years ago			`def crop_to_text(image):`
Add module for outputting csv from parsed table Make cell extraction a little more accurate. 5 years ago			`MAX_COLOR_VAL = 255`
			`BLOCK_SIZE = 15`
			`SUBTRACT_FROM_MEAN = -2`

			`img_bin = cv2.adaptiveThreshold(`
			`~image,`
			`MAX_COLOR_VAL,`
			`cv2.ADAPTIVE_THRESH_MEAN_C,`
			`cv2.THRESH_BINARY,`
			`BLOCK_SIZE,`
			`SUBTRACT_FROM_MEAN,`
			`)`

Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`img_h, img_w = image.shape`
			`horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(img_w * 0.5), 1))`
			`vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(img_h * 0.7)))`
			`horizontal_lines = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)`
			`vertical_lines = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)`
			`both = horizontal_lines + vertical_lines`
			`cleaned = img_bin - both`
Add module for outputting csv from parsed table Make cell extraction a little more accurate. 5 years ago
Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`# Get rid of little noise.`
			`kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))`
			`opened = cv2.morphologyEx(cleaned, cv2.MORPH_OPEN, kernel)`
Add ocr_image module 5 years ago
Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`contours, hierarchy = cv2.findContours(opened, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)`
Add ocr_image module 5 years ago			`bounding_rects = [cv2.boundingRect(c) for c in contours]`
Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`NUM_PX_COMMA = 6`
Fix bug picking up noise in detecting contours 5 years ago			`MIN_CHAR_AREA = 5 * 9`
Add module for outputting csv from parsed table Make cell extraction a little more accurate. 5 years ago			`if bounding_rects:`
Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`minx, miny, maxx, maxy = math.inf, math.inf, 0, 0`
Fix bug picking up noise in detecting contours 5 years ago			`for x, y, w, h in [(x, y, w, h) for x, y, w, h in bounding_rects if w * h > MIN_CHAR_AREA]:`
Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`minx = min(minx, x)`
			`miny = min(miny, y)`
			`maxx = max(maxx, x + w)`
			`maxy = max(maxy, y + h)`
			`x, y, w, h = minx, miny, maxx - minx, maxy - miny`
			`cropped = image[y:min(img_h, y+h+NUM_PX_COMMA), x:min(img_w, x+w)]`
Add module for outputting csv from parsed table Make cell extraction a little more accurate. 5 years ago			`else:`
Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`# If we morphed out all of the text, fallback to using the unmorphed image.`
Add module for outputting csv from parsed table Make cell extraction a little more accurate. 5 years ago			`cropped = image`
Add ocr_image module 5 years ago			`bordered = cv2.copyMakeBorder(cropped, 5, 5, 5, 5, cv2.BORDER_CONSTANT, None, 255)`
			`return bordered`
			`def ocr_image(image, config):`
			`return pytesseract.image_to_string(`
Add module for outputting csv from parsed table Make cell extraction a little more accurate. 5 years ago			`image,`
Add ocr_image module 5 years ago			`config=config`
			`)`

Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`def main(image_file, tess_args):`
			`directory, filename = os.path.split(image_file)`
Add module for outputting csv from parsed table Make cell extraction a little more accurate. 5 years ago			`filename_sans_ext, ext = os.path.splitext(filename)`
Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)`
Add module for outputting csv from parsed table Make cell extraction a little more accurate. 5 years ago			`cropped = crop_to_text(image)`
			`ocr_data_dir = os.path.join(directory, "ocr_data")`
			`os.makedirs(ocr_data_dir, exist_ok=True)`
			`out_imagepath = os.path.join(ocr_data_dir, filename)`
			`out_txtpath = os.path.join(ocr_data_dir, "{}.gt.txt".format(filename_sans_ext))`
			`cv2.imwrite(out_imagepath, cropped)`
Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`txt = ocr_image(cropped, " ".join(tess_args))`
			`print(txt)`
Add module for outputting csv from parsed table Make cell extraction a little more accurate. 5 years ago			`with open(out_txtpath, "w") as txt_file:`
			`txt_file.write(txt)`
Add ocr_image module 5 years ago
			`if __name__ == "__main__":`
Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules. 5 years ago			`args, tess_args = parser.parse_known_args()`
			`main(args.image, tess_args)`