diff --git a/README.org b/README.org index 0ebdcb6..65484b1 100644 --- a/README.org +++ b/README.org @@ -49,7 +49,7 @@ PDF=$1 python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {} | grep table > /tmp/extracted-tables.txt cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt -cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr +cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} for image in $(cat /tmp/extracted-tables.txt); do dir=$(dirname $image) @@ -57,6 +57,7 @@ for image in $(cat /tmp/extracted-tables.txt); do done #+END_SRC + The package was written in a [[https://en.wikipedia.org/wiki/Literate_programming][literate programming]] style. The source code at [[https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html]] is meant to act as the documentation and reference material. diff --git a/ocr_tables b/ocr_tables index 25e936a..5f2e413 100755 --- a/ocr_tables +++ b/ocr_tables @@ -5,8 +5,7 @@ PDF=$1 python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {} | grep table > /tmp/extracted-tables.txt cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt -cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr - +cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} for image in $(cat /tmp/extracted-tables.txt); do dir=$(dirname $image) python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt") diff --git a/pdf_table_extraction_and_ocr.html b/pdf_table_extraction_and_ocr.html index be38955..5126f4d 100644 --- a/pdf_table_extraction_and_ocr.html +++ b/pdf_table_extraction_and_ocr.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
- +This Python package provides utilities for extracting tabular data from PDF @@ -359,17 +363,17 @@ The package is split into modules with narrow focuses.
pdf_to_images
uses Poppler and ImageMagick to extract images from a PDF.extract_tables
finds and extracts table-looking things from an image.extract_cells
extracts and orders cells from a table.ocr_image
uses Tesseract to turn a OCR the text from an image of a cell.ocr_image
uses Tesseract to OCR the text from an image of a cell.ocr_to_csv
converts into a CSV the directory structure that ocr_image
outputs.pdfimages
from PopplerThis package was created in a literate programming style with the help of Babel. @@ -405,8 +409,8 @@ barrier for contributors who aren’t already familiar with Emacs and Babel.
Here is an example of a shell script that uses each module to turn a pdf with a @@ -424,26 +428,29 @@ you need into your own python projects and use them as needed.
#!/bin/sh +#!/bin/sh PDF=$1 python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {} | grep table > /tmp/extracted-tables.txt cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt -cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr - +cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} for image in $(cat /tmp/extracted-tables.txt); do dir=$(dirname $image) python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt") done
+Any extra args you pass after the image path to python -m table_ocr.ocr_image
will be passed directly to tesseract as options. If you don’t pass anything, reasonable english defaults are used.
+
Detect text with the stroke-width-transform alogoritm. https://zablo.net/blog/post/stroke-width-transform-swt-python/index.html
@@ -452,8 +459,8 @@ Detect text with the stroke-width-transform alogoritm.
-
Not all pdfs need to be sent through OCR to extract the text content. If you can
@@ -462,27 +469,27 @@ probably aren’t necessary.
This code calls out to pdfimages from Poppler.
Tesseract can detect orientation and we can then use ImageMagick’s mogrify to
@@ -546,19 +559,29 @@ to correct the rotation. This makes OCR more straightforward.
This answer from opencv.org was heavily referenced while writing the code around
@@ -597,7 +620,7 @@ that makes things like shape detection more accurate.
It’s likely that some images will contain tables that aren’t accurately
@@ -702,8 +725,8 @@ the x, y, width and height are anywhere in that range.
Tesseract does not perform well when run on images of tables. It performs best
@@ -720,9 +743,13 @@ We’ll start with an image shown at the end of the previous section.
+Tesseract is used for recognizing characters. It is not involved in extracting the tables from an image or in extracting cells from the table.
+
It’s a very good idea to train tesseract. Accuracy will improve tremendously.
-Run the
+Here is a tip for quickly creating training data.
+
+The output of the
+You’ll want to compare each image to its OCR text to check for accuracy. If
+the text doesn’t match, you’ll want to update the text and add the image to the
+training data.
+
+The fastest way to do this is with
+
+
+Then run
+Press
+When finished, rename the files back to the filename structure that Tesseract
+looks for in training.
+
Blurring helps to make noise less noisy so that the overall structure of an
@@ -815,8 +902,8 @@ cv2.imwrite("resources/examples/example-table-blur
We’ve got a bunch of pixels that are gray. Thresholding will turn them all
@@ -854,8 +941,8 @@ cv2.imwrite("resources/examples/example-table-thre
Blurring and thresholding allow us to find the lines. Opening the lines allows
@@ -944,7 +1031,7 @@ above/below certain sizes.
We want to process these from left-to-right, top-to-bottom.
@@ -996,7 +1083,7 @@ value of their center. We’ll remove those rectangles from the list and rep
OCR with Tesseract works best when there is about 10 pixels of white border
@@ -1272,8 +1359,8 @@ cv2.imwrite("resources/examples/example-table-cell
If we cleaned up the images well enough, we might get some accurate OCR!
@@ -1303,10 +1390,31 @@ period into a comma, then you might need to do some custom Tesseract training.
+The second argument passed to
+If no options are passed to
+This python package comes with
+When you
+The
Takes a variable number of pdf files and creates images out of each page of the
@@ -1520,7 +1644,7 @@ blank line.
Takes 1 or more image paths as arguments.
@@ -1648,60 +1795,33 @@ For each image path given as an agument, outputs:
Takes as a command line argument a path to an image of a table.
@@ -1827,146 +1963,61 @@ cells.
This does a little bit of cleanup before sending it through tesseract.
@@ -2036,13 +2087,8 @@ Creates images and text files that can be used for training tesseract. See
The following code lets us specify a size for images when they are exported to
@@ -2173,7 +2203,7 @@ with Created: 2020-04-25 Sat 12:20 Created: 2020-10-14 Wed 21:282 Preparing data
+2 Preparing data
2.1 Converting PDFs to images
+2.1 Converting PDFs to images
# Wrapper around the Poppler command line utility "pdfimages" and helpers for
+
# Wrapper around the Poppler command line utility "pdfimages" and helpers for
# finding the output files of that command.
def pdf_to_images(pdf_filepath):
"""
Turn a pdf into images
+ Returns the filenames of the created images sorted lexicographically.
"""
directory, filename = os.path.split(pdf_filepath)
- with working_dir(directory):
- image_filenames = pdfimages(pdf_filepath)
+ image_filenames = pdfimages(pdf_filepath)
# Since pdfimages creates a number of files named each for there page number
# and doesn't return us the list that it created
- return [os.path.join(directory, f) for f in image_filenames]
+ return sorted([os.path.join(directory, f) for f in image_filenames])
def pdfimages(pdf_filepath):
@@ -495,8 +502,14 @@ This code calls out to uses 3 digits in its regex.
"""
directory, filename = os.path.split(pdf_filepath)
+ if not os.path.isabs(directory):
+ directory = os.path.abspath(directory)
filename_sans_ext = filename.split(".pdf")[0]
- subprocess.run(["pdfimages", "-png", pdf_filepath, filename.split(".pdf")[0]])
+
+ # pdfimages outputs results to the current working directory
+ with working_dir(directory):
+ subprocess.run(["pdfimages", "-png", filename, filename.split(".pdf")[0]])
+
image_filenames = find_matching_files_in_dir(filename_sans_ext, directory)
logger.debug(
"Converted {} into files:\n{}".format(pdf_filepath, "\n".join(image_filenames))
@@ -516,8 +529,8 @@ This code calls out to
-
2.2 Detecting image orientation and applying rotation.
+2.2 Detecting image orientation and applying rotation.
def preprocess_img(filepath):
- """
- Processing that involves running shell executables,
+
def preprocess_img(filepath, tess_params=None):
+ """Processing that involves running shell executables,
like mogrify to rotate.
+
+ Uses tesseract to detect rotation.
+
+ Orientation and script detection is only available for legacy tesseract
+ (--oem 0). Some versions of tesseract will segfault if you let it run OSD
+ with the default oem (3).
"""
- rotate = get_rotate(filepath)
+ if tess_params is None:
+ tess_params = ["--psm", "0", "--oem", "0"]
+ rotate = get_rotate(filepath, tess_params)
logger.debug("Rotating {} by {}.".format(filepath, rotate))
mogrify(filepath, rotate)
-def get_rotate(image_filepath):
+def get_rotate(image_filepath, tess_params):
+ """
+ """
+ tess_command = ["tesseract"] + tess_params + [image_filepath, "-"]
output = (
- subprocess.check_output(["tesseract", "--psm", "0", image_filepath, "-"])
+ subprocess.check_output(tess_command)
.decode("utf-8")
.split("\n")
)
@@ -575,8 +598,8 @@ to correct the rotation. This makes OCR more straightforward.
3 Detecting tables
+3 Detecting tables
def find_tables(image):
+
def find_tables(image):
BLUR_KERNEL_SIZE = (17, 17)
STD_DEV_X_DIRECTION = 0
STD_DEV_Y_DIRECTION = 0
@@ -680,8 +703,8 @@ cv2.imwrite("resources/examples/example-table.png"
3.1 Improving accuracy
+3.1 Improving accuracy
4 OCR tables
+4 OCR tables
4.1 Training Tesseract
+4.1 Training Tesseract
ocr_tables
script on a few pdfs to generate some training data. That
+Run the ocr_tables
script on a few pdfs to generate some training data. That
script outputs pairs of .png
and .gt.txt
files that can be used by
tesstrain.
/usr/local/share/tessdata/
.
4.1.1 Training tips
+ocr_cells
script will be a directory named ocr_data
that
+will have two files for each cell. One file is the image of the cell and the
+other file is the OCR text.
+feh
.
+feh
lets you view an image and a caption at the same time and lets you edit
+the caption from within feh
.
+feh
expects the captions to be named <image-name>.txt
, so use a little
+shell-fu to do a quick rename.
+for f in *.txt; do f1=$(cut -d"." -f1 <(echo $f)); mv $f ${f1}.png.txt; done
+
4.2 Blur
+feh -K .
to specify the current directory as the caption directory.
+This will open a window with the first image in the directory and its caption.
+c
to edit the caption (if needed) and n~/~p
to move to the
+next/previons images. Press q
to quit.
+for f in *.txt; do f1=$(cut -d"." -f1 <(echo $f)); mv $f ${f1}.gt.txt; done
+
+4.2 Blur
4.3 Threshold
+4.3 Threshold
4.4 Finding the vertical and horizontal lines of the table
+4.4 Finding the vertical and horizontal lines of the table
vertical = horizontal = img_bin.copy()
@@ -894,8 +981,8 @@ cv2.imwrite("resources/examples/example-table-line
4.5 Finding the contours
+4.5 Finding the contours
contours, heirarchy = cv2.findContours(
+
contours, heirarchy = cv2.findContours(
mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE,
)
@@ -976,8 +1063,8 @@ above/below certain sizes.
4.6 Sorting the bounding rectangles
+4.6 Sorting the bounding rectangles
def cell_in_same_row(c1, c2):
+
def cell_in_same_row(c1, c2):
c1_center = c1[1] + c1[3] - c1[3] / 2
c2_bottom = c2[1] + c2[3]
c2_top = c2[1]
@@ -1070,7 +1157,7 @@ cv2.imwrite("resources/examples/example-table-cell
def extract_cell_images_from_table(image):
+
def extract_cell_images_from_table(image):
BLUR_KERNEL_SIZE = (17, 17)
STD_DEV_X_DIRECTION = 0
STD_DEV_Y_DIRECTION = 0
@@ -1184,8 +1271,8 @@ cv2.imwrite("resources/examples/example-table-cell
4.7 Cropping each cell to the text
+4.7 Cropping each cell to the text
4.8 OCR each cell
+4.8 OCR each cell
ocr_image
is a string of the command line arguments passed directly to tesseract
. You can view the available options at https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#options
+tesseract
, then language defaults to english. This means tesseract
needs to be able to find a file named eng.traineddata
on whatever path it searches for languages.
+eng.traineddata
and table-ocr.traineddata
. table-ocr.traineddata
is a personal model that I’ve found to be more accurate for my use case. You should train your own to maximize accuracy.
+pip install
this package, the traineddata gets copied to a tessdata
folder in the same directory in which pip
installs the package.
+ocr_image
package in this repo defaults to using the --tessdata-dir
option to the package’s tessdata
directory in the package install location and the -l
option to the table_ocr
language.
+import pytesseract
import cv2
import numpy as np
+import math
image = cv2.imread("resources/examples/example-table-cell-1-1.png", cv2.IMREAD_GRAYSCALE)
<<crop-to-text>>
<<ocr-image>>
@@ -1322,8 +1430,8 @@ ocr_image(image, "--psm 7")
5 Files
+5 Files
@@ -1331,8 +1439,8 @@ ocr_image(image, "--psm 7")
5.1 setup.py
+5.1 setup.py
import setuptools
@@ -1340,43 +1448,43 @@ ocr_image(image, "--psm 7")
long_description = """
Utilities for turning images of tables into CSV data. Uses Tesseract and OpenCV.
-Requires binaries for tesseract and pdfimages (from Poppler).
+Requires binaries for tesseract, ImageMagick, and pdfimages (from Poppler).
"""
setuptools.setup(
name="table_ocr",
- version="0.0.1",
+ version="0.2.0",
author="Eric Ihli",
author_email="eihli@owoga.com",
- description="Turn images of tables into CSV data.",
+ description="Extract text from tables in images.",
long_description=long_description,
long_description_content_type="text/plain",
url="https://github.com/eihli/image-table-ocr",
packages=setuptools.find_packages(),
+ package_data={
+ "table_ocr": ["tessdata/table-ocr.traineddata", "tessdata/eng.traineddata"]
+ },
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
],
- install_requires=[
- "pytesseract~=0.3",
- "opencv-python~=4.2",
- ],
- python_requires='>=3.6',
+ install_requires=["pytesseract~=0.3", "opencv-python~=4.2",],
+ python_requires=">=3.6",
)
5.2 table_ocr
+5.2 table_ocr
5.2.1 table_ocr/__init__.py
+5.2.1 table_ocr/__init__.py
5.2.2 table_ocr/util.py
+5.2.2 table_ocr/util.py
from contextlib import contextmanager
@@ -1413,15 +1521,15 @@ setuptools.setup(
5.2.3 table_ocr/pdf_to_images/
+5.2.3 table_ocr/pdf_to_images/
5.2.3.1 table_ocr/pdf_to_images/__init__.py
+5.2.3.1 table_ocr/pdf_to_images/__init__.py
import os
+
import os
import re
import subprocess
@@ -1434,14 +1542,14 @@ setuptools.setup(
def pdf_to_images(pdf_filepath):
"""
Turn a pdf into images
+ Returns the filenames of the created images sorted lexicographically.
"""
directory, filename = os.path.split(pdf_filepath)
- with working_dir(directory):
- image_filenames = pdfimages(pdf_filepath)
+ image_filenames = pdfimages(pdf_filepath)
# Since pdfimages creates a number of files named each for there page number
# and doesn't return us the list that it created
- return [os.path.join(directory, f) for f in image_filenames]
+ return sorted([os.path.join(directory, f) for f in image_filenames])
def pdfimages(pdf_filepath):
@@ -1454,8 +1562,14 @@ setuptools.setup(
uses 3 digits in its regex.
"""
directory, filename = os.path.split(pdf_filepath)
+ if not os.path.isabs(directory):
+ directory = os.path.abspath(directory)
filename_sans_ext = filename.split(".pdf")[0]
- subprocess.run(["pdfimages", "-png", pdf_filepath, filename.split(".pdf")[0]])
+
+ # pdfimages outputs results to the current working directory
+ with working_dir(directory):
+ subprocess.run(["pdfimages", "-png", filename, filename.split(".pdf")[0]])
+
image_filenames = find_matching_files_in_dir(filename_sans_ext, directory)
logger.debug(
"Converted {} into files:\n{}".format(pdf_filepath, "\n".join(image_filenames))
@@ -1471,19 +1585,29 @@ setuptools.setup(
]
return files
-def preprocess_img(filepath):
- """
- Processing that involves running shell executables,
+def preprocess_img(filepath, tess_params=None):
+ """Processing that involves running shell executables,
like mogrify to rotate.
+
+ Uses tesseract to detect rotation.
+
+ Orientation and script detection is only available for legacy tesseract
+ (--oem 0). Some versions of tesseract will segfault if you let it run OSD
+ with the default oem (3).
"""
- rotate = get_rotate(filepath)
+ if tess_params is None:
+ tess_params = ["--psm", "0", "--oem", "0"]
+ rotate = get_rotate(filepath, tess_params)
logger.debug("Rotating {} by {}.".format(filepath, rotate))
mogrify(filepath, rotate)
-def get_rotate(image_filepath):
+def get_rotate(image_filepath, tess_params):
+ """
+ """
+ tess_command = ["tesseract"] + tess_params + [image_filepath, "-"]
output = (
- subprocess.check_output(["tesseract", "--psm", "0", image_filepath, "-"])
+ subprocess.check_output(tess_command)
.decode("utf-8")
.split("\n")
)
@@ -1499,8 +1623,8 @@ setuptools.setup(
5.2.3.2 table_ocr/pdf_to_images/__main__.py
+5.2.3.2 table_ocr/pdf_to_images/__main__.py
import argparse
+
import argparse
from table_ocr.util import working_dir, make_tempdir, get_logger
from table_ocr.pdf_to_images import pdf_to_images, preprocess_img
@@ -1553,15 +1677,16 @@ parser.add_argument("files", nargs=
5.2.4 table_ocr/extract_tables/
+5.2.4 table_ocr/extract_tables/
5.2.4.1 table_ocr/extract_tables/__init__.py
+5.2.4.1 table_ocr/extract_tables/__init__.py
import cv2
+
import os
+import cv2
def find_tables(image):
BLUR_KERNEL_SIZE = (17, 17)
@@ -1610,13 +1735,35 @@ parser.add_argument("files", nargs= # Leaving that step as a future TODO if it is ever necessary.
images = [image[y:y+h, x:x+w] for x, y, w, h in bounding_rects]
return images
+
+def main(files):
+ results = []
+ for f in files:
+ directory, filename = os.path.split(f)
+ image = cv2.imread(f, cv2.IMREAD_GRAYSCALE)
+ tables = find_tables(image)
+ files = []
+ filename_sans_extension = os.path.splitext(filename)[0]
+ if tables:
+ os.makedirs(os.path.join(directory, filename_sans_extension), exist_ok=True)
+ for i, table in enumerate(tables):
+ table_filename = "table-{:03d}.png".format(i)
+ table_filepath = os.path.join(
+ directory, filename_sans_extension, table_filename
+ )
+ files.append(table_filepath)
+ cv2.imwrite(table_filepath, table)
+ if tables:
+ results.append((f, files))
+ # Results is [[<input image>, [<images of detected tables>]]]
+ return results
5.2.4.2 table_ocr/extract_tables/__main__.py
+5.2.4.2 table_ocr/extract_tables/__main__.py
import argparse
-import os
-
-import cv2
+
import argparse
-from table_ocr.extract_tables import find_tables
+from table_ocr.extract_tables import main
parser = argparse.ArgumentParser()
parser.add_argument("files", nargs="+")
-
-
-def main(files):
- results = []
- for f in files:
- directory, filename = os.path.split(f)
- image = cv2.imread(f, cv2.IMREAD_GRAYSCALE)
- tables = find_tables(image)
- files = []
- filename_sans_extension = os.path.splitext(filename)[0]
- if tables:
- os.makedirs(os.path.join(directory, filename_sans_extension), exist_ok=True)
- for i, table in enumerate(tables):
- table_filename = "table-{:03d}.png".format(i)
- table_filepath = os.path.join(
- directory, filename_sans_extension, table_filename
- )
- files.append(table_filepath)
- cv2.imwrite(table_filepath, table)
- if tables:
- results.append((f, files))
-
- for image_filename, table_filenames in results:
- print("\n".join(table_filenames))
-
-
-if __name__ == "__main__":
- args = parser.parse_args()
- files = args.files
- main(files)
+args = parser.parse_args()
+files = args.files
+results = main(files)
+for image, tables in results:
+ print("\n".join(tables))
5.2.5 table_ocr/extract_cells/
+5.2.5 table_ocr/extract_cells/
5.2.5.1 table_ocr/extract_cells/__init__.py
+5.2.5.1 table_ocr/extract_cells/__init__.py
import cv2
+import os
def extract_cell_images_from_table(image):
BLUR_KERNEL_SIZE = (17, 17)
@@ -1798,13 +1918,29 @@ parser.add_argument("files", nargs= cell_images_row.append(image[y:y+h, x:x+w])
cell_images_rows.append(cell_images_row)
return cell_images_rows
+
+def main(f):
+ results = []
+ directory, filename = os.path.split(f)
+ table = cv2.imread(f, cv2.IMREAD_GRAYSCALE)
+ rows = extract_cell_images_from_table(table)
+ cell_img_dir = os.path.join(directory, "cells")
+ os.makedirs(cell_img_dir, exist_ok=True)
+ paths = []
+ for i, row in enumerate(rows):
+ for j, cell in enumerate(row):
+ cell_filename = "{:03d}-{:03d}.png".format(i, j)
+ path = os.path.join(cell_img_dir, cell_filename)
+ cv2.imwrite(path, cell)
+ paths.append(path)
+ return paths
5.2.5.2 table_ocr/extract_cells/__main__.py
+5.2.5.2 table_ocr/extract_cells/__main__.py
import os
-import sys
-
-import cv2
-
-from table_ocr.extract_cells import extract_cell_images_from_table
+
import sys
-def main(f):
- results = []
- directory, filename = os.path.split(f)
- table = cv2.imread(f, cv2.IMREAD_GRAYSCALE)
- rows = extract_cell_images_from_table(table)
- cell_img_dir = os.path.join(directory, "cells")
- os.makedirs(cell_img_dir, exist_ok=True)
- for i, row in enumerate(rows):
- for j, cell in enumerate(row):
- cell_filename = "{:03d}-{:03d}.png".format(i, j)
- path = os.path.join(cell_img_dir, cell_filename)
- cv2.imwrite(path, cell)
- print(path)
-
-
-def extract_cell_images_from_table(image):
- BLUR_KERNEL_SIZE = (17, 17)
- STD_DEV_X_DIRECTION = 0
- STD_DEV_Y_DIRECTION = 0
- blurred = cv2.GaussianBlur(image, BLUR_KERNEL_SIZE, STD_DEV_X_DIRECTION, STD_DEV_Y_DIRECTION)
- MAX_COLOR_VAL = 255
- BLOCK_SIZE = 15
- SUBTRACT_FROM_MEAN = -2
-
- img_bin = cv2.adaptiveThreshold(
- ~blurred,
- MAX_COLOR_VAL,
- cv2.ADAPTIVE_THRESH_MEAN_C,
- cv2.THRESH_BINARY,
- BLOCK_SIZE,
- SUBTRACT_FROM_MEAN,
- )
- vertical = horizontal = img_bin.copy()
- SCALE = 5
- image_width, image_height = horizontal.shape
- horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(image_width / SCALE), 1))
- horizontally_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)
- vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(image_height / SCALE)))
- vertically_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)
-
- horizontally_dilated = cv2.dilate(horizontally_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1)))
- vertically_dilated = cv2.dilate(vertically_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (1, 60)))
-
- mask = horizontally_dilated + vertically_dilated
- contours, heirarchy = cv2.findContours(
- mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE,
- )
-
- perimeter_lengths = [cv2.arcLength(c, True) for c in contours]
- epsilons = [0.05 * p for p in perimeter_lengths]
- approx_polys = [cv2.approxPolyDP(c, e, True) for c, e in zip(contours, epsilons)]
-
- # Filter out contours that aren't rectangular. Those that aren't rectangular
- # are probably noise.
- approx_rects = [p for p in approx_polys if len(p) == 4]
- bounding_rects = [cv2.boundingRect(a) for a in approx_polys]
-
- # Filter out rectangles that are too narrow or too short.
- MIN_RECT_WIDTH = 40
- MIN_RECT_HEIGHT = 10
- bounding_rects = [
- r for r in bounding_rects if MIN_RECT_WIDTH < r[2] and MIN_RECT_HEIGHT < r[3]
- ]
-
- # The largest bounding rectangle is assumed to be the entire table.
- # Remove it from the list. We don't want to accidentally try to OCR
- # the entire table.
- largest_rect = max(bounding_rects, key=lambda r: r[2] * r[3])
- bounding_rects = [b for b in bounding_rects if b is not largest_rect]
-
- cells = [c for c in bounding_rects]
- def cell_in_same_row(c1, c2):
- c1_center = c1[1] + c1[3] - c1[3] / 2
- c2_bottom = c2[1] + c2[3]
- c2_top = c2[1]
- return c2_top < c1_center < c2_bottom
-
- orig_cells = [c for c in cells]
- rows = []
- while cells:
- first = cells[0]
- rest = cells[1:]
- cells_in_same_row = sorted(
- [
- c for c in rest
- if cell_in_same_row(c, first)
- ],
- key=lambda c: c[0]
- )
-
- row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
- rows.append(row_cells)
- cells = [
- c for c in rest
- if not cell_in_same_row(c, first)
- ]
-
- # Sort rows by average height of their center.
- def avg_height_of_center(row):
- centers = [y + h - h / 2 for x, y, w, h in row]
- return sum(centers) / len(centers)
-
- rows.sort(key=avg_height_of_center)
- cell_images_rows = []
- for row in rows:
- cell_images_row = []
- for x, y, w, h in row:
- cell_images_row.append(image[y:y+h, x:x+w])
- cell_images_rows.append(cell_images_row)
- return cell_images_rows
+from table_ocr.extract_cells import main
-if __name__ == "__main__":
- main(sys.argv[1])
+paths = main(sys.argv[1])
+print("\n".join(paths))
5.2.6 table_ocr/ocr_image/
+5.2.6 table_ocr/ocr_image/
5.2.6.1 table_ocr/ocr_image/__init__.py
+5.2.6.1 table_ocr/ocr_image/__init__.py
import math
+import os
+import sys
import cv2
import numpy as np
import pytesseract
+def main(image_file, tess_args):
+ """
+ OCR the image and output the text to a file with an extension that is ready
+ to be used in Tesseract training (.gt.txt).
+
+ Tries to crop the image so that only the relevant text gets passed to Tesseract.
+
+ Returns the name of the text file that contains the text.
+ """
+ directory, filename = os.path.split(image_file)
+ filename_sans_ext, ext = os.path.splitext(filename)
+ image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)
+ cropped = crop_to_text(image)
+ ocr_data_dir = os.path.join(directory, "ocr_data")
+ os.makedirs(ocr_data_dir, exist_ok=True)
+ out_imagepath = os.path.join(ocr_data_dir, filename)
+ out_txtpath = os.path.join(ocr_data_dir, "{}.gt.txt".format(filename_sans_ext))
+ cv2.imwrite(out_imagepath, cropped)
+ if not tess_args:
+ d = os.path.dirname(sys.modules["table_ocr"].__file__)
+ tessdata_dir = os.path.join(d, "tessdata")
+ tess_args = ["--psm", "7", "-l", "table-ocr", "--tessdata-dir", tessdata_dir]
+ txt = ocr_image(cropped, " ".join(tess_args))
+ with open(out_txtpath, "w") as txt_file:
+ txt_file.write(txt)
+ return out_txtpath
+
def crop_to_text(image):
MAX_COLOR_VAL = 255
BLOCK_SIZE = 15
@@ -2022,8 +2073,8 @@ cells.
5.2.6.2 table_ocr/ocr_image/__main__.py
+5.2.6.2 table_ocr/ocr_image/__main__.py
import argparse
-import math
-import os
-import sys
-
-import cv2
-from table_ocr.ocr_image import crop_to_text, ocr_image
+from table_ocr.ocr_image import main
description="""Takes a single argument that is the image to OCR.
Remaining arguments are passed directly to Tesseract.
@@ -2053,35 +2099,19 @@ Creates images and text files that can be used for training tesseract. See
parser = argparse.ArgumentParser(description=description)
parser.add_argument("image", help="filepath of image to perform OCR")
-def main(image_file, tess_args):
- directory, filename = os.path.split(image_file)
- filename_sans_ext, ext = os.path.splitext(filename)
- image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)
- cropped = crop_to_text(image)
- ocr_data_dir = os.path.join(directory, "ocr_data")
- os.makedirs(ocr_data_dir, exist_ok=True)
- out_imagepath = os.path.join(ocr_data_dir, filename)
- out_txtpath = os.path.join(ocr_data_dir, "{}.gt.txt".format(filename_sans_ext))
- cv2.imwrite(out_imagepath, cropped)
- txt = ocr_image(cropped, " ".join(tess_args))
- print(txt)
- with open(out_txtpath, "w") as txt_file:
- txt_file.write(txt)
-
-if __name__ == "__main__":
- args, tess_args = parser.parse_known_args()
- main(args.image, tess_args)
+args, tess_args = parser.parse_known_args()
+print(main(args.image, tess_args))
5.2.7 table_ocr/ocr_to_csv/
+5.2.7 table_ocr/ocr_to_csv/
5.2.7.1 table_ocr/ocr_to_csv/__init__.py
+5.2.7.1 table_ocr/ocr_to_csv/__init__.py
import csv
@@ -2115,8 +2145,8 @@ parser.add_argument("image",
-
5.2.7.2 table_ocr/ocr_to_csv/__main__.py
+5.2.7.2 table_ocr/ocr_to_csv/__main__.py
import argparse
@@ -2145,8 +2175,8 @@ parser.add_argument("files", nargs=
6 Utils
+6 Utils
advice-add
.
(concat "#+ATTR_HTML: :width " width " :height " height "\n[[file:" text "]]")
+
(concat "#+ATTR_HTML: :width " width " :height " height "\n[[file:" text "]]")
advice-add
.
6.1 Logging
+6.1 Logging
def get_logger(name):
@@ -2217,7 +2247,7 @@ with
advice-add
.