42 Commits (8be7972bc710aa974faca7ba97c964f9ec3f086c)
 

Author SHA1 Message Date
Eric Ihli 8be7972bc7 Add 0.2.3 to dist 4 years ago
Eric Ihli b7c6034331 Update README and add README.txt to setup.py 4 years ago
Eric Ihli f404b6e673 Update README 4 years ago
Eric Ihli 1aae198e39 Update readme and add dist for 0.2.2 4 years ago
Eric Ihli 248fc827cc Add README and demo 4 years ago
Eric Ihli 7e5516eb5d Update README in setup.py 4 years ago
Eric Ihli 5a34d0a845 Whitespace change in README 4 years ago
Eric Ihli df50db1fbd Strip whitespace when reading ocr for csv 4 years ago
Eric Ihli 01406752d4 Update docs to describe ocr_image defaults
The example scripts now work with the included tessdata models.
4 years ago
Eric Ihli 3b31888a55 Include tesseract traineddata files
Includes a english model and a model trained specifically on cells
extracted from tables.
4 years ago
Eric Ihli 7b103723af Allow tesseract params to be passed into OSD 5 years ago
Eric Ihli bc32d59253 Add tip for quickly creating training data 5 years ago
Eric Ihli 7ad4c0d4dc Fix bug relating to directory of pdf
Relative paths now work.
5 years ago
Eric Ihli 449ee015d3 Update license and setup.py 5 years ago
Eric Ihli 1156eafc5c Return sorted image paths from pdf_to_images 5 years ago
Eric Ihli 99beaaa2d1 Make ocr_image return/print path of text file
Move the main function to the __init__ file so it can be imported by
other code. Modify it so that it returns the path to the file that
contains the OCR text so that calling code can keep find the results.
5 years ago
Eric Ihli 6359b86e42 Move main of extract_cells to __init__.py
Also make main return a value rather than print to stdout.

It's more convenient for other code to use these modules when they
return values rather than print.
5 years ago
Eric Ihli 962abb7a02 Move `main` to __init__ for extract_tables
Since this function is the meat and potatoes, it's nice to be able to
import it as typical, which you can't really do if it only resides in
__main__.py.

Also, __main__.py doesn't need `if __name__ == "__main__"`. The whole
point of __main__.py is that it only gets run when that condition is true.
5 years ago
Eric Ihli 85f864cd17 Return value from main rather than print
We only really want to print if we are running the module as a script.
It's nice to allow `main` to be imported and used from other code, and
that code probably wants a returned value rather than having to read
from stdout.
5 years ago
Eric Ihli 37483148c8 Fix typo 5 years ago
Eric Ihli 0ac2e885c1 Fix link to documentation 5 years ago
Eric Ihli 8e9bc0e0a0 Fix typo 5 years ago
Eric Ihli 075e265d05 Add README 5 years ago
Eric Ihli 0420f97bd6 Update exported html 5 years ago
Eric Ihli eb4e3d81b7 Clarify and expand on content around code 5 years ago
Eric Ihli 6891fc9990 Add example image and csv output
Give more code blocks names
5 years ago
Eric Ihli 4eca593944 Remove unused files, finish refactor of structure 5 years ago
Eric Ihli b911f87126 Refactor extract_cells into module 5 years ago
Eric Ihli b9f088cf92 Refactor table extraction into module 5 years ago
Eric Ihli 98ef6ffd85 Refactor utilities to modules
Rather than have them all tangled into __main__. This makes the package
more usable as python modules rather than just a command line utility.
5 years ago
Eric Ihli bea192678e Fix bug picking up noise in detecting contours 5 years ago
Eric Ihli 54511b9a1f Fix bugs and improve accuracy
Files in the ocr_to_csv module need to be named in a certain way.
Specify that and fix a bug, we need to have them sorted
lexicographically.

Don't dilate the characters in a cell in order to make a contiguous set
of pixels that we can find a contour around. The problem with that is
that you sometimes dilate too far and hit an image boundary and can't
erode back in. If a cell wall border was remaining between the text and
the image boundary, well now you're keeping that border line in the
image. (Unless you remove it some other way. So that might be a valid
option in the future.) The method we're using now instead is to group
all contours together and create a bounding box around all of them. The
problem with that is if there is any noise at all outside the text,
we're grabbing it. Before, we were dilating and taking the largest
contour, so we weren't including that noise. And we can't get rid of the
noise with opening morph because it's sometimes pretty big noise and
opening any bigger distorts the text so much that we lose accuracy in
finding those boundaries.

Also adds a shell script to simplify the plumbing of all these modules.
5 years ago
Eric Ihli aa900de4e7 Use cleaner filenames for intermediate files 5 years ago
Eric Ihli e49fffa5a7 Add module for outputting csv from parsed table
Make cell extraction a little more accurate.
5 years ago
Eric Ihli de398f73c2 Add ocr_image module 5 years ago
Eric Ihli f77425fd9e Remove misnamed module 5 years ago
Eric Ihli 96497d7327 Add doc for shell script to parse text from table
Add notes for improving accuracy.
5 years ago
Eric Ihli 32c62fd773 Add script to ocr individual cells 5 years ago
Eric Ihli 396782051e Remove egg info from git tracking 5 years ago
Eric Ihli 78e9cdb3f5 Add gitignore, rename modules, remove unused code 5 years ago
Eric Ihli 8546902e64 Fix bug, html-image-size helper had no results
This was causing the results of every src block that used it to be nil.
5 years ago
Eric Ihli 28bcdbd4f7 Initial commit 5 years ago