image-table-ocr

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Go to file

Eric Ihli 54511b9a1f Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules.		6 years ago
resources/examples	Fix bugs and improve accuracy	6 years ago
table_ocr	Fix bugs and improve accuracy	6 years ago
.gitignore	Remove egg info from git tracking	6 years ago
LICENSE	Initial commit	6 years ago
ocr_tables	Fix bugs and improve accuracy	6 years ago
pdf_table_extraction_and_ocr.html	Fix bug, html-image-size helper had no results	6 years ago
pdf_table_extraction_and_ocr.org	Fix bugs and improve accuracy	6 years ago
setup.py	Add gitignore, rename modules, remove unused code	6 years ago