image-table-ocr

Commit Graph

Author	SHA1	Message	Date
Eric Ihli	49205462a3	Fix bug, missing __init__.py in demo Causing ModuleNotFound exception.	5 years ago
Eric Ihli	f77c6854f8	Add requests as dependency for demo	5 years ago
Eric Ihli	8be7972bc7	Add 0.2.3 to dist	5 years ago
Eric Ihli	b7c6034331	Update README and add README.txt to setup.py	5 years ago
Eric Ihli	f404b6e673	Update README	5 years ago
Eric Ihli	1aae198e39	Update readme and add dist for 0.2.2	5 years ago
Eric Ihli	248fc827cc	Add README and demo	5 years ago
Eric Ihli	7e5516eb5d	Update README in setup.py	5 years ago
Eric Ihli	5a34d0a845	Whitespace change in README	5 years ago
Eric Ihli	df50db1fbd	Strip whitespace when reading ocr for csv	5 years ago
Eric Ihli	01406752d4	Update docs to describe ocr_image defaults The example scripts now work with the included tessdata models.	5 years ago
Eric Ihli	3b31888a55	Include tesseract traineddata files Includes a english model and a model trained specifically on cells extracted from tables.	5 years ago
Eric Ihli	7b103723af	Allow tesseract params to be passed into OSD	5 years ago
Eric Ihli	bc32d59253	Add tip for quickly creating training data	5 years ago
Eric Ihli	7ad4c0d4dc	Fix bug relating to directory of pdf Relative paths now work.	5 years ago
Eric Ihli	449ee015d3	Update license and setup.py	5 years ago
Eric Ihli	1156eafc5c	Return sorted image paths from pdf_to_images	5 years ago
Eric Ihli	99beaaa2d1	Make ocr_image return/print path of text file Move the main function to the __init__ file so it can be imported by other code. Modify it so that it returns the path to the file that contains the OCR text so that calling code can keep find the results.	5 years ago
Eric Ihli	6359b86e42	Move main of extract_cells to __init__.py Also make main return a value rather than print to stdout. It's more convenient for other code to use these modules when they return values rather than print.	5 years ago
Eric Ihli	962abb7a02	Move `main` to __init__ for extract_tables Since this function is the meat and potatoes, it's nice to be able to import it as typical, which you can't really do if it only resides in __main__.py. Also, __main__.py doesn't need `if __name__ == "__main__"`. The whole point of __main__.py is that it only gets run when that condition is true.	5 years ago
Eric Ihli	85f864cd17	Return value from main rather than print We only really want to print if we are running the module as a script. It's nice to allow `main` to be imported and used from other code, and that code probably wants a returned value rather than having to read from stdout.	5 years ago
Eric Ihli	37483148c8	Fix typo	5 years ago
Eric Ihli	0ac2e885c1	Fix link to documentation	5 years ago
Eric Ihli	8e9bc0e0a0	Fix typo	5 years ago
Eric Ihli	075e265d05	Add README	5 years ago
Eric Ihli	0420f97bd6	Update exported html	5 years ago
Eric Ihli	eb4e3d81b7	Clarify and expand on content around code	5 years ago
Eric Ihli	6891fc9990	Add example image and csv output Give more code blocks names	5 years ago
Eric Ihli	4eca593944	Remove unused files, finish refactor of structure	5 years ago
Eric Ihli	b911f87126	Refactor extract_cells into module	5 years ago
Eric Ihli	b9f088cf92	Refactor table extraction into module	5 years ago
Eric Ihli	98ef6ffd85	Refactor utilities to modules Rather than have them all tangled into __main__. This makes the package more usable as python modules rather than just a command line utility.	5 years ago
Eric Ihli	bea192678e	Fix bug picking up noise in detecting contours	5 years ago
Eric Ihli	54511b9a1f	Fix bugs and improve accuracy Files in the ocr_to_csv module need to be named in a certain way. Specify that and fix a bug, we need to have them sorted lexicographically. Don't dilate the characters in a cell in order to make a contiguous set of pixels that we can find a contour around. The problem with that is that you sometimes dilate too far and hit an image boundary and can't erode back in. If a cell wall border was remaining between the text and the image boundary, well now you're keeping that border line in the image. (Unless you remove it some other way. So that might be a valid option in the future.) The method we're using now instead is to group all contours together and create a bounding box around all of them. The problem with that is if there is any noise at all outside the text, we're grabbing it. Before, we were dilating and taking the largest contour, so we weren't including that noise. And we can't get rid of the noise with opening morph because it's sometimes pretty big noise and opening any bigger distorts the text so much that we lose accuracy in finding those boundaries. Also adds a shell script to simplify the plumbing of all these modules.	5 years ago
Eric Ihli	aa900de4e7	Use cleaner filenames for intermediate files	5 years ago
Eric Ihli	e49fffa5a7	Add module for outputting csv from parsed table Make cell extraction a little more accurate.	5 years ago
Eric Ihli	de398f73c2	Add ocr_image module	5 years ago
Eric Ihli	f77425fd9e	Remove misnamed module	5 years ago
Eric Ihli	96497d7327	Add doc for shell script to parse text from table Add notes for improving accuracy.	5 years ago
Eric Ihli	32c62fd773	Add script to ocr individual cells	5 years ago
Eric Ihli	396782051e	Remove egg info from git tracking	5 years ago
Eric Ihli	78e9cdb3f5	Add gitignore, rename modules, remove unused code	5 years ago
Eric Ihli	8546902e64	Fix bug, html-image-size helper had no results This was causing the results of every src block that used it to be nil.	5 years ago
Eric Ihli	28bcdbd4f7	Initial commit	5 years ago

44 Commits (main) All Branches Search

44 Commits (main)

All Branches