Move the main function to the __init__ file so it can be imported by
other code. Modify it so that it returns the path to the file that
contains the OCR text so that calling code can keep find the results.
Also make main return a value rather than print to stdout.
It's more convenient for other code to use these modules when they
return values rather than print.
Since this function is the meat and potatoes, it's nice to be able to
import it as typical, which you can't really do if it only resides in
__main__.py.
Also, __main__.py doesn't need `if __name__ == "__main__"`. The whole
point of __main__.py is that it only gets run when that condition is true.
We only really want to print if we are running the module as a script.
It's nice to allow `main` to be imported and used from other code, and
that code probably wants a returned value rather than having to read
from stdout.
Files in the ocr_to_csv module need to be named in a certain way.
Specify that and fix a bug, we need to have them sorted
lexicographically.
Don't dilate the characters in a cell in order to make a contiguous set
of pixels that we can find a contour around. The problem with that is
that you sometimes dilate too far and hit an image boundary and can't
erode back in. If a cell wall border was remaining between the text and
the image boundary, well now you're keeping that border line in the
image. (Unless you remove it some other way. So that might be a valid
option in the future.) The method we're using now instead is to group
all contours together and create a bounding box around all of them. The
problem with that is if there is any noise at all outside the text,
we're grabbing it. Before, we were dilating and taking the largest
contour, so we weren't including that noise. And we can't get rid of the
noise with opening morph because it's sometimes pretty big noise and
opening any bigger distorts the text so much that we lose accuracy in
finding those boundaries.
Also adds a shell script to simplify the plumbing of all these modules.