From bc32d5925334e33614b06cdee01e2292f9484e9a Mon Sep 17 00:00:00 2001 From: Eric Ihli Date: Tue, 28 Apr 2020 08:49:55 -0700 Subject: [PATCH] Add tip for quickly creating training data --- pdf_table_extraction_and_ocr.org | 37 ++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/pdf_table_extraction_and_ocr.org b/pdf_table_extraction_and_ocr.org index 681caa0..40fe5e4 100644 --- a/pdf_table_extraction_and_ocr.org +++ b/pdf_table_extraction_and_ocr.org @@ -350,6 +350,43 @@ Once the training is complete, there will be a new file ~tesstrain/data/.traineddata~. Copy that file to the directory Tesseract searches for models. On my machine, it was ~/usr/local/share/tessdata/~. +*** Training tips + +Here is a tip for quickly creating training data. + +The output of the ~ocr_cells~ script will be a directory named ~ocr_data~ that +will have two files for each cell. One file is the image of the cell and the +other file is the OCR text. + +You'll want to compare each image to its OCR text to check for accuracy. If +the text doesn't match, you'll want to update the text and add the image to the +training data. + +The fastest way to do this is with ~feh~. + +~feh~ lets you view an image and a caption at the same time and lets you edit +the caption from within ~feh~. + +~feh~ expects the captions to be named ~.txt~, so use a little +shell-fu to do a quick rename. + +#+BEGIN_SRC shell +for f in *.txt; do f1=$(cut -d"." -f1 <(echo $f)); mv $f ${f1}.png.txt; done +#+END_SRC + +Then run ~feh -K .~ to specify the current directory as the caption directory. +This will open a window with the first image in the directory and its caption. + +Press ~c~ to edit the caption (if needed) and ~n~/~p~ to move to the +next/previons images. Press ~q~ to quit. + +When finished, rename the files back to the filename structure that Tesseract +looks for in training. + +#+BEGIN_SRC shell +for f in *.txt; do f1=$(cut -d"." -f1 <(echo $f)); mv $f ${f1}.gt.txt; done +#+END_SRC + ** Blur Blurring helps to make noise less noisy so that the overall structure of an