Add tip for quickly creating training data

6 years ago · bc32d59253
parent 7ad4c0d4dc
commit bc32d59253
1 changed files with 37 additions and 0 deletions
--- a/pdf_table_extraction_and_ocr.org
+++ b/pdf_table_extraction_and_ocr.org
@ -350,6 +350,43 @@ Once the training is complete, there will be a new file
 ~tesstrain/data/<model-name>.traineddata~. Copy that file to the directory
 Tesseract searches for models. On my machine, it was ~/usr/local/share/tessdata/~.
 *** Training tips
 Here is a tip for quickly creating training data.
 The output of the ~ocr_cells~ script will be a directory named ~ocr_data~ that
 will have two files for each cell. One file is the image of the cell and the
 other file is the OCR text.
 You'll want to compare each image to its OCR text to check for accuracy. If
 the text doesn't match, you'll want to update the text and add the image to the
 training data.
 The fastest way to do this is with ~feh~.
 ~feh~ lets you view an image and a caption at the same time and lets you edit
 the caption from within ~feh~.
 ~feh~ expects the captions to be named ~<image-name>.txt~, so use a little
 shell-fu to do a quick rename.
 #+BEGIN_SRC shell
 for f in *.txt; do f1=$(cut -d"." -f1 <(echo $f)); mv $f ${f1}.png.txt; done
 #+END_SRC
 Then run ~feh -K .~ to specify the current directory as the caption directory.
 This will open a window with the first image in the directory and its caption.
 Press ~c~ to edit the caption (if needed) and ~n~/~p~ to move to the
 next/previons images. Press ~q~ to quit.
 When finished, rename the files back to the filename structure that Tesseract
 looks for in training.
 #+BEGIN_SRC shell
 for f in *.txt; do f1=$(cut -d"." -f1 <(echo $f)); mv $f ${f1}.gt.txt; done
 #+END_SRC
 ** Blur
 Blurring helps to make noise less noisy so that the overall structure of an