nOCR tutorial



This tutorial will hopefully show how to OCR an image based subtitle file via OCR method nOCR.
This tutorial should also mostly work for "Binary image compare".

First step is to open your image based subtitle file which you can do via File - Open... or drag'n'drop file to list view or File - Import images...

For this tutorial I've opened a Blu-ray .sup file which will present the main OCR window.

Main OCR window



In the main OCR window you can setup properties for running the OCR process and start the OCR process by clicking on the Start OCR button.

Also note that many options are available via list view and image context menus (right click in list view or on preview image to activate), e.g. export to another format without OCR'ing.

Some of the most important items have been marked with a red circle, more on these later.

The main OCR window supports these shortcuts:
  • Alt up/down: Go to previous/next line
  • Ctrl+G: Go to line number
  • Ctrl+T: Open auto training window
  • Ctrl+P: Show preview
  • Ctrl+Shift+P: Image pre-processing
  • Ctrl+F: Find
  • F3: Find next
  • Ctrl+H: Show how horizontal line split will be done for current image

Preparing OCR

When starting OCR for a new file we need to setup a few things, which requires you to start the OCR process too see what happens. The three most important things are:

  • Font size
  • Number of pixels is space
  • Italic angle

If the font size is small (letter height less than ~28 pixels), you should use "Tesseract 5" or "Binary image compare".
You can double click on a line in the list view to start the "Inspect window" where you can see how letters will be split, recognized and their sizes.

"Number of pixels is space" will determine how words will be found, so "h o w a r e y o u" indicates a too low value, and "howareyou" indicates a too high value. For a normal Blu-ray .sup file 8-15 pixels of space between words are common, 10-12 for most.

If the subtitle contains italic, then do find one of these lines, right-click on the image and choose "Set italic angle". Italic angle will mostly be used to help split words (together with "Number of pixels is space").


Running OCR

When running the OCR you should have "Draw missing texts" checked - otherwise you will get a "*" for unknown letters.
Clicking on the Start OCR button will start the OCR process and with my chosen .sup file I get the following prompt pretty fast:



Here I must enter the new letter in "Character(s) as text" but I also need to make sure that the green and red lines are correct.
For characters like "i" and "!" make sure that the dot has some green and also make sure that a red line separate the letter parts.

In the image below I've entered "j" as text and added a extra red line between the white blocks hoping that it will avoid recognizing wrong letters like "l".



Now I just click "OK" and the OCR process continues.

After running through a few lines I can see that some letters are detected incorrectly. In the "Unknown words" list it looks like "y" is detected as "V". So I press the "Stop" button which stops the OCR process.
I double-click on line #6 in the list view, which start the "Inspect window". Here I click on "Add better match".




In the "nOCR character window" I enter "y" as text and press "OK".



Then I continue the OCR process from line #6.
I add few more chars like "j" earlier, and then I notice the entry "ridicu1ous" with "one" instead of "L" (line #67) in the "unknown word" list.
I double click on line #67 in the list view to inspect again.



The "1" (one) actually looks like "l" (L), so I delete the character from the OCR db by clicking on the "Delete" button.
Then I continue from line #67 and the error is gone.


Line #1013 gives "0on't" instead of "Don't"... which I fix with "Add better match".

After the OCR process is completed, I review the "Unknown words" list and fix casing of 3-4 words and click "OK" - because we're done :)

Conclusion

nOCR is a good choice for Blu-ray .sup files with a standard-like-font.

If you have a non-standard-like-font, and you have access to a similar true type font, then you can install that font in Windows and actually train a new nOCR db with your specific font.




Did you not find what you were looking for? Feel free to email me.

Also, do check out the Subtitle Edit Intro videos and the Syncing Subtitles with Subtitle Edit tutorial by dny238!