Tesseract - Aplikasi Optical Character Recognition Open Source

Diposting oleh Problem Child Rabu, Februari 13, 2019

Tesseract - Aplikasi Optical Character Recognition Open Source. Tesseract adalah aplikasi OCR open source yang tersedia untuk Linux, Windows, serta Mac. Bagi yang belum tau, OCR sendiri merupakan teknologi untuk mengenali teks didalam gambar dan mengkonversinya ke teks biasa.

Di tutorial ini saya menggunakan Ubuntu untuk mencoba Tesseract. Untuk cara instalasinya sendiri sangat mudah karena paketnya sudah tersedia di repository universe Ubuntu.

sudo apt-get update
sudo apt-get install tesseract-ocr

tesseract --help

Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

tesseract --help-extra

Usage:
  tesseract --help | --help-extra | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

OCR Engine modes:
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

Single options:
  -h, --help            Show minimal help message.
  --help-extra          Show extra help for advanced users.
  --help-psm            Show page segmentation modes.
  --help-oem            Show OCR Engine modes.
  -v, --version         Show version information.
  --list-langs          List available languages for tesseract engine.
  --print-parameters    Print tesseract parameters.

Contoh penggunaan
Sebagai contoh kita coba untuk mengkonversi teks dalam gambar ini

Simpan dengan nama linuxsec.png
Lalu jalankan perintah

tesseract linuxsec.png linuxsec -l eng

cat linuxsec.txt

Contoh output:

Mudah bukan?
Untuk lebih launjutnya bisa kalian cek di laman GitHub mereka:

https://github.com/tesseract-ocr/tesseract

Oke sekian share kali ini. Jika ada yang kurang jelas silahkan ditanyakan.