python tessract ocr 辨識文字

Tesseract 使用＆安裝＆訓練

簡單驗證碼去噪灰度二值化

無意間找到網址是固定的教育部某平台之簡易驗證碼

雖然可以爬好幾10xxx張，但是看了資訊安全政策好像會被 ban 所以弄了一個驗證碼產生器，自製訓練庫，好處是可自定義演算法，本機端即可儲存圖片。

實際操作畫面如下：

— 工商服務板塊 —

人工智慧課堂企劃報告，老師說明 OCR 這個目前由 Google 維護的識別引擎，用途為辨識掃描的文件/圖片中的文字，提供實務上業界的應用，掃描健檢報告數據，將數據存入資料庫。

我們這組的想法是應用於農民可拍照上傳產品外包裝編號查看農藥用途。

實做起來遇到一些問題微調操作

所以做個紀錄有了這篇筆記整理 📝

安裝 Tessract / 語言包

安裝環境： MacOS

Homebrew 安裝

1
homebrew install tesseract

https://formulae.brew.sh/formula/tesseract

1
homebrew install tesseract-lang

https://formulae.brew.sh/formula/tesseract-lang

語言包一次打包 / 需要的再匯入

位置大概是這邊：/usr/local/Cellar/tesseract/4.x/share/tessdata

編譯安裝問題

1
Error: invalid option: --with-training-tools

要安裝訓練模型時，發現 tesseract 最新4.x版本 homebrew 沒有訓練模型相關指令

撰寫文章當下根據網路文章找到了解決方法需自行編譯 build,make install

參考下列文章（感謝文章作者）

https://juejin.im/post/6844904097863188487

可下載 `python` 的相關模組

1
2
3
pip install pillow

pip install pytesseract

詳細教學：

https://lufor129.medium.com/pytesseract-辨識圖片中的文字-b1024f678fac

https://blog.csdn.net/huitailangyz/article/details/80390090

查看安裝版本

1
2
3
4
5
6
7
8
tesseract 4.1.1
leptonica-1.80.0
libgif 5.2.1 : libjpeg 9d : libpng 1.6.37
: libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

參數設定

列出可使用的語言包

1
tesseract --list-langs

chi_tra 繁體
chi_tra_vert 繁體，直排
-l 辨識語言 / 預設是英文 default “eng”

設定圖片辨識引擎

–oem ： OCR Engine modes, 演算法類型 / 目前沒了解這塊用預設

1
2
3
4
5
6
7
tesseract --help-oem

OCR Engine modes:
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

參考文章解釋:

为了识别包含单个字符的图像，我们通常使用卷积神经网络(CNN)。任意长度的文本是一系列字符，这类问题可以通过RNNs解决，LSTM是RNN的一种常用形式

PSM 判別語句類型模式

–psm : 判別語句類型模式 / 大陸翻譯：自动页面分割模式

常用說明

0：定向脚本监测（OSD）
1：使用OSD自动分页
2 ：自动分页，但是不使用OSD或OCR
3 ：全自动分页，但是没有使用OSD（默认）
4 ：假设可变大小的一个文本列。
5 ：假设垂直对齐文本的单个统一块。
6 ：假设一个统一的文本块。

7 ：將圖片視為一行文字句子。
8 ：將圖片視為一個『單字』。
9 ：將圖像視為『環狀字』。
10 ：將圖片視為一個『字母』。

參數不懂的沿用參考文章簡體中文翻譯

PSM 參數詳細說明： https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/

辨識效果測試

執行畫面：

簡易驗證碼：

較複雜驗證碼：登入系統驗證碼

線條干擾誤判斷成 - 符號

圖片預處理 / 提升辨識度

再來就是要測試一些更難辨識的照片跟訓練了

官方文檔解釋： https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#image-processing

每個Image都有不同的標記，需要乾淨能夠辨識標記的物件

OpenCV 套件預先處理圖片

辨識原則：

至少為300 DPI為最佳
白底黑字，字體清晰
去噪，灰度图二值化

轉灰度原理：

mode 取值可為下列幾種

其中會用到的為 1 . L

imgry.mode binary.mode

1
2
3
4
5
6
7
8
9
· 1 (1-bit pixels, black and white, stored with one pixel per byte)
· L (8-bit pixels, black and white)
· P (8-bit pixels, mapped to any other mode using a colour palette)
· RGB (3x8-bit pixels, true colour)
· RGBA (4x8-bit pixels, true colour with transparency mask)
· CMYK (4x8-bit pixels, colour separation)
· YCbCr (3x8-bit pixels, colour video format)
· I (32-bit signed integer pixels)
· F (32-bit floating point pixels)

opencv :

https://lufor129.medium.com/pytesseract-辨識圖片中的文字-b1024f678fac

常見錯誤： https://zhuanlan.zhihu.com/p/35590364

套件記得載

1
2
3
4
5
6
7
# opencv 方式讀取圖檔轉灰度
imgcv = cv2.imread('CaptchaImage.jpg', cv2.IMREAD_GRAYSCALE)
type(imgcv)
cv2.imshow('CaptchaImage', imgcv)
cv2.imwrite('output/gray05.tiff', imgcv)
cv2.waitKey(0)
cv2.destroyAllWindows()

cv2.threshold 常用參數

cv2.THRESH_BINARY（黑白兩值）
cv2.THRESH_BINARY_INV（黑白二值反轉）
cv2.THRESH_TRUNC （得到的圖像為多像素值）尚待釐清多像素值定義

灰度轉二值化(去噪):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
    print 'binary mode: ', binary.mode
    print binary.getpixel((0, 0))
    co = binary.getcolors()
    print co
    
    #輸出
    ----------------------------------------
    binary mode: 1
    0  #左上角為黑色
    [(503, 0), (1993, 1)]
    共有幾個黑與白像素點

判斷文字位置：

https://youtu.be/6DjFscX4I_c?t=502

訓練文字辨識，手動標註

jTessBoxEditor / VietOCR-4.5

手動標記｜工人智慧

下載位置：

https://sourceforge.net/projects/vietocr/

軟體資訊：

使用語法總覽

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

OCR Engine modes:
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

Single options:
  -h, --help            Show minimal help message.
  --help-extra          Show extra help for advanced users.
  --help-psm            Show page segmentation modes.
  --help-oem            Show OCR Engine modes.
  -v, --version         Show version information.
  --list-langs          List available languages for tesseract engine.
  --print-parameters    Print tesseract parameters.

未完待續，下方整理了一些相關文章，有興趣的同學可以參考。

參考文章

簡單的製作驗證碼範例： https://ithelp.ithome.com.tw/articles/10159457

[ 實用心得 ] Tesseract-OCR

https://medium.com/@b98606021/實用心得-tesseract-ocr

mac上文字识别(Tesseract-OCR for mac )

https://blog.csdn.net/u010670689/article/details/78374623

Image Processing in OpenCV

Image Thresholding

使用Tesseract训练lang文件并OCR识别集装箱号

https://www.jianshu.com/p/5f847d8089ce

Tesseract-OCR 實現光學漢字識別（語言包解釋）

https://www.itread01.com/content/1577329506.html

python 读取、保存、二值化、灰度化图片+opencv处理图片的方法

https://blog.csdn.net/JohinieLi/article/details/69389980

設置限定詞（會降低識別率）

tessdata/configs/digits 沒找到

https://blog.csdn.net/ADebugMan/article/details/100592200

檔案放置區:

Tesseract-OCR 辨識＆驗證碼分析及實作

圖片辨識結合自動回覆機器人

此為大學專案進行的學習成果

如有發現任何參數錯誤，歡迎留言通知