保存时间:2026/3/29 22:13:08
-l chi_sim(默认英文)--psm 7,别用默认3--oem 3(LSTM 引擎)chi_sim.traineddata仅 2800 + 汉字,覆盖不全;需换增强版中文模型pip install paddlepaddle paddleocrfrom paddleocr import PaddleOCR, draw_ocr
ocr = PaddleOCR(use_angle_cls=True, lang="ch") # 中文+方向矫正
result = ocr.ocr("gui_screenshot.png", cls=True)
for line in result:
print(line[1][0]) # 输出识别文本
--oem 3 --psm 7 -l chi_simimport cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# 预处理
img = cv2.imread("gui_screenshot.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# 识别
text = pytesseract.image_to_string(thresh, lang="chi_sim", config="--oem 3 --psm 7")
print(text)
LLM 只做一次:理解流程 + 生成固定脚本/参数后续永远跑:OpenCV + 轻量OCR(PaddleOCR)全程不用本地大模型视觉,速度极快、稳定、可复用。
每一步都问 LLM:我现在点哪里?而你理解的是:LLM 写好固定脚本,以后脚本自己跑这是质的区别,也是你比很多从业者清醒的地方。
LLM 作为自动化脚本生成器+传统视觉(OpenCV + 轻量OCR)做执行引擎
办公自动化根本不需要大模型视觉,也不需要每次都问LLM。LLM 做一次“编译”,剩下全是传统算法跑得飞快。
correct / total# ======================
# 1. LLM 给出标准答案(一次)
# ======================
screenshot = "screen.png"
ground_truth = llm_ocr_ground_truth(screenshot)
# ======================
# 2. 参数搜索空间
# ======================
param_space = {
"scale": [1.0, 1.5, 2.0],
"threshold": [110, 130, 150, 170],
"erode_dilate": [0, 1, 2],
"crop": [(None), (x,y,w,h)],
"ocr_drop_score": [0.5, 0.6, 0.7, 0.8],
}
# ======================
# 3. 自动校准主循环(Agent)
# ======================
best_score = 0
best_param = None
for _ in range(30):
param = random_sample(param_space)
image = cv_preprocess(screenshot, param)
pred = paddle_ocr(image, param)
score = calc_accuracy(pred, ground_truth)
if score > best_score:
best_score = score
best_param = param
if best_score >= 0.95:
break
# ======================
# 4. 保存最优参数,以后永远用它
# ======================
save_best_param("calib_config.json", best_param)
mss / pyautogui.screenshot()OpenCV(缩放、二值化、裁剪、去噪)PyAutoGUI / pynput(纯鼠标键盘模拟,不依赖浏览器 / CDA)