Files
Odoo-Modules/fusion_accounting_ocr/services/ocr_providers/tesseract_adapter.py
gsinghpal 125f48377a feat(fusion_accounting_ocr): pluggable OCR for vendor bills
Replaces Enterprise's account_invoice_extract with a Fusion-native pipeline:

Stage 1 (text extraction): Tesseract OCRs the bill attachment via
pytesseract + pdf2image. Pluggable OCRProvider adapter pattern allows
future Mindee / Google Document AI / Ollama-vision backends.

Stage 2 (field parsing): The fusion_accounting_ai LLMProvider reads the
raw OCR text and returns structured invoice fields (vendor, invoice
number, dates, amounts, line items) as JSON.

Draft invoice fields are auto-populated for empty-only fields (never
overwriting user-entered data). Vendor matching by name against
res.partner with supplier_rank > 0.

Adds:
- account.move.ocr_state (selection: not_requested/pending/processing/
  done/failed/manual)
- account.move.ocr_raw_text, ocr_extracted_data (Json), ocr_backend,
  ocr_confidence
- fusion.ocr.log (audit trail per OCR run)
- res.company.fusion_ocr_enabled / fusion_ocr_default_backend / auto_run
- /fusion/ocr/request_for_invoice JSON-RPC endpoint

Backend availability detected at runtime via OCRProvider.is_available()
classmethods. Tesseract 5.3.4 + pytesseract 0.3.13 + pdf2image 1.17.0
are installed in the container.

Tests: 13 (TesseractAdapter availability + image OCR; flow tests for
draft autofill, no-attachment guard, customer-invoice guard, ref-not-
overwritten; field parser empty/clean-json/markdown-fence/bad-JSON/
provider-exception). All pass on westin-v19 OrbStack VM.

Made-with: Cursor
2026-04-20 00:32:50 -04:00

72 lines
2.2 KiB
Python

"""Tesseract OCR adapter.
Uses the system tesseract binary via pytesseract, with poppler-backed
PDF rendering via pdf2image. Inside the container these are pre-installed:
- tesseract-ocr 5.3.4
- pytesseract 0.3.13
- pdf2image 1.17.0
- poppler-utils
"""
import io
import logging
from .base import OCRProvider, OCRResult
_logger = logging.getLogger(__name__)
class TesseractAdapter(OCRProvider):
name = 'tesseract'
@classmethod
def is_available(cls) -> bool:
try:
import pytesseract
from pdf2image import convert_from_bytes # noqa: F401
from PIL import Image # noqa: F401
pytesseract.get_tesseract_version()
return True
except Exception as e:
_logger.debug("TesseractAdapter not available: %s", e)
return False
def extract(self, image_or_pdf_bytes, *, mimetype='application/pdf'):
import pytesseract
from pdf2image import convert_from_bytes
from PIL import Image
try:
is_pdf = (
mimetype == 'application/pdf'
or (image_or_pdf_bytes[:4] == b'%PDF')
)
if is_pdf:
pages = convert_from_bytes(image_or_pdf_bytes, dpi=200)
else:
img = Image.open(io.BytesIO(image_or_pdf_bytes))
pages = [img]
texts = []
for p in pages:
texts.append(pytesseract.image_to_string(p))
full_text = '\n\f\n'.join(texts)
# Heuristic confidence - tesseract has a per-word conf in
# image_to_data, but a length proxy is fine for routing
# decisions. Future: use pytesseract.image_to_data for a real
# average word-level confidence.
conf = min(1.0, len(full_text) / 1000.0)
return OCRResult(
raw_text=full_text,
confidence=conf,
pages=len(pages),
backend='tesseract',
)
except Exception as e:
_logger.warning("Tesseract OCR failed: %s", e)
return OCRResult(
raw_text='', confidence=0.0, pages=0,
backend='tesseract', error=str(e),
)