feat(fusion_accounting_ocr): pluggable OCR for vendor bills

Replaces Enterprise's account_invoice_extract with a Fusion-native pipeline:

Stage 1 (text extraction): Tesseract OCRs the bill attachment via
pytesseract + pdf2image. Pluggable OCRProvider adapter pattern allows
future Mindee / Google Document AI / Ollama-vision backends.

Stage 2 (field parsing): The fusion_accounting_ai LLMProvider reads the
raw OCR text and returns structured invoice fields (vendor, invoice
number, dates, amounts, line items) as JSON.

Draft invoice fields are auto-populated for empty-only fields (never
overwriting user-entered data). Vendor matching by name against
res.partner with supplier_rank > 0.

Adds:
- account.move.ocr_state (selection: not_requested/pending/processing/
  done/failed/manual)
- account.move.ocr_raw_text, ocr_extracted_data (Json), ocr_backend,
  ocr_confidence
- fusion.ocr.log (audit trail per OCR run)
- res.company.fusion_ocr_enabled / fusion_ocr_default_backend / auto_run
- /fusion/ocr/request_for_invoice JSON-RPC endpoint

Backend availability detected at runtime via OCRProvider.is_available()
classmethods. Tesseract 5.3.4 + pytesseract 0.3.13 + pdf2image 1.17.0
are installed in the container.

Tests: 13 (TesseractAdapter availability + image OCR; flow tests for
draft autofill, no-attachment guard, customer-invoice guard, ref-not-
overwritten; field parser empty/clean-json/markdown-fence/bad-JSON/
provider-exception). All pass on westin-v19 OrbStack VM.

Made-with: Cursor
This commit is contained in:
gsinghpal
2026-04-20 00:32:50 -04:00
parent a730942d24
commit 125f48377a
24 changed files with 952 additions and 0 deletions

View File

@@ -0,0 +1,40 @@
"""OCRProvider contract - every backend must conform.
Mirrors the LLMProvider pattern in fusion_accounting_ai. Future adapters
(Mindee, Google Document AI, Ollama-vision) drop in alongside the default
tesseract adapter without touching account.move.
"""
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
@dataclass
class OCRResult:
raw_text: str = ''
confidence: float = 0.0 # 0.01.0
pages: int = 0
backend: str = ''
error: str = ''
metadata: dict = field(default_factory=dict)
class OCRProvider(ABC):
"""Abstract OCR backend. Subclasses implement extract()."""
name: str = 'base'
@abstractmethod
def extract(self, image_or_pdf_bytes: bytes, *, mimetype: str = 'application/pdf') -> OCRResult:
"""Extract text from raw bytes.
``mimetype`` hints whether to PDF-render (poppler) or image-decode
(PIL) the bytes. Implementations should still inspect the byte
signature for safety.
"""
...
@classmethod
def is_available(cls) -> bool:
"""Return True if the backend's runtime deps are present."""
return True