feat(fusion_accounting_ocr): pluggable OCR for vendor bills

Replaces Enterprise's account_invoice_extract with a Fusion-native pipeline: Stage 1 (text extraction): Tesseract OCRs the bill attachment via pytesseract + pdf2image. Pluggable OCRProvider adapter pattern allows future Mindee / Google Document AI / Ollama-vision backends. Stage 2 (field parsing): The fusion_accounting_ai LLMProvider reads the raw OCR text and returns structured invoice fields (vendor, invoice number, dates, amounts, line items) as JSON. Draft invoice fields are auto-populated for empty-only fields (never overwriting user-entered data). Vendor matching by name against res.partner with supplier_rank > 0. Adds: - account.move.ocr_state (selection: not_requested/pending/processing/ done/failed/manual) - account.move.ocr_raw_text, ocr_extracted_data (Json), ocr_backend, ocr_confidence - fusion.ocr.log (audit trail per OCR run) - res.company.fusion_ocr_enabled / fusion_ocr_default_backend / auto_run - /fusion/ocr/request_for_invoice JSON-RPC endpoint Backend availability detected at runtime via OCRProvider.is_available() classmethods. Tesseract 5.3.4 + pytesseract 0.3.13 + pdf2image 1.17.0 are installed in the container. Tests: 13 (TesseractAdapter availability + image OCR; flow tests for draft autofill, no-attachment guard, customer-invoice guard, ref-not- overwritten; field parser empty/clean-json/markdown-fence/bad-JSON/ provider-exception). All pass on westin-v19 OrbStack VM. Made-with: Cursor
2026-04-20 00:32:50 -04:00
parent a730942d24
commit 125f48377a
24 changed files with 952 additions and 0 deletions
--- a/fusion_accounting_ocr/services/ocr_providers/base.py
+++ b/fusion_accounting_ocr/services/ocr_providers/base.py
@@ -0,0 +1,40 @@
+"""OCRProvider contract - every backend must conform.
+
+Mirrors the LLMProvider pattern in fusion_accounting_ai. Future adapters
+(Mindee, Google Document AI, Ollama-vision) drop in alongside the default
+tesseract adapter without touching account.move.
+"""
+
+from abc import ABC, abstractmethod
+from dataclasses import dataclass, field
+
+
+@dataclass
+class OCRResult:
+    raw_text: str = ''
+    confidence: float = 0.0  # 0.0–1.0
+    pages: int = 0
+    backend: str = ''
+    error: str = ''
+    metadata: dict = field(default_factory=dict)
+
+
+class OCRProvider(ABC):
+    """Abstract OCR backend. Subclasses implement extract()."""
+
+    name: str = 'base'
+
+    @abstractmethod
+    def extract(self, image_or_pdf_bytes: bytes, *, mimetype: str = 'application/pdf') -> OCRResult:
+        """Extract text from raw bytes.
+
+        ``mimetype`` hints whether to PDF-render (poppler) or image-decode
+        (PIL) the bytes. Implementations should still inspect the byte
+        signature for safety.
+        """
+        ...
+
+    @classmethod
+    def is_available(cls) -> bool:
+        """Return True if the backend's runtime deps are present."""
+        return True