feat(fusion_accounting_ocr): pluggable OCR for vendor bills

Replaces Enterprise's account_invoice_extract with a Fusion-native pipeline:

Stage 1 (text extraction): Tesseract OCRs the bill attachment via
pytesseract + pdf2image. Pluggable OCRProvider adapter pattern allows
future Mindee / Google Document AI / Ollama-vision backends.

Stage 2 (field parsing): The fusion_accounting_ai LLMProvider reads the
raw OCR text and returns structured invoice fields (vendor, invoice
number, dates, amounts, line items) as JSON.

Draft invoice fields are auto-populated for empty-only fields (never
overwriting user-entered data). Vendor matching by name against
res.partner with supplier_rank > 0.

Adds:
- account.move.ocr_state (selection: not_requested/pending/processing/
  done/failed/manual)
- account.move.ocr_raw_text, ocr_extracted_data (Json), ocr_backend,
  ocr_confidence
- fusion.ocr.log (audit trail per OCR run)
- res.company.fusion_ocr_enabled / fusion_ocr_default_backend / auto_run
- /fusion/ocr/request_for_invoice JSON-RPC endpoint

Backend availability detected at runtime via OCRProvider.is_available()
classmethods. Tesseract 5.3.4 + pytesseract 0.3.13 + pdf2image 1.17.0
are installed in the container.

Tests: 13 (TesseractAdapter availability + image OCR; flow tests for
draft autofill, no-attachment guard, customer-invoice guard, ref-not-
overwritten; field parser empty/clean-json/markdown-fence/bad-JSON/
provider-exception). All pass on westin-v19 OrbStack VM.

Made-with: Cursor
This commit is contained in:
gsinghpal
2026-04-20 00:32:50 -04:00
parent a730942d24
commit 125f48377a
24 changed files with 952 additions and 0 deletions

View File

@@ -0,0 +1,13 @@
"""Manual fallback adapter - no real OCR, just marks the document as
'awaiting manual entry'. Used when no real OCR backend is available
or when the user explicitly disables OCR.
"""
from .base import OCRProvider, OCRResult
class ManualAdapter(OCRProvider):
name = 'manual'
def extract(self, image_or_pdf_bytes, *, mimetype='application/pdf'):
return OCRResult(raw_text='', confidence=0.0, pages=0, backend='manual')