feat(fusion_accounting_ocr): pluggable OCR for vendor bills

Replaces Enterprise's account_invoice_extract with a Fusion-native pipeline: Stage 1 (text extraction): Tesseract OCRs the bill attachment via pytesseract + pdf2image. Pluggable OCRProvider adapter pattern allows future Mindee / Google Document AI / Ollama-vision backends. Stage 2 (field parsing): The fusion_accounting_ai LLMProvider reads the raw OCR text and returns structured invoice fields (vendor, invoice number, dates, amounts, line items) as JSON. Draft invoice fields are auto-populated for empty-only fields (never overwriting user-entered data). Vendor matching by name against res.partner with supplier_rank > 0. Adds: - account.move.ocr_state (selection: not_requested/pending/processing/ done/failed/manual) - account.move.ocr_raw_text, ocr_extracted_data (Json), ocr_backend, ocr_confidence - fusion.ocr.log (audit trail per OCR run) - res.company.fusion_ocr_enabled / fusion_ocr_default_backend / auto_run - /fusion/ocr/request_for_invoice JSON-RPC endpoint Backend availability detected at runtime via OCRProvider.is_available() classmethods. Tesseract 5.3.4 + pytesseract 0.3.13 + pdf2image 1.17.0 are installed in the container. Tests: 13 (TesseractAdapter availability + image OCR; flow tests for draft autofill, no-attachment guard, customer-invoice guard, ref-not- overwritten; field parser empty/clean-json/markdown-fence/bad-JSON/ provider-exception). All pass on westin-v19 OrbStack VM. Made-with: Cursor
2026-04-20 00:32:50 -04:00
parent a730942d24
commit 125f48377a
24 changed files with 952 additions and 0 deletions
--- a/fusion_accounting_ocr/tests/test_tesseract_adapter.py
+++ b/fusion_accounting_ocr/tests/test_tesseract_adapter.py
@@ -0,0 +1,47 @@
+import io
+
+from PIL import Image, ImageDraw
+
+from odoo.tests import tagged
+from odoo.tests.common import TransactionCase
+
+from odoo.addons.fusion_accounting_ocr.services.ocr_providers.tesseract_adapter import (
+    TesseractAdapter,
+)
+
+
+@tagged('post_install', '-at_install')
+class TestTesseractAdapter(TransactionCase):
+
+    def test_is_available(self):
+        # In our container tesseract + pytesseract + pdf2image are pre-installed.
+        self.assertTrue(TesseractAdapter.is_available())
+
+    def test_extract_simple_text_image(self):
+        # Generate a tiny PNG with the text "INVOICE 12345 Total $100".
+        # Use a slightly larger image and try to load a TTF font for
+        # tesseract reliability; fall back to default bitmap font otherwise.
+        img = Image.new('RGB', (800, 120), color='white')
+        draw = ImageDraw.Draw(img)
+        try:
+            from PIL import ImageFont
+            font = ImageFont.truetype(
+                '/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 36,
+            )
+        except Exception:
+            font = None
+        draw.text((20, 30), "INVOICE 12345 Total $100", fill='black', font=font)
+
+        buf = io.BytesIO()
+        img.save(buf, format='PNG')
+        png_bytes = buf.getvalue()
+
+        adapter = TesseractAdapter()
+        result = adapter.extract(png_bytes, mimetype='image/png')
+
+        self.assertEqual(result.backend, 'tesseract')
+        self.assertEqual(result.error, '')
+        self.assertEqual(result.pages, 1)
+        self.assertGreater(len(result.raw_text), 0)
+        # Tesseract should pick up the digits at minimum.
+        self.assertIn('12345', result.raw_text.replace(' ', ''))