Replaces Enterprise's account_invoice_extract with a Fusion-native pipeline: Stage 1 (text extraction): Tesseract OCRs the bill attachment via pytesseract + pdf2image. Pluggable OCRProvider adapter pattern allows future Mindee / Google Document AI / Ollama-vision backends. Stage 2 (field parsing): The fusion_accounting_ai LLMProvider reads the raw OCR text and returns structured invoice fields (vendor, invoice number, dates, amounts, line items) as JSON. Draft invoice fields are auto-populated for empty-only fields (never overwriting user-entered data). Vendor matching by name against res.partner with supplier_rank > 0. Adds: - account.move.ocr_state (selection: not_requested/pending/processing/ done/failed/manual) - account.move.ocr_raw_text, ocr_extracted_data (Json), ocr_backend, ocr_confidence - fusion.ocr.log (audit trail per OCR run) - res.company.fusion_ocr_enabled / fusion_ocr_default_backend / auto_run - /fusion/ocr/request_for_invoice JSON-RPC endpoint Backend availability detected at runtime via OCRProvider.is_available() classmethods. Tesseract 5.3.4 + pytesseract 0.3.13 + pdf2image 1.17.0 are installed in the container. Tests: 13 (TesseractAdapter availability + image OCR; flow tests for draft autofill, no-attachment guard, customer-invoice guard, ref-not- overwritten; field parser empty/clean-json/markdown-fence/bad-JSON/ provider-exception). All pass on westin-v19 OrbStack VM. Made-with: Cursor
72 lines
2.2 KiB
Python
72 lines
2.2 KiB
Python
"""Tesseract OCR adapter.
|
|
|
|
Uses the system tesseract binary via pytesseract, with poppler-backed
|
|
PDF rendering via pdf2image. Inside the container these are pre-installed:
|
|
- tesseract-ocr 5.3.4
|
|
- pytesseract 0.3.13
|
|
- pdf2image 1.17.0
|
|
- poppler-utils
|
|
"""
|
|
|
|
import io
|
|
import logging
|
|
|
|
from .base import OCRProvider, OCRResult
|
|
|
|
_logger = logging.getLogger(__name__)
|
|
|
|
|
|
class TesseractAdapter(OCRProvider):
|
|
name = 'tesseract'
|
|
|
|
@classmethod
|
|
def is_available(cls) -> bool:
|
|
try:
|
|
import pytesseract
|
|
from pdf2image import convert_from_bytes # noqa: F401
|
|
from PIL import Image # noqa: F401
|
|
pytesseract.get_tesseract_version()
|
|
return True
|
|
except Exception as e:
|
|
_logger.debug("TesseractAdapter not available: %s", e)
|
|
return False
|
|
|
|
def extract(self, image_or_pdf_bytes, *, mimetype='application/pdf'):
|
|
import pytesseract
|
|
from pdf2image import convert_from_bytes
|
|
from PIL import Image
|
|
|
|
try:
|
|
is_pdf = (
|
|
mimetype == 'application/pdf'
|
|
or (image_or_pdf_bytes[:4] == b'%PDF')
|
|
)
|
|
if is_pdf:
|
|
pages = convert_from_bytes(image_or_pdf_bytes, dpi=200)
|
|
else:
|
|
img = Image.open(io.BytesIO(image_or_pdf_bytes))
|
|
pages = [img]
|
|
|
|
texts = []
|
|
for p in pages:
|
|
texts.append(pytesseract.image_to_string(p))
|
|
full_text = '\n\f\n'.join(texts)
|
|
|
|
# Heuristic confidence - tesseract has a per-word conf in
|
|
# image_to_data, but a length proxy is fine for routing
|
|
# decisions. Future: use pytesseract.image_to_data for a real
|
|
# average word-level confidence.
|
|
conf = min(1.0, len(full_text) / 1000.0)
|
|
return OCRResult(
|
|
raw_text=full_text,
|
|
confidence=conf,
|
|
pages=len(pages),
|
|
backend='tesseract',
|
|
)
|
|
except Exception as e:
|
|
_logger.warning("Tesseract OCR failed: %s", e)
|
|
return OCRResult(
|
|
raw_text='', confidence=0.0, pages=0,
|
|
backend='tesseract', error=str(e),
|
|
)
|