Designing a Schema-Guided Invoice Intelligence Pipeline with lift-pdf for Accounts-Payable Extraction, Validation, and Ledger Generation

A new tutorial outlines a method for building an accounts-payable extraction pipeline using synthetic invoice PDFs and a structured JSON schema as the target output format.

The approach treats invoice parsing as schema-guided document understanding rather than a simple optical character recognition task. The workflow generates realistic invoices, defines fields such as vendor identity, billing party, purchase order number, line items, tax, total amount, balance due, and payment status, then asks the model to extract those values directly from the rendered PDF layout.

The setup includes practical extraction traps common in finance workflows. These include distinguishing bill-to from ship-to addresses, separating subtotals from after-tax totals, returning null for absent values, and correctly marking partially paid invoices as unpaid when a balance remains.

Technical setup

The tutorial begins by defining runtime controls that decide how many invoices to process, whether to use 4-bit loading, whether to preview the generated PDF, and whether to test a real invoice later.

It installs core dependencies for PDF generation, rendering, tabular analysis, plotting, and lift-pdf inference. The code pins Pillow to version 11.3.0 to address a known Colab compatibility issue involving Pillow, torchvision, and Transformers.

Copy Code

N_DOCS               = 3       
FORCE_FULL_PRECISION = False   
FORCE_4BIT           = False   
SHOW_FIRST_PAGE      = True    
RUN_ON_REAL_PDF      = False   
REAL_PDF_URL         = ""      
REAL_PDF_PAGES       = "0-1"   
PIN_PILLOW           = True    
PILLOW_VERSION       = "11.3.0"
import os, sys, subprocess, json, re, time, warnings
warnings.filterwarnings("ignore")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
def pip(*pkgs, upgrade=False):
   """Install without invoking a shell (so '[hf]' is never glob-expanded)."""
   args = [sys.executable, "-m", "pip", "install", "-q"] + (["-U"] if upgrade else []) + list(pkgs)
   print("  pip install", *pkgs)
   subprocess.run(args, check=False)
print("STEP 1/7 · Installing lift + light dependencies (first run is the slow one)…")
pip("reportlab", "pypdfium2", "pandas", "matplotlib")  
pip("lift-pdf[hf]")                                     
pip("bitsandbytes", "accelerate", upgrade=True)         
if PIN_PILLOW:
   pip(f"pillow=={PILLOW_VERSION}")
   if "PIL" in sys.modules:                      
       import PIL
       if getattr(PIL, "__version__", "") != PILLOW_VERSION:
           print(f"     Pinned Pillow {PILLOW_VERSION} on disk, but a stale "
                 f"{getattr(PIL, '__version__', '?')} is loaded in memory — restarting runtime.")
           print("     Just re-run the cell(s) after Colab reconnects.")
           os.kill(os.getpid(), 9)
print("     …install finished.\n")
import torch

The code detects the available GPU and raises an error if CUDA is not found. It recommends the A100 as the best option, though L4 and T4 cards also work.

It patches the Hugging Face model-loading path so lift can transparently load the checkpoint with a BitsAndBytes quantization configuration when needed. This allows the system to switch between full precision and 4-bit NF4 quantization based on available VRAM.

The InferenceManager initialises once and reuses across all invoices to avoid repeated model-loading overhead. The code wraps lift.extract() inside a helper function so each PDF can be mined with the same schema and optional page range.

Copy Code

def detect_gpu():
   if not torch.cuda.is_available():
       raise SystemExit(
           "\n✗ No CUDA GPU found. In Colab: Runtime ▸ Change runtime type ▸ GPU "
           "(A100 is best; L4/T4 also work).\n"
       )
   p  = torch.cuda.get_device_properties(0)
   cc = torch.cuda.get_device_capability(0)
   return p.name, p.total_memory / 1e9, cc
def enable_4bit(compute_dtype):
   """Load lift's weights in 4-bit NF4 whatever transformers Auto* class it uses internally."""
   import inspect, functools, transformers
   from transformers import BitsAndBytesConfig
   bnb = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_use_double_quant=True,
       bnb_4bit_compute_dtype=compute_dtype,
   )
   def patch(cls):
       try:
           cm   = inspect.getattr_static(cls, "from_pretrained")
           orig = cm.__func__ if isinstance(cm, (classmethod, staticmethod)) else cm
       except Exception:
           return
       @functools.wraps(orig)
       def inner(cls_, *args, **kwargs):
           kwargs.setdefault("quantization_config", bnb)
           kwargs.setdefault("device_map", {"": 0})
           model = orig(cls_, *args, **kwargs)
           try:                                  
               model.to   = lambda *a, **k: model
               model.cuda = lambda *a, **k: model
           except Exception:
               pass
           return model
       cls.from_pretrained = classmethod(inner)
   for name in ["AutoModelForImageTextToText", "AutoModelForMultimodalLM",
                "AutoModelForVision2Seq", "AutoModelForCausalLM", "AutoModel"]:
       c = getattr(transformers, name, None)
       if c is not None:
           patch(c)
   try:
       from transformers.modeling_utils import PreTrainedModel
       patch(PreTrainedModel)
   except Exception:
       pass
print("STEP 2/7 · Preparing the model backend…")
gpu_name, vram, cc = detect_gpu()
use_4bit      = FORCE_4BIT or (vram < 34 and not FORCE_FULL_PRECISION)
compute_dtype = torch.bfloat16 if cc[0] >= 8 else torch.float16  
print(f"     GPU: {gpu_name} | ~{vram:.0f} GB | compute capability {cc[0]}.{cc[1]}")
print(f"     Load mode: {'4-bit NF4' if use_4bit else 'full bf16'} (compute dtype {compute_dtype})")
os.environ.setdefault("TORCH_DEVICE", "cuda:0")
os.environ.setdefault("MODEL_CHECKPOINT", "datalab-to/lift")
if use_4bit:
   enable_4bit(compute_dtype)
from lift import extract
from lift.model import InferenceManager
print("     Loading lift weights (≈20 GB download on first run)…")
_t = time.time()
MODEL = InferenceManager(method="hf")         
print(f"     ✓ model ready in {time.time() - _t:.0f}s\n")
def run_lift(pdf_path, schema, page_range=None):
   kw = {"model": MODEL}
   if page_range:
       kw["page_range"] = page_range
   result = extract(pdf_path, schema, **kw)
   return getattr(result, "extraction", None)

Test data

The tutorial defines three synthetic invoices to test the pipeline.

The first document is INV-2026-0412 from Cloudworks Inc. dated 04 May 2026. The vendor address is 500 Market St, Suite 900, San Francisco, CA 94105, USA. The bill-to party is Acme Robotics LLC at 12 Foundry Rd, Pittsburgh, PA 15222, USA. The ship-to location is Acme Robotics — Warehouse 4 at 88 Dockside Blvd, Newark, NJ 07114, USA. The currency is USD. The tax rate is 0.085. No amount has been paid. The line items include Cloud Compute — Standard tier (monthly) at 240.00, Object Storage — 2 TB at 46.00, and Priority Support add-on at 99.00. The notes state payment is due within 30 days and late payments accrue 1.5% monthly interest.

The second document is INV-ND-2026-118 from Nordic Design Studio Oy dated 18 April 2026. The vendor address is Eteläranta 12, 00130 Helsinki, Finland. The bill-to party is Helsinki Media Oy at Mannerheimintie 4, 00100 Helsinki, Finland. There is no ship-to address. The purchase order number is PO-HM-5589. The discount amount is 785.00. The currency is EUR.

Source Read original →