
How to parse invoices accurately with bem
Invoices can be gnarly. Here's a production-ready, enterprise workflow to extract their data accurately (and enrich it with your source of truth).
If you are building an accounts payable automation or a spend management platform, you know the reality of invoices: they are chaotic. They arrive as PDFs, images, or forwarded emails. They have infinite layouts. Tables span multiple pages. Line items are messy.
Most developers start by throwing these files at a generic LLM with a prompt like "Extract the invoice number and line items."
This works for a hackathon demo. It fails in production.
In finance, "mostly accurate" is a bug. You cannot have a 95% success rate when processing millions of dollars in payments. You need deterministic outputs from probabilistic inputs.
At bem, we don't do "vibe coding." We build atomic units of functionality that enforce strict contracts between your messy data and your database.
Here is how to engineer a production-ready invoice pipeline that doesn't just extract text, but maps it to your internal source of truth.
1. The Contract: Defining the Transform Function
We don't send a chat message to a model asking it to "be helpful." We define a Transform Function. This is a strict schema definition that forces the underlying inference engine to adhere to a specific JSON structure.
For an invoice, we care about specific fields: Vendor Name, Invoice Number, Date, and the Line Items table
1// POST /v2/functions2// Creating the "Invoice Extractor" Transform Function34const invoiceSchema = {5 type: "object",6 properties: {7 vendor_name: { type: "string", description: "The name of the vendor issuing the invoice" },8 invoice_number: { type: "string", description: "Unique identifier for the invoice" },9 total_amount: { type: "number", description: "The final total including tax" },10 line_items: {11 type: "array",12 items: {13 type: "object",14 properties: {15 description: { type: "string" },16 quantity: { type: "number" },17 unit_price: { type: "number" },18 total: { type: "number" }19 },20 required: ["description", "total"]21 }22 }23 },24 required: ["vendor_name", "total_amount", "line_items"]25};2627await fetch("[https://api.bem.ai/v2/functions](https://api.bem.ai/v2/functions)", {28 method: "POST",29 headers: { "x-api-key": process.env.BEM_API_KEY }, // You can get this by signing up on app.bem.ai !30 body: JSON.stringify({31 functionName: "invoice-extractor-v1",32 type: "transform",33 outputSchemaName: "Invoice Schema",34 outputSchema: invoiceSchema,35 })36});37
2. The Source of Truth: Collections & Enrichment
Extracting "Amazon Web Services" from a PDF is easy. Knowing that "Amazon Web Services", "AWS", and "Amazon.com" all map to Vendor ID vnd_9982 in your ERP is hard.
Standard LLMs hallucinate these mappings. bem uses Collections and Enrich Functions to ground AI outputs in your actual data.
First, we upload your vendor master list to a Collection. This is a vector-embedded database managed by bem.
1// 1. Initialize the Collection2// POST /v2/collections34await fetch("[https://api.bem.ai/v2/collections](https://api.bem.ai/v2/collections)", {5 method: "POST",6 headers: { "x-api-key": process.env.BEM_API_KEY },7 body: JSON.stringify({8 collectionName: "vendor-master-list"9 })10});1112// 2. Upload your Vendor Master List13// POST /v2/collections/items1415await fetch("[https://api.bem.ai/v2/collections/items](https://api.bem.ai/v2/collections/items)", {16 method: "POST",17 headers: { "x-api-key": process.env.BEM_API_KEY },18 body: JSON.stringify({19 collectionName: "vendor-master-list",20 items: [21 { data: { id: "vnd_9982", name: "Amazon Web Services", category: "Cloud Infrastructure" } },22 { data: { id: "vnd_5521", name: "WeWork", category: "Office Rent" } },23 // ...24 ]25 })26});
Next, we create an Enrich Function. This function takes the extracted vendor name from step 1, performs a semantic search against your Collection, and appends the correct Vendor ID to the payload.
This isn't a guess; it's a retrieval-augmented lookup configured strictly via JMESPath selectors.
1// POST /v2/functions2// Creating the "Vendor Matcher" Enrich Function34await fetch("[https://api.bem.ai/v2/functions](https://api.bem.ai/v2/functions)", {5 method: "POST",6 headers: { "x-api-key": process.env.BEM_API_KEY },7 body: JSON.stringify({8 functionName: "vendor-enricher-v1",9 type: "enrich",10 config: {11 steps: [12 {13 // Take the extracted name from the previous function14 sourceField: "vendor_name",15 // Search against your master list16 collectionName: "vendor-master-list",17 // Inject the result into a new field18 targetField: "matched_vendor_record",19 // Use hybrid search for best results (keyword + semantic)20 searchMode: "hybrid",21 topK: 122 }23 ]24 }25 })26});
3. Now the magic of orchestration: The Workflow
Now we chain them together. In bem, a Workflow is a Directed Acyclic Graph (DAG) of functions. The output of the invoice-extractor becomes the input of the vendor-enricher.
1// POST /v2/workflows2// Linking extraction and enrichment34await fetch("[https://api.bem.ai/v2/workflows](https://api.bem.ai/v2/workflows)", {5 method: "POST",6 headers: { "x-api-key": process.env.BEM_API_KEY },7 body: JSON.stringify({8 name: "process-invoices-production",9 mainFunction: {10 name: "invoice-extractor-v1",11 versionNum: 112 },13 relationships: [14 {15 sourceFunction: { name: "invoice-extractor-v1", versionNum: 1 },16 destinationFunction: { name: "vendor-enricher-v1", versionNum: 1 }17 }18 ]19 })20});
4. Production: The Event-Driven Loop
We built bem to be asynchronous by default. Blocking APIs (waiting for a response) are fragile. If you send a 50-page invoice to a blocking API, it will eventually timeout.
Our architecture allows you to dispatch thousands of invoices simultaneously without degrading performance.
Dispatch (The Call)
You trigger a workflow with a single API call. We accept base64 strings and multipart-form.
1// POST /v2/calls2// Fire and forget. Non-blocking.34await fetch("[https://api.bem.ai/v2/calls](https://api.bem.ai/v2/calls)", {5 method: "POST",6 headers: { "x-api-key": process.env.BEM_API_KEY },7 body: JSON.stringify({8 calls: [{9 workflowName: "process-invoices-production",10 // Pass your own internal ID for tracking11 callReferenceID: "invoice_db_id_8823",12 input: {13 singleFile: {14 inputType: "pdf",15 inputContent: "base64_encoded_pdf_string..."16 }17 }18 }]19 })20});
React (The Webhook)
You don't poll. You subscribe. can When the workflow completes, we push the fully enriched JSON payload to your endpoint. (If you really need to poll, you can! Just use our GET endpoint).
1// POST /v1-alpha/subscriptions2// Listen for completed transformations34await fetch("[https://api.bem.ai/v1-alpha/subscriptions](https://api.bem.ai/v1-alpha/subscriptions)", {5 method: "POST",6 headers: { "x-api-key": process.env.BEM_API_KEY },7 body: JSON.stringify({8 name: "invoice-complete-listener",9 type: "transform",10 workflowName: "process-invoices-production",11 webhookURL: "[https://api.your-company.com/webhooks/bem/invoices](https://api.your-company.com/webhooks/bem/invoices)"12 })13});
Extra Credit: Trust, but Verify
Shipping AI to production without regression testing is negligence.
Because bem treats functions as versioned primitives, you can run evaluations programmatically. Before promoting invoice-extractor-v2, you can use our /v2/functions/regression endpoint to replay historical data against the new version and compare accuracy metrics.
We also offer a specific endpoint, /v2/functions/review, which calculates the statistical confidence of your pipeline and estimates the human review effort needed to hit 99.9% accuracy.
Summary: Enterprise-Grade Reliability
Invoices are messy, but your architecture shouldn't be. By decoupling the extraction (Transform) from the grounding (Enrich), and orchestrating them in an event-driven workflow, you build a system that is:
- Embeddable: It fits into your existing event bus.
- Customizable: You define the schema, not us.
- Accurate: Data is validated against your own database via Collections.
Stop writing prompts. Start building pipelines.
Start to see it in action?
Talk to our team to walk through how bem can work inside your stack.


