
Processing a 456-Page Contract in Seconds: Two Approaches to Document Intelligence
The market is splitting into two approaches for working with unstructured data: compressed semantics (RAG) and just-in-time agent access. This guide walks through both using the NFL's 456-page CBA, with working code examples in Python, TypeScript, and curl.

The way production teams access unstructured data is splitting into two fundamentally different approaches. Understanding both is the difference between building a system that scales and one that breaks at 10x volume.
This guide walks through both approaches using a real document: the NFL's 456-page Collective Bargaining Agreement. We'll process it with Bem's V3 API, show working code, and explain when to use which approach.
The Market Is Diverging
Over the past 18 months, we've watched hundreds of production teams build document intelligence systems. A clear pattern has emerged: there are now two schools of thought on how to make unstructured data usable, and they're moving in opposite directions.
Approach 1: Compress Semantics Ahead of Time (RAG)
The first approach, popularized by the RAG (Retrieval-Augmented Generation) wave, works like this: ingest your documents, chunk them, embed the chunks into a vector database, and retrieve relevant pieces at query time. The semantics are compressed before anyone asks a question.
- Predictable latency at query time
- Works well for known question patterns
- Cost-efficient for high-volume, repetitive queries
- Schema-enforced outputs with confidence scoring
The tradeoff: chunking is lossy. A table header on page 1 that defines the unit of measurement for numbers on page 47 gets separated during chunking. Context that spans sections, pages, or documents is lost. For many use cases, this is fine. For claims adjudication on a 456-page contract, it's a liability.
Approach 2: Just-in-Time Semantics (Agent-Native)
The second approach skips chunking entirely. Instead, the full document is made available to an agent, which traverses it using file-system-level operations: ls, grep, cat, find. The agent decides what to read, when to read it, and how deep to go. Semantics are resolved just-in-time.
- No context loss from chunking
- Handles cross-reference and multi-section reasoning
- Agents can follow the document's own structure
- Better for interpretive, open-ended questions
The tradeoff: token cost can be higher per query, and the agent's reasoning path is less predictable. For a chatbot answering "what holidays do union members get?", this approach is powerful. For processing 10,000 invoices per hour, it's overkill.
You Need Both
The best production systems use both approaches. Structured extraction for the workflows that run millions of times a day. Agent-native document access for the questions that require judgment. Bem supports both through two composable primitives: Extract and Parse.
The Document: NFL's 456-Page CBA
To demonstrate both approaches, we're using the NFL-NFLPA Collective Bargaining Agreement. It's 456 pages of dense legal language covering player compensation, salary caps, benefits, drug policies, disciplinary procedures, and more. It's the kind of document that takes a legal team days to review manually.
You can download the full PDF from the NFLPA website.
Approach 1: Extract (Structured Automation)
Extract is for when you know exactly what data you need from a document. You define a schema, Bem selects the right model, and you get back verified JSON with confidence scores on every field.
Step 1: Define Your Schema
First, create an extraction function with the fields you care about. Here we're pulling the key contract terms: parties, dates, salary cap schedule, and major provisions.
1curl -X POST https://api.bem.ai/v3/functions \2 -H "x-api-key: $BEM_API_KEY" \3 -H "Content-Type: application/json" \4 -d '{5 "functionName": "contract-terms",6 "type": "extract",7 "displayName": "Contract Terms Extractor",8 "outputSchemaName": "ContractTerms",9 "outputSchema": {10 "type": "object",11 "required": ["parties", "effectiveDate", "termLength"],12 "properties": {13 "parties": {14 "type": "array",15 "description": "Named parties to the agreement",16 "items": {17 "type": "object",18 "properties": {19 "name": { "type": "string" },20 "role": { "type": "string" }21 }22 }23 },24 "effectiveDate": {25 "type": "string",26 "description": "When the agreement takes effect"27 },28 "expirationDate": {29 "type": "string",30 "description": "When the agreement expires"31 },32 "termLength": {33 "type": "string",34 "description": "Duration in years"35 },36 "salaryCap": {37 "type": "object",38 "properties": {39 "amount": { "type": "string" },40 "yearlySchedule": {41 "type": "array",42 "items": {43 "type": "object",44 "properties": {45 "year": { "type": "string" },46 "amount": { "type": "string" }47 }48 }49 }50 }51 },52 "keyProvisions": {53 "type": "array",54 "items": {55 "type": "object",56 "properties": {57 "title": { "type": "string" },58 "summary": { "type": "string" }59 }60 }61 },62 "disputeResolution": { "type": "string" }63 }64 }65 }'
Step 2: Create a Workflow and Send the Document
Wrap the function in a workflow and send the PDF. For a 456-page document, use async mode and poll for results.
1# Create workflow2curl -X POST https://api.bem.ai/v3/workflows \3 -H "x-api-key: $BEM_API_KEY" \4 -H "Content-Type: application/json" \5 -d '{6 "name": "contract-analysis",7 "mainNodeName": "contract-terms",8 "nodes": [{9 "name": "contract-terms",10 "function": { "name": "contract-terms" }11 }]12 }'1314# Send the 456-page CBA15curl -X POST https://api.bem.ai/v3/workflows/contract-analysis/call \16 -H "x-api-key: $BEM_API_KEY" \17 -F "wait=false" \18 -F "callReferenceID=nfl-cba-001" \19 -F "file=@nfl-cba-2020.pdf"
Step 3: Get Verified Results
When the call completes, you get back schema-enforced JSON. Every field has been verified against the document. Confidence scores tell you exactly how certain the extraction is.
1# Poll for results2curl https://api.bem.ai/v3/calls/{callID} \3 -H "x-api-key: $BEM_API_KEY"45# Real output from the NFL CBA (456 pages, processed in ~95 seconds):6{7 "call": {8 "status": "completed",9 "outputs": [{10 "transformedContent": {11 "parties": [12 {13 "name": "National Football League Management Council",14 "role": "Management Council"15 },16 {17 "name": "National Football League Players Association",18 "role": "Union"19 }20 ],21 "effectiveDate": "2020-03-15",22 "termLength": "11 years",23 "salaryCap": {24 "amount": "Calculated based on AR, Projected AR, and Player Cost Amount, with specific percentages for League Media AR, NFL Ventures/Postseason AR, and Local AR."25 },26 "keyProvisions": [27 {28 "title": "No Strike/Lockout/Suit",29 "summary": "Neither party will engage in strikes or lockouts"30 },31 {32 "title": "College Draft",33 "summary": "Rules for annual and supplemental drafts, including eligibility and required tenders"34 },35 {36 "title": "Veteran Free Agency",37 "summary": "Rules for unrestricted and restricted free agents, qualifying offers, and right of first refusal"38 },39 {40 "title": "Franchise and Transition Players",41 "summary": "Rules for designating franchise/transition players, required tenders, and signing periods"42 },43 {44 "title": "Anti-Collusion",45 "summary": "Prohibited conduct, enforcement provisions, burden of proof"46 }47 ],48 "disputeResolution": "System Arbitrator and Impartial Arbitrator with binding authority, appeals panel, and confidentiality provisions"49 }50 }]51 }52}
That's a real response. 456 pages of dense legal language, distilled into structured JSON in 95 seconds. Every field maps to your schema. If a field can't be extracted with high confidence, it's flagged, not hallucinated.
Using the Python SDK
The same flow in Python:
1from bem import Bem23client = Bem() # reads BEM_API_KEY from environment45# Create and call in one step6call = client.workflows.call(7 workflow_name="contract-analysis",8 file_path="nfl-cba-2020.pdf",9 wait=True10)1112# Access structured output13terms = call.outputs[0].transformed_content14print(f"Agreement: {terms['parties'][0]['name']} vs {terms['parties'][1]['name']}")15print(f"Term: {terms['effectiveDate']} to {terms['expirationDate']}")16print(f"Salary cap 2023: {terms['salaryCap']['yearlySchedule'][2]['amount']}")1718for provision in terms['keyProvisions']:19 print(f" - {provision['title']}: {provision['summary']}")
Approach 2: Parse (Agent-Native Document Access)
Parse is for when you don't know the questions in advance. Instead of defining a schema, you give Bem the document and it creates a fully navigable knowledge layer: raw text by section, named entities, and relationships between them. Your agents access this layer through file-system-style operations.
Step 1: Parse the Document
1# Create a parse function2curl -X POST https://api.bem.ai/v3/functions \3 -H "x-api-key: $BEM_API_KEY" \4 -H "Content-Type: application/json" \5 -d '{6 "functionName": "doc-parser",7 "type": "parse",8 "displayName": "Document Parser"9 }'1011# Create workflow and send the CBA12curl -X POST https://api.bem.ai/v3/workflows/doc-parser/call \13 -H "x-api-key: $BEM_API_KEY" \14 -F "wait=false" \15 -F "file=@nfl-cba-2020.pdf"
Step 2: Navigate with File-System Operations
Once parsed, the document is accessible through operations your agents already understand: ls, cat, grep, find, stat. This is the same access pattern that coding agents like Claude Code use to navigate codebases. It's immediately familiar.
1# List all parsed documents2curl -X POST https://api.bem.ai/v3/fs \3 -H "x-api-key: $BEM_API_KEY" \4 -H "Content-Type: application/json" \5 -d '{ "operation": "ls" }'67# Search for "salary cap" across the entire document8curl -X POST https://api.bem.ai/v3/fs \9 -H "x-api-key: $BEM_API_KEY" \10 -H "Content-Type: application/json" \11 -d '{ "operation": "grep", "query": "salary cap" }'1213# Get metadata: how many pages, sections, entities14curl -X POST https://api.bem.ai/v3/fs \15 -H "x-api-key: $BEM_API_KEY" \16 -H "Content-Type: application/json" \17 -d '{ "operation": "stat", "documentId": "..." }'1819# Find all named entities (organizations, people, financial terms)20curl -X POST https://api.bem.ai/v3/fs \21 -H "x-api-key: $BEM_API_KEY" \22 -H "Content-Type: application/json" \23 -d '{ "operation": "find" }'
Step 3: Let Your Agent Explore
The real power is when you connect this to an AI agent. The agent can traverse the document the same way a human lawyer would: start with the table of contents, zoom into relevant sections, cross-reference terms, and follow the document's own structure.
1from bem import Bem23client = Bem()45# Parse the document6call = client.workflows.call(7 workflow_name="doc-parser",8 file_path="nfl-cba-2020.pdf",9 wait=True10)1112# Agent-style exploration13sections = client.fs.ls()14print(f"Document has {len(sections)} parsed documents")1516# Search for specific content17results = client.fs.grep("holiday")18for match in results:19 print(f"Found in section: {match['section']}")20 print(f" Context: {match['snippet']}")2122# Get all entities (people, organizations, financial concepts)23entities = client.fs.find()24for entity in entities[:10]:25 print(f" {entity['type']}: {entity['name']}")
When to Use Which
The decision is straightforward:
Use Extract when:
- You know exactly what fields you need
- The same schema applies across many documents
- You need deterministic, auditable outputs
- You're automating a process (claims, invoices, rate confirmations)
- Volume matters: thousands of documents per day
Use Parse when:
- Questions are open-ended or unpredictable
- Users or agents need to explore documents interactively
- Cross-section reasoning is required
- You're building chatbots, search, or Q&A systems
- Context that spans pages can't be lost to chunking
Use both when:
- You want to parse a contract library for agent access AND extract specific fields into your ERP
- You need a searchable knowledge layer AND automated workflow triggers
- Extract and Parse are chainable in a single workflow
Chaining Extract + Parse in a Single Workflow
Bem's workflow engine lets you compose both approaches. Parse the full document for knowledge access, then extract specific fields into your system of record. One API call, two outputs.
1curl -X POST https://api.bem.ai/v3/workflows \2 -H "x-api-key: $BEM_API_KEY" \3 -H "Content-Type: application/json" \4 -d '{5 "name": "full-contract-pipeline",6 "mainNodeName": "doc-parser",7 "nodes": [8 {9 "name": "doc-parser",10 "function": { "name": "doc-parser" }11 },12 {13 "name": "contract-terms",14 "function": { "name": "contract-terms" }15 }16 ],17 "edges": [18 { "from": "doc-parser", "to": "contract-terms" }19 ]20 }'2122# One call, both outputs23curl -X POST https://api.bem.ai/v3/workflows/full-contract-pipeline/call \24 -H "x-api-key: $BEM_API_KEY" \25 -F "file=@nfl-cba-2020.pdf" \26 -F "wait=false"
The result: your agents can explore the full document through the file-system API, while your downstream systems receive clean, structured JSON. Both outputs are verified, auditable, and improve with every correction.
Getting Started
Install the SDK in your language of choice:
1# Python2pip install bem-sdk34# TypeScript / Node.js5npm install bem-ai-sdk67# Go8go get github.com/bem-team/bem-go-sdk910# C#11dotnet add package Bem
Or use the CLI:
1# Install2brew install bem-team/tools/bem34# Process a document5bem workflows call contract-analysis --input ./nfl-cba-2020.pdf
For agent-native workflows, add the MCP server to Claude, Cursor, or any MCP-compatible agent:
1claude mcp add bem -- npx -y bem-ai-sdk-mcp
Your agent can now call Bem directly. Ask it to parse a document, extract specific fields, or search across your entire document library.
The Bottom Line
The industry is moving past the "RAG vs. no-RAG" debate. Production teams need both structured extraction and agent-native document access. The question isn't which approach to use. It's whether your infrastructure supports both.
Bem is the production layer for unstructured data. One API, both approaches, verified outputs. Get started at bem.ai.

Written by
Antonio Bustamante
Apr 30, 2026 · Whitepaper


Ready to see it in action?
Talk to our team to walk through how Bem can work inside your stack.
Talk to the team