May 21, 2026

We used bem to analyze SpaceX's 308-page S-1 filing

How to use bem to analyze an S-1 filing: from PDF to orbital GPUs.

Antonio Bustamante

May 21, 2026·16 min read·

Open in Claude Code Open in Codex Open in Cursor

We used the bem V3 API to read SpaceX's 308-page S-1 in 4 minutes and 45 seconds. Specifically: Parse, Split, Classify, and eight bounding-box-enabled Extract functions, wired into a single workflow. Below are the highlights, and at the end of the post, exactly how we did it.

SpaceX filed its S-1 on the evening of May 20, 2026: 308 pages, 3.5 MB of PDF, 12 MB of underlying HTML. Three operating segments. A dual-class voting structure. The largest TAM disclosure in public-markets history. The morning after, we pointed bem at the filing. Eight specialist Extract functions (every one with enableBoundingBoxes: true and preCount: true). A Parse function for the section graph. A Classify function with nine branches to route each section to the right specialist. A Split function with fifteen section classes for chunking. One workflow (spacex-s1-analyst) wiring it all together. Under five minutes of compute across the eight section calls. 172 per-field bounding boxes anchoring every extracted value back to a (page, left, top, width, height) rectangle in the source PDF. Every number you'll read below is anchored to a coordinate in the actual filing.

The findings come first, in roughly the order an analyst would surface them. The technical walkthrough sits at the end: every API call, the workflow shape, the classifier branches, the bounding-box format, the schemas.

Four pages of the SpaceX S-1: cover, The Offering, Capitalization, Consolidated Statements of Operations

Four of the pages we ran through bem. Cover, The Offering, Capitalization, and the audited Consolidated Statements of Operations.

1. SpaceX has signed Anthropic to a $1.25 billion-per-month cloud contract

The single most important new disclosure in this S-1 is buried in the prospectus summary and the subsequent-events note. On May 3, 2026, SpaceX entered into Cloud Services Agreements with Anthropic PBC for compute capacity across the COLOSSUS and COLOSSUS II clusters.

The headline term: $1.25 billion per month through May 2029, with capacity ramping in May and June 2026 at a reduced fee, terminable by either party on 90 days' notice. At the headline run-rate, that is $15B per year and a maximum $45B contract value, against a counterparty whose own commercial trajectory is one of the steepest in software history.

This is the contract that explains everything else in the AI segment. The 4-to-1 capex-to-revenue ratio on AI compute, the $12.7B of FY25 AI capex against $3.2B of segment revenue, the COLOSSUS II buildout: it is being underwritten in advance by the Anthropic agreement, with Grok training and other internal AI use absorbing the residual capacity.

For an analyst, the questions this raises are concrete: is there a take-or-pay floor, what are the SLA penalties, what does the 90-day-out look like on either side, and how does this affect the contracted-revenue line in the next amendment. The S-1 names the counterparty and the dollar amount; everything downstream is a model input.

2. Musk's new compensation package vests from $500B to $7.5T in market cap, with a Mars colony rider

On January 13, 2026, the SpaceX board approved a grant to Elon Musk of 1 billion performance-based restricted shares of Class B common stock (10 votes per share). The award vests across 15 equal tranches of roughly 66.67M shares each, conditioned on two requirements both being met for each tranche:

A market-capitalization milestone. The schedule runs from $500B at tranche 1 to $7.5T at tranche 15, in $500B steps.
The Company's establishment of "a permanent human colony on Mars with at least one million inhabitants."

In connection with the xAI Merger that closed February 2, 2026, SpaceX also assumed a separate performance award originally granted to Musk by xAI in November 2025. That award entitles Musk to Class A shares equal to 0.20% of fully diluted capitalization at each of 12 valuation milestones ranging from $1.065 trillion to $6.565 trillion, in $500B increments. The first valuation milestone was achieved prior to the merger and Musk was issued 25,172,695 shares of Class A common stock at closing.

Two analyst takeaways. First, the combined post-IPO equity arc puts founder dilution at a known and disclosed schedule: roughly 2.4% of the company at the top milestone of the xAI tail (0.20% × 12), plus an additional 1B Class B shares on the SpaceX side. Second, the Mars-colony rider has no analog in the public-markets compensation set. It is not a soft KPI; it is a closing condition.

3. Orbital AI compute satellites: deployment begins as early as 2028

The most consequential strategic disclosure in the Business section is the orbital data center program. SpaceX intends to build "constellations, with potentially millions of satellites, for orbital data centers" using a new class of orbital AI compute satellites deployed in Sun-synchronous orbit and linked back to Earth via Starlink.

The rationale, in the filing's own language: handling "energy-intensive AI workloads, such as inference demand, at far greater scale and efficiency than terrestrial alternatives." First deployments as early as 2028. This is a vertical that does not exist today and which SpaceX is uniquely positioned to attempt: a launch-vehicle company that operates a terrestrial AI compute business at gigawatt scale and a satellite operator with ~9,600 satellites already on-orbit (as of March 31, 2026) and 10.3 million subscribers across 164 countries.

The TAM SpaceX assigns to this is, candidly, qualitative; the filing does not put a dollar figure on orbital AI compute. The way to model it: each marginal terrestrial gigawatt of AI compute is currently bounded by power and cooling availability; if even a single-digit-percent share of long-tail inference workloads moves orbital, the implied launch cadence and constellation scale become the bottleneck, and SpaceX is the only company today with the manufacturing throughput and reusable launch economics to address it.

4. The capex story

This is now a capex-dominated company. From the MD&A:

Period	Space	Connectivity	AI	Total
FY2025	$3,832M	$4,178M	$12,727M	$20,737M
Q1 2026	$1,052M	$1,332M	$7,723M	$10,107M

The Q1 2026 run-rate alone is $10.1B in three months, with AI carrying $7.7B of it. To put that in context: AI capex in Q1 2026 was greater than the combined Space + Connectivity capex for the entire year of 2025. This is what a hyperscaler ramp looks like at the asset level.

Cross-reference to a subsequent event in the financial statements: SpaceX has agreed to acquire turbine assets for approximately $2,000 million ("the Turbine Acquisition"), explicitly disclosed as a purchase to "help provide power to the Company's data centers." The vertical integration of compute reaches power generation now. Closing expected May 2026.

The funding side: a $20.0 billion SpaceX Bridge Loan sits on the pre-IPO balance sheet, maturing September 2, 2027. There is also a $5B amended revolving credit facility (undrawn) maturing 2031, and $9.1B of "Other Financings" footnoted as failed sale-leaseback obligations related to AI infrastructure assets. Total long-term debt at March 31, 2026: $29.1B. The bridge maturity sets the clock on the IPO and any post-IPO debt refinancing.

The revolving credit facility ($5B amended capacity, originally $1.5B in February 2025) accrues at SOFR + 0.75-1.25% depending on the Company's debt rating, with multi-currency draw options (SONIA for GBP, EURIBOR for EUR). The Bridge Loan term sheet is footnoted but the rate is not disclosed in the body of the prospectus. That detail will move into focus as the deal team prices the IPO and the refinancing path crystallizes.

Capitalization table with bounding boxes around bridge loan, total long-term debt, and the equity classes

Capitalization as of March 31, 2026. The Bridge Loan, total long-term debt, redeemable preferred, and four classes of common stock are all anchored on the page.

These are the items a sell-side analyst or a credit committee would pull out of the back of the filing. They sit in the auditor's report, the Notes to Consolidated Financial Statements, and the Related Party Transactions section. None of them change a multiple by themselves; they shape every multiple that gets built on top.

The auditor is PricewaterhouseCoopers LLP. Los Angeles office. Auditor of record since 2012. The opinion is unqualified. The audit report was issued March 30, 2026 and re-issued May 7, 2026 for the effects of a stock split and a change in reportable segments.

One Critical Audit Matter (one CAM). PwC flags exactly one: "Revenue Recognition: Estimate of Total Cost at Completion for Certain Contracts Recognized Over Time." The CAM applies to a portion of the $4.1B Space-segment revenue and $11.4B Connectivity-segment revenue recognized over time using the cost-to-cost input method. The judgment risk PwC calls out: launch timing, allocation of shared costs across reusable launch vehicles, satellite material costs, and expected technological changes. For an analyst, this is the line that says: management discretion is highest in the launch-revenue and satellite-manufacturing-revenue recognition lines, and the auditor procedures focus there. No going-concern paragraph, no other CAMs.

The $66 million of related-party interest in FY2025 has a specific counterparty. On October 12, 2025, CTC (a subsidiary of xAI and an indirect subsidiary of SpaceX) entered into an equipment lease agreement with Valor Equity Partners for AI infrastructure hardware. Valor's founder, CEO and Chief Investment Officer is Antonio J. Gracias, a director of SpaceX. The lease was determined to be a failed sale-leaseback, recorded as $4,507M of debt on the balance sheet ($455M current + $4,052M long-term) at December 31, 2025. The $66M related-party interest in FY2025 is the interest on this lease. That single transaction accounts for roughly half of the $9,105M "Other Financings" line in the capitalization table.

The Tesla related-party purchases are not small. $506M of Megapack products in FY2025 ($191M in FY2024), plus $131M of Cybertrucks at MSRP in FY2025. These are recorded in Property, Plant and Equipment. The Megapack ramp tracks the AI capex ramp; the Cybertrucks are a smaller note. Both are at-arms-length-priced but with related-party disclosure.

Adjusted EBITDA is a non-GAAP measure with a published bridge. The filing's headline non-GAAP figure of $6,584M of FY25 Adjusted EBITDA does not reconcile to the $(2,589)M GAAP operating loss without significant add-backs: depreciation and amortization (the headline reconciling item given the $20.7B FY25 capex base), stock-based compensation expense (a meaningfully larger number than in prior years given the SpaceX 2026 grant and the assumed xAI award), restructuring ($487M FY25), and impairment ($38M FY25). The reconciliation table itself is in the MD&A's Non-GAAP Financial Measures subsection. We are running an MD&A Extract against it next; until then, treat $6,584M as the issuer's number and the GAAP operating loss of $(2,589)M as the apples-to-apples line.

Free cash flow math, back-of-envelope. Combine FY25 Cash from Operations (which we have not pulled here; sits in the Consolidated Statements of Cash Flows on PDF page ~236 and is the natural next Extract) with the FY25 capex of $20.7B. The Q1 2026 stub alone is $10.1B of capex against a $1.1B Adjusted EBITDA quarter and a deeper GAAP loss. The funding gap is being filled by some combination of (a) the $20B Bridge Loan, (b) the failed-sale-leaseback structures with Valor and others, and (c) the Anthropic contract pre-billings on COLOSSUS II ramp. Quantifying the cash-flow waterfall is the natural follow-up Extract.

6. The audited income statement

Three years of audited results, from the Consolidated Statements of Operations on PDF page 234. All figures in millions of USD except per-share data.

Line item	FY2025	FY2024	FY2023
Revenue	$18,674	$14,015	$10,387
Cost of revenue	(9,451)	(7,996)	(6,110)
Research and development	(8,643)	(3,464)	(2,105)
Selling, general and administrative	(2,644)	(1,813)	(1,665)
Restructuring	(487)	(213)	(237)
Impairment	(38)	(63)	(3,775)
Income (loss) from operations	(2,589)	466	(3,505)
Interest expense	(1,945)	(1,580)	(1,693)
Other income (expense), net	(177)	985	(42)
Income tax provision (benefit)	718	(549)	(363)
Net income (loss)	$(4,937)	$791	$(4,628)
EPS basic	$(1.69)	$0.01	$(1.68)
Weighted-avg shares basic (millions)	2,926	2,848	2,759

Six things an analyst will isolate immediately:

Revenue compounding at 34% per year. $10.4B → $14.0B → $18.7B. This is among the cleanest growth profiles in the public-markets pipeline at this revenue scale.
The R&D step-up is the story of 2025. R&D went from $3.5B (2024) to $8.6B (2025). That is the AI segment build hitting the income statement in earnest, on top of the continued Starship program ($3.0B of FY25 R&D was Starship-specific per the MD&A). It is the single largest reason FY25 went from a $466M operating profit (2024) to a $(2,589)M operating loss (2025).
2023 carried a $3,775M impairment. Without that charge, 2023 operating income would have been roughly $270M positive. The impairment is consistent with the legacy X (formerly Twitter) writedown timing and isolates 2023's loss as a one-time accounting reality rather than an operating story.
The 2024 "other income, net" of $985M is the digital-asset reclassification effect. The auditor's opinion flags this directly: "the Company changed the manner in which it accounts for digital assets in 2024." The shift to fair-value treatment on the bitcoin treasury position shows up here. The unit count and unrealized gain breakdown is in Note 2.
The 2025 tax provision of $718M on a $4.2B pre-tax loss is unusual. Either non-deductible items at scale (likely impairments or related-party interest), a foreign-jurisdiction split, or a valuation-allowance change on deferred tax assets. The reconciliation is in the tax footnote.
These are restated, post-xAI-Merger combined financials, not standalone SpaceX. The xAI Merger closed February 2, 2026, but because Musk had controlling voting interest in SpaceX, xAI, and X simultaneously, the transaction was accounted for as a reorganization of entities under common control. No goodwill, no new intangibles. The combined entity's historical financials are reported retrospectively at carrying amounts for all periods presented. That means the FY2024 $14.0B revenue and the FY2023 $10.4B revenue both include retroactive consolidation of xAI and X. The 2023 impairment of $3,775M is the Twitter/X writedown captured retroactively. Anyone modeling SpaceX standalone needs to back this out of the comparable years.

Consolidated Statement of Operations with bounding boxes around revenue, R&D, operating loss, net loss, and EPS for FY25/24/23

Three years of audited operating results. The R&D ramp from $3.5B to $8.6B is the single most consequential line on the page; the bounding box anchors the underlying cell.

7. The segment picture

From the MD&A, three operating segments:

Space. FY25 revenue $4,086M, segment Adjusted EBITDA $653M. Includes Falcon 9, Falcon Heavy, Starship development, Starshield, and government launch contracts. Starship-specific R&D was $3,004M in FY25 alone.
Connectivity. FY25 revenue $11,387M, segment Adjusted EBITDA $7,168M, year-over-year growth of +49.8% / +120.4% / +86.2% on revenue / op income / segment EBITDA. ~9,600 satellites on-orbit, ~10.3M subscribers across 164 countries as of March 31, 2026. This is the cash engine.
AI. FY25 revenue $3,201M, segment operating loss $(6,355)M, segment Adjusted EBITDA $(1,237)M. Includes COLOSSUS, COLOSSUS II, Grok, X, Macrohard (agentic AI platform jointly developed with Tesla), and Terafab (chip-manufacturing partnership with Tesla and Intel, with a "general framework" disclosed but no defined timelines or capital commitments yet).

The narrative restructure is significant. The headline of "SpaceX = launches" stopped being load-bearing in 2025. Connectivity carries roughly 61% of FY25 revenue ($11.4B of $18.7B) and provides essentially all of consolidated operating income. Space is at-or-near operating breakeven net of Starship R&D. AI is the growth investment, and the cash flow being deployed there is the largest dollar figure in the filing.

8. The TAM disclosure

The Business section quantifies a total addressable market of $28.5 trillion, segmented across Space ($370B from Novaspace, plus an unquantified lunar economy upside), Connectivity (consumer broadband, enterprise solutions, government, Starlink Mobile across ~30 MNO partnerships in ~30 countries), and AI. SpaceX names two of its end-state ambitions as the next two trillion-dollar markets: Starlink as the first one already underway, and AI compute (terrestrial today, orbital starting 2028) as the second.

Whatever the discount applied to the topline TAM figure, it sets the management-team frame for capital allocation. Multi-trillion-dollar adjacent markets are explicitly enumerated in the long-term strategy section:

Point-to-point Earth transport via Starship. Long-haul terrestrial travel between major cities at a fraction of current transit times. Disclosed as part of the long-term Starship-enabled opportunity set.
Space tourism. Expected to grow as the build-out of orbital flight infrastructure continues.
In-orbit manufacturing. Microgravity-based production of pharmaceuticals, materials, and components.
Energy production on the Moon and Mars. Solar generation at scale.
Asteroid mining. Metals and critical-resources extraction from near-Earth and main-belt asteroids.
Manufacturing on the Moon and Mars. Construction materials and fuel production from local resources.

The TAM number is what it is; the line item that matters more is the disclosure that any of these are explicitly part of the long-term strategy section of an S-1. That language means they are part of the issuer's articulated growth story, not investor narrative.

9. Capital structure and the dual-class architecture

From the Capitalization table extract, as of March 31, 2026:

$20,000M SpaceX Bridge Loan (matures Sep 2, 2027)
$9,105M Other Financings (failed sale-leaseback obligations on AI infrastructure)
$27M X 2027 and X 2030 Notes (the residual xAI/X debt)
Total long-term debt: $29,111M
$7,049M of redeemable convertible preferred stock (converts on offering)
Four classes of common stock pre-listing (Classes A, B, C, D). Class C is reclassified out on the offering; Class D is authorized for future use.

Pro forma post-conversion, the common-stock float is approximately 12.5 billion shares (Class A: 6.82B issued/outstanding; Class B: 5.70B). The voting structure is 1 vote per Class A share and 10 votes per Class B share, with Class B convertible into Class A at the holder's option, and automatic conversion of Class B on transfer outside the permitted-holder set. The filing notes that SpaceX will be a controlled company under Nasdaq rules following the offering.

The dilution from preferred conversion alone is material. Class A shares-of-record go from 2,882M (actual, March 31, 2026) to 6,825M (pro forma post-conversion of the $7,049M redeemable convertible preferred, the Class C reclassification, and the related mechanics). That is a +137% increase in Class A shares outstanding before the IPO adds a single primary share. Class B goes from 2,421M actual to 5,696M pro forma. For modeling purposes, the right pro forma share count to anchor against is the 12.5B combined-class total, not the 5.3B actual.

Cover page with bounding boxes around extracted issuer and counsel fields

Filing-front facts: registrant name, Texas state of incorporation, SIC code, IRS EIN, principal executive offices, agent for service, and counsel for both sides of the deal. Gibson Dunn for the issuer, Davis Polk for the underwriters.

The Texas incorporation is itself a recent operative fact: SpaceX has historically been a Delaware corporation. The S-1 confirms the redomicile to Texas is complete by the filing date, which moves the applicable corporate law framework from DGCL to TBOC. For governance-comparable analysis, this is a meaningful precedent shift.

10. The risk picture

The Risk Factors section runs from page 40 onward. From the first eleven pages, bem extracted 19 risk headlines and bucketed them by category:

Category	Count
Operational	10
Regulatory	4
Cybersecurity	1
Macro	1
Competitive	1
Key-person	1
Litigation	1
Financial	1

Heavy on Operational (Starship cadence and reusability, satellite collisions, manufacturing throughput, launch delays and failures, ground-station and data-center continuity), with Regulatory close behind (FCC and ITU spectrum, AI and privacy law in the U.S. and abroad, satellite licensing across jurisdictions). The headline competitor list named in the AI risk section is direct: OpenAI, Anthropic, Google, Meta, Microsoft, and open-source model providers on the model side; Threads, Reddit, and TikTok on the platform side.

Risk Factors section with bounding boxes around the top risk headlines

A sample of the bolded risk headlines, anchored to their position on the page. Categorized, this becomes a comparable per-issuer risk matrix.

11. How we did it

The findings above were produced by a single workflow on the bem V3 API: eleven functions in total (eight bbox-enabled Extracts, one Parse, one Classify, one Split), wired into one workflow with a classifier root and a fallback path for the sections the typed extractors don't cover. This section walks through what the pipeline looks like end-to-end so you can build the same thing.

The pipeline

bash

1                       ┌────────────────────────────────┐
2                       │   s1-section-classifier        │
3                       │   (Classify, 9 branches)       │
4                       └────────────────┬───────────────┘
5                                        │ classify a section, route by type
6        ┌───────────────────────────────┼─────────────────────────────────┐
7        │              │                │                 │               │
8   issuer_cover   the_offering    capitalization    income_statement  ... 9 branches
9        │              │                │                 │               │
10        ▼              ▼                ▼                 ▼               ▼
11 s1-issuer-      s1-offering-     s1-capitalization- s1-income-       s1-parse
12   extract         extract          extract           statement-       (agentic
13                                                      extract          fallback)
14 (Extract,       (Extract,        (Extract,         (Extract,
15  bbox=on,        bbox=on,         bbox=on,          bbox=on,
16  preCount=on)    preCount=on)     preCount=on)      preCount=on)

The classifier sits at the workflow root. For each incoming section PDF, it picks a single branch based on the section's content and the branch descriptions, then forwards the document to the matching specialist Extract. Extracts return structured JSON plus per-field bounding boxes. Anything that doesn't match a typed branch falls through to the agentic Parse function via the isErrorFallback: true branch.

One function definition, fully shaped

Here is what an Extract definition looks like on the V3 API. The Audited Income Statement function, in full:

python

1upsert_function("s1-income-statement-extract", {
2    "functionName": "s1-income-statement-extract",
3    "type": "extract",
4    "displayName": "S-1: Audited Income Statement",
5    "tags": ["spacex-s1", "blog-analyst-toolkit"],
6    "outputSchemaName": "S1IncomeStatement",
7    "outputSchema": {
8        "type": "object",
9        "description": "Consolidated Statement of Operations line items, audited. Capture FY2025, FY2024, FY2023.",
10        "properties": {
11            "periodsCovered": {"type": "array", "items": {"type": "string"}},
12            "currencyUnit":   {"type": "string"},
13            "revenue":                  {"type": "object", "properties": {"fy2025": {"type": "string"}, "fy2024": {"type": "string"}, "fy2023": {"type": "string"}}},
14            "costOfRevenue":            {"type": "object", "properties": {...}},
15            "researchAndDevelopment":   {"type": "object", "properties": {...}},
16            "operatingIncomeLoss":      {"type": "object", "properties": {...}},
17            "netIncomeLoss":            {"type": "object", "properties": {...}},
18            "epsBasic":                 {"type": "object", "properties": {...}},
19            "weightedAverageSharesBasic":   {"type": "object", "properties": {...}},
20            "weightedAverageSharesDiluted": {"type": "object", "properties": {...}}
21        }
22    },
23    "enableBoundingBoxes": True,    # ← per-field provenance
24    "preCount": True,                # ← keep the vision model from stopping early on long docs
25})

Two flags do most of the work:

enableBoundingBoxes: True switches the function to bem's bounding-box-aware vision model. Every field the model produces also comes with a list of rectangles on the source page (multiple if the field is sourced from more than one cell).
preCount: True forces the model to enumerate input pages before it starts extracting. On a typical S-1 page that's a few hundred milliseconds of extra latency; on a long, scanned PDF it materially improves recall and prevents the model from stopping at page 3.

One classifier branch description, fully shaped

The classify function is where the routing intelligence lives. Each branch description has to be positive-signals-only. The model classifies on "what this section looks like," never on negation. Here is the Capitalization branch verbatim:

json

1{
2  "name": "capitalization",
3  "functionName": "s1-capitalization-extract",
4  "description": "The Capitalization table section. Distinguishing features: a 'CAPITALIZATION' header, a tabular layout with 'Actual' and 'As Adjusted' columns, line items including 'Long-term debt', 'Redeemable convertible preferred stock', 'Common stock', 'Additional paid-in capital', 'Accumulated deficit', 'Total shareholders' equity', balance-sheet-date language."
5}

The other branches follow the same shape. The agentic fallback branch carries isErrorFallback: true and is wired to the Parse function: anything that doesn't match a typed extractor goes there, returning a section graph and entity list that an analyst (or a downstream agent) can browse.

The actual API call

Calls go through workflows, not individual functions. To run a section through the pipeline:

python

1body = {
2    "callReferenceID": f"blog-spacex-s1-cover-{int(time.time())}",
3    "input": {
4        "singleFile": {
5            "inputType": "pdf",
6            "inputContent": base64.b64encode(pdf_bytes).decode(),
7        }
8    },
9}
10req = urllib.request.Request(
11    "https://api.bem.ai/v3/workflows/spacex-s1-analyst/call?wait=true",
12    data=json.dumps(body).encode(),
13    headers={"x-api-key": KEY, "Content-Type": "application/json"},
14    method="POST",
15)
16resp = json.loads(urllib.request.urlopen(req, timeout=300).read())

?wait=true blocks for up to ~60 seconds. For longer pipelines you'll get back status: "running" and then poll /v3/calls/{callID} every few seconds until it reaches completed, failed, or errored. The full event trace (including the classify decision, every Extract output, and field-level confidences) is at /v3/calls/{callID}/trace.

What an extract event actually returns

Each Extract event in the call output contains the transformedContent (the JSON shaped to your schema), a fieldBoundingBoxes object (RFC 6901 JSON Pointer paths mapped to coordinate lists), a fieldConfidences object (per-field confidence in [0, 1]), and an `avgConfidence` rollup. The bounding-box object is the analyst-relevant primitive:

json

1"fieldBoundingBoxes": {
2  "/registrant/name":   [{"page": 1, "top": 0.233, "left": 0.356, "width": 0.284, "height": 0.013}],
3  "/registrant/stateOfIncorporation":
4                        [{"page": 1, "top": 0.272, "left": 0.118, "width": 0.020, "height": 0.005}],
5  "/registrant/sicCode":[{"page": 1, "top": 0.273, "left": 0.490, "width": 0.019, "height": 0.007}],
6  "/agentForService/name":
7                        [{"page": 1, "top": 0.346, "left": 0.482, "width": 0.041, "height": 0.005}],
8  "/issuerCounsel/0":   [{"page": 1, "top": 0.443, "left": 0.225, "width": 0.118, "height": 0.007}],
9  "/underwriterCounsel/0":
10                        [{"page": 1, "top": 0.443, "left": 0.668, "width": 0.122, "height": 0.007}],
11  "/filingDate":        [{"page": 1, "top": 0.106, "left": 0.595, "width": 0.046, "height": 0.007}]
12}

Coordinates are normalized to [0, 1] on each page so they survive any downstream rasterization at arbitrary DPI. RFC 6901 JSON Pointer paths address into arrays (`/issuerCounsel/0`) and nested objects (`/registrant/sicCode`) the same way; an automated audit tool can walk the extracted JSON and the bounding-box map in lockstep.

That is the primitive that lets a downstream pipeline carry the citation through to the consumer. The dataframe row is (field, value, confidence, page, left, top, width, height). The model output and the source document never get decoupled.

Offering subsection with bounding boxes around the Class A/B share class fields, voting power, and listing exchange

The Offering page bounding boxes from `s1-offering-extract`. Field labels overlaid on the source PDF; the structured JSON and the source coordinates travel together.

What the run actually looked like

Eight section calls, one Parse call, plus a couple of direct function calls bypassing the classifier for sections where the section content sat at a category boundary. Per-section timing and the bounding-box count produced:

Section	Function called	Wall time	Fields with bboxes
Filing cover	`s1-issuer-extract`	16 s	16
The Offering	`s1-offering-extract` (direct)	14 s	8
Use of Proceeds	`s1-use-of-proceeds-extract`	19 s	2
Capitalization	`s1-capitalization-extract`	37 s	35
Dilution	`s1-dilution-extract`	24 s	5
MD&A	`s1-mdna-extract`	47 s	18
Income Statement	`s1-income-statement-extract`	55 s	46
Risk Factors	`s1-risk-factors-extract`	73 s	42
Parse front matter	`s1-parse`	<2 s	(sections graph)

Total wall time: about 4 minutes 45 seconds. Total per-field bounding boxes: 172. Pages touched: 39 of the 308. The post is a different artifact from the workflow output. The post took longer to write than the API took to extract.

Re-running on the amendment

When the S-1/A drops with the price range and the share count filled in, the workflow runs again against the same paths in the same JSON shape. The placeholders that were blank in this initial filing turn into real numbers, and the bounding box tells you exactly which cells changed. That is the difference between an analyst-pipeline workflow and a one-shot model call: the second filing isn't another project, it's the same query rerun against an updated document.

That is the toolkit. The findings are the show.

If you read filings for a living and want to see what an analyst pipeline looks like on your coverage, the build above runs against the public bem V3 API. Sign up and run it on yours.

Written by