Hospital Bill Data

How it works

Methodology

Every number on this site is parsed from a hospital's own CMS-required machine-readable file. Here is exactly how that happens — and where the limits are.

1. Where the data comes from

Under the federal Hospital Price Transparency rule (45 CFR 180), hospitals must publish a machine-readable file (MRF) of their standard charges. We work from these hospital-published files using the official CMS v3.0 schema. We do not collect prices from bills, surveys, or third parties, and we never estimate or invent a value.

2. Discovery

Hospitals place a cms-hpt.txt manifest at the root of their public website (for example, https://hospital.org/cms-hpt.txt). We treat this as a structured discovery file that may list one or more locations, each with a source page URL and a direct link to its MRF. We also support a manually configured MRF URL where a manifest is unavailable.

3. Archiving the raw source

We download each MRF and store the raw file unchanged, recording its URL, size, a SHA-256 content hash, the HTTP status, and the detected format. The raw archive is the ground truth that every parsed row can be traced back to.

4. Parsing

We stream-parse the file so that even multi-gigabyte files are processed without loading them entirely into memory. We support the CMS v3.0 CSV “tall” and CSV “wide” layouts and the JSON schema, including zip-wrapped files. For each row we capture the service description, billing code(s) and code type, setting, billing class, and the full set of charge fields.

5. Validation and rejection logging

Each row is validated. Rows without a description or without any usable code or price are rejected and counted — never silently dropped. We record how many rows were seen, imported, and rejected for every ingestion run, and surface those counts on each hospital’s source panel.

6. Normalization

We normalize codes, payer names, and plan names so the data is searchable, while always preserving the original values exactly as published. Money and percentage fields are parsed conservatively; anything ambiguous becomes empty rather than a guess. Each charge row keeps a stable hash of its original source row.

7. Procedure mapping and confidence

We map service lines to consumer procedure groups (such as “MRI” or “colonoscopy”) primarily by exact billing-code match (high confidence) and secondarily by description keywords (medium confidence). Low-confidence matches are flagged and are excluded from indexed code pages.

8. What we publish — and what we hold back

A page is only indexed when it is backed by real parsed rows with a verifiable source. We keep pages out of search engines when data is missing, a source could not be validated, a hospital has no usable MRF, procedure-mapping confidence is low, or the page would otherwise be thin or duplicative.

9. Important limits

Hospital files vary widely in completeness, formatting, and update cadence. A disclosed price is what the hospital published; it is not a quote and not a guarantee. See data limitations and why your actual bill may differ for the full picture.