# Reproducing the Empirical Analyses

This document accompanies the audit trail and reproduction scripts for the empirical findings reported in §5 of *Observed Appetite: A Computational Framework for Measuring Commercial Insurance Carrier Underwriting Behavior at Distribution Scale* (Shrestha, 2026; DOI [10.5281/zenodo.20280436](https://doi.org/10.5281/zenodo.20280436)).

File references below use bare names (e.g., `carriers.json`); resolve them as siblings in flat distributions (Zenodo individual files) or in the parent directory in nested distributions (the `appetite-corpus-v1.zip` archive and the website mirror).

Two studies are reported in the paper:

- **Analysis B** — Granularity gap (primary contribution). Codes each of 509 carriers along six dimensions (D1–D6) and reports five headline metrics.
- **Analysis A** — Within-carrier inter-source agreement. Two complementary κ studies (Study 1: PDF vs. text-page; Study 2: raw source vs. parse).

## Files in this directory

| File | What it is |
|---|---|
| `compute_analysis_b_v2.py` | Reproducible script for Analysis B. Reads the parsed-JSON corpus and the raw source text/PDFs; emits the six-dimension coding and the headline metrics. |
| `compute_analysis_a_v2.py` | Reproducible script for Analysis A Studies 1 and 2. Uses a deterministic NAICS-2 sector-keyword matcher; bootstraps κ + Gwet's AC1 across 1,000 resamples. |
| `analysis-b-coding-v2.csv` | Full coding output: one row per carrier × six dimensions, with verbatim D5 evidence quotes and source-page references. The audit trail for the 2.16% interaction-disclosure headline. |
| `analysis-b-headlines-v2.json` | Five headline metrics plus the seven auxiliary statistics, all with 95% bootstrap CIs. |
| `analysis-a-study1-corrected.json` | **The published κ result for Analysis A Study 1.** Cohen's κ = +0.2511 [0.22, 0.28]; the full 2 × 2 contingency table; pooled marginals (33.65% / 15.71%); 2.14× coverage ratio. |
| `analysis-a-study1.json` | The earlier, **non-published** computation that applied a drop rule excluding both-out cells. Preserved verbatim for the §5.2.4 sensitivity check disclosure (yields κ = –0.30 under the drop rule). Do not cite this as the headline. |
| `analysis-a-study1-vectors.csv` | Per-cell inputs to Study 1: each (carrier, NAICS-2 sector) coded from PDF and from text-page independently. 189 carriers × 20 sectors = 3,780 rows. |
| `analysis-a-study2.json` | Study 2 (pipeline self-audit): κ between the raw source text and the normalized parse, on a stratified sample of 25 carriers. AC1 = +0.55. |
| `analysis-a-study2-vectors.csv` | Per-cell inputs to Study 2. |

Seed: `20260518` for both scripts.

## What can be reproduced from this directory alone

- **All headline numbers in §5** can be verified by reading the JSON files. The numbers in the paper are computed exactly once; you can confirm they match what the JSONs report.
- **The Analysis B coding logic** can be re-executed against ``carriers.json`` (the sanitized corpus released alongside this directory), provided the original raw-source PDFs and TXT pages are present at the path the script expects.
- **The Analysis A coding logic** can be re-executed against the raw scrape directory.

## What requires re-fetching to reproduce from scratch

Analysis A Study 1 depends on the raw PDF and HTML/text artifacts retrieved from each carrier's public URL. These artifacts are not redistributed in this release (each is owned by its publishing carrier under that carrier's own terms of use). To reproduce Study 1 from scratch:

1. Read `sources.csv` for the 509 carrier URLs and the `scrapeAccessDate` column.
2. Re-fetch each URL using your preferred browser-automation or HTTP client. Note that some carrier URLs have shifted since 2026-03 to 2026-04; use the Wayback Machine if a URL has moved.
3. Run `compute_analysis_a_v2.py` against your re-fetched directory.

## Scripts' input-path expectations

Both scripts use path-relative resolution from their own location. In the QuoteSweep-internal repo layout, the scripts expect:

- A parsed corpus at `<repo-root>/quotesweep-app/exports/appetite-guides-parsed.json` (the internal-format JSON; sanitized to `carriers.json` for the public release).
- Raw scrape artifacts at `<repo-root>/quotesweep-app/data/appetite-guides/`.

For external reproduction, the simplest patch is to point the scripts' `PARSED_PATH` and `RAW_DIR` variables to your local equivalents. The schema of the parsed JSON is documented in `codebook.md` and `corpus-schema.md`.

## Caveats for downstream researchers

- **Single-coder coding.** All D5 (interaction disclosure) coding and Analysis A keyword matching was performed by one LLM-assisted agent in a single session. The `analysis-b-coding-v2.csv` file carries verbatim evidence quotes so the codings can be independently audited.
- **Sample bias.** The 509-carrier sample is over-representative of carriers that publish *any* appetite documentation. The headline disclosure rates reported in §5.3 should therefore be interpreted as ceilings on the population, not central estimates.
- **Drop-rule artifact.** Read `analysis-a-study1.json` only in the context of §5.2.4. The published κ result is in `analysis-a-study1-corrected.json`.

## Contact

Questions, corrections, and citation requests: see `CITATION.md`.

End of reproduction notes.
