# QuoteSweep Stated-Appetite Corpus, v1.0

A public dataset of 509 U.S. commercial Property & Casualty insurance carriers' publicly available stated-appetite materials, normalized into a single machine-readable JSON.

| | |
|---|---|
| DOI | [10.5281/zenodo.20280436](https://doi.org/10.5281/zenodo.20280436) |
| Carriers | 509 |
| Line-of-business rows | 2,031 |
| Appetite class rows | 9,526 |
| Snapshot date | 2026-05-19 |
| Scrape window | 2026-03-23 to 2026-04-06 (15 days) |
| Provenance | 201 carrier-published PDFs + 500 carrier-website TXT scrapes, normalized by the QuoteSweep pipeline |
| License | CC-BY-4.0 (see `LICENSE`) |
| Distribution | Canonical archive on Zenodo (DOI above); mirror at https://quotesweep.com/research/appetite-corpus-v1/ |

## What this is

This is the underlying dataset for the empirical analysis section of the QuoteSweep observed-appetite whitepaper. It captures *what U.S. commercial P&C carriers publicly say they will write*, parsed into a uniform schema. The headline findings derived from it are reproducible from this package alone (see "Reproducibility" below).

## What this is not

It is **not**:
- A claim about what these carriers actually quote in practice (that is the *observed* appetite category the whitepaper defines).
- A complete enumeration of the U.S. commercial P&C carrier population. Carriers absent here are absent because they publish no public appetite material we could locate or because our pipeline did not parse them. Treat the 509-carrier sample as a floor on disclosure, not a ceiling.
- A normalized representation of carrier underwriting *intent*. The corpus surfaces what's literally published; it does not infer.

## Files in this release

This release is distributed in two layouts depending on the mirror:

**Nested layout (Zenodo `appetite-corpus-v1.zip` archive, and website mirror):**

```
corpus-release/
├── README.md             — this file
├── LICENSE               — CC-BY-4.0
├── CITATION.md           — suggested citation forms
├── carriers.json         — the 509-carrier normalized dataset (3.7 MB)
├── sources.csv           — per-carrier provenance rows
├── codebook.md           — field-by-field data dictionary
├── corpus-schema.md      — supplementary schema notes with population rates
└── reproduce/            — full audit trail and reproduction scripts
    ├── REPRODUCE.md
    ├── compute_analysis_b_v2.py
    ├── compute_analysis_a_v2.py
    ├── analysis-b-coding-v2.csv          — D5 audit trail with verbatim evidence quotes
    ├── analysis-b-headlines-v2.json      — five headline metrics with 95% CIs
    ├── analysis-a-study1-corrected.json  — published κ (the one cited in §5.2)
    ├── analysis-a-study1.json            — pre-correction κ artifact (§5.2.4 sensitivity check)
    ├── analysis-a-study1-vectors.csv     — Study 1 raw per-cell inputs
    ├── analysis-a-study2.json            — pipeline self-audit (§3.6)
    └── analysis-a-study2-vectors.csv     — Study 2 raw per-cell inputs
```

**Flat layout (Zenodo individual-file uploads):** Zenodo's individual-file upload does not preserve subdirectories, so on `zenodo.org/records/20280436` the 16 files appear as siblings. For full structure-preserving reproduction, download the `appetite-corpus-v1.zip` archive from the same Zenodo record.

In both layouts, top-level dataset files (`carriers.json`, `sources.csv`, etc.) describe the corpus; the analytical audit trail (`compute_*.py`, `analysis-*.json`, `analysis-*.csv`, `REPRODUCE.md`) is in `reproduce/` nested or alongside in flat. File names are identical between layouts.

### Top-level file detail

| File | Purpose |
|---|---|
| `carriers.json` | The 509-carrier dataset (sanitized; internal scrape filenames stripped). |
| `sources.csv` | Per-carrier provenance row: `carrierId`, `carrierName`, `sourceType`, `sourceUrl`, `guideYear`, `guidePublicationYear`, `scrapeAccessDate`, `accessStatus`. |
| `codebook.md` | Field-by-field data dictionary, value distributions, known gaps. |
| `corpus-schema.md` | Supplementary schema notes; companion to `codebook.md`. |
| `CITATION.md` | Suggested academic and informal citation forms. |
| `LICENSE` | CC-BY-4.0 license text. |

## Reproducibility

The five headline findings in the paper's Empirical Analysis section can be re-derived from this release. The analysis scripts (`reproduce/compute_analysis_b_v2.py`, `reproduce/compute_analysis_a_v2.py`) use seed `20260518` and require Python 3.10+ plus `pdftotext` (Poppler) for Analysis B's raw-text scan step.

The raw PDF and TXT source files used by Analysis A are not redistributed (each is owned by its publishing carrier under that carrier's own terms). Reproduction of Analysis A from scratch therefore requires re-fetching the URLs in `sources.csv`; see `reproduce/REPRODUCE.md` for the procedure.

Readers who only want to *verify* (rather than re-execute) the numbers can read them directly from `reproduce/analysis-b-headlines-v2.json` and `reproduce/analysis-a-study1-corrected.json`.

## Sanitization policy applied to this release

Versus the QuoteSweep-internal parsed JSON, the following were stripped:
- `sourceFile` — internal file-naming convention; the public-facing carrier URL appears instead in `sources.csv`.
- All timestamps that reflect QuoteSweep's scrape or enrichment job state (e.g., `lastEnriched`).
- Any QuoteSweep-internal identifiers other than `carrierId`, which doubles as the stable public slug.

Carrier-published material (the entire `appetiteClasses`, `excludedClasses`, `linesOfBusiness`, `specialPrograms`, `underwritingNotes`, `contactInfo` content) is preserved verbatim from the source.

## Provenance of `sources.csv`

The `sourceUrl` column captures the carrier-published URL that QuoteSweep originally fetched. URLs were populated on 508 of 509 carriers from the per-carrier `*-discovery.json` provenance records.

The CSV separates two distinct dates that earlier releases conflated:

- `guidePublicationYear` (populated on 459 of 509 carriers) — the four-digit year on the carrier's own publication where extractable (PDF metadata, on-document date stamp, or `guideDate` from the discovery record). Absent that, the value is `unknown` or blank. Heterogeneous in derivation; see `codebook.md`.
- `scrapeAccessDate` (populated on 509 of 509 carriers) — ISO date (UTC) on which QuoteSweep's discovery pipeline retrieved the underlying artifact, derived from the earliest file-modification time across the per-carrier scrape artifacts. Distribution: 503 carriers on 2026-03-23, 1 on 2026-03-27, 5 on 2026-04-06. The paper's "between March and April 2026" claim resolves to this column.

## Suggested workflow for downstream researchers

1. Read `codebook.md` and the paper's `corpus-schema.md` for the schema and known inconsistencies.
2. Replicate the headline metrics from `carriers.json` using the analysis scripts in the paper repository.
3. For inter-source agreement studies (Analysis A), use `sources.csv` to re-fetch the raw PDFs / TXT pages from their carrier URLs. Cite the access date.
4. For extensions, the heterogeneity in `appetiteClasses[*].description` and `excludedClasses[*].reason` is a rich free-text substrate for NLP / classification studies.

## Contact and acknowledgments

Compiled by Ankur Shrestha for the QuoteSweep observed-appetite whitepaper. Issues, corrections, and citations welcomed — please open a GitHub issue at the paper's repository (URL to be added when paper is hosted).