HIPAA-Safe CSV Import: What Schema-Only Mode Actually Does
Schema-only mode sends only column headers and up to three short sample rows — never the full file. Here is precisely what leaves your system, why it is free with no DPA, and how the compliance posture changes when you switch to full-data mode.
The fastest way to fail a healthcare security review is to send protected health information somewhere it did not need to go. The cleanest way to pass one is to not send it at all. AdaptivMapr’s default mode — schema-only — is built around exactly that idea: to propose a column mapping, the engine needs to understand the shape of your data, not its contents.
This post is the precise version of that claim. No marketing rounding — here is exactly what leaves your system, what does not, and what changes when you opt into full-data mode.
What actually leaves your system in schema-only mode
Two things, and nothing else:
- The column headers. Every header in the file — because that is what the cascade matches against the template.
- Up to three sample rows, each value clamped to 80 characters. A few short samples help the engine disambiguate a column whose header is uninformative (an
idthat is clearly a UUID, a column of two-letter country codes). Anything past the third row, or past 80 characters in a cell, is dropped before the request leaves the edge.
That clamp is not advisory. It is enforced at the HTTP boundary in every endpoint that accepts sample rows, through a single shared chokepoint in the parser, so schema-only and full-data code paths cannot diverge on what they are allowed to send. The full upload — every other row, every full-length value — never leaves your infrastructure in this mode. The engine itself is stateless: there is no application database, and what little it does hold lives in-process with a 24-hour time to live.
Why three rows of ≤80 characters changes the compliance math
HIPAA’s minimum-necessary principle asks you to disclose the least PHI required for the task. The task here is mapping a column called Geb_Datum to a birth-date field. That task is fully served by the header and a couple of clamped samples; it does not require the other 50,000 rows. By construction, schema-only mode discloses close to the minimum a mapping task can operate on.
Because the exposure is so constrained, schema-only mode is free, unlimited, and needs no data protection agreement to use. You can wire it into CI, run it against every partner file you receive, and never sign a DPA for the privilege — there is simply not enough leaving your system to warrant one. This is the mode we expect most integration work to live in.
A practical caution that belongs in every engineering README: three sample rows of free-text can still contain real identifiers if your source data does. Schema-only mode minimises exposure by volume and length; it does not redact. If your sample rows would contain PHI, send synthetic or de-identified samples — or send headers only.
When you need the rows: full-data mode
Some jobs genuinely require the data, not just its shape — validating every value against a code system, normalising a whole column, or committing a transformed resource. That is full-data mode, and it is gated behind an active PHI subscription rather than being free by default.
The compliance posture is deliberately different here:
- The cascade still runs in-process, exactly as it does for schema-only. The earlier layers — statistics, heuristic, fuzzy, and semantic — never call out to a third party.
- Only the layer-5 LLM step routes externally, and when it does it goes through phi-cloud, a PHI-aware, OpenAI-compatible gateway. The request carries
X-PHI: trueand anX-Regionheader, so phi-cloud forces a PHI-eligible, in-region model rather than whatever default a generic LLM endpoint would pick. - A BAA is available through phi-cloud for the full-data path. That is where the Business Associate Agreement and the jurisdiction guarantees are enforced.
We say HIPAA-aware, and we say a BAA is available. We do not say “HIPAA certified” — there is no such certification to hold, and any vendor claiming it is telling you something untrue. (Our SOC 2 work is in progress; we will say so plainly when it lands, and not before.)
How to choose, in one paragraph
If you are mapping headers — figuring out which partner column is the birth date, whether a file matches a template, what the diff against your canonical schema looks like — stay in schema-only. It is free, it sends almost nothing, and it needs no paperwork. Reach for full-data mode only when the operation genuinely needs the values, and accept the subscription, the phi-cloud routing, and the BAA that come with it. Most teams spend most of their time in the free mode, which is the point.
See the headers-only request shape and the full-data opt-in in the docs, watch a real header resolve in the patient demographics walkthrough, or read how the cascade keeps the metered LLM layer from firing on most columns in the LOINC resolution post.