How the Data Pipeline Works
We ingest NEPA documents from the NEPATEC2.0 dataset into SQLite so search, signals, and attestation all run against a single source of truth.
Source: NEPATEC2.0
All project and document data comes from NEPATEC2.0 (PNNL PermitAI), hosted on HuggingFace. The dataset has three streams:
- CE — Categorical Exclusions (~54.7K projects, 367K pages)
- EA — Environmental Assessments (~3.1K projects, 469K pages)
- EIS — Environmental Impact Statements (~4.1K projects, 6.1M pages)
Each stream is a large JSONL file: one line per project, with nested project, process, and documents (each document has metadata and pages with extracted text). You place these files under nepatec_data/CE/, nepatec_data/EA/, and nepatec_data/EIS/ (e.g. ce_full_dataset.jsonl), then run python -m backend.db.ingest from the repo root.
Discovery and one record = one project
The ingest script discovers all *.jsonl files under nepatec_data/CE, nepatec_data/EA, and nepatec_data/EIS. Each line is one JSON object = one project with its documents. We parse the line, normalize the process type to CE | EA | EIS, then insert one row into projects and one row per document into documents, and one row per milestone into milestones. Each project is written in a single transaction; if anything fails we roll back that project and continue with the next.
Main-doc selection (critical for EIS)
Many EIS projects have several files marked main_document: YES in the source (ROD, errata, summaries). We need to run compliance signals on the analysis document—the FEIS or DEIS—not a 3-page errata sheet.
So we use process-type-aware main-doc selection:
- CE and EA: A document is “main” if the source says
main_document: YES. - EIS: We first look for documents whose
document_type(or inferred from filename) isFEIS,DEIS, orEIS. If we find any, those are the main docs. Only if we find none do we fall back tomain_document: YES.
That way the scanner always runs on the full analysis document, and the is_main flag in the DB is set only for those.
CE category and document type
For Categorical Exclusions, we store ce_category from the document metadata (e.g. B1, C4 under agency-specific schemes). For EA and EIS we leave it null. Document type (FEIS, DEIS, ROD, EA, FONSI, etc.) comes from metadata when present; when it’s blank we infer from the filename using a fixed list of substrings (e.g. ROD → Record of Decision, FEIS → Final EIS).
Milestones
We create milestones from documents that map to a known event type (ROD, NOI, DEIS, FEIS, FONSI, etc.). The event type comes from document_type or the same filename fallback. We store project_id, event_type, optional event_date (when the source provides it), and source_doc (the document id). The Observatory and project detail views use this for timelines.
Text and SHA-256
For each document we concatenate all page text from the pages array into one text_content string and store it in documents. We also compute SHA-256 of that text and store it in documents.sha256. The hash is used when building the Solana attestation payload so we can prove the attested flags correspond to the exact document text.
SQLite and FTS5
Everything lands in a single SQLite database (nepa.db by default, path in backend/config.py). We use WAL mode and foreign keys. We build a FTS5 full-text index on projects over title, agency, and state so the project search is keyword-based and fast. Triggers keep the FTS index in sync on insert/update/delete. Indexes on process_type, state, agency, document project_id, is_main, and doc_type keep queries and scans efficient.
Schema in short
- projects — id, title, process_type, agency, state, county, (Solana columns), raw_json.
- documents — id, project_id, doc_type, title, filename, page_count, is_main, ce_category, text_content, sha256.
- milestones — project_id, event_type, event_date, source_doc.
- flags — filled by the compliance scanner (see Compliance Signals).
FAST-41 (Stuckness Radar) is separate: we read a CSV from other_data/ and do not load it into this SQLite DB.
Key files to look at
backend/db/ingest.py— discover_jsonl, ingest_record, _pick_main, filename fallbacksbackend/db/schema.sql— projects, documents, milestones, FTS5, triggersbackend/config.py— DB_PATH, NEPATEC_DATA_DIR
In one sentence
We discover NEPATEC2.0 JSONL under nepatec_data/CE|EA|EIS, insert each project and its documents and milestones into SQLite with process-type-aware main-doc selection and filename fallbacks for doc type and milestones, then maintain an FTS5 index so search and downstream features (signals, attestation, Observatory) all use one database.