JFK File Explorer

Beyond the Data Dump: The Problem with "Accessible" Government Records

When the National Archives released the declassified JFK assassination documents, they fulfilled the letter of transparency law but fell short of its spirit. Yes, the files were technically "accessible" - but try searching through 130,000+ pages of scanned PDFs filled with handwritten notes, redactions, and 1960s typewritten reports. It's like being given access to a library where every book's pages have been scattered across the floor.

That problem isn't malicious - it's practical. Government agencies aren't in the business of creating user-friendly research interfaces. They release what they have, in the format they have it in. This creates a barrier between the public and declassified information. Raw data isn't the same as accessible knowledge.

That's exactly the problem I set out to solve with the JFK File Explorer. Instead of promoting conspiracy theories or cherry-picking quotes myself, I wanted to create a tool that would let researchers, journalists, and curious citizens interact with the actual primary source material in a meaningful way.

The Technical Challenge: Running Concurrent VLMs for OCR over Several Days

Converting these documents into searchable format required solving several complex problems simultaneously. These weren't clean, digital-native PDFs - they were scans of decades-old paperwork, complete with handwritten annotations, stamps, redactions, and the inevitable degradation that comes with age.

Traditional OCR tools would have struggled with this content. Instead, I leveraged state-of-the-art Vision Language Models (VLMs) that can understand both the visual layout and contextual meaning of document content - essentially enabling AI to read like a human researcher would.

The Multi-Stage Pipeline

My approach involved a carefully orchestrated five-stage process:

First was link extraction - I programmatically gathered URLs to all declassified PDFs from the National Archives website. This meant parsing through various file formats and handling special cases where the government had used different organizational schemes across different release years.

Next was mass PDF downloading - systematically pulling down every document to create a local repository. This step alone involved handling tens of thousands of unique PDF files, each requiring careful error handling and rate limiting to avoid overwhelming the government servers.

The third stage was image conversion - transforming each page of the PDFs into high-resolution images. This preprocessing step was crucial for the VLM analysis that would follow, ensuring the models had crisp, clear images to work with.

VLM-powered OCR was where the magic happened. I rented two RTX 5090 machines on VastAI for four days at approximately $100 total cost, running Gemma3:12b via Ollama+Llama.cpp to perform optical character recognition on every page. The model didn't just extract text - it understood document structure, preserved formatting, and even transcribed handwritten notes into clean markdown files. When it was unsure of the text, or came across redactions, it used “[illegible]” and “[redacted]” markers, which minimized AI hallucinations. This step required processing over 130,000 individual page images, and took 4 days.

Finally came structured data extraction - using GPT-4o-mini to read through the markdown content and identify key information like people, organizations, event timelines, and document summaries. Processing 30 markdown-converted documents concurrently, this step cost about $70 in OpenAI credits and transformed the raw text into structured YAML data perfect for database ingestion in about 20 hours.

Creating an Interactive Knowledge System

With the documents converted to structured data, I could build something far more powerful than a simple search interface. I chose Weaviate as the vector database, which enables semantic search capabilities that go beyond keyword matching.

The system I built offers multiple ways to interact with the data:

The web interface at https://jfk.andrewcampi.com provides an intuitive file explorer with a built-in AI copilot powered by Llama-3.3:70b hosted on Groq. Users can search using natural language queries and get intelligent summaries of relevant documents, complete with specific quotes and citations.

For ChatGPT users, I created a Custom GPT that has direct access to the JFK File Explorer MCP (Model Context Protocol) Server, allowing conversational exploration of the document collection within the familiar ChatGPT interface.

For developers and researchers who want to build their own tools, I provide access via the Model Context Protocol (MCP) server at https://jfk-mcp.andrewcampi.com/mcp, enabling custom integrations and agentic AI analysis workflows.

Users can even login as "guest" to download the markdown and YAML datasets directly, ensuring the processed data is as open and accessible as possible.

Democratizing Knowledge, Not Promoting Conspiracy or Misinformation

This project's intent was never to fuel speculation or conspiracy theories. Quite the opposite - by making the primary source material genuinely searchable and accessible, it enables evidence-based research and helps combat misinformation.

The government essentially performed a data dump of scanned documents without providing any meaningful way to understand what was actually in them. Yes, the files were accessible, but the knowledge and information contained within them was not. There's a crucial difference between data availability and information accessibility.

The JFK File Explorer bridges that gap. Instead of researchers spending months manually combing through documents or relying on secondhand summaries, they can now query the entire collection with questions like "Who did Oswald meet with in Mexico City?" or "Did Oswald know Jack Ruby before the assassination?"

The Future of AI-Powered Historical Research

This project demonstrates something profound about AI's potential role in democratic society. Modern AI tools can transform how we interact with public information, making government transparency more than just a legal checkbox.

The technical architecture I built is entirely reusable. The same pipeline could process any large collection of scanned historical documents - from congressional hearings, to declassified intelligence reports, to historical archives. The cost of this transformation continues to decrease as AI capabilities improve and computing becomes more accessible.

Perhaps most importantly, this approach preserves the integrity of the source material while making it infinitely more usable. Every search result links back to the original document. Every claim can be verified against the primary source. The AI doesn't interpret or editorialize - it simply makes the existing information discoverable and understandable.

The JFK File Explorer represents a new paradigm for how AI can serve the public interest: not by replacing human judgment, but by removing the technical barriers that prevent citizens from accessing and understanding their own government's records. In a world where misinformation spreads faster than facts, tools that make primary sources more accessible aren't just technically interesting - they're democratically essential.