Doris: An AI Librarian with Local Wikipedia

The Challenge of Knowledge Access

Modern AI assistants typically rely on knowledge encoded in their training data or real-time internet searches. The training data approach means information becomes stale as soon as training concludes, while the internet search approach requires external APIs, rate limits, and network connectivity. Neither solution provides the combination of comprehensive, current, and locally accessible knowledge that would be ideal for many applications.

Wikipedia represents one of humanity's largest collaborative knowledge projects—millions of articles covering virtually every topic imaginable, continuously updated by contributors worldwide. However, Wikipedia's structure is optimized for human browsing rather than programmatic access. The wiki markup format, complex template system, and interconnected article structure make it challenging for AI systems to efficiently extract and utilize this wealth of information.

A Showcase of AI Technologies

Doris is an AI librarian project that demonstrates the integration of several advanced AI and data processing technologies to create a powerful, locally-hosted knowledge assistant. The project combines large-scale data processing, search indexing, LangChain agents, retrieval-augmented generation (RAG), and modern UI frameworks to build a system that can answer factual questions and provide book recommendations entirely from local resources supplemented by external APIs.

Rather than being a single focused tool, Doris serves as a comprehensive showcase of how different AI technologies work together. The project demonstrates skills across data engineering, natural language processing, agent-based AI architectures, information retrieval, and frontend development—all working in concert to create a functional AI assistant.

Multi-Stage Data Pipeline

The foundation of Doris is a sophisticated data processing pipeline that transforms Wikipedia's raw data dump into an AI-accessible knowledge base. This pipeline consists of three sequential stages, each addressing specific technical challenges:

Stage 1: Downloading and Extracting Wikipedia

The first stage downloads the complete English Wikipedia dump from Wikimedia's servers. This dump, compressed with bzip2, weighs in at approximately 20GB and expands to roughly 110GB of raw XML data. The implementation handles this massive download with progress tracking via tqdm, streaming the data to avoid memory issues, and managing the decompression of the bz2 archive into a single enormous XML file.

This stage alone demonstrates practical understanding of handling large-scale data transfers, memory-efficient file processing, and user feedback through progress indicators. The fact that the system can handle files measured in tens of gigabytes speaks to thoughtful engineering around resource constraints.

Stage 2: XML to Markdown Conversion

The second stage tackles the complex task of parsing Wikipedia's XML structure and converting wiki markup to clean markdown. This involves processing the 110GB XML file efficiently, extracting article titles and content, cleaning wiki markup syntax (templates, references, categories, file links), converting wiki formatting to markdown equivalents, organizing articles into a directory structure for efficient access, and handling edge cases like redirects, special characters, and long filenames.

The conversion process uses event-driven XML parsing to handle the massive file without loading it entirely into memory. Regular expressions meticulously transform wiki syntax—bold text ('''text''' becomes **text**), italics (''text'' becomes *text*), headings, internal links, and removing extraneous markup like templates, references, and HTML tags.

The result is approximately 32GB of clean markdown files, each representing a single Wikipedia article, organized into subdirectories based on the first two characters of the filename. This organizational structure prevents any single directory from becoming unwieldy and enables efficient file system operations.

Stage 3: Full-Text Indexing

The third stage creates a searchable index of all article titles using Whoosh, a pure-Python search engine library. This indexing process walks through all generated markdown files, extracts titles from each file, applies stemming analysis to improve search matching, and builds an inverted index for fast lookups.

The indexing leverages multiprocessing to utilize all available CPU cores, processing millions of articles in parallel. The resulting index, approximately 2GB in size, enables near-instantaneous title searches even across millions of articles. This search capability is crucial for the AI agent's ability to quickly locate relevant information.

LangChain Agent Architecture

With the knowledge base prepared, Doris implements a LangChain agent that can intelligently use tools to answer user queries. This agent architecture represents a significant advance over simple prompt-response systems—the AI doesn't just generate text, it actively decides which tools to use and how to use them to accomplish tasks.

The agent has access to two primary tools:

get_factual_info: Searches the local Wikipedia index for factual information. When invoked with a query, this tool searches the title index for matching articles, ranks results by relevance, selects the most appropriate article, reads the article content, and returns a sample with source attribution.

search_books: Queries the Google Books API for book recommendations. This tool constructs properly formatted API requests, parses JSON responses, extracts relevant book information (title, authors, ISBN), and formats results for the AI to present to users.

The agent decides autonomously which tool to use based on the user's question. Asking about historical facts triggers Wikipedia searches, while requesting book recommendations invokes the Google Books API. The system can chain multiple tool uses together, searching Wikipedia to gather context before recommending related books, or trying multiple search queries if the first attempt doesn't yield results.

Retrieval-Augmented Generation

Doris implements a form of retrieval-augmented generation (RAG), a technique that enhances language models by providing them with relevant information retrieved from external sources. Rather than relying solely on the model's training data, the system retrieves specific Wikipedia content related to the user's query and includes that content in the prompt sent to the language model.

This approach provides several advantages: answers can reference information beyond the model's training cutoff date, responses include proper source attribution with file paths, factual accuracy improves since claims can be grounded in retrieved text, and the system remains functional even without internet connectivity (except for the LLM API itself).

The retrieval strategy demonstrates sophistication in query optimization. The system prompt instructs the agent to use broad, general queries matching potential article titles rather than question-like strings. For "Where was Thomas Jefferson born?", it searches "Thomas Jefferson" rather than the full question. This query transformation significantly improves retrieval accuracy.

Intelligent Article Selection

When multiple Wikipedia articles match a query, Doris employs a custom ranking algorithm to select the most relevant one. The system calculates word overlap between the query and each article title, considers Whoosh's relevance scores, and combines these factors to identify the best match.

This simple but effective approach often outperforms relying solely on the search engine's scoring, particularly for ambiguous queries where multiple articles might be relevant but one is clearly more appropriate given the context.

Dual Interface Design

Doris provides two distinct user interfaces, demonstrating versatility in deployment options:

Command-Line Interface: A terminal-based chat interface for users comfortable with CLI tools. This implementation maintains conversation history across turns, displays verbose logging of agent decisions and tool invocations, and provides a lightweight option for server deployments or automated testing.

Streamlit Web Interface: A modern web-based UI accessible through any browser. The Streamlit implementation includes a chat interface with message history, sidebar API key configuration, loading states and spinners during processing, and markdown rendering for formatted responses. The web interface makes Doris accessible to non-technical users while maintaining all the functionality of the CLI version.

Both interfaces use the same underlying agent, ensuring consistent behavior regardless of which interface users choose. This separation of concerns—business logic in the agent, presentation in the interface—demonstrates solid software architecture principles.

Conversation Memory and Context

The agent maintains conversation history, enabling multi-turn interactions where context carries forward. Users can ask follow-up questions, request clarification, or build on previous topics without repeating information. The LangChain memory system preserves both human messages and AI responses, providing the agent with complete context for each new turn.

This conversational capability transforms Doris from a question-answer system into an interactive assistant that can engage in more natural, flowing dialogue about complex topics.

Technical Implementation Highlights

Several aspects of the implementation deserve particular attention:

Scalable Processing: Handling 110GB XML files and generating 32GB of processed data requires memory-efficient streaming approaches, chunk-based processing, and careful resource management. The implementation never attempts to load entire datasets into memory.

Parallel Processing: Multi-core indexing significantly reduces the time required to index millions of articles. The use of multiprocessing pools demonstrates understanding of Python's concurrency model and how to leverage modern hardware.

Robust Text Processing: The wiki markup conversion handles dozens of edge cases through carefully crafted regular expressions. The system deals with nested templates, HTML entities, special characters, malformed markup, and file path limitations.

Progress Feedback: Every long-running operation includes tqdm progress bars showing completion percentage, processing speed, and estimated time remaining. This attention to user experience makes the multi-hour setup process tolerable.

Error Handling: The system gracefully handles missing files, network errors, malformed data, and edge cases throughout the pipeline. Each stage validates preconditions and provides clear error messages when issues arise.

API Integration: The Google Books integration demonstrates proper API usage including URL encoding, JSON parsing, handling missing fields with defaults, and rate limit considerations.

Educational and Portfolio Value

Beyond its utility as a functional AI assistant, Doris serves exceptional value as a learning resource and portfolio piece. The project demonstrates proficiency with:

Data Engineering: Large-scale data pipeline design and implementation
Natural Language Processing: Text cleaning, normalization, and transformation
Information Retrieval: Search indexing and relevance ranking
Agent-Based AI: LangChain framework and tool-using agents
RAG Techniques: Retrieval-augmented generation patterns
API Integration: External service integration and error handling
Frontend Development: Both CLI and web-based user interfaces
Software Architecture: Separation of concerns and modular design

For someone evaluating technical capabilities, Doris provides concrete evidence of the ability to work across the full stack—from data processing through AI implementation to user interface design.

Practical Considerations

The README includes important practical guidance about costs and requirements. Because the AI agent makes frequent LLM calls during operation, using paid APIs like OpenAI can become expensive quickly. The project recommends using local open-source models or alternative endpoints that don't charge per token.

This cost awareness demonstrates real-world deployment experience—understanding that impressive demos can have prohibitive operational costs if not carefully designed. The LLM endpoint is abstracted into a separate module specifically to enable users to swap in cost-effective alternatives.

The multi-hour setup process (downloading, extracting, converting, and indexing) requires patience and adequate disk space (approximately 144GB for all stages). The documentation sets clear expectations about resource requirements and processing time, helping users plan appropriately.

Potential Extensions and Improvements

While fully functional, Doris could be extended in numerous directions:

Enhanced Retrieval: Semantic search using embeddings instead of keyword matching would improve relevance for complex queries. Chunking articles and searching at the paragraph level rather than full articles would provide more precise context.

Additional Tools: Integration with other knowledge sources (academic papers, news, technical documentation) would expand the agent's capabilities. Web browsing tools could supplement the static Wikipedia snapshot with current information.

Caching and Optimization: Caching frequent queries and search results would reduce latency. Pre-computing embeddings for all articles would enable faster semantic search.

Improved Context Management: Better strategies for selecting which article portions to include in context would maximize information density. Multi-hop reasoning across multiple articles would enable more sophisticated answers.

User Features: Citation management, export functionality, query history, and personalization would enhance the user experience.

Comparison to Modern RAG Systems

Doris predates many commercial RAG systems but implements many of the same core concepts. Modern systems might use vector databases instead of keyword search, but the fundamental pattern—retrieve relevant information, augment the prompt, generate a grounded response—remains consistent.

What makes Doris particularly interesting is its local-first approach. While many RAG systems rely on cloud services for both the knowledge base and the LLM, Doris demonstrates that substantial knowledge can be processed and stored locally, with only the LLM inference requiring external services (and even that could be localized with appropriate models).

Open Source Availability

Doris is open source and available on GitHub at github.com/andrewcampi/doris. The repository includes all pipeline scripts, agent implementation, both user interfaces, and comprehensive setup documentation.

The project serves as both a functional tool and an educational resource for those interested in building their own RAG systems, processing large datasets, or implementing LangChain agents. The code is well-commented, making it accessible for learning and adaptation to other use cases.

Doris represents a comprehensive exploration of modern AI assistant technology, demonstrating how large-scale data processing, intelligent information retrieval, and agent-based AI can combine to create capable, locally-hosted knowledge systems. The project showcases technical breadth across multiple domains while maintaining practical functionality as a working AI librarian.