.. _architecture_overview:
Architecture Overview
=====================
The `extrai` library follows a modular, multi-stage pipeline to transform unstructured text into structured, database-ready objects. This document provides an overview of this architecture, covering both the standard and optional dynamic model generation workflows.
Core Workflow Diagram
---------------------
The following diagram illustrates the complete workflow, including the optional dynamic model generation path.
.. mermaid::
    graph TD
        %% Define styles for different stages for better colors
        classDef inputStyle fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e
        classDef processStyle fill:#eef2ff,stroke:#6366f1,stroke-width:2px,color:#3730a3
        classDef consensusStyle fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,color:#78350f
        classDef outputStyle fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,color:#14532d
        classDef modelGenStyle fill:#fdf4ff,stroke:#a855f7,stroke-width:2px,color:#581c87
        subgraph "Inputs (Static Mode)"
            A["📄
Documents"]
            B["🏛️
SQLAlchemy Models"]
            L1["🤖
LLM"]
        end
        subgraph "Inputs (Dynamic Mode)"
            C["📋
Task Description
(User Prompt)"]
            D["📚
Example Documents"]
            L2["🤖
LLM"]
        end
        subgraph "Model Generation
(Optional)"
            MG("🔧
Generate SQLModels
via LLM")
        end
        subgraph "Data Extraction"
            EG("📝
Example Generation
(Optional)")
            P("✍️
Prompt Generation")
            
            subgraph "LLM Extraction Revisions"
                direction LR
                E1("🤖
Revision 1")
                H1("💧
SQLAlchemy Hydration 1")
                E2("🤖
Revision 2")
                H2("💧
SQLAlchemy Hydration 2")
                E3("🤖
...")
                H3("💧
...")
            end
            
            F("🤝
JSON Consensus")
            H("💧
SQLAlchemy Hydration")
        end
        subgraph Outputs
            SM["🏛️
Generated SQLModels
(Optional)"]
            O["✅
Hydrated Objects"]
            DB("💾
Database Persistence
(Optional)")
        end
        %% Connections for Static Mode
        L1 --> P
        A --> P
        B --> EG
        EG --> P
        P --> E1
        P --> E2
        P --> E3
        E1 --> H1
        E2 --> H2
        E3 --> H3
        H1 --> F
        H2 --> F
        H3 --> F
        F --> H
        H --> O
        H --> DB
        %% Connections for Dynamic Mode
        L2 --> MG
        C --> MG
        D --> MG
        MG --> EG
        EG --> P
        MG --> SM
        %% Apply styles
        class A,B,C,D,L1,L2 inputStyle;
        class P,E1,E2,E3,H,EG processStyle;
        class F consensusStyle;
        class O,DB,SM outputStyle;
        class MG modelGenStyle;
Workflow Stages
---------------
The library processes data through the following stages:
0.  **Dynamic Model Generation (Optional)**: In this mode, the `SQLModelCodeGenerator` uses an LLM to generate `SQLModel` class definitions from a high-level task description and example documents. This is ideal when the data schema is not known in advance.
1.  **Documents Ingestion**: The `WorkflowOrchestrator` accepts one or more text documents as the primary input for extraction.
2.  **Schema Introspection**: The library inspects the provided `SQLModel` classes (either predefined or dynamically generated) to create a detailed JSON schema. This schema is crucial for instructing the LLM on the desired output format.
3.  **Example Generation (Optional)**: To improve the accuracy of the LLM, the `ExampleJSONGenerator` can create few-shot examples from the schema. These examples are included in the prompt to give the LLM a clear template to follow.
4.  **Prompt Generation**: The `PromptBuilder` combines the JSON schema, the input documents, and any few-shot examples into a comprehensive system prompt and a user prompt.
5.  **LLM Interaction & Revisioning**: The configured `LLMClient` sends the prompts to the LLM to produce multiple, independent JSON structures (revisions). This step is fundamental to the consensus mechanism.
6.  **JSON Validation & Consensus**: Each JSON revision from the LLM is validated against the schema. The `JSONConsensus` class then takes all valid revisions and applies a consensus algorithm to resolve discrepancies, producing a single, unified JSON object.
7.  **SQLAlchemy Object Hydration**: The `SQLAlchemyHydrator` transforms the final consensus JSON into a graph of `SQLModel` instances, correctly linking related objects.
8.  **Database Persistence (Optional)**: The hydrated `SQLModel` objects can be saved to a relational database via a standard SQLAlchemy session.