Architecture OverviewΒΆ

The extrai library follows a modular, multi-stage pipeline to transform unstructured text into structured, database-ready objects. This document provides an overview of this architecture, covering both the standard and optional dynamic model generation workflows.

Core Workflow DiagramΒΆ

The following diagram illustrates the complete workflow, including the optional dynamic model generation path.

        graph TD
    %% Define styles for different stages for better colors
    classDef inputStyle fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e
    classDef processStyle fill:#eef2ff,stroke:#6366f1,stroke-width:2px,color:#3730a3
    classDef consensusStyle fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,color:#78350f
    classDef outputStyle fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,color:#14532d
    classDef modelGenStyle fill:#fdf4ff,stroke:#a855f7,stroke-width:2px,color:#581c87

    subgraph "Inputs (Static Mode)"
        A["πŸ“„<br/>Documents"]
        B["πŸ›οΈ<br/>SQLAlchemy Models"]
        L1["πŸ€–<br/>LLM"]
    end

    subgraph "Inputs (Dynamic Mode)"
        C["πŸ“‹<br/>Task Description<br/>(User Prompt)"]
        D["πŸ“š<br/>Example Documents"]
        L2["πŸ€–<br/>LLM"]
    end

    subgraph "Model Generation<br/>(Optional)"
        MG("πŸ”§<br/>Generate SQLModels<br/>via LLM")
    end

    subgraph "Data Extraction"
        EG("πŸ“<br/>Example Generation<br/>(Optional)")
        P("✍️<br/>Prompt Generation")

        subgraph "LLM Extraction Revisions"
            direction LR
            E1("πŸ€–<br/>Revision 1")
            H1("πŸ’§<br/>SQLAlchemy Hydration 1")
            E2("πŸ€–<br/>Revision 2")
            H2("πŸ’§<br/>SQLAlchemy Hydration 2")
            E3("πŸ€–<br/>...")
            H3("πŸ’§<br/>...")
        end

        F("🀝<br/>JSON Consensus")
        H("πŸ’§<br/>SQLAlchemy Hydration")
    end

    subgraph Outputs
        SM["πŸ›οΈ<br/>Generated SQLModels<br/>(Optional)"]
        O["βœ…<br/>Hydrated Objects"]
        DB("πŸ’Ύ<br/>Database Persistence<br/>(Optional)")
    end

    %% Connections for Static Mode
    L1 --> P
    A --> P
    B --> EG
    EG --> P
    P --> E1
    P --> E2
    P --> E3
    E1 --> H1
    E2 --> H2
    E3 --> H3
    H1 --> F
    H2 --> F
    H3 --> F
    F --> H
    H --> O
    H --> DB

    %% Connections for Dynamic Mode
    L2 --> MG
    C --> MG
    D --> MG
    MG --> EG
    EG --> P

    MG --> SM

    %% Apply styles
    class A,B,C,D,L1,L2 inputStyle;
    class P,E1,E2,E3,H,EG processStyle;
    class F consensusStyle;
    class O,DB,SM outputStyle;
    class MG modelGenStyle;
    

Workflow StagesΒΆ

The library processes data through the following stages:

  1. Dynamic Model Generation (Optional): In this mode, the SQLModelCodeGenerator uses an LLM to generate SQLModel class definitions from a high-level task description and example documents. This is ideal when the data schema is not known in advance.

  2. Documents Ingestion: The WorkflowOrchestrator accepts one or more text documents as the primary input for extraction.

  3. Schema Introspection: The library inspects the provided SQLModel classes (either predefined or dynamically generated) to create a detailed JSON schema. This schema is crucial for instructing the LLM on the desired output format.

  4. Example Generation (Optional): To improve the accuracy of the LLM, the ExampleJSONGenerator can create few-shot examples from the schema. These examples are included in the prompt to give the LLM a clear template to follow.

  5. Prompt Generation: The PromptBuilder combines the JSON schema, the input documents, and any few-shot examples into a comprehensive system prompt and a user prompt.

  6. LLM Interaction & Revisioning: The configured LLMClient sends the prompts to the LLM to produce multiple, independent JSON structures (revisions). This step is fundamental to the consensus mechanism.

  7. JSON Validation & Consensus: Each JSON revision from the LLM is validated against the schema. The JSONConsensus class then takes all valid revisions and applies a consensus algorithm to resolve discrepancies, producing a single, unified JSON object.

  8. SQLAlchemy Object Hydration: The SQLAlchemyHydrator transforms the final consensus JSON into a graph of SQLModel instances, correctly linking related objects.

  9. Database Persistence (Optional): The hydrated SQLModel objects can be saved to a relational database via a standard SQLAlchemy session.