Architecture OverviewΒΆ
The extrai library follows a modular, multi-stage pipeline to transform unstructured text into structured, database-ready objects. This document provides an overview of this architecture, covering both the standard and optional dynamic model generation workflows.
Core Workflow DiagramΒΆ
The following diagram illustrates the complete workflow, including the optional dynamic model generation path.
graph TD
%% Define styles for different stages for better colors
classDef inputStyle fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e
classDef processStyle fill:#eef2ff,stroke:#6366f1,stroke-width:2px,color:#3730a3
classDef consensusStyle fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,color:#78350f
classDef outputStyle fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,color:#14532d
classDef modelGenStyle fill:#fdf4ff,stroke:#a855f7,stroke-width:2px,color:#581c87
subgraph "Inputs (Static Mode)"
A["π<br/>Documents"]
B["ποΈ<br/>SQLAlchemy Models"]
L1["π€<br/>LLM"]
end
subgraph "Inputs (Dynamic Mode)"
C["π<br/>Task Description<br/>(User Prompt)"]
D["π<br/>Example Documents"]
L2["π€<br/>LLM"]
end
subgraph "Model Generation<br/>(Optional)"
MG("π§<br/>Generate SQLModels<br/>via LLM")
end
subgraph "Data Extraction"
EG("π<br/>Example Generation<br/>(Optional)")
P("βοΈ<br/>Prompt Generation")
subgraph "LLM Extraction Revisions"
direction LR
E1("π€<br/>Revision 1")
H1("π§<br/>SQLAlchemy Hydration 1")
E2("π€<br/>Revision 2")
H2("π§<br/>SQLAlchemy Hydration 2")
E3("π€<br/>...")
H3("π§<br/>...")
end
F("π€<br/>JSON Consensus")
H("π§<br/>SQLAlchemy Hydration")
end
subgraph Outputs
SM["ποΈ<br/>Generated SQLModels<br/>(Optional)"]
O["β
<br/>Hydrated Objects"]
DB("πΎ<br/>Database Persistence<br/>(Optional)")
end
%% Connections for Static Mode
L1 --> P
A --> P
B --> EG
EG --> P
P --> E1
P --> E2
P --> E3
E1 --> H1
E2 --> H2
E3 --> H3
H1 --> F
H2 --> F
H3 --> F
F --> H
H --> O
H --> DB
%% Connections for Dynamic Mode
L2 --> MG
C --> MG
D --> MG
MG --> EG
EG --> P
MG --> SM
%% Apply styles
class A,B,C,D,L1,L2 inputStyle;
class P,E1,E2,E3,H,EG processStyle;
class F consensusStyle;
class O,DB,SM outputStyle;
class MG modelGenStyle;
Workflow StagesΒΆ
The library processes data through the following stages:
Dynamic Model Generation (Optional): In this mode, the SQLModelCodeGenerator uses an LLM to generate SQLModel class definitions from a high-level task description and example documents. This is ideal when the data schema is not known in advance.
Documents Ingestion: The WorkflowOrchestrator accepts one or more text documents as the primary input for extraction.
Schema Introspection: The library inspects the provided SQLModel classes (either predefined or dynamically generated) to create a detailed JSON schema. This schema is crucial for instructing the LLM on the desired output format.
Example Generation (Optional): To improve the accuracy of the LLM, the ExampleJSONGenerator can create few-shot examples from the schema. These examples are included in the prompt to give the LLM a clear template to follow.
Prompt Generation: The PromptBuilder combines the JSON schema, the input documents, and any few-shot examples into a comprehensive system prompt and a user prompt.
LLM Interaction & Revisioning: The configured LLMClient sends the prompts to the LLM to produce multiple, independent JSON structures (revisions). This step is fundamental to the consensus mechanism.
JSON Validation & Consensus: Each JSON revision from the LLM is validated against the schema. The JSONConsensus class then takes all valid revisions and applies a consensus algorithm to resolve discrepancies, producing a single, unified JSON object.
SQLAlchemy Object Hydration: The SQLAlchemyHydrator transforms the final consensus JSON into a graph of SQLModel instances, correctly linking related objects.
Database Persistence (Optional): The hydrated SQLModel objects can be saved to a relational database via a standard SQLAlchemy session.