Intro

extrai extracts data from text documents using LLMs, formatting the output into a given SQLModel and registering it in a database.

The library utilizes a Consensus Mechanism to ensure accuracy. It makes the same request multiple times, using the same or different providers, and then selects the values that meet a configured threshold.

extrai also has other features, like generating SQLModels from a prompt and documents, and generating few-shot examples. For complex, nested data, the library offers Hierarchical Extraction, breaking down the extraction into manageable, hierarchical steps. It also includes built-in analytics to monitor performance and output quality.

Workflow Overview

The library is built around a few key components that work together to manage the extraction workflow. The following diagram illustrates the high-level workflow (see Architecture Overview for more details):

        graph TD
    A[Unstructured Text] --> B(WorkflowOrchestrator);
    C[SQLModel Definition] --> B;
    B --> D{LLM Client};
    D --> E[Multiple JSON Outputs];
    E --> F(SQLAlchemyHydrator);
    F --> G(JSONConsensus);
    G --> H[Consolidated JSON];
    H --> I(SQLAlchemyHydrator);
    I --> J[Structured Data in DB];
    

Key Features