Intro¶

extrai extracts data from text documents using LLMs, formatting the output into a given SQLModel and registering it in a database.

The library utilizes a Consensus Mechanism to ensure accuracy. It makes the same request multiple times, using the same or different providers, and then selects the values that meet a configured threshold.

extrai also has other features, like generating SQLModels from a prompt and documents, and generating few-shot examples. For complex, nested data, the library offers Hierarchical Extraction, breaking down the extraction into manageable, hierarchical steps. It also includes built-in analytics to monitor performance and output quality.

Workflow Overview¶

The library is built around a few key components that work together to manage the extraction workflow. The following diagram illustrates the high-level workflow (see Architecture Overview for more details):

        graph TD
    A[Unstructured Text] --> B(WorkflowOrchestrator);
    C[SQLModel Definition] --> B;
    B --> D{LLM Client};
    D --> E[Multiple JSON Outputs];
    E --> F(SQLAlchemyHydrator);
    F --> G(JSONConsensus);
    G --> H[Consolidated JSON];
    H --> I(SQLAlchemyHydrator);
    I --> J[Structured Data in DB];

Key Features¶

Consensus Mechanism: Improves extraction accuracy by consolidating multiple LLM outputs.
Dynamic SQLModel Generation: Generate SQLModel schemas from natural language descriptions.
Hierarchical Extraction: Handles complex, nested data by breaking down the extraction into manageable, hierarchical steps.
Extensible LLM Support: Integrates with various LLM providers through a client interface.
Built-in Analytics: Collects metrics on LLM performance and output quality to refine prompts and monitor errors.
Workflow Orchestration: A central orchestrator to manage the extraction pipeline.
Example JSON Generation: Automatically generate few-shot examples to improve extraction quality.
Customizable Prompts: Customize prompts at runtime to tailor the extraction process to specific needs.
Rotating LLMs providers: Create the JSON revisions from multiple LLM providers.