Intro¶
extrai extracts data from text documents using LLMs, formatting the output into a given SQLModel and registering it in a database.
The library utilizes a Consensus Mechanism to ensure accuracy. It makes the same request multiple times, using the same or different providers, and then selects the values that meet a configured threshold.
extrai also has other features, like generating SQLModels from a prompt and documents, and generating few-shot examples. For complex, nested data, the library offers Hierarchical Extraction, breaking down the extraction into manageable, hierarchical steps. It also includes built-in analytics to monitor performance and output quality.
Workflow Overview¶
The library is built around a few key components that work together to manage the extraction workflow. The following diagram illustrates the high-level workflow (see Architecture Overview for more details):
graph TD
A[Unstructured Text] --> B(WorkflowOrchestrator);
C[SQLModel Definition] --> B;
B --> D{LLM Client};
D --> E[Multiple JSON Outputs];
E --> F(SQLAlchemyHydrator);
F --> G(JSONConsensus);
G --> H[Consolidated JSON];
H --> I(SQLAlchemyHydrator);
I --> J[Structured Data in DB];
Key Features¶
Consensus Mechanism: Improves extraction accuracy by consolidating multiple LLM outputs.
Dynamic SQLModel Generation: Generate SQLModel schemas from natural language descriptions.
Hierarchical Extraction: Handles complex, nested data by breaking down the extraction into manageable, hierarchical steps.
Extensible LLM Support: Integrates with various LLM providers through a client interface.
Built-in Analytics: Collects metrics on LLM performance and output quality to refine prompts and monitor errors.
Workflow Orchestration: A central orchestrator to manage the extraction pipeline.
Example JSON Generation: Automatically generate few-shot examples to improve extraction quality.
Customizable Prompts: Customize prompts at runtime to tailor the extraction process to specific needs.
Rotating LLMs providers: Create the JSON revisions from multiple LLM providers.