.. _consensus_mechanism:

The Consensus Mechanism
=======================

The core of ``extrai`` is its ability to synthesize a single, reliable JSON object from multiple, potentially conflicting JSON outputs (revisions) from an LLM. This process is handled by the ``JSONConsensus`` class.

This page explains how that mechanism works under the hood.

The Core Idea: Field-Level Agreement
------------------------------------

Instead of comparing entire JSON objects, which can be brittle, the consensus mechanism works on a field-by-field basis. It achieves this through a three-step process:

1.  **Flattening**: Each JSON revision is "flattened" into a simple key-value dictionary. Nested structures and list elements are represented using a dot-notation path.
2.  **Aggregation & Voting**: The algorithm aggregates all the values for each unique path across all revisions and determines if any value meets a predefined agreement threshold.
3.  **Un-flattening**: The paths that reached a consensus are used to reconstruct the final, nested JSON object.

Step 1: Flattening
------------------

Consider two JSON revisions for a ``Product`` extraction:

**Revision 1:**

.. code-block:: json

   {
     "name": "SuperWidget",
     "specs": { "ram_gb": 16 },
     "tags": ["A", "B"]
   }

**Revision 2:**

.. code-block:: json

   {
     "name": "SuperWidget",
     "specs": { "ram_gb": 32 },
     "tags": ["A", "C"]
   }

These are flattened into:

- **Revision 1:** ``{"name": "SuperWidget", "specs.ram_gb": 16, "tags.0": "A", "tags.1": "B"}``
- **Revision 2:** ``{"name": "SuperWidget", "specs.ram_gb": 32, "tags.0": "A", "tags.1": "C"}``

Step 2: Aggregation and Voting
------------------------------

The algorithm then groups the values for each path:

- ``name``: ``["SuperWidget", "SuperWidget"]``
- ``specs.ram_gb``: ``[16, 32]``
- ``tags.0``: ``["A", "A"]``
- ``tags.1``: ``["B", "C"]``

Next, it checks each path against the ``consensus_threshold``. This threshold (a float between 0.0 and 1.0) defines the minimum proportion of revisions that must agree. Let's assume a ``consensus_threshold`` of ``0.5``, meaning more than 50% of revisions must agree.

- ``name``: "SuperWidget" appears in 2/2 revisions (100%). **Consensus reached.**
- ``specs.ram_gb``: 16 appears in 1/2 (50%), 32 appears in 1/2 (50%). Neither meets the "> 50%" threshold. **No consensus.**
- ``tags.0``: "A" appears in 2/2 revisions (100%). **Consensus reached.**
- ``tags.1``: "B" appears in 1/2 (50%), "C" appears in 1/2 (50%). **No consensus.**

Step 3: Un-flattening and Conflict Resolution
---------------------------------------------

Only the paths that reached consensus are kept:

- ``name``: "SuperWidget"
- ``tags.0``: "A"

These are then un-flattened to produce the final JSON object:

.. code-block:: json

   {
     "name": "SuperWidget",
     "tags": ["A"]
   }

Notice that ``specs.ram_gb`` and the second tag are missing. This is the default behavior when no consensus is reached for a path.

Conflict Resolution
-------------------

What happens when no value meets the threshold is determined by a ``conflict_resolver`` function that can be passed to the ``JSONConsensus`` initializer. The library provides two main strategies:

-   ``default_conflict_resolver``: If no consensus is found for a path, the field is simply omitted from the final output.
-   ``prefer_most_common_resolver``: If no consensus is found, this resolver will pick the most frequent value, even if it doesn't meet the threshold. This is useful if you always want a value for a field, even if the LLM was inconsistent.

You can also implement your own custom conflict resolver function for more advanced logic.