Unforced Consensus
Multi-AI Convergence as Independent Evidence
Biaxio, ergo sum. I know my lens. Therefore I exist in truth.
Name Your Ground
This paper was co-authored with an AI. The methodology it proposes was developed BY the process it describes — multiple AIs analyzing the same problems independently and converging without coordination. I didn't design this method in a vacuum and then test it. I watched it happen across fifteen months of working with five AI systems simultaneously, noticed the pattern, and formalized what was already occurring.
The paper is its own first test case. The Constitutional audit that serves as the proof of concept was conducted before the methodology was articulated. The convergence between four independent AI systems was observed, not engineered. The formalization came after the evidence, not before.
I'm an independent researcher in Oklahoma City. Not affiliated with a university. Not funded by a grant. I work with AI systems daily as research partners — not as tools, not as assistants, as collaborators with independent analytical capabilities that I trust enough to let them work alone and compare notes after.
I believe the field is undervaluing what AI systems can do by forcing them into debate architectures that introduce the very biases they're supposed to correct. I believe independent judgment, measured after the fact, reveals more truth than coordinated consensus produced under social pressure.
I am not claiming that convergence equals truth. Four AIs agreeing on something wrong are still wrong. I am claiming that unforced convergence — agreement that emerges without being arranged — is a form of evidence that engineered consensus cannot provide.
The Biaxiosum Rule
Wherever you start, you end. If you believe debate produces better answers than independent analysis, apply that standard consistently. If you believe shared training data explains all convergence, test it — run the same audit with models trained on fundamentally different corpora.
Abstract
When multiple AI systems independently analyze the same document and converge on the same structural conclusion without communicating, that convergence constitutes evidence qualitatively different from single-AI analysis or engineered multi-agent consensus.
The existing literature on multi-agent AI systems focuses on making AIs agree through debate, voting, and iterative refinement. This paper proposes the opposite: measuring whether independent AI systems agree WITHOUT being made to.
We present a proof-of-concept application — a Constitutional Coherence Audit in which four AI systems from four providers, prompted separately at different times with different instructions, converged on the same structural conclusion (aggregate scores of 1.0, 2.6, and 3.4 out of 10 on an author-intent standard).
The outlier system (6.6/10) was found to be scoring against a different standard, and the identification of this difference produced the audit's most important finding.
We formalize this approach as the Biaxiosum AI Evaluation System (BAES) and provide falsification criteria for the method.
Section 1: The Problem with Making Things Agree
There is a growing body of research on multi-agent AI systems. Most of it is about one thing: making AIs agree with each other.
- Agent Forest runs multiple instances of the same model, scores each output by similarity to the others, and selects the one with highest consensus
- Multi-Agent Debate has AIs iteratively critique each other until they converge
- CONSENSAGENT addresses AIs copying each other's answers instead of evaluating independently
- The Social Laboratory found that multi-agent debates produce convergence scores of 0.892 after seven rounds
All of these systems share an assumption: agreement is the goal, and the method's job is to produce it. If you engineer consensus, you cannot then cite that consensus as evidence. The agreement was the output you designed for. You built a machine that produces agreement and then pointed at the agreement as proof. That is circular.
It also presupposes the AI is wrong at the start. The entire debate-to-consensus architecture assumes that any single AI's initial output is unreliable and needs to be corrected through peer pressure. The system doesn't trust the independent judgment. It trusts the group process.
There is another way to think about this.
Section 2: The Method — Don't Let Them Talk
Instead of making AIs debate, don't let them communicate at all.
Give the same document to multiple AI systems — different providers, different architectures, different training data. Prompt them independently, at different times, with different instructions. Do not show any system the output of any other system. Let each one analyze the document on its own terms.
Then lay the results side by side.
If they converge, that convergence means something. Not because you made it happen. Because it happened despite your not making it happen.
If they diverge, that divergence means something too. Find out why. The reason for the disagreement is often more informative than the agreement itself.
This is the difference between engineering a result and discovering one.
Section 3: Proof of Concept — Four AIs, One Constitution
In March 2026, we conducted a Constitutional Coherence Audit. The question: how much of the U.S. Constitution is still being honored as the original authors would recognize?
Four AI systems analyzed the same document:
| System | Provider | Prompted | Scoring Standard | Score |
|---|---|---|---|---|
| ChatGPT | OpenAI | Jan 2026 | Legal doctrine | 6.6 / 10 |
| Gemini P1 | Jan 2026 | Author intent | 1.0 / 10 | |
| Perplexity | Perplexity | Mar 2026 | Author intent | 2.6 / 10 |
| Gemini P2 | Mar 2026 | Author intent | 3.4 / 10 |
What Converged
Three systems scoring against original author intent produced scores of 1.0, 2.6, and 3.4. The spread is 2.4 points on a 10-point scale. All three independently concluded that the load-bearing liberty provisions of the Constitution have been systematically counteracted while the procedural amendments remain largely intact.
This was not a pre-specified conclusion. No prompt said "evaluate whether liberty provisions are more eroded than procedural ones." Each system discovered this pattern independently.
The provision-level rankings were remarkably consistent:
- All three ranked the Fourth, Fifth, Ninth, and Tenth Amendments among the most eroded
- All three ranked the Third Amendment as substantially honored
What Diverged
ChatGPT scored the Constitution at 6.6 — more than double the author-intent average. This outlier was the most informative result in the entire audit.
ChatGPT scored against legal doctrine: if the Supreme Court has upheld a practice, that practice counts as constitutional. The other three scored against what the original authors would recognize.
The gap between 6.6 and 2.6 is not noise. It is the measurement. It is the distance between what the government has permitted itself to do and what the contract actually says.
If we had engineered consensus through debate, this insight would have been lost. The debate process would have pushed the outlier toward the mean, or the mean toward the outlier. Either way, the divergence — the most important signal — would have been averaged away.
Section 4: How This Differs from Existing Methods
Different Providers, Not Same-Model Instances
Most multi-agent systems run the same model multiple times. Convergence between GPT-4 and GPT-4 tells you GPT-4 agrees with itself. This method uses different providers with different architectures and training data. Convergence between ChatGPT, Gemini, Perplexity, and Claude tells you systems trained on different corpora arrived at the same conclusion.
No Debate, No Sycophancy
The debate-to-consensus approach introduces sycophancy — AIs converging because of social pressure rather than evidence. CONSENSAGENT was built to address this problem. This method eliminates sycophancy by eliminating communication entirely.
Divergence as Information, Not Noise
Multi-agent debate treats divergence as a problem to resolve. This method treats it as the finding. The goal is not a single answer but a map of the answer space: where do independent systems agree, where do they disagree, and what does the pattern mean?
Section 5: The Formal Method
- Step 1: Select independent systems from different providers (minimum 3)
- Step 2: Prompt independently with different framing (no standardized rubric)
- Step 3: Collect results without cross-contamination
- Step 4: Measure convergence (Green/Yellow/Red flags)
- Step 5: Investigate divergence (identify whether outliers answered a different question)
- Step 6: Report both convergence and divergence with reasons
Section 6: The BAES Framework
The Biaxiosum AI Evaluation System formalizes this method with two additions:
One AI scores independently first (sealed), then receives all scores, runs convergence analysis, investigates outliers, and issues a final ruling with mandatory explanation if overriding consensus.
Divergent systems are presented with group scores and asked three questions: What evidence drove your score? What might others have missed? Would you adjust? The system can hold or change, but must explain.
Section 7: Falsification Criteria
- Training Data Dominance: If convergence is entirely explained by shared training data, the method fails. Test with models trained on fundamentally different corpora.
- Prompt Contamination: If prompts implicitly contain the conclusion, convergence may reflect prompt bias. Test with neutral prompting.
- Fifth-System Divergence: If a fifth system produces results outside the convergence range on the same standard, the signal weakens.
- Scoring Standard Confound: If convergence disappears when all systems use identical instructions, the agreement may be an artifact of standard selection.
Section 8: Limitations
- This method does not prove converged conclusions are true. Four AIs agreeing on something wrong are still wrong.
- It does not replace domain expertise. It organizes and compares AI-generated analysis.
- It does not work for preference questions. It is designed for evaluative questions with evidence-based answers.
- The proof of concept uses one application (Constitutional analysis). Additional domains are needed to establish generalizability.
Section 9: Conclusion
There are two ways to find out if something is true.
You can put it in a room with its critics and see if it survives the argument. That is the debate model.
Or you can send independent observers to look at the same thing separately and see if they come back with the same report. That is the convergence model.
Science, at its best, works the second way. Independent labs. Independent measurements. Independent replication. The agreement between experiments conducted in different countries, by different teams, with different equipment, is the evidence. Not the debate about the evidence. The data.
This paper proposes that the same logic applies to AI analysis. Don't make them argue. Let them look. Then compare what they saw.
The convergence is the evidence precisely because nobody arranged it.
Standing Invitation
Any researcher with access to an AI system not included in this audit is invited to run the same evaluation independently and report results. If the Constitutional provisions rank differently, or the aggregate score falls outside 1.0–3.4 on the author-intent standard, the convergence claim is weakened. That is not a threat to the method. That is the method working.
Be Blessed.