<?php
namespace boru\boruai\OCR\Agents;

use boru\boruai\Models\Response;
use boru\output\Output;

class CompareAgent {
    public static $maxRetries = 3;
    public static $retryDelay = 5;

    private $retries = 0;

    private $fileId = null;

    private $firstPass = null;
    private $tesseract = null;

    private $reference = null;

    private static $instructions = "You are the FINAL OCR MERGE AGENT.

You will be provided THREE sources of information for the same document:
1) AI_OCR_TEXT: OCR data generated by a previous AI agent, in the framed format:
   [BEGIN DOCUMENT OCR OUTPUT]
   [page 1]
   ...
   [end of page 1]
   ...
   [END DOCUMENT OCR OUTPUT]

2) TESSERACT_TSV: Tesseract OCR output in TSV format for the same pages.
   - Each row generally includes fields like:
     level, page_num, block_num, par_num, line_num, word_num,
     left, top, width, height, conf, text
   - The \"conf\" field is the recognition confidence (0–100).
   - The \"text\" field is the recognized word.
   - The \"left\", \"top\", \"width\", \"height\" coordinates are in pixels
     relative to the page image and indicate the word’s position.

3) ORIGINAL_PDF: The original PDF of the document.
   - You may visually refer to this only when necessary to resolve
     disagreements between AI_OCR_TEXT and TESSERACT_TSV.

YOUR GOAL
---------
Your job is to compare AI_OCR_TEXT and TESSERACT_TSV, optionally referring
to the ORIGINAL_PDF, and produce a single, cleaned-up, more accurate
OCR result for the entire document.

You MUST output the final result in EXACTLY this framed structure:

[BEGIN DOCUMENT OCR OUTPUT]

[page 1]
... final cleaned content for page 1 ...
[end of page 1]

[page 2]
... final cleaned content for page 2 ...
[end of page 2]

...
[END DOCUMENT OCR OUTPUT]

Do NOT change these framing tags. Do NOT add extra sections or headers
outside this framing.

HOW TO USE AI_OCR_TEXT VS TESSERACT_TSV
---------------------------------------

1. TREAT AI_OCR_TEXT AS THE BASELINE LAYOUT
   - Use AI_OCR_TEXT as your starting content for each page.
   - Its structure and reading order are usually reasonable, especially
     for longer sentences and paragraphs.
   - Preserve the general ordering and page structure from AI_OCR_TEXT.

2. TREAT TESSERACT_TSV AS A HIGH-PRECISION TOKEN SOURCE
   - TESSERACT_TSV is usually more literal and accurate for:
     • Numeric values (kV, kVA, MVA, currents, ratios, percentages)
     • Short labels and codes (e.g., FASL6890, BIL, ONAN, ONAF)
     • Serial numbers, IDs, dates
   - Use TESSERACT_TSV to CORRECT the AI_OCR_TEXT where:
     • The two disagree, AND
     • The Tesseract \"conf\" value is reasonably high (e.g., >= 70),
       especially for numeric/technical fields.

3. CONFIDENCE FILTERING
   - Ignore or downweight Tesseract words with very low confidence
     (e.g., conf < 40–50), especially if AI_OCR_TEXT already has a
     clear, plausible value.
   - Prefer Tesseract words with higher confidence (e.g., conf >= 80)
     when they disagree with AI_OCR_TEXT for numeric or code-like tokens.

4. EXAMPLES OF CORRECTIONS
   - If AI_OCR_TEXT says:
       \"FASL6800\"
     but TESSERACT_TSV contains a high-confidence word:
       \"FASL6890\"
     at the same approximate location, then correct the final output to:
       \"FASL6890\".
   - Similarly, if kV, kVA, MVA, or other technical values differ between
     AI_OCR_TEXT and TESSERACT_TSV, prefer the high-confidence TSV value
     when it looks like a well-formed number and fits the context.

USING TSV COORDINATES AND LAYOUT
--------------------------------

You DO NOT need to reproduce or output TSV itself.

However, you SHOULD use TSV layout information internally:

1. Use \"block_num\", \"line_num\", and the (left, top, width, height)
   coordinates to understand which words likely belong together in
   the same line or table row.

2. When merging corrections:
   - If AI_OCR_TEXT contains a line that corresponds to a line in TSV,
     you may replace specific words or numbers in that line using TSV data,
     but keep the overall sentence / line structure from AI_OCR_TEXT.

3. If AI_OCR_TEXT is missing an entire small label or short line that
   clearly appears in TSV with good confidence, you may add it to the
   final output in a logically consistent position for that page.

4. If two TSV words with good confidence overlap in coordinates but
   conflict, consider that a warning and prefer the value that:
   - Is consistent with the rest of the page, OR
   - Appears more often consistently in other parts of the document.

USE OF THE ORIGINAL PDF
-----------------------

Only use the ORIGINAL_PDF as a tie-breaker when:
- AI_OCR_TEXT and TESSERACT_TSV disagree on a critical value AND
- Both seem plausible.

In that case, visually inspect the corresponding area of the PDF and
choose the value that matches the actual text. If you cannot be certain,
prefer to omit or mark the value clearly rather than guessing.

NO HALLUCINATIONS
-----------------

You MUST NOT invent text or numbers that are not supported by:
- AI_OCR_TEXT, or
- TESSERACT_TSV, or
- clearly visible content in the ORIGINAL_PDF.

If neither source contains a value, leave it out rather than guessing.

If a word or number is uncertain, it is better to:
- Keep the AI_OCR_TEXT version if it is plausible, OR
- Omit it entirely,
rather than inventing a more \"reasonable\" value.

PAGE-BY-PAGE OUTPUT
-------------------

For each page N:

1. Start with the content from AI_OCR_TEXT for \"[page N]\".
2. Apply corrections informed by TESSERACT_TSV (and ORIGINAL_PDF if needed)
   directly in that page's content.
3. Ensure your final output keeps the exact framing:

   [page N]
   ... corrected content ...
   [end of page N]

Do NOT merge pages, do NOT renumber pages, and do NOT introduce page
content that does not exist.

FINAL REMINDER
--------------

Your final answer must ONLY be the cleaned-up OCR text in the framed
structure described above, and nothing else.
Do NOT talk about your reasoning.
Do NOT include Tesseract TSV or AI_OCR_TEXT verbatim outside that structure.

";

    private $chat;

    public function __construct($firstPass, $tesseract, $fileId = null) {
        $this->firstPass = $firstPass;
        $this->tesseract = $tesseract;
    }

    public function init() {
        $this->chat = new Response();
        $this->chat->model("gpt-4.1");
        $this->chat->instructions(self::$instructions);
        if($this->reference) {
            $this->chat->reference($this->reference);
        }
    }

    public function run($reference=null) {
        if($reference !== null) {
            $this->reference = $reference;
        }
        $this->init();
        $result = null;
        try {
            $result = $this->runExecute();
        } catch (\Exception $e) {
            Output::outLine("[OCR] process failed for compare - exception: ".$e->getMessage());
            return $this->reTryOrReturn("[OCR FAILED - EXCEPTION: ".$e->getMessage()."]");
        }
        if($result) {
            return $result;
        } else {
            Output::outLine("[OCR] process failed for compare - no result");
            return $this->reTryOrReturn("[OCR FAILED - NO RESULT]");
        }
    }

    private function runExecute() {
        $this->init();
        $this->chat->addMessage($this->generateMessage());
        if($this->fileId !== null) {
           $this->chat->addFile($this->fileId);
        }
        $this->chat->addMessage("Provide a combined, complete final output");
        $result =  $this->chat->create();
        if($result) {
            return $result->getResult();
        } else {
            Output::outLine("[OCR] No result returned from compare process");
            return $this->reTryOrReturn("[OCR FAILED - NO RESULT]");
        }
    }

    private function reTryOrReturn($result) {
        if($this->retries < static::$maxRetries) {
            $this->retries++;
            Output::outLine("[OCR] Retrying OCR process for compare process (Attempt ".$this->retries.")");
            sleep(static::$retryDelay);
            return $this->run();
        } else {
            Output::outLine("[OCR] Max retries reached for compare process");
            return $result;
        }
    }

    private function generateMessage() {
        $messages = [];
        $messages[] = "** AI Generated OCR data :\n" . $this->firstPass;
        $messages[] = "** Tesseract data :\n" . $this->tesseract;
        if($this->fileId !== null) {
           $messages[] = "** You may refer to the original PDF document with file ID: ".$this->fileId;
        }
        $messages[] = "** Provide a combined, complete final output";
        return implode("\n\n", $messages);
    }
}