# BoruOCR

BoruOCR is a PHP library that runs a deterministic OCR pipeline over PDFs, images, spreadsheets, and other document types, with optional AI post‑processing layered on top.

The core pipeline:
- Renders pages using a best‑fit page image provider (MuPDF, libvips, Imagick, etc.).
- Runs Tesseract text and TSV OCR.
- Builds a baseline text representation from TSV layout analysis.
- Optionally hands the result to AI agents for higher‑level interpretation.

For some source types (e.g. spreadsheets, Word documents) BoruOCR can short‑circuit OCR and use direct text extraction instead.

## Requirements

- PHP (5.6+)
- Composer

## Recommended Debian/Ubuntu tools

These tools are **not** installed by Composer but are strongly recommended for full functionality and best quality:

- `mupdf-tools`  
  Provides `mutool`/`mudraw` for high‑quality, fast PDF rasterization and text extraction backends.

- `libvips`  
  High‑performance image processing library used by the libvips page provider for large PDFs and images.

- `docx2txt`  
  CLI tool to extract text directly from `.docx` files without running OCR.

- `antiword`  
  CLI tool to extract text from legacy `.doc` files.

Install on Debian/Ubuntu, for example:

```bash
sudo apt-get update
sudo apt-get install mupdf-tools libvips-tools docx2txt antiword
```

> Note: Exact package names may vary by distribution; adjust as needed.

## Installation

```bash
composer require boru/ocr
```

## Basic usage

```php
<?php

use boru\ocr\OcrEngine;

$engine = OcrEngine::forFile('/path/to/document.pdf');

// Optional: enable AI post-processing (requires boru/boruai configuration)
// $engine->withAI(true);

$result = $engine->run();

// Get normalized final text
$text = $result->getText();

echo $text;
```

See the source under `src/` for more advanced configuration options (layout tuning, AI options, evidence export, etc.).