PDF Text Extraction with OCR in Next.js 13

In this comprehensive tutorial, we will guide you through the process of extracting text from a PDF using Optical Character Recognition (OCR) in a Next.js 13 application. We'll utilize PDF.js for rendering PDF pages and Tesseract.js for OCR.

Demo.

Setting up the Next.js 13 app

Let's start by creating a new Next.js 13 app using the following command:

npx create-next-app@latest next-ocr
cd next-ocr

Next, install the required packages:

npm install tesseract.js pdfjs-dist

And you need to add something to next.config.js. otherwise, you will get
Error: Cannot find module '../build/Release/canvas.node' .

/** @type {import('next').NextConfig} */
const nextConfig = {
  future: { webpack5: true },
  webpack: (config, {}) => {
    config.resolve.alias.canvas = false;
    config.resolve.alias.encoding = false;
    return config;
  },
};

module.exports = nextConfig;

Setting up PDF.js for PDF Rendering

We'll use PDF.js, a powerful JavaScript library for rendering PDFs in the browser.

Create a file lib/pdf-to-img.ts and add the following code to set up PDF.js and define functions to load the PDF and render each page as an image.

// @ts-ignore
import * as pdfjsLib from "pdfjs-dist/build/pdf";
import { PDFPageProxy } from "pdfjs-dist/types/src/display/api";
pdfjsLib.GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/${pdfjsLib.version}/pdf.worker.min.js`;

const loadPdf = async (file: File): Promise<PDFPageProxy[]> => {
  const uri = URL.createObjectURL(file);
  const pdf = await pdfjsLib.getDocument({ url: uri }).promise;

  const pages: PDFPageProxy[] = [];
  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i);
    pages.push(page);
  }
  return pages;
};

const renderPageToImage = async (
  page: PDFPageProxy,
  scale: number = 3
): Promise<string> => {
  // Rendering logic explained later
};

export const pdfToImg = async (file: File): Promise<string[]> => {
  try {
    const pages = await loadPdf(file);
    const images: string[] = [];

    for (const page of pages) {
      const image = await renderPageToImage(page);
      images.push(image);
    }

    return images;
  } catch (error) {
    console.error("PDF error:", error);
    return [];
  }
};

In the above code, we first set the pdfjsLib.GlobalWorkerOptions.workerSrc to the URL of the PDF.js worker script. This is required for PDF.js to work properly.

Rendering PDF Pages as Images

We'll now implement the renderPageToImage function to render PDF pages to images. We'll use the HTML5 Canvas API to render the PDF page to a canvas and then convert the canvas to a data URL.

// pdf-to-img.ts
const renderPageToImage = async (
  page: pdfjsLib.PDFPageProxy,
  scale: number = 3
): Promise<string> => {
  const viewport = page.getViewport({ scale });
  const canvas = document.createElement("canvas");
  const context = canvas.getContext("2d");

  if (!canvas || !context) {
    throw new Error("Canvas or context is null.");
  }

  const pixelRatio = window.devicePixelRatio || 1;
  canvas.width = viewport.width * pixelRatio;
  canvas.height = viewport.height * pixelRatio;
  context.scale(pixelRatio, pixelRatio);

  context.imageSmoothingEnabled = true;
  context.imageSmoothingQuality = "high";

  const renderContext = {
    canvasContext: context,
    viewport: viewport,
    enableWebGL: false,
  };

  const renderTask = page.render(renderContext);

  await renderTask.promise;

  return canvas.toDataURL();
};

In the above code, we first create a canvas and a context. Then we set the canvas width and height to the viewport width and height. We also set the canvas scale to the device pixel ratio. This ensures that the canvas is rendered at the correct size on high-resolution displays.

Implementing OCR using Tesseract.js

We'll now implement the ocr function to extract text from an image using Tesseract.js.

// app/page.tsx
"use client"

import { createWorker } from "tesseract.js";
import { pdfToImg } from "@/lib/pdf-to-img";

const Home = () => {
const handleExtractPdf = async (file: File) => {
  if (!file) return;
  try {
    const images = await pdfToImg(file);
    const pages = [];

    for (let i = 0; i < images.length; i++) {
      const image = images[i];
      const worker = await createWorker({
        logger: (m) => console.log(m),
      });

      await worker.load();
      await worker.loadLanguage("eng");
      await worker.initialize("eng");
      const { data: { text } } = await worker.recognize(image);

      // Pushing the extracted text from each page to the pages array
      pages.push(text);

      await worker.terminate();
    }

    return pages;
  } catch (error) {
    console.error("Error extracting PDF:", error);
  }
};

// Other codes explained after this

export default Home;

In the above code, we first create a Tesseract.js worker and load the English language. Then we call the recognize function to extract text from the image. Finally, we terminate the worker.

Finally, Trigger PDF Extraction on User Input

// app/page.tsx
"use client";

import { createWorker } from "tesseract.js";
import { pdfToImg } from "@/lib/pdf-to-img";

const Home = () => {
  // handleExtractPdf function explained above

  const handleFileUpload = async (
    event: React.ChangeEvent<HTMLInputElement>
  ) => {
    const files = event.target.files;
    if (!files) return;
    const file = files[0];
    const pdfContent = await handleExtractPdf(file);
    console.log("Extracted PDF content:", pdfContent);
  };

  return (
    <div>
      <input type="file" accept=".pdf" onChange={handleFileUpload} />
    </div>
  );
};

export default Home;

In this final step, we create a file input element to allow users to upload a PDF file and trigger the PDF extraction process.

Conclusion

In this tutorial, we learned how to extract text from a PDF using Optical Character Recognition (OCR) in a Next.js 13 application. We used PDF.js for rendering PDF pages and Tesseract.js for OCR.

Resources

Tesseract.js - A JavaScript library that gets words in almost any language out of images.
PDF.js - A JavaScript library that renders PDF files using the HTML5 Canvas API.

Source Code

The complete source code for this tutorial is available on GitHub

Thank you for reading 💙