Convert JPG to Searchable PDF: A Comprehensive Guide

img2pdf

Converting images to PDF, even to the searchable ones sometimes really a big obstacle. I have found many libraries and turned out to be useless, spent countless of hours to generate a searchable PDF from JPEG based Word-like documents.

At first, I have tried classical approaches like Python img2pdf library. It was successful but as overhead, I see there is nothing to be searchable.

Secondly, I deep dived with Tesseract. https://github.com/tesseract-ocr/tesseract. Tesseract as rumors had it, was a great solution to this kind of problem but when I wanted to run a command, it just generated only textual artifact of the file.

Many roads I have gone through…

  • Google please can you say me, how can I generate a searchable PDF?
  • Google please, say me, any alternative to img2pdf file?

Then, after most of the questions, alike above yielded at my browser, I stumbled upon syncfusion libraries. An ASP.NET, in-house solution to all my headaches. Once I have also given a shot to “Magick”. The result was nothing again, unfortunately.

using Syncfusion.Pdf;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf.Parsing;
using Syncfusion.OCRProcessor;
using System.IO;
using System.Diagnostics;

namespace Sync3
{
    class Program
    {
        public static void CreatePdf(string name)
        {
            PdfDocument document = new PdfDocument();
            //Add a page to the document
            PdfPage page = document.Pages.Add();
            //Create PDF graphics for a page
            PdfGraphics graphics = page.Graphics;
            //Load the image from the disk
            PdfBitmap image = new PdfBitmap(name);
            //Draw the image
            graphics.DrawImage(image, 0, 0, page.GetClientSize().Width, page.GetClientSize().Height);
            //Save the document into stream
            MemoryStream stream = new MemoryStream();
            document.Save(stream);
            //Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
            using (OCRProcessor processor = new OCRProcessor(@"C:\Users\anil.bektas\Desktop\tess_bin"))
            {
                //Load a PDF document
                PdfLoadedDocument lDoc = new PdfLoadedDocument(stream);
                //Set OCR language to process
                processor.Settings.Language = Languages.English;
                //Process OCR by providing the PDF document and Tesseract data
                processor.PerformOCR(lDoc, @"C:\Users\anil.bektas\Desktop\Tessdata");
                //Save the OCR processed PDF document in the disk
                lDoc.Save("OCR" + name.Split('.')[0] + ".pdf");
                //Close the document
                lDoc.Close(true);
            }
            //This will open the PDF file so, the result will be seen in default PDF viewer
            Process.Start("OCR" + name.Split('.')[0] + ".pdf");
        }
        static void Main(string[] args)
        {
            //Create a new PDF document
            CreatePdf("3.jpg");
            CreatePdf("4.jpg");
            PdfDocument finalDoc = new PdfDocument();
            string[] source = { "OCR3.pdf", "OCR4.pdf" };
            PdfDocument.Merge(finalDoc, source);
            finalDoc.Save("Sample.pdf");
            finalDoc.Close(true);
        }
    }
}

The program is as you see, it inputs 2 files like 3.jpg and 4.jpg, they are all Word-like files having the text and more text in them, and generates OCR3.pdf and OCR4.pdf respectively. In the end, it merges all PDFs into Sample.pdf.

The installation is a bit mixed-up. Since “Syncfusion.OCRProcessor” yet not a standalone package on Nuget and there isn’t any tutorial at the Syncfusion site on how to work the things out. By chance, I installed a package that contains it, by making a guess. It was Syncfusion.PDF.OCR.AspNet.

But what means, C:\Users\anil.bektas\Desktop\tess_bin and C:\Users\anil.bektas\Desktop\Tessdata… ? Tessdata and tess_bin are the folders renamed by me. They contain Tesseract dll and last, in the Tessdata folder, the English keywords file. I will be publishing the necessary files in advance and update the post. You can get them out of Google but since I have them all, I will make you not deal with the minor details.

In the end, if you are an ASP.NET developer, it is a huge relief to find a solution that again contains C# code.

Edit:

Tessdata: (files like the language data is kept in) https://drive.google.com/drive/folders/1bs1n1wpphsBFFapKFTsRrN2zz-7DlpQq?usp=sharing

Tess_bin or say Tesseract binaries: https://drive.google.com/drive/folders/1rO0TMIb_h4tMkh4pLaOmgUVCYrUoCtDj?usp=sharing