Convert JPG to Searchable PDF: A Comprehensive Guide


Converting images to PDF, even to the searchable ones sometimes really a big obstacle. I have found many libraries and turned out to be useless, spent countless of hours to generate a searchable PDF from JPEG based Word-like documents.

At first, I have tried classical approaches like Python img2pdf library. It was successful but as overhead, I see there is nothing to be searchable.

Secondly, I deep dived with Tesseract. Tesseract as rumors had it, was a great solution to this kind of problem but when I wanted to run a command, it just generated only textual artifact of the file.

Many roads I have gone through…

  • Google please can you say me, how can I generate a searchable PDF?
  • Google please, say me, any alternative to img2pdf file?

Then, after most of the questions, alike above yielded at my browser, I stumbled upon syncfusion libraries. An ASP.NET, in-house solution to all my headaches. Once I have also given a shot to “Magick”. The result was nothing again, unfortunately.

using Syncfusion.Pdf;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf.Parsing;
using Syncfusion.OCRProcessor;
using System.IO;
using System.Diagnostics;

namespace Sync3
    class Program
        public static void CreatePdf(string name)
            PdfDocument document = new PdfDocument();
            //Add a page to the document
            PdfPage page = document.Pages.Add();
            //Create PDF graphics for a page
            PdfGraphics graphics = page.Graphics;
            //Load the image from the disk
            PdfBitmap image = new PdfBitmap(name);
            //Draw the image
            graphics.DrawImage(image, 0, 0, page.GetClientSize().Width, page.GetClientSize().Height);
            //Save the document into stream
            MemoryStream stream = new MemoryStream();
            //Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
            using (OCRProcessor processor = new OCRProcessor(@"C:\Users\anil.bektas\Desktop\tess_bin"))
                //Load a PDF document
                PdfLoadedDocument lDoc = new PdfLoadedDocument(stream);
                //Set OCR language to process
                processor.Settings.Language = Languages.English;
                //Process OCR by providing the PDF document and Tesseract data
                processor.PerformOCR(lDoc, @"C:\Users\anil.bektas\Desktop\Tessdata");
                //Save the OCR processed PDF document in the disk
                lDoc.Save("OCR" + name.Split('.')[0] + ".pdf");
                //Close the document
            //This will open the PDF file so, the result will be seen in default PDF viewer
            Process.Start("OCR" + name.Split('.')[0] + ".pdf");
        static void Main(string[] args)
            //Create a new PDF document
            PdfDocument finalDoc = new PdfDocument();
            string[] source = { "OCR3.pdf", "OCR4.pdf" };
            PdfDocument.Merge(finalDoc, source);

The program is as you see, it inputs 2 files like 3.jpg and 4.jpg, they are all Word-like files having the text and more text in them, and generates OCR3.pdf and OCR4.pdf respectively. In the end, it merges all PDFs into Sample.pdf.

The installation is a bit mixed-up. Since “Syncfusion.OCRProcessor” yet not a standalone package on Nuget and there isn’t any tutorial at the Syncfusion site on how to work the things out. By chance, I installed a package that contains it, by making a guess. It was Syncfusion.PDF.OCR.AspNet.

But what means, C:\Users\anil.bektas\Desktop\tess_bin and C:\Users\anil.bektas\Desktop\Tessdata… ? Tessdata and tess_bin are the folders renamed by me. They contain Tesseract dll and last, in the Tessdata folder, the English keywords file. I will be publishing the necessary files in advance and update the post. You can get them out of Google but since I have them all, I will make you not deal with the minor details.

In the end, if you are an ASP.NET developer, it is a huge relief to find a solution that again contains C# code.


Tessdata: (files like the language data is kept in)

Tess_bin or say Tesseract binaries: