Convert JPG to Searchable PDF: A Comprehensive Guide 2

image to pdf

Hi there, in a tutorial last year I have mentioned how to convert an image file to searchable OCR PDF in the .NET platform as of 2020 August.

For some details please visit: https://anilbektas.com/2019/10/14/convert-jpg-to-searchable-pdf-a-comprehensive-guide/

After some trial and errors, I see these information is a bit obsolete and I’ve decided to revamp the article about my PDF converter.

[HttpPost]
public async Task<IActionResult> JpegToPdf(List<IFormFile> files)
{
	System.Text.EncodingProvider encoding = System.Text.CodePagesEncodingProvider.Instance;
	Encoding.RegisterProvider(encoding);

	var root = _hostingEnvironment.WebRootPath + "\\Upload\\";
	var bins = _hostingEnvironment.WebRootPath + "\\Bins\\";

	var pdfPath = root + Functions.RandomString(3) + ".pdf";
	var outputOcrPath = root + Functions.RandomString(3) + ".pdf";

	foreach (var formFile in files)
	{
		//Creating the new PDF document
		PdfDocument document = new PdfDocument();
		if (formFile.Length > 0)
		{
			MemoryStream file = new MemoryStream();
			formFile.CopyTo(file);
			//Loading the image
			PdfImage image = PdfImage.FromStream(file);
			//Adding new page
			PdfPage page = document.Pages.Add();

			//Drawing image to the PDF page
			page.Graphics.DrawImage(image, new PointF(0, 0));
			file.Dispose();
		}
		//Saving the PDF to the MemoryStream
		MemoryStream stream = new MemoryStream();
		document.Save(stream);
		//Set the position as '0'.
		stream.Position = 0;
		StreamToFile(stream, pdfPath);
		PdfToOcrPdf(bins, pdfPath, outputOcrPath);
		GC.Collect();
		return Ok(new { path = outputOcrPath });

	}
	return BadRequest();

}

See the line

System.Text.EncodingProvider encoding = System.Text.CodePagesEncodingProvider.Instance;
Encoding.RegisterProvider(encoding);

This is done to prevent some errors while we read the Tesseract binaries from the tess_bin folder. Say we first get an image from the memory. The file at the FormFile path of the controller as a parameter, is the file that is uploaded as an image. These can be PNG or a JPEG.

StreamToFile(stream, pdfPath);
PdfToOcrPdf(bins, pdfPath, outputOcrPath);

The parts above, first save the PDF with no search capabilites to disk and then the PdToOcrPdf adds a search capability to the reserved PDF.

The bin files are the ancestor path of the tess_bin and Tessdata so when you are working with them, please put a “\\tess_bin” directive to the end of the path. What really changed at the folder structures when we compare with the years 2018-2019 is depicted below.

tesseract binaries

See there is a x64 and x86 path on the way.

There are some missing codes above. One is stream to file writer and lastly, the tesseract path. Long story short, we are first converting image to a PDF and then converting it to an OCR PDF at lastly.

private void PdfToOcrPdf(string bins, string pdfPath, string outputOcrPath)
{
	OCRProcessor processor = new OCRProcessor(bins + "\\tess_bin");
	//Load a PDF document
	FileStream stream = new FileStream(pdfPath, FileMode.Open);
	PdfLoadedDocument document = new PdfLoadedDocument(stream);
	//Set OCR language
	processor.Settings.Language = Languages.English;
	//Perform OCR with input document and tessdata (Language packs)
	try
	{
		processor.PerformOCR(document, bins + @"\\tessdata");
	}
	catch (System.Exception ex)
	{

		Console.WriteLine(ex);
	}
	MemoryStream stream1 = new MemoryStream();
	//Save the document into stream.
	document.Save(stream1);
	//If the position is not set to '0' then the PDF will be empty. 
	stream1.Position = 0;
	//Close the document. 
	document.Close(true);
	StreamToFile(stream1, outputOcrPath);
}

And last the stream to file:

private void StreamToFile(Stream stream, string outputFile)
{
    using (var fileStream = new FileStream(outputFile, FileMode.Create, FileAccess.Write))
    {
        stream.CopyTo(fileStream);
    }
}

The functions Functions.RandomString(3) is a generator. You can consider this on your own and I am not sharing a code here. PdfToOcrPdf function takes the bin path, the PDF path which is converted from an image and the output PDF path. That’s all.

Thanks and stay with code!