Reading PDF Files in Sahi

During automation we sometimes need to verify contents of PDF files. A good open source library for reading PDF content is Apache PDFbox There are two ways to use PDFBox. One is to write our own code to extract content. The other simpler way is to use pdfbox-app which helps extract the text content from pdf files easily from the command line.

Prerequisites

To run the sample below:

Code

_log("Download pdfbox-app-1.7.1.jar from http://pdfbox.apache.org/download.html", "CUSTOM");
_log("Copy it to sahi/userdata/extlib/pdfbox folder (create if required)", "CUSTOM");
/**
<ul><li>Reads PDF file and shows the text in the page itself.</li>
<li>Useful for adding assertions</li>
<li>@param $pdf</li>
</ul>
*/
function showPDFText($pdf) {
  var $pdfboxAppJarPath = _userDataPath("extlib\\pdfbox\\pdfbox-app-x.x.x.jar");
  var $pdf = _resolvePath($pdf);
  var $data = _execute("java -jar " + $pdfboxAppJarPath + " ExtractText -console -html " + $pdf, true);
  _call(document.body.innerHTML = $data);
}

/**
<ul><li>Reads PDF file and returns the contents as a string</li>
<li>@param $pdf</li>
<li>@returns string text contents of the PDF file</li>
</ul>
*/

function getPDFText($pdf) {
  var $pdfboxAppJarPath = _userDataPath("extlib\\pdfbox\\pdfbox-app-x.x.x.jar");
  var $pdf = _resolvePath($pdf);
  var $data = _execute("java -jar " + $pdfboxAppJarPath + " ExtractText -console "+$pdf+" -text", true);
  return $data;
}

var $pdf = "Gandhiji.pdf";
showPDFText($pdf);
_assertContainsText("2 October 1869", _paragraph("/Born/"));
_assertContainsText("30 January 1948", _paragraph("/Died/"));
In the code above, we use pdfbox-app-x.x.x.jar to convert our pdf to text. The
getPDFText
function returns the text, while
showPDFText
shows it in the currently open page itself, by replacing the html body.

Reading values of editable fields from pdf

To read fields from editable pdf files please refer to read_pdf_fields.sah (which includes pdf_fields_lib.sah)