Sahi Pro - Reading PDF Files in Sahi
During automation we sometimes need to verify contents of PDF files. A good open source library for reading PDF content is Apache PDFbox There are two ways to use PDFBox. One is to write our own code to extract content. The other simpler way is to use pdfbox-app which helps extract the text content from pdf files easily from the command line.Prerequisites
- Download pdfbox-app-1.7.1.jar from http://pdfbox.apache.org/download.html
- Copy it to sahi/userdata/extlib/pdfbox folder (create if required)
- Copy the code below to
sahi/userdata/scripts/pdf_pdfboxapp.sah
- Copy Gandhiji.pdf.pdf and save it in
sahi/userdata/scripts/
Code
_log("Download pdfbox-app-1.7.1.jar from http://pdfbox.apache.org/download.html", "CUSTOM");
_log("Copy it to sahi/userdata/extlib/pdfbox folder (create if required)", "CUSTOM");
/**
<ul><li>Reads PDF file and shows the text in the page itself.</li>
<li>Useful for adding assertions</li>
<li>@param $pdf</li>
</ul>
*/
function showPDFText($pdf) {
var $pdfboxAppJarPath = _userDataPath("extlib\\pdfbox\\pdfbox-app-1.7.1.jar");
var $pdf = _resolvePath($pdf);
var $data = _execute("java -jar " + $pdfboxAppJarPath + " ExtractText -console -html " + $pdf, true);
_call(document.body.innerHTML = $data);
}
/**
<ul><li>Reads PDF file and returns the contents as a string</li>
<li>@param $pdf</li>
<li>@returns string text contents of the PDF file</li>
</ul>
*/
function getPDFText($pdf) {
var $pdfboxAppJarPath = _userDataPath("extlib\\pdfbox\\pdfbox-app-1.7.1.jar");
var $pdf = _resolvePath($pdf);
var $data = _execute("java -jar " + $pdfboxAppJarPath + " ExtractText -console -text " + $pdf, true);
return $data;
}
var $pdf = "Gandhiji.pdf";
showPDFText($pdf);
_assertContainsText("2 October 1869", _paragraph("/Born/"));
_assertContainsText("30 January 1948", _paragraph("/Died/"));
In the code above, we use pdfbox-app-x.x.x.jar to convert our pdf to text.The
getPDFText
function returns the text, whileshowPDFText
shows it in the currently open page itself, by replacing the html body.