PDFs are important to PageKicker not just as a final output format but because much of the world’s highest-quality “input” content  is contained  only, or can only be conveniently accessed, in PDFs.  For this reason PageKicker  has over the years created a useful collection of tools for analyzing, manipulating, and summarizing PDF collections.

The PageKicker grepclips utility can now create “sheaves” that contain all the PDF pages that contain search hits in a recursively searched collection of PDF documents.  Example use cases:

  • Creating a collection of “interesting” pages from large collections of PDFs such as public documents, technical reports, or premium content.
  • Printing out all relevant pages for reference while studying.
  • Compiling a collection of references for transmittal to a customer.
  • Supporting human creation of indexes, glossaries, or bibliographies.

For example, if in a collection of 100 PDFs, the search expression is contained in four of the PDF documents, numbers 21, 27, 48, and 95, “sheaf” files called doc21-pages.pdf, doc27-pages.pdf, doc48-pages.pdf, and doc95-pages.pdf will be created in the output directory.    If in document 21 the search hits appear on pages 12, 74, and 106,  sheaf file doc21-pages will contain (only) pages 12, 74, and 106, and so on for the remaining documents that contain hits.  A cumulative document is also created including all pages with hits, and it is named [search_phrase]-pages.pdf.  All sheaf pages are stamped with their origin file name for easy reference.

“Sheaves” output is turned on or off using the –s or –sheaves option with the value “yes”. Usage example:

grepclips -p "search phrase" -P /path/to/dir -A "number of context lines \
after" -B "number of context lines before" --word "yes" --sheaves "yes"

The variable $sheaves is set to “no” by default because searching within PDFs and assembling the sheaves into several new documents is considerably slower than standard grep searches in text.

The use case assumes that the document collection includes matching copies of .pdf and .txt files in each directory.  PageKicker includes a simple script called pdfdir2txt.sh for this purpose.  That script must be run within each directory containing pdfs.  The find command with pipe can be used to accomplish this recursively.  Native recursion is on the “to do” list.

Dependencies for grepclips include imagemagick, pandoc, pdfgrep, and pdftk.  It is recommended that these dependencies be installed by cloning the complete PageKicker platform:

git clone https://github.com/fredzannarbor/pagekicker-community.git

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s