Adding a new booktype to PageKicker

 

In the PageKicker algorithmic publishing platform the booktype is an explicit definition of the rules that are used for creating electronic codexes, i.e. content that is wrapped in “containers” like pages wrapped in a bound book. Booktypes include entities such as anthologies, chapbooks, dictionaries, encyclopedias, epic poems, and novels. Another way of describing such entities is as “chunk-ables” that can be published on their own, sufficiently valuable that they can be thrust into the stream of commerce (without a paddle?). At the most practical level, booktypes exist and have evolved over the centuries because they work. Following a known format – or inventing a new one! – can make an enormous contribution to the success of author, publisher, and book, and to the happiness of readers, librarians and even critics.

In PageKicker, booktypes go beyond “chunk-ables” such as encyclopedia and include other, lower-level attributes that define structure, substance, and style. Structure is defined in terms of “parts of the book” such as acknowledgement, foreword, epigraph, part, chapter, bibliography, and index. Substance is defined by rules that govern the search, creation, and assembly strategy for each part of the book. For example, the substance rule for “encyclopedia” might be “search for all content relevant to the seed phrases and assemble each document in alphabetical order”. Similarly, a rule for style might be “use the order of parts specified by the Chicago Manual of the Style”.

PageKicker’s approach to booktypes is an algorithmic abstraction of an important aspect of traditional publishing. Early in the acquisition and development stage an author or publisher who is contemplating a book on a particular topic for a particular audience must consider what is the best format for the job. The default is usually the standard codex book, i.e. front matter, sequential chapters, back matter, but as noted above there are scores of options that have been developed over 500+ years of publishing. The algorithmic approach enables the publisher to define and continually improve consistent rules for creating books in a particular format. The publisher can also switch from one booktype to another instantly and painlessly experiment with different approaches to the same work.

Technical Details

The default booktype is reader, which is a codex nonfiction with parts of the book in order as specified in the Chicago Manual of Style, i.e. front matter, body, back matter. One booktype has recently been added, draft-report. The purpose of draft-report is to streamline the reader format to only those parts of the book that are helpful in jump-starting the writing of report, i.e. title, an executive summary, content, and bibliography–much reduced front matter.

Booktype is one of several hundred variables that the PageKicker system accesses during book creation.
The $booktype variable is specified at runtime. Like most variables in PageKicker, the order of precedence is (in ascending order), the config file ~/.pagekicker, the default variable values file scripts/includes/set-variables.sh, and the command line as –booktype. Note that the value of $booktype provided at the command line can be overridden by values specified in jobprofiles for robots or imprints. Thus, if an imprint file explicitly defines the booktype as always “encyclopedia”, PageKicker will always create all books for that imprint as encylopedias. For this reason, it is recommended that booktypes should not be specified in imprint files (unless the imprint always publishes one and only one type of book).

When the builder script runs, it uses a case statement to look for the value of $booktype and run the corresponding script. That booktypes are defined and created by scripts is probably not ideal–it might be better to have them defined strictly as data objects–but on the plus side it does largely isolate the changes.

Example: Adding a Chapbook

The procedure for adding a new booktype is relatively straightforward.

The first step is to add a new clause to the booktype case construct in the builder script. The case construct for draft-report looks like this:

draft-report)
echo "assembling parts needed for $booktype"

. includes/draft-report.sh
"$PANDOC_BIN" \
-o "$TMPDIR$uuid/draft-report-$safe_product_name.docx" \
"$TMPDIR"$uuid/draft-report.md

# note that draft-report does not get SKU because it is not
# acompleted product
;;

So, to add a new case clause for, let us say, a chapbook, the format would look like this:

chapbook)
echo "assembling parts needed for $booktype"
. includes/chapbook.sh
"$PANDOC_BIN" -o \
"$TMPDIR$uuid/$sku-chapbook-$safe_product_name.epub" \
"$TMPDIR"$uuid/chapbook.md
;;

From inspecting the above code sample, we see that the chapbook script will reside in pagekicker-community/scripts/includes. It need not be a shell script, it could be Python or any other language. The chapbook script has one major responsibility: to return the markdown format file chapbook.md to the assembly area at $TMPDIR$uuid, which will then use pandoc to convert it into an epub format document. (Additional formats could be created by adding additional pandoc commands with different output extensions). By convention, the book files are named with an SKU followed by a safe product name, which is the literal book title with special characters converted into underscores, e.g. 12345678-The_Plant.epub. The ;; command concludes the case clause.

Since the chapbook script runs as a sourced include (part of the main script), all defined PageKicker variables are available to it. Since as mentioned above all parts of the book required for reader are built by default, chapbook is only responsible for creating any unique parts needed for a chapbook.

While chapbooks have a long history and were created for a wide variety of purposes, the most common modern usage is for small collections of poetry by a single author. Thus, we will assume that the script needs to write a number of poems. Take a look at the code for draft-report.sh, which includes three major parts: comments, part creation, and assembly.

The script begins by documenting our work.

#!/bin/bash
# --booktype="chapbook"
# A specified number of poems are created.
# There is a limit on word count.
# Other attributes may be implemented as desired.
# See http://www.baymoon.com/~ariadne/chapbooks.htm
# for a helpful guide to possible attributes.
# The script must return chapbook.md to $TMPDIR$uuid.

The script is then responsible for creating its specified parts of the book. In this case, we will assume that the default number of poems is 20 and the default maximum word count is 10,000. For the moment, this will be merely a mockup that uses a hypothetical poem generator. So the second major section of chapbook is responsible for creating the unique substance.

poems="20"
poem_max_wordcount="10000"

echo "running poem script"

$PYTHON_BIN "poet.py" --poems "20" \
--maxwords "10000" -o "$TMPDIR$uuid/poems.md" \
--numbering "on"

Now the script must blend its unique content into the material already created by PageKicker, which is defined as follows for the reader booktype.

cat \
"$TMPDIR$uuid/titlepage.md" \
"$TMPDIR$uuid/robo_ack.md" \
"$TMPDIR$uuid/settings.md" \
"$TMPDIR$uuid/rebuild.md" \
"$TMPDIR$uuid/tldr.md" \
"$TMPDIR$uuid/listofpages.md" \
"$TMPDIR$uuid/humansummary.md" \
"$TMPDIR$uuid/programmaticsummary.md" \
"$TMPDIR$uuid/add_this_content.md" \
"$TMPDIR$uuid/chapters.md" \
"$TMPDIR$uuid/content_collections/content_collections_results.md" \
"$TMPDIR$uuid/googler.md" \
"$TMPDIR$uuid/googler-news.md" \
"$TMPDIR$uuid/sorted_uniqs.md" \
"$TMPDIR$uuid/analyzed_webpage.md" \
"$TMPDIR$uuid/acronyms.md" \
"$TMPDIR$uuid/twitter/sample_tweets.md" \
"$TMPDIR$uuid/allflickr.md" \
"$TMPDIR$uuid/sources.md" \
"$TMPDIR$uuid/changelog.md" \
"$TMPDIR$uuid/builtby.md" \
"$TMPDIR$uuid/byimprint.md" \
"$TMPDIR$uuid/imprint_mission_statement.md" \
"$TMPDIR$uuid/yaml-metadata.md" \
> "$TMPDIR$uuid/complete.md"

For the chapbook booktype we can delete many of these sections as either pedantic or irrelevant to the art of poetry. We then add the poems.md part to the list of items that are assembled to make up chapbook.md:

"$TMPDIR$uuid/titlepage.md" \
"$TMPDIR$uuid/robo_ack.md" \
"$TMPDIR$uuid/listofpages.md" \
"$TMPDIR$uuid/**poems.md" **\
"$TMPDIR$uuid/changelog.md" \
"$TMPDIR$uuid/builtby.md" \
"$TMPDIR$uuid/byimprint.md" \
"$TMPDIR$uuid/imprint_mission_statement.md" \
"$TMPDIR$uuid/yaml-metadata.md" \
> "$TMPDIR$uuid/**chapbook.md"**

echo "chapbook content"

There is no need for an exit status, we simply report completion in the echo statement and control reverts to the appropriate place in bin/builder, i.e. the next step in the case clause, which is the pandoc command that builds the chapbook itself.

"$PANDOC_BIN" -o \
"$TMPDIR$uuid/$sku-chapbook-$safe_product_name.epub" \
"$TMPDIR"$uuid/chapbook.md

The chapbook file, $sku-chapbook-$safe_product_name.epub, is delivered to the results directory, where the user can access it and additional actions such as delivery and distribution can be carried out.

This is the basic procedure for adding booktypes. We highly encourage innovation: by all mean, write a script and plug it in! Note that if a booktype script introduces dependencies (as in the hypothetical poet.py program mentioned in the example), the install program pagekicker-community/simple-install.sh should be updated to install those dependencies and the requirements should be documented in install_notes.md. Similarly, a test script for the booktype should be added to test/.

Advertisements

grepclips adds PDF “sheaves” creation

PDFs are important to PageKicker not just as a final output format but because much of the world’s highest-quality “input” content  is contained  only, or can only be conveniently accessed, in PDFs.  For this reason PageKicker  has over the years created a useful collection of tools for analyzing, manipulating, and summarizing PDF collections.

The PageKicker grepclips utility can now create “sheaves” that contain all the PDF pages that contain search hits in a recursively searched collection of PDF documents.  Example use cases:

  • Creating a collection of “interesting” pages from large collections of PDFs such as public documents, technical reports, or premium content.
  • Printing out all relevant pages for reference while studying.
  • Compiling a collection of references for transmittal to a customer.
  • Supporting human creation of indexes, glossaries, or bibliographies.

For example, if in a collection of 100 PDFs, the search expression is contained in four of the PDF documents, numbers 21, 27, 48, and 95, “sheaf” files called doc21-pages.pdf, doc27-pages.pdf, doc48-pages.pdf, and doc95-pages.pdf will be created in the output directory.    If in document 21 the search hits appear on pages 12, 74, and 106,  sheaf file doc21-pages will contain (only) pages 12, 74, and 106, and so on for the remaining documents that contain hits.  A cumulative document is also created including all pages with hits, and it is named [search_phrase]-pages.pdf.  All sheaf pages are stamped with their origin file name for easy reference.

“Sheaves” output is turned on or off using the –s or –sheaves option with the value “yes”. Usage example:

grepclips -p "search phrase" -P /path/to/dir -A "number of context lines \
after" -B "number of context lines before" --word "yes" --sheaves "yes"

The variable $sheaves is set to “no” by default because searching within PDFs and assembling the sheaves into several new documents is considerably slower than standard grep searches in text.

The use case assumes that the document collection includes matching copies of .pdf and .txt files in each directory.  PageKicker includes a simple script called pdfdir2txt.sh for this purpose.  That script must be run within each directory containing pdfs.  The find command with pipe can be used to accomplish this recursively.  Native recursion is on the “to do” list.

Dependencies for grepclips include imagemagick, pandoc, pdfgrep, and pdftk.  It is recommended that these dependencies be installed by cloning the complete PageKicker platform:

git clone https://github.com/fredzannarbor/pagekicker-community.git

bibliographic info by chapter

Per customer request, PageKicker now provides bibliographic information at chapter level.    The Sources section now includes properly formatted citations to each document included in the Algorithmic Content section.

Additional improvements are underway, such as including YAML and BibTex format information.

Screenshot from 2017-03-24 19-37-40

 

 

grepclips: shows file name only once followed by matches within file

A customer wanted to search a bunch of files and find all the files that contained particular keywords, but the standard grep output controls were inappropriate for the use case:  grep -H prints the filename at the beginning of each and every line, whereas grep -h omits the filename from each and every line.  Thus, the desired behavior was:

filename1
hit1
hit2
hit3

filename2
hit1
hit2
hit3

It took a bit of research and tinkering to get a script working properly, so it is provided here for future reference.

https://github.com/fredzannarbor/pagekicker-community/blob/master/scripts/bin/grepclips

#!/bin/bash
# recursively greps directory
# for each file that contains caseinsensitive matches
# provides filename followed by matches

# sets defaults 

afterKWIC=0
beforeKWIC=0
path="."

while :
do
case $1 in
--help | -\?)
usage:
grepclips -p "search phrase" -P /path/to/dir -A "context lines after" -B "context lines before"
exit 0 # This is not an error, the user requested help, so do not exit status 1.
;;
-p|--pattern)
pattern=$2
shift 2
;;
-p|--pattern=*)
pattern=${1#*=}
shift
;;
-P|--path)
path=$2
shift 2
;;
-P|--path=*)
pattern=${1#*=}
shift
;;
-A|--afterKWIC)
afterKWIC=$2
shift 2
;;
-A|--afterKWIC=*)
afterKWIC=${1#*=}
shift
;;
-B|--beforeKWIC)
beforeKWIC=$2
shift 2
;;
-B|--beforeKWIC=*)
beforeKWIC=${1#*=}
shift
;;
 --) # End of all options
 shift
 break
 ;;
 -*)
 echo "WARN: Unknown option (ignored): $1" >&2
 shift
 ;;
 *) # no more options. Stop while loop
 break
 ;;

esac
done

if [ ! "$pattern" ]; then
 echo "ERROR: option '-p [pattern]' not given. See --help" >&2
 exit 1
fi


grep -r -l -i "$pattern" "$path" | while read fn
do
 echo "$fn:"
 grep -i "$pattern" --no-group-separator -h -A "$afterKWIC" -B "$beforeKWIC" "$fn"
 echo " "
done            break
            ;;
        -*)
            echo "WARN: Unknown option (ignored): $1" >&2
            shift
            ;;
        *)  # no more options. Stop while loop
            break
            ;;

esac
done

grep -r -l -i "$pattern" "$path" | while read fn
do
  echo "$fn:"
  grep -i "$pattern" --no-group-separator -h -A "$afterKWIC" -B "$beforeKWIC"  "$fn"
  echo " "
done

Bug fixes

Version 1.7.15 fixes a bug in the creation of yaml metadata; eliminates an extraneous warning message; and prevents the creation of an empty file.

Version 1.7.16 fixes the ability to toggle Search & News snippets on and off.

 

 

Go from 0 to 60 on that report

tl;dr: Save hours on writing each report.

Version 1.7.14 of PageKicker adds the new $booktype option “draft-report”, which helps you go straight from research to writing by providing a highly stripped down Word document that includes the following:

  • Title
  • Executive Summary
  • Chapters
    • Permissioned content
    • Search & News Snippets
    • Programmatic Summaries
  • Conclusion
  • Appendix: Unique Proper Nouns and Key Terms
  • Appendix: Acronyms

No matter staring at a blank sheet of paper, or copy and pasting from dozens of documents, or asking colleagues to turn in draft sections in a particular format. One command line invocation and you can start with a structured document whose content you can begin editing immediately, either yourself or in a group.

bin/builder.sh --booktype "draft-report" \
 --booktitle "Bannon: Lenin or Lazarus?" \
 --seedsviacli "Steve Bannon; Julius Evola; economic nationalism"

 

book-builder adds user-friendly –help

Version 1.6.13 adds user-friendly –help to the bin/builder.sh command.

bin/builder.sh --help

Result:

version v1.6.13-Johnson
BASICS
Must always run from ~/pagekicker-community/scripts/
Usage: bin/builder.sh --option1 "value" etc.

QUICK START
--seedsviacli "Melania Trump; Slovenia" # semi-colon separated
--booktitle "Melania of Slovenia"
--editedby "Your Name Here" # your personal byline
--expand_seeds_to_pages "yes" # spidering search strategy, many more pages
--verbose (no value) turn on verbose output
Job results: ls -lart $TMPDIR (default /tmp/pagekicker/), inside latest dir

DEFAULTS
User configuration file: ~/.pagekicker/config.txt
Default variable values: scripts/includes/set-variables.sh
Command line trumps default variables trumps user configuration

COMPLETE LIST OF OPTIONS
(to come)

The “QUICK START” is all you need to know to build your first book from the command line.

To install PageKicker, do:

git clone https://github.com/fredzannarbor/pagekicker-community
cd pagekicker-community
./simple-install.sh

Try a first book build:

bin/builder.sh --seedsviacli "Melania Trump; Slovenia" \
--booktitle "Melania of Slovenia" --verbose

The results will be found in the latest directory in $TMPDIR, usually /tmp/pagekicker. You can find the latest directory using ls -lart.  The files you want begin with an SKU number and end in document extensions like .docx and .epub.

 

PageKicker code is maturing for use by third parties

Just added version 1.5.13, which makes the default stdout “quiet” mode and adds a –verbose flag.  This may seem minor but is actually fairly significant both because it took me quite  bit of effort to figure out an elegant way to accomplish this (I had to learn how to use file descriptors to route stdout to a “backup” file descriptor) and because it signifies that PageKicker’s code is responding to use by third parties.  The motivation for this change was that I want to add a –help flag for use by operators, and to do this I had to undo the excessively verbosity of the previous version (because the –help command was being accompanied by a lot of excess verbos[iage]). Now, it will be straightforward to add the –help text, which will be in the next subdot release.

https://github.com/fredzannarbor/pagekicker-community/releases

 

 

PageKicker bug fixes

Three bug fix releases this weekend.

Version 1.4.9 fixes an ebook creation error that occurred if Amazon’s kindlegen software was not available on the host machine.

Version 1.4.10 makes the install process more seamless and less error-prone by automating a configuration step that previously needed to be carried out by hand.

Version 1.4.11 fixes a cosmetic error in the bibliography section, where book titles beginning with special characters such as the #hashtag were interpreted as markdown and therefore formatted incorrectly.  Markdown characters at the beginning of the line are now removed.

https://github.com/fredzannarbor/pagekicker-community/releases

To update an existing installation, simply do

 git checkout master && git pull

Single-click install PageKicker *now*

Version 1.4.8 of PageKicker’s Algorithmic Publishing Toolkit makes the open source platform available via single-click install for the first time.

On Ubuntu 16.04 LTS:

cd ~
git clone https://github.com/fredzannarbor/pagekicker-community.gitcd pagekicker-community
./simple-install.sh

To verify installation:

cd scripts
../test/paella.sh

Produces:

hostname:~/pagekicker-community/scripts$ ../test/paella.sh 1> /dev/null
IBM Word Cloud Generator build 32
Copyright (c)2009 IBM
IBM Word Cloud Generator build 32
Copyright (c)2009 IBM

Inspect the results:

cd /tmp/pagekicker
ls -lart
# cd to most recent directory
# completed books begin with 100** and end in .docx, .epub, etc.

Quick start:

cd ~/pagekicker-community/scripts # all commands must be launched from this directory
bin/builder.sh --seedsviacli "Fancy Bear; Kaspersky" --booktitle "Cybersecurity Demo Book"
  • runs verbosely by default, to make it run silently add 1> /dev/null to end of command
  • searches wikipedia by default

 

Experiment by adding some command line options:

--expand_seeds_to_pages "yes"
--covercolor "Red"
--coverfont "Arial"
--editedby "Charles Dickens"