Generate BOOK files from PDF (Word)

We start with 001.pdf and DublinCore.txt in a directory (e.g. ~/Pubb2012/openbess_TO082-00001).

On Desktop with LibreOffice and Docspplit (http://documentcloud.github.com/docsplit/)

Call the script pdfTOtxt.sh with directory of book directories as parameter:

giancarlo@ubuntud:~$./pdfTOtxt.sh Pubb2012/

pdfTOtxt.sh

#!/bin/bash
 
bdir=$1
cd "$bdir"
 
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for bookdir in $(find openbess* -type d);
do
   echo "$bookdir"
   cd "$bookdir"
   n=0
   SAVEIFS=$IFS
   IFS=$(echo -en "\n\b")
   for nfile in $(find *.pdf -type f);
   do
      let "n += 1"
      filedoc="$nfile"
   done
   if [ $n -gt 1 ] || [ $n -lt 1 ]
   then
      echo "ERROR file PDF non unico"
      exit
   fi
   IFS=$SAVEIFS
   cp "$filedoc" doc.pdf
   rm "$filedoc"
 
   docsplit text --pages all --no-ocr --no-clean --output OCR/ doc.pdf
 
   cd OCR
   n=0
   SAVEIFS=$IFS
   IFS=$(echo -en "\n\b")
   for nfile in $(find *.txt -type f);
   do
      numer=${nfile#doc_}
      numero=${numer%\.txt}
      sn=$(printf "%04d" $numero)
 
      tr -d '\f' < "$nfile" > "$sn".txt
      rm "$nfile"
 
      echo "$sn"" DONE"
   done
   IFS=$SAVEIFS
 
   cd ~
   echo "DONE **************************""$bookdir"
done
exit

The script creates OCR directory with a single txt file for every pdf page (i.e. 0001.txt, 0002.txt, …) in every book directory.

Copy book directories (e.g. openbess_TO082-00001) to back-end server, e.g. into /srv/data/bookforingest directory.

On Back-end server with ImageMagick and pdftk

Call the script pdfatiff.sh with directory of book directories as parameter:

#./pdfatiff.sh /srv/data/bookforingest

pdfatiff.sh

#!/bin/bash
 
bdir=$1
 
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for bookdir in $(find "$bdir/"openbess* -maxdepth 0 -type d );
do
 
   echo "$bookdir"
   n=0
   SAVEIFS=$IFS
   IFS=$(echo -en "\n\b")
   for nfile in $(find "$bookdir/"*.pdf -type f);
   do
      let "n += 1"
      filepdf="$nfile"
   done
   if [ $n -gt 1 ] || [ $n -lt 1 ]
   then
      echo "ERROR file PDF non unico"
      exit
   fi
 
   mkdir "$bookdir""/pdfs"
   cp "$filepdf" "$bookdir""/pdfs"
   cd "$bookdir""/pdfs"
 
   pdftk "$filepdf" burst output pg-%04d.pdf
 
   n=0
   SAVEIFS=$IFS
   IFS=$(echo -en "\n\b")
   for nfile in $(find pg-*.pdf -type f);
   do
      let "n += 1"
      sn=$(printf "%04d" $n)
      filepdf="$nfile"
      echo "$filepdf"" -> ""$sn.tif"
 
      pdftk "$filepdf" output "temp.pdf"
 
      # For PDF from image
      #	convert -density 150 "temp.pdf" "$sn.tif"
      # For PDF from Word
      convert -background white -flatten -density 600 -resize 1200 -border 0.5% -bordercolor LightGray "temp.pdf" "../""$sn.tif"
      rm "temp.pdf"
   done
   cd ~/clineFC
   rm -R "$bookdir""/pdfs"
done
exit

The script creates a single tif file for every pdf page (i.e. 0001.tif, 0002.tif, …) in every book directory.

Book in now ready for ingesting.