This is an old revision of the document!

Generate BOOK files from PDF (Word)

We start with 001.pdf and DublinCore.txt in a directory (e.g. ~/Pubb2012/openbess_TO082-00001).

On Desktop with LibreOffice and Docspplit (http://documentcloud.github.com/docsplit/)

Call the script pdfTOtxt.sh with directory of book directories as parameter:

giancarlo@ubuntud:~$./pdfTOtxt.sh Pubb2012/

pdfTOtxt.sh

#!/bin/bash
 
bdir=$1
cd "$bdir"
 
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for bookdir in $(find openbess* -type d);
do
   echo "$bookdir"
   cd "$bookdir"
   n=0
   SAVEIFS=$IFS
   IFS=$(echo -en "\n\b")
   for nfile in $(find *.pdf -type f);
   do
      let "n += 1"
      filedoc="$nfile"
   done
   if [ $n -gt 1 ] || [ $n -lt 1 ]
   then
      echo "ERROR file PDF non unico"
      exit
   fi
   IFS=$SAVEIFS
   cp "$filedoc" doc.pdf
   rm "$filedoc"
 
   docsplit text --pages all --no-ocr --no-clean --output OCR/ doc.pdf
 
   cd OCR
   n=0
   SAVEIFS=$IFS
   IFS=$(echo -en "\n\b")
   for nfile in $(find *.txt -type f);
   do
      numer=${nfile#doc_}
      numero=${numer%\.txt}
      sn=$(printf "%04d" $numero)
 
      tr -d '\f' < "$nfile" > "$sn".txt
      rm "$nfile"
 
      echo "$sn"" DONE"
   done
   IFS=$SAVEIFS
 
   cd ~
   echo "DONE **************************""$bookdir"
done
exit

The script creates OCR directory with a single txt file for every pdf page (i.e. 0001.txt, 0002.txt, …) in every book directory.

Copy book directories (e.g. openbess_TO082-00001) to back-end server, e.g. into /srv/data/bookforingest directory.

On Back-end server with ImageMagick and pdftk

Call the script pdfatiff.sh with directory of book directories as parameter:

#./pdfatiff.sh /srv/data/bookforingest