This is an old revision of the document!
Generate BOOK files from PDF (Word)
We start with 001.pdf and DublinCore.txt in a directory (e.g. ~/Pubb2012/openbess_TO082-00001).
Call the script pdfTOtxt.sh with directory of book directories as parameter:
giancarlo@ubuntud:~$./pdfTOtxt.sh Pubb2012/
- pdfTOtxt.sh
#!/bin/bash
bdir=$1
cd "$bdir"
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for bookdir in $(find openbess* -type d);
do
echo "$bookdir"
cd "$bookdir"
n=0
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for nfile in $(find *.pdf -type f);
do
let "n += 1"
filedoc="$nfile"
done
if [ $n -gt 1 ] || [ $n -lt 1 ]
then
echo "ERROR file PDF non unico"
exit
fi
IFS=$SAVEIFS
cp "$filedoc" doc.pdf
rm "$filedoc"
docsplit text --pages all --no-ocr --no-clean --output OCR/ doc.pdf
cd OCR
n=0
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for nfile in $(find *.txt -type f);
do
numer=${nfile#doc_}
numero=${numer%\.txt}
sn=$(printf "%04d" $numero)
tr -d '\f' < "$nfile" > "$sn".txt
rm "$nfile"
echo "$sn"" DONE"
done
IFS=$SAVEIFS
cd ~
echo "DONE **************************""$bookdir"
done
exit
The script creates OCR directory with a single txt file for every pdf page (i.e. 0001.txt, 0002.txt, …) in every book directory.
Copy book directories (e.g. openbess_TO082-00001) to back-end server, e.g. into /srv/data/bookforingest directory.
Call the script pdfatiff.sh with directory of book directories as parameter:
#./pdfatiff.sh /srv/data/bookforingest