====== Generate BOOK files from PDF (Word) ====== We start with 001.pdf and DublinCore.txt in a directory (e.g. ~/Pubb2012/openbess_TO082-00001). * **On Desktop** with LibreOffice and Docspplit ([[http://documentcloud.github.com/docsplit/]]) Call the script //pdfTOtxt.sh// with directory of book directories as parameter: giancarlo@ubuntud:~$./pdfTOtxt.sh Pubb2012/ #!/bin/bash bdir=$1 cd "$bdir" SAVEIFS=$IFS IFS=$(echo -en "\n\b") for bookdir in $(find openbess* -type d); do echo "$bookdir" cd "$bookdir" n=0 SAVEIFS=$IFS IFS=$(echo -en "\n\b") for nfile in $(find *.pdf -type f); do let "n += 1" filedoc="$nfile" done if [ $n -gt 1 ] || [ $n -lt 1 ] then echo "ERROR file PDF non unico" exit fi IFS=$SAVEIFS cp "$filedoc" doc.pdf rm "$filedoc" docsplit text --pages all --no-ocr --no-clean --output OCR/ doc.pdf cd OCR n=0 SAVEIFS=$IFS IFS=$(echo -en "\n\b") for nfile in $(find *.txt -type f); do numer=${nfile#doc_} numero=${numer%\.txt} sn=$(printf "%04d" $numero) tr -d '\f' < "$nfile" > "$sn".txt rm "$nfile" echo "$sn"" DONE" done IFS=$SAVEIFS cd ~ echo "DONE **************************""$bookdir" done exit The script creates OCR directory with a single txt file for every pdf page (i.e. 0001.txt, 0002.txt, ...) in every book directory. \\ \\ Copy book directories (e.g. openbess_TO082-00001) to back-end server, e.g. into /srv/data/bookforingest directory. * **On Back-end server** with ImageMagick and pdftk Call the script //pdfatiff.sh// with directory of book directories as parameter: #./pdfatiff.sh /srv/data/bookforingest #!/bin/bash bdir=$1 SAVEIFS=$IFS IFS=$(echo -en "\n\b") for bookdir in $(find "$bdir/"openbess* -maxdepth 0 -type d ); do echo "$bookdir" n=0 SAVEIFS=$IFS IFS=$(echo -en "\n\b") for nfile in $(find "$bookdir/"*.pdf -type f); do let "n += 1" filepdf="$nfile" done if [ $n -gt 1 ] || [ $n -lt 1 ] then echo "ERROR file PDF non unico" exit fi mkdir "$bookdir""/pdfs" cp "$filepdf" "$bookdir""/pdfs" cd "$bookdir""/pdfs" pdftk "$filepdf" burst output pg-%04d.pdf n=0 SAVEIFS=$IFS IFS=$(echo -en "\n\b") for nfile in $(find pg-*.pdf -type f); do let "n += 1" sn=$(printf "%04d" $n) filepdf="$nfile" echo "$filepdf"" -> ""$sn.tif" pdftk "$filepdf" output "temp.pdf" # For PDF from image # convert -density 150 "temp.pdf" "$sn.tif" # For PDF from Word convert -background white -flatten -density 600 -resize 1200 -border 0.5% -bordercolor LightGray "temp.pdf" "../""$sn.tif" rm "temp.pdf" done cd ~/clineFC rm -R "$bookdir""/pdfs" done exit The script creates a single tif file for every pdf page (i.e. 0001.tif, 0002.tif, ...) in every book directory. \\ \\ Book in now ready for ingesting.