We start with 001.pdf and DublinCore.txt in a directory (e.g. ~/Pubb2012/openbess_TO082-00001).
Call the script pdfTOtxt.sh with directory of book directories as parameter:
giancarlo@ubuntud:~$./pdfTOtxt.sh Pubb2012/
#!/bin/bash bdir=$1 cd "$bdir" SAVEIFS=$IFS IFS=$(echo -en "\n\b") for bookdir in $(find openbess* -type d); do echo "$bookdir" cd "$bookdir" n=0 SAVEIFS=$IFS IFS=$(echo -en "\n\b") for nfile in $(find *.pdf -type f); do let "n += 1" filedoc="$nfile" done if [ $n -gt 1 ] || [ $n -lt 1 ] then echo "ERROR file PDF non unico" exit fi IFS=$SAVEIFS cp "$filedoc" doc.pdf rm "$filedoc" docsplit text --pages all --no-ocr --no-clean --output OCR/ doc.pdf cd OCR n=0 SAVEIFS=$IFS IFS=$(echo -en "\n\b") for nfile in $(find *.txt -type f); do numer=${nfile#doc_} numero=${numer%\.txt} sn=$(printf "%04d" $numero) tr -d '\f' < "$nfile" > "$sn".txt rm "$nfile" echo "$sn"" DONE" done IFS=$SAVEIFS cd ~ echo "DONE **************************""$bookdir" done exit
The script creates OCR directory with a single txt file for every pdf page (i.e. 0001.txt, 0002.txt, …) in every book directory.
Copy book directories (e.g. openbess_TO082-00001) to back-end server, e.g. into /srv/data/bookforingest directory.
Call the script pdfatiff.sh with directory of book directories as parameter:
#./pdfatiff.sh /srv/data/bookforingest
#!/bin/bash bdir=$1 SAVEIFS=$IFS IFS=$(echo -en "\n\b") for bookdir in $(find "$bdir/"openbess* -maxdepth 0 -type d ); do echo "$bookdir" n=0 SAVEIFS=$IFS IFS=$(echo -en "\n\b") for nfile in $(find "$bookdir/"*.pdf -type f); do let "n += 1" filepdf="$nfile" done if [ $n -gt 1 ] || [ $n -lt 1 ] then echo "ERROR file PDF non unico" exit fi mkdir "$bookdir""/pdfs" cp "$filepdf" "$bookdir""/pdfs" cd "$bookdir""/pdfs" pdftk "$filepdf" burst output pg-%04d.pdf n=0 SAVEIFS=$IFS IFS=$(echo -en "\n\b") for nfile in $(find pg-*.pdf -type f); do let "n += 1" sn=$(printf "%04d" $n) filepdf="$nfile" echo "$filepdf"" -> ""$sn.tif" pdftk "$filepdf" output "temp.pdf" # For PDF from image # convert -density 150 "temp.pdf" "$sn.tif" # For PDF from Word convert -background white -flatten -density 600 -resize 1200 -border 0.5% -bordercolor LightGray "temp.pdf" "../""$sn.tif" rm "temp.pdf" done cd ~/clineFC rm -R "$bookdir""/pdfs" done exit
The script creates a single tif file for every pdf page (i.e. 0001.tif, 0002.tif, …) in every book directory.
Book in now ready for ingesting.