Automated Narration of 14th Century Book

Siraj Sabihuddin

A scientist and engineering trying to read history without reading. It took considerable time to produce this article for free in my spare time. If you like this article or want help in a project contact me. And if you appreciate the work and have the inclination, buy me a coffee or ten via the Donate button.

Donate

Shop

Contact

Recently, I encountered a problem. The problem is that of time. I have been trying to read an English translation of a book published in 1377 AD by Ibn Khaldun entitled The Muqaddimah [2] – also known as the “Introduction to History”. This book can be regarded somewhat as the precursor of the equivalent work produced by Machiavelli (The Prince [1]) published in 1532. Machiavelli, as it were, was the Ibn Khaldun of Europe. For posterity’s sake, I’ve included the full text version of a Franz Rosenthal’s translation of The Muqaddimah with this article below.

Download The Muqaddimah PDF File Here

Franz Rosenthal died roughly 20 years since the time of publishing this article. However, I distribute the book recognizing that, at least according to the internet archive, the book is currently in the public domain [4].

1. Introduction

The issue, however, with The Muqaddimah is that no audio book seems to exist and the book sits at roughly a thousand pages of text. I, being one of those lazy folks, like to multi-task somewhat in order to get my reading done. So while I walk to and from work, I’d like to be able to read without the requisite dangers of being run over by a car – a hapless reader might encounter this issue were he to read, willy-nilly while out and about. Thus, in this article, I create a simple python program to synthesize an audiobook from an English translation of Ibn Khaldun’s seminal work. Right ho, boys and girls. With that, lets get started.

I’ll start you off by looking at a PDF file for the book. And then going from there to convert the PDF to text, fix parsing errors, and then insert pauses and the like during voice synthesis. These steps are shown in the diagram below:

The end result of this process will sound something like the following audio clip of the first two sentences of the book (page 3, top):

TEXT: “Ibn Khaldun’s Life. WRITING the biography of Ibn Khaldun would not seem to be a particularly difficult task”

Advice

I would ask that if you use my article and the code, here in, that you take the time to cite it properly as well. The code is distributed under under GPL v2 License. That is, improvements should be given back to the community (see a discussion with Linus Torvalds: https://youtu.be/PaKIZ7gJlRU)

2. Converting PDF to Text using Direct Extraction

Digitized books come in all shapes and sizes. Some might be just pure text files, others might be word documents, still others in EPUB, MOBI, AZW or other formats. In the case of pure text files, its relatively easy to extract the text. However, some formats make it somewhat more difficult, for instance those with Digital Rights Management (DRM). Historical texts may also be challenging where images are the approach used to store the file data.

In my case, I chose to work directly with PDF since the original digital book I had available was in PDF format. My first go to approach was to utilizes an existing python library called PyPDF2 [3] to extract the text from the PDF file using the import shown below:

# PDF parsing and reading
from PyPDF2 import PdfFileReader

Using the imported PdfFileReader, I could now read the header information from the PDF file directly using the function shown below.

def pdfExtractInformation(filename):
    # Open file 
    with open(filename, 'rb') as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo() 
    return information

Running the function above yields the following output (see below). So it looks like a good start. Unfortunately this good start will not last as you’ll see shortly.

{'/Creator': "Microsoft Word - al-muqadimmah for ibn khal'doon", '/Producer': 'ScanSoft PDF Create! 2', '/CreationDate': 'D:20080721223217', '/ModDate': 'D:20080721223217', '/Author': 'Owner', '/Title': "Microsoft Word - al-muqadimmah for ibn khal'doon"}

So at this stage I implemented the following function to read the PDF file directly.

def pdfExtractText (filename, startpage, endpage):
    with open(filename, 'rb') as f:
        pdf = PdfFileReader(f)
        numpages = pdf.getNumPages()
		print(numpages)
        # Exit this method and return 
        # blank string if the range is not valid
        if (startpage<0 or startpage>numpages or endpage<0 
         or endpage>numpages or startpage>endpage): 
            return ''

        # Construct the page range. 
        readrange=range(startpage,endpage)
        print(readrange)

        # Initialize pagedata variable
        pagedata = ''

        # Iterate through each page and grab text.
        for pagenum in readrange:
            page = pdf.getPage(pagenum)
            pagedata = pagedata + page.extractText()
        
    return pagedata

I can now run the function on only one page of the PDF text using the following code.

txt=pdfExtractText (PDFPath, 3, 4)
print(txt)

Here the output resulting from the text extraction shows the following results for pages 3 to 4 of the text:

1230
range(3, 4)
Ibn Khaldun's own great work, especially the Muqaddimah, is
another important source for his biography. Written in a much more
personal style than most medieval works, the Muqaddimah sharply
outlines his own personal philosophy and provides insights into the
workings of his mind.
This abundance of biographical source material has enabled
modern scholars at various times to write Ibn Khaldun's life and to
present the data in a factually correct form to whic h little can be
added. These modern biographies vary greatly in length. Among the
longest are de Slane's account in the Introduction to his translation of
the Muqaddimah, largely a literal translation of the Autobiography,6
and that by M. A. Enan, in his Ibn Khaldun, His Life and Work.7There
has been no re cent treatment in extenso of Ibn Khaldun's early life
(down to 1382), but his Egyptian period is the subject of two masterly
studies by W. J. Fischel, "Ibn Khaldun's Activities in Mamluk Egypt
(1382 -1406)"8and Ibn Khaldun and Tamerlane.9
In its outlines, Ibn Khaldun's life thus is quite clearly known.
However, the modern student who would like to know much more
about him, discovers that his questions can only be answered by
conjecture, if at all. Considering the excellence of the source ma terial,
at least as judged by external criteria, the deficiencies in our knowledge
...
biographer's claim that experiences which he shared with all his
contemporaries contributed to the formation of his individual

Taking a closer look at this file you might notice that there are some carriage returns that have been added to the extracted text (see below). We will need to potentially make some adjustments to remove these before performing voice synthesis.

'Ibn Khaldun\'s own great work, especially the Muqaddimah, is\nanother important source for his biography. Written in a much more\npersonal style than most medieval works, the Muqaddimah sharply\noutlines his own personal philosophy and provides insights into the\nworkings of his mind.\nThis abundance of biographical source material has enabled\nmodern scholars at various times to write Ibn Khaldun\'s life and to\npresent the data in a factually correct form to whic h little can be\nadded. These modern biographies vary greatly in length. Among the\nlongest are de Slane\'s account in the Introduction to his translation of\nthe Muqaddimah, largely a literal translation of the Autobiography,6\nand that by M. A. Enan, in his Ibn Khaldun, His Life and Work.7There\nhas been no re cent treatment in extenso of Ibn Khaldun\'s early life\n(down to 1382), but his Egyptian period is the subject of two masterly\nstudies by W. J. Fischel, "Ibn Khaldun\'s Activities in Mamluk Egypt\n(1382 -1406)"8and Ibn Khaldun and Tamerlane.9\nIn its outlines, Ibn Khaldun\'s life thus is quite clearly known.\nHowever, the modern student who would like to know much more\nabout him, discovers that his questions can only be answered by\nconjecture, if at all. Considering the excellence of the source ma terial,\nat least as judged by external criteria, the deficiencies in our knowledge\nmust be ascribed to the internal character of the avail able information.\nIt is true that no amount of material will ever fully satisfy a biographer,\nbut in Ibn Khaldun\'s ca se there are particular reasons why a fully\nsatisfactory account of his life is virtually impossible of achievement. In\nthe first place, Ibn Khaldun considered only such events in his life worth\nrecording as were especially remarkable, the most unusual\nach ievements of an exceptional person. Thus he did not pay much\nattention to the kind of data so dear to modern psychological\nbiographers. He does not speak about his childhood. His family is\nmentioned only because family considerations often influenced the\ncourse of his wanderings and because it was afflicted by unusual\nmisfortunes. All his ordinary activities are passed over in silence. Ibn\nKhaldun would probably have denied that this kind of data has any\nheuristic value. He would have doubted the validity o f the modern\nbiographer\'s claim that experiences which he shared with all his\ncontemporaries contributed to the formation of his individual'

3. Converting PDF to Text Using Optical Character Recognition

There are some cases where its useful to be able to extract text that isn’t necessarily directly readable from the PDF direct extraction. This could be where the PDF itself is corrupted some how or has some built in DRM that prevents you from copying data. In reality, we may not even have a PDF or we may want to read a document from a portal that has an embedded online magazine or such. So this alternative approach instead utilizes Optical Character Recognition (OCR) to extract text.

3.1. On Tesseract OCR

To do this we will utilize pytesseract library [5]. The pytesseract library is a wrapper based around Google’s Tesseract OCR engine [6]. Alas, sadly we need to rely on google for the OCR for now. Later on, in future articles, I’ll discuss alternatives – the nice thing however is that no online connection is needed, its an open source tool and we can grab the Tessaract executable for Windows or Linux directly for this purpose. The platform python library can be used to detect which operating system is being used by the user. Installing Tesseract binaries can be done by directly visiting the Tesseract github documentation page here: https://tesseract-ocr.github.io/tessdoc/ [7]

LINUX

To install on Linux you can use the Linux package manager via the following command in the bash terminal and the instructions provided here: https://notesalexp.org/tesseract-ocr/#tesseract_5.x [8][12]

sudo apt install tesseract-ocr -y

WINDOWS

To install on Windows you can directly download the executable for tesseract via google’s website here: https://github.com/UB-Mannheim/tesseract/wiki/Downloading-Tesseract-OCR-Engine [9][11]

Below is a list of libraries that we need to operate the OCR based text extraction of the PDF document. Notice that the tempfile [14], pdf2image [15] and PIL [16] libraries are used to support the conversion of the PDF into images and their temporary storage.

# For detecting the platform on which code is running
import platform                     
# To create temporary directory
from tempfile import TemporaryDirectory     
# For optical character recognition
import pytesseract                         
# For converting pdf to images
from pdf2image import convert_from_path     
# Imaging library
from PIL import Image

3.2. Creating Code for OCR

The code in the function below is based on work presented in [10] and [13]. The function does the following things:

Grabs the path for the Tesseract executable based on the platform in which the code is being run. It assigns this to an attribute in the pytesseract library.
Runs the convert_from_path function from the pdf2image library using the Tesseract binary assigned earlier.
Iterates through the images created by the library and stores them on disk in a temporary location to save memory space on computer.
Open these files one at a time and convert them to string using the pytesseract library function: image_to_string. Use the PIL Image type for storing the opened file and passing them into the pytesseract function.
For the text string grabbed from each image, append it to the larger text string and save this larger text string to file for further use by audio synthesis.

def pdfExtractOCRText(PDFPath, OutputFilePath, 
                      OutputFolderPath, start=None, 
                      stop=None, TesseractEXEPath=None, 
                      PopplerEXEPath=None):
    # If using windows we need to install 
    # tesseract for windows from
    if platform.system() == "Windows":
        if (TesseractEXEPath is None 
         or PopplerEXEPath is None): 
            raise FileNotFoundError
        
        # Setup the path to the Tesseract EXE 
        # if in windows
        pytesseract.pytesseract.tesseract_cmd 
          =TesseractEXEPath

    # Store all the pages of the PDF in a list variable
    image_file_list = []
    
    # At the end of the with .. tempdir block, the
    # TemporaryDirectory() we're using gets removed! 
    with TemporaryDirectory() as tempdir:
        # Create a temporary directory to hold 
        # temporary images of the pdf.
        # These are stored into a variable
        if platform.system() == "Windows":
            pdf_pages = convert_from_path(PDFPath, 
             dpi=200, poppler_path=PopplerEXEPath,
             output_folder=OutputFolderPath, 
             first_page=start, last_page=stop,fmt='jpg')
        else:
            pdf_pages = convert_from_path(PDFPath, 
             dpi=200, output_folder=OutputFolderPath, 
             first_page=start, last_page=stop, fmt='jpg')

        # Iterate through all the pages stored above
        # enumerate() "counts" the pages for us.
        # store these images into a temporary directory
        for page_enumeration, page in 
        	enumerate(pdf_pages, start=1): 
            # Create a file name to store the Path 
            # for the image
            tmpfilename = tempdir + '\page_' + 
              str(page_enumeration).rjust(4, '0') + '.jpg'

            # Declaring filename for each page of PDF 
            # as JPG. For each page, filename will be:
            # PDF page 1 -> page_001.jpg
            # PDF page 2 -> page_002.jpg
            # PDF page 3 -> page_003.jpg
            # ....
            # PDF page n -> page_00n.jpg
            # Save the image of the page in system
            page.save(tmpfilename, "JPEG")
            image_file_list.append(tmpfilename)

        # Open the file in append mode so that
        # All contents of all images are added to 
        # the same file. At the end of the with .. 
        # output_file block. The file is closed after 
        # writing all the text.
        with open(OutputFilePath, "a") as output_file:
            full_text=''
            # Iterate from 1 to total number of pages
            for image_file in image_file_list:
                # Progress tracker
                print ('Images: ' + 
                       str(len(image_file_list))+' Image:' 
                        + str(image_file_list.index 
                        (image_file)+1))
                
                # Set filename to recognize text from
                # Again, these files will be:
                # page_1.jpg
                # page_2.jpg
                # ....
                # page_n.jpg 
                # Recognize the text as string in 
                # image using pytesserct
                text = str(((pytesseract.image_to_string 
                             (Image.open(image_file)))))

                # The recognized text is stored 
                # in variable text. Any string 
                # processing may be applied on text
                # Here, basic formatting has been done: 
                # In many PDFs, at line ending, if a 
                # word can't be written fully, a 'hyphen' 
                # is added. The rest of the word is 
                # written in the next line
                full_text = full_text + text

            # Finally, write the processed text to 
            # the file.
            output_file.write(full_text)   
    
    # Return the text after parsing and writing to file
    return full_text

3.3. Parsing the Text

Unfortunately, we may have a few problems after text extraction. The text extracted might have artifacts from the translation process. So before we get to synthesizing audio it may be necessary to remove some of these artifacts. This process can actually get quite complicated. I have been exceedingly lazy and just done some minor corrections. These need considerably more care to be done properly and at minimum I should really be using regular expressions through out for text replacements. Below is some simple replacement code that can be called after the initial text has been extracted. With OCR techniques, you are more likely to encounter textual artifacts. For the code below, at minimum we need the following library import for regex (regular expressions)

import re # Regular expressions

In addition we can also make use of the natural language toolkit (nltk library [17]). I’ll discuss this further in the next section on audio synthesis. The code below is specifically needed for cleaning up OCR extracted text. This may not be the case for a direct approach to extraction.

def parseText(text):
    # Replace new paragraph marker with carriage return 
    # characters
    text = text.replace("\n\n", ".\n\n")
    
    # Fixing periods
    text = text.replace("..", ". ")
    text = text.replace(". .", "")
    text = text.replace(".   .","")

    # Remove orphan quotations and other transcription errors
    text = text.replace(",’"," ")
    text = text.replace("’."," ") 
    text = text.replace(".’",". ")
    text = text.replace("","") 
    text = text.replace("®","")
    text = text.replace("°.","")  
    text = text.replace("*°","")
    text = text.replace("**","")
    text = text.replace("™","")
    text = text.replace("*’","")
    text = text.replace("*.","")
    text = text.replace("*”.","")
    text = text.replace("*~","")
    text = text.replace("*”°","")
    text = text.replace("“°°","")
    
    # line end characters replaced with space
    text = text.replace("\n", " ")

    # Fix names
    text = text.replace("lbn", "Ibn")
    text = text.replace("|\\","I")
    text = text.replace("|bn","I")
    text = text.replace("|","")
    
    # Remove commas between numbers (for instance in dollar 
    # amounts or years)
    text=re.sub("(?![0-9]),(?=[1-9])", "", text)

    return text

4. Synthesizing Audio from Extracted Text

Now that we have utilized two different approaches to synthesizing text and corrected some of the errors, its time to do the fun part: creating the audio!

The below code (following) demonstrates the imports needed to implement this voice synthesis. Here we use Google’s Text to Speech tools (gtts library). This tool provides an excellent speech synthesizer but at a cost. Namely, that the request must be made to Google Translate’s API. This is not ideal for a number of reasons:

It requires an internet connection to do the conversion
It requires sending data to a third party (not ideal) if you have private documents that you would like to translate
It leaves you with limited control over synthesis parameters

That said, for this article, I’m going to continue to use GTTS as the approach to voice synthesis. In a later article, I’ll try to tackle training my own speech synthesizer with my own voice to create a uniquely me, audio synthesis without the negatives of using Google’s services hopefully.

# Import the required module for text to speech conversion
from gtts import gTTS                       
# for Adding or detecting pauses in audio
from pydub import AudioSegment, silence     
# For assigning an encoder for audio
from pydub.utils import which               
# For byte stream to edit audio
from io import BytesIO                      
# Tokenize sentences using the natural language toolkit
import nltk
# For some basic os functions
import os

In addition to GTTS, a BytesIO stream provides a mechanism for storing audio. The pydub library provides a tool by which to add together small audio sequences. With the converted text there may be in appropriate length pauses, and after tokenizing text (via nltk) into sentences and phrases, pydub can provide a method to add pauses in between audio segments.

To encode audio into an mp3 format, we need some form of encoder. There are a number of these available. But the focus for the purposes of this, I’ll use ffmpeg. This can be installed, on Linux, with the following command:

sudo apt install ffmpeg

On Windows, ffmpeg can be installed from the following link: https://ffmpeg.org/download.html [18]. The code below provides a function to generate the audio.

def audioGenerate(text, language, accent, filename):
    # Construct a new filename. One for gtts and 
    # another for pyttxs3 and its temp file.
    name, ext = os.path.splitext(filename)
    filename0 = name+'_gtts'+ext
    filename0t = name+'_gtts_tmp'+ext
    
    # Create a byte stream object to act as a 
    # pointer to a file
    fp0 = BytesIO()
    
    # Asssign the ffmpeg encoder to pydub audio 
    # segment encoder.
    AudioSegment.converter = which("ffmpeg")
    
    # provide a data structure containing 
    # a mapping for various regional accents
    # to be used by gtts or other synthesizer such
    # as pyttxs
    ac={
        'gtts':{
            'en':{
                'au':'com.au',
                'uk':'co.uk',
                'us':'com',
                'ca':'ca',
                'in':'co.in',
                'ie':'ie',
                'za':'co.za',
            }
        },
    }

    # Construct a zero segment playlist of audio segments
    # I'll add to this for every sentence. Do so for 
    # both gtts and pytxs approaches to voice synthesis
    playlist0 = AudioSegment.empty()
    
    # Grab the sentences in the text
    sentences = nltk.tokenize.sent_tokenize(text)

    # Iterate through the text sentence by sentence
    for sentence in sentences:
        # Progress counter
        print('Sentences: '+ str(len(sentences)) + 
              'Sentence: ' + str(sentences.index(sentence)+1))

        # Use regular expressions to split up the sentences
        # into phrases separated by commans and semi-colons
        pattern = r'|'.join([',',';',' - '])
        phrases = re.split(pattern, sentence)
        
        # Iterate through individual phrases in the sentences
        for phrase in phrases:
            #--------------------------------------
            # Using GTTS engine for text-to-speech
            #--------------------------------------
            # Passing the text and language to the 
            # text to speech gtts engine, 
            try: 
                gtts = gTTS(text=phrase, lang=language, 
                            tld = ac['gtts'][language][accent], 
                            slow=False)
                # Write audio for phrase to file
                gtts.write_to_fp(fp0)
                fp0.seek(0)
                # Create an audio phrase
                seg0=AudioSegment.from_mp3(fp0)
                playlist0 = playlist0 + seg0 +
                   AudioSegment.silent(duration=80)
                # Reset the fp0 byte stream object for the 
                # next phrase
                fp0 = BytesIO()
            except Exception as e: pass

        # Once all phrases for the current sentence 
        # are complete add the end of sentence pause
        playlist0 = playlist0 + 
             AudioSegment.silent(duration=300) 
        
    # Write the audio to file
    playlist0.export(filename0, format="mp3")

5. Downloads and Conclusion

You can download the complete code for this project as a Jupyter notebook here:

Code + Examples

In the future, what I’d like to do is take the voice synthesis to the next level. That is to record my own voice and create a narrator for the audio book that sounds like my own voice. Further, I’ll start to tackle other languages such as Japanese and Chinese and their translation to an audio book as well. Stay tuned folks.

Information

Would love some support if you found this article useful. It takes a lot of time to write articles like this (i.e roughly 20+ hours). Your support would be extremely helpful to pay for the coffees that I need to stay awake into the night after work to write it.