Using Python to Automate Narration of a 14 Century Book

Siraj Sabihuddin

A scientist and engineering trying to read history without reading. It took considerable time to produce this article for free in my spare time. If you like this article or want help in a project contact me. And if you appreciate the work and have the inclination, buy me a coffee or ten via the Donate button.

Recently, I encountered a problem. The problem is that of time. I have been trying to read an English translation of a book published in 1377 AD by Ibn Khaldun entitled The Muqaddimah [2] – also known as the “Introduction to History”. This book can be regarded somewhat as the precursor of the equivalent work produced by Machiavelli (The Prince [1]) published in 1532. Machiavelli, as it were, was the Ibn Khaldun of Europe. For posterity’s sake, I’ve included the full text version of a Franz Rosenthal’s translation of The Muqaddimah with this article below.

Franz Rosenthal died roughly 20 years since the time of publishing this article. However, I distribute the book recognizing that, at least according to the internet archive, the book is currently in the public domain [4].

The Muqaddimah, Ibn Khaldun
(1377 AD) [2]
, The Prince, Machiavelli (1532 AD)

1. Introduction

The issue, however, with The Muqaddimah is that no audio book seems to exist and the book sits at roughly a thousand pages of text. I, being one of those lazy folks, like to multi-task somewhat in order to get my reading done. So while I walk to and from work, I’d like to be able to read without the requisite dangers of being run over by a car – a hapless reader might encounter this issue were he to read, willy-nilly while out and about. Thus, in this article, I create a simple python program to synthesize an audiobook from an English translation of Ibn Khaldun’s seminal work. Right ho, boys and girls. With that, lets get started.

I’ll start you off by looking at a PDF file for the book. And then going from there to convert the PDF to text, fix parsing errors, and then insert pauses and the like during voice synthesis. These steps are shown in the diagram below:

Steps covered in this article

The end result of this process will sound something like the following audio clip of the first two sentences of the book (page 3, top):

TEXT: “Ibn Khaldun’s Life. WRITING the biography of Ibn Khaldun would not seem to be a particularly difficult task”

Advice

I would ask that if you use my article and the code, here in, that you take the time to cite it properly as well. The code is distributed under under GPL v2 License. That is, improvements should be given back to the community (see a discussion with Linus Torvalds: https://youtu.be/PaKIZ7gJlRU)

2. Converting PDF to Text using Direct Extraction

Digitized books come in all shapes and sizes. Some might be just pure text files, others might be word documents, still others in EPUB, MOBI, AZW or other formats. In the case of pure text files, its relatively easy to extract the text. However, some formats make it somewhat more difficult, for instance those with Digital Rights Management (DRM). Historical texts may also be challenging where images are the approach used to store the file data.

In my case, I chose to work directly with PDF since the original digital book I had available was in PDF format. My first go to approach was to utilizes an existing python library called PyPDF2 [3] to extract the text from the PDF file using the import shown below:

Python

Using the imported PdfFileReader, I could now read the header information from the PDF file directly using the function shown below.

Python

Running the function above yields the following output (see below). So it looks like a good start. Unfortunately this good start will not last as you’ll see shortly.

{'/Creator': "Microsoft Word - al-muqadimmah for ibn khal'doon", '/Producer': 'ScanSoft PDF Create! 2', '/CreationDate': 'D:20080721223217', '/ModDate': 'D:20080721223217', '/Author': 'Owner', '/Title': "Microsoft Word - al-muqadimmah for ibn khal'doon"}

So at this stage I implemented the following function to read the PDF file directly.

Python

I can now run the function on only one page of the PDF text using the following code.

Python

Here the output resulting from the text extraction shows the following results for pages 3 to 4 of the text:

1230
range(3, 4)
Ibn Khaldun's own great work, especially the Muqaddimah, is
another important source for his biography. Written in a much more
personal style than most medieval works, the Muqaddimah sharply
outlines his own personal philosophy and provides insights into the
workings of his mind.
This abundance of biographical source material has enabled
modern scholars at various times to write Ibn Khaldun's life and to
present the data in a factually correct form to whic h little can be
added. These modern biographies vary greatly in length. Among the
longest are de Slane's account in the Introduction to his translation of
the Muqaddimah, largely a literal translation of the Autobiography,6
and that by M. A. Enan, in his Ibn Khaldun, His Life and Work.7There
has been no re cent treatment in extenso of Ibn Khaldun's early life
(down to 1382), but his Egyptian period is the subject of two masterly
studies by W. J. Fischel, "Ibn Khaldun's Activities in Mamluk Egypt
(1382 -1406)"8and Ibn Khaldun and Tamerlane.9
In its outlines, Ibn Khaldun's life thus is quite clearly known.
However, the modern student who would like to know much more
about him, discovers that his questions can only be answered by
conjecture, if at all. Considering the excellence of the source ma terial,
at least as judged by external criteria, the deficiencies in our knowledge
...
biographer's claim that experiences which he shared with all his
contemporaries contributed to the formation of his individual

Taking a closer look at this file you might notice that there are some carriage returns that have been added to the extracted text (see below). We will need to potentially make some adjustments to remove these before performing voice synthesis.

'Ibn Khaldun\'s own great work, especially the Muqaddimah, is\nanother important source for his biography. Written in a much more\npersonal style than most medieval works, the Muqaddimah sharply\noutlines his own personal philosophy and provides insights into the\nworkings of his mind.\nThis abundance of biographical source material has enabled\nmodern scholars at various times to write Ibn Khaldun\'s life and to\npresent the data in a factually correct form to whic h little can be\nadded. These modern biographies vary greatly in length. Among the\nlongest are de Slane\'s account in the Introduction to his translation of\nthe Muqaddimah, largely a literal translation of the Autobiography,6\nand that by M. A. Enan, in his Ibn Khaldun, His Life and Work.7There\nhas been no re cent treatment in extenso of Ibn Khaldun\'s early life\n(down to 1382), but his Egyptian period is the subject of two masterly\nstudies by W. J. Fischel, "Ibn Khaldun\'s Activities in Mamluk Egypt\n(1382 -1406)"8and Ibn Khaldun and Tamerlane.9\nIn its outlines, Ibn Khaldun\'s life thus is quite clearly known.\nHowever, the modern student who would like to know much more\nabout him, discovers that his questions can only be answered by\nconjecture, if at all. Considering the excellence of the source ma terial,\nat least as judged by external criteria, the deficiencies in our knowledge\nmust be ascribed to the internal character of the avail able information.\nIt is true that no amount of material will ever fully satisfy a biographer,\nbut in Ibn Khaldun\'s ca se there are particular reasons why a fully\nsatisfactory account of his life is virtually impossible of achievement. In\nthe first place, Ibn Khaldun considered only such events in his life worth\nrecording as were especially remarkable, the most unusual\nach ievements of an exceptional person. Thus he did not pay much\nattention to the kind of data so dear to modern psychological\nbiographers. He does not speak about his childhood. His family is\nmentioned only because family considerations often influenced the\ncourse of his wanderings and because it was afflicted by unusual\nmisfortunes. All his ordinary activities are passed over in silence. Ibn\nKhaldun would probably have denied that this kind of data has any\nheuristic value. He would have doubted the validity o f the modern\nbiographer\'s claim that experiences which he shared with all his\ncontemporaries contributed to the formation of his individual'

3. Converting PDF to Text Using Optical Character Recognition

There are some cases where its useful to be able to extract text that isn’t necessarily directly readable from the PDF direct extraction. This could be where the PDF itself is corrupted some how or has some built in DRM that prevents you from copying data. In reality, we may not even have a PDF or we may want to read a document from a portal that has an embedded online magazine or such. So this alternative approach instead utilizes Optical Character Recognition (OCR) to extract text.

3.1. On Tesseract OCR

To do this we will utilize pytesseract library [5]. The pytesseract library is a wrapper based around Google’s Tesseract OCR engine [6]. Alas, sadly we need to rely on google for the OCR for now. Later on, in future articles, I’ll discuss alternatives – the nice thing however is that no online connection is needed, its an open source tool and we can grab the Tessaract executable for Windows or Linux directly for this purpose. The platform python library can be used to detect which operating system is being used by the user. Installing Tesseract binaries can be done by directly visiting the Tesseract github documentation page here: https://tesseract-ocr.github.io/tessdoc/ [7]

LINUX

To install on Linux you can use the Linux package manager via the following command in the bash terminal and the instructions provided here: https://notesalexp.org/tesseract-ocr/#tesseract_5.x [8][12]

sudo apt install tesseract-ocr -y
WINDOWS

To install on Windows you can directly download the executable for tesseract via google’s website here: https://github.com/UB-Mannheim/tesseract/wiki/Downloading-Tesseract-OCR-Engine [9][11]

Below is a list of libraries that we need to operate the OCR based text extraction of the PDF document. Notice that the tempfile [14], pdf2image [15] and PIL [16] libraries are used to support the conversion of the PDF into images and their temporary storage.

Python

3.2. Creating Code for OCR

The code in the function below is based on work presented in [10] and [13]. The function does the following things:

  1. Grabs the path for the Tesseract executable based on the platform in which the code is being run. It assigns this to an attribute in the pytesseract library.
  2. Runs the convert_from_path function from the pdf2image library using the Tesseract binary assigned earlier.
  3. Iterates through the images created by the library and stores them on disk in a temporary location to save memory space on computer.
  4. Open these files one at a time and convert them to string using the pytesseract library function: image_to_string. Use the PIL Image type for storing the opened file and passing them into the pytesseract function.
  5. For the text string grabbed from each image, append it to the larger text string and save this larger text string to file for further use by audio synthesis.
Python

3.3. Parsing the Text

Unfortunately, we may have a few problems after text extraction. The text extracted might have artifacts from the translation process. So before we get to synthesizing audio it may be necessary to remove some of these artifacts. This process can actually get quite complicated. I have been exceedingly lazy and just done some minor corrections. These need considerably more care to be done properly and at minimum I should really be using regular expressions through out for text replacements. Below is some simple replacement code that can be called after the initial text has been extracted. With OCR techniques, you are more likely to encounter textual artifacts. For the code below, at minimum we need the following library import for regex (regular expressions)

Python

In addition we can also make use of the natural language toolkit (nltk library [17]). I’ll discuss this further in the next section on audio synthesis. The code below is specifically needed for cleaning up OCR extracted text. This may not be the case for a direct approach to extraction.

Python

4. Synthesizing Audio from Extracted Text

Now that we have utilized two different approaches to synthesizing text and corrected some of the errors, its time to do the fun part: creating the audio!

The below code (following) demonstrates the imports needed to implement this voice synthesis. Here we use Google’s Text to Speech tools (gtts library). This tool provides an excellent speech synthesizer but at a cost. Namely, that the request must be made to Google Translate’s API. This is not ideal for a number of reasons:

  1. It requires an internet connection to do the conversion
  2. It requires sending data to a third party (not ideal) if you have private documents that you would like to translate
  3. It leaves you with limited control over synthesis parameters

That said, for this article, I’m going to continue to use GTTS as the approach to voice synthesis. In a later article, I’ll try to tackle training my own speech synthesizer with my own voice to create a uniquely me, audio synthesis without the negatives of using Google’s services hopefully.

Python

In addition to GTTS, a BytesIO stream provides a mechanism for storing audio. The pydub library provides a tool by which to add together small audio sequences. With the converted text there may be in appropriate length pauses, and after tokenizing text (via nltk) into sentences and phrases, pydub can provide a method to add pauses in between audio segments.

To encode audio into an mp3 format, we need some form of encoder. There are a number of these available. But the focus for the purposes of this, I’ll use ffmpeg. This can be installed, on Linux, with the following command:

sudo apt install ffmpeg

On Windows, ffmpeg can be installed from the following link: https://ffmpeg.org/download.html [18]. The code below provides a function to generate the audio.

Python

5. Downloads and Conclusion

You can download the complete code for this project as a Jupyter notebook here:

In the future, what I’d like to do is take the voice synthesis to the next level. That is to record my own voice and create a narrator for the audio book that sounds like my own voice. Further, I’ll start to tackle other languages such as Japanese and Chinese and their translation to an audio book as well. Stay tuned folks.

Information

Would love some support if you found this article useful. It takes a lot of time to write articles like this (i.e roughly 20+ hours). Your support would be extremely helpful to pay for the coffees that I need to stay awake into the night after work to write it.

6. Listen to the Book

I’ll be slowly adding the full audio of the book as transated by Franz Rosenthal on my youtube playlist here: https://www.youtube.com/playlist?list=PLdIyJnxAQFX3WdLBdm9foK70CYpSqMUCy.

References:

  1. Wikipedia. https://en.wikipedia.org/wiki/The_Prince. [Accessed: 2022]
  2. Wikipedia. https://en.wikipedia.org/wiki/Muqaddimah. [Accessed: 2022]
  3. https://pypdf2.readthedocs.io/en/latest/ [Accessed: 2022]
  4. https://archive.org/details/al-muqaddimah-by-ibn-khaldun-and-translated-by-franz-rosenthal/page/n5/mode/2up [Accessed: 2022]
  5. https://github.com/madmaze/pytesseract [August 2022]
  6. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf [Accessed: 2022]
  7. https://tesseract-ocr.github.io/tessdoc/ [Accessed: 2022]
  8. https://notesalexp.org/tesseract-ocr/#tesseract_5.x [Accessed: 2022]
  9. https://github.com/UB-Mannheim/tesseract/wiki/Downloading-Tesseract-OCR-Engine [Accessed: 2022]
  10. https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/ [Accessed: 2022]
  11. https://blog.alivate.com.au/poppler-windows/ [Accessed: 2022]
  12. https://linuxhint.com/install-tesseract-ocr-linux/ [Accessed: 2022]
  13. https://realpython.com/pdf-python/ [Accessed: 2022]
  14. https://docs.python.org/3/library/tempfile.html [Accessed: 2022]
  15. https://github.com/Belval/pdf2image [Accessed: 2022]
  16. https://github.com/pi2p/pil [Accessed: 2022]
  17. https://www.nltk.org/ [Accessed: 2022]
  18. https://ffmpeg.org/download.html [Accessed: 2022]
  19. https://www.geeksforgeeks.org/convert-text-speech-python/ [Accessed: 2022]
  20. https://pythonbasics.org/python-play-sound/ [Accessed: 2022]
  21. https://gtts.readthedocs.io/en/latest/module.html [Accessed: 2022]
  22. https://pythonbasics.org/python-play-sound/ [Accessed: 2022]
  23. https://gtts.readthedocs.io/en/latest/tokenizer.html [Accessed: 2022]
  24. https://www.holisticseo.digital/python-seo/nltk/tokenization [Accessed: 2022]
  25. https://stackoverflow.com/questions/55962939/how-can-i-efficiently-convert-gtts-audio-into-pydub-audiosegments [Accessed: 2022]
  26. https://pydub.com/ [Accessed: 2022]

1 thought on “Using Python to Automate Narration of a 14 Century Book

Leave a Reply

Your email address will not be published. Required fields are marked *