Skip to content

kaieberl/paper2speech

Repository files navigation

Paper2Speech

Tip

ArXiv now features html versions for new papers, see here. I am currently working on a browser add-on that adds buttons directly to the website.

Motivation

As a student in applied mathematics / machine learning, I often get to read scientific books, lecture notes and papers. Usually I prefer listening to a lecture from the professor and following his visual explanations on the blackboard, because then I get much information through the ear and don't have to do the "heavy lifting" through reading only. So far, this has not been available for books and papers.
So I thought: Why not let a software read out the text for you? What if you just had to click a button in the Finder, and the book or paper is converted to speech automatically?
This script uses the Meta Nougat package to extract formatted text from pdf and then converts it to audio using the Google Cloud Text to Speech API.

Sample output for the paper Large Language Models for Compiler Optimization:
output audio

Features

The aim of this package is to make papers more accessible by converting them to audio, or to an easy-to-read web page.

  • pause before and after headings
  • skip references like [1], (1, 2)], [Feynman et al., 1965], [AAKA23, SKNM23]
  • spell out abbreviations like e.g., i.e., w.r.t., Fig., Eq.
  • read out inline math (work in progress)
  • do not read out block math, instead pause
  • do not read out table contents
  • read out figure, table captions

Installation

Replace the GEMMA_CPP_PATH variable in src/markdown_to_html.py with the build path of your gemma executable. The tokenizer and model weights should be in the same directory.

git clone git://github.com/kaieberl/paper2speech
pip install .

For conversion to html, additionally install:

brew install node
npm install -g @mathpix/mpx-cli
sudo port install latexml

Usage

Files can be converted from pdf, mmd and tex to mp3 and html.

paper2speech <input_file.pdf> -o <output_file.mp3>

In case an error occurs in a later stage, you can invoke the command again on intermediately produced files (e.g. mmd).

The Google cloud authentication json file should be in the src directory. It can be downloaded from the Google Cloud Console, as described here.
TLDR: On https://cloud.google.com, create a new project. In your project, in the upper right corner, click on the 3 dots > project settings > service accounts > choose one or create service account > create key > json > create. The resulting json file should be downloaded automatically. Google TTS Neural2 and Wavenet voices are free for the first 1 million characters per month, after that $16 per 1M characters for the Neural2 voices and $4 per 1M characters for the Wavenet voices.

You can customize the voice in the definition of the voice variable.

voice = texttospeech.VoiceSelectionParams(
    language_code='en-GB',
    name='en-GB-Neural2-B',
)

Go to https://cloud.google.com/text-to-speech to try out different voices and languages. Below the text box, there is a button to show the json request. E.g. to use an American english voice, replace 'en': ('en-GB', 'en-GB-Neural2-B'), by 'en': ('en-US', 'en-US-Neural2-J'),. Also change the fallback Wavenet voice to the same voice a few lines further down:

voice = texttospeech.VoiceSelectionParams(
    language_code='en-GB',
    name='en-GB-Wavenet-B',
)

This voice is used if the Neural voice returns an error, e.g. because a sentence is too long.

On macOS, you can create a shortcut in the Finder with the following steps:

  1. in Automator, create a new Quick Action.
  2. At the top, choose input as "PDF files" in "Finder".
  3. add a "Run Shell Script" action. Set shell to /bin/zsh and pass input as arguments.
  4. add the following code: For mp3 output:
source ~/opt/miniconda3/etc/profile.d/conda.sh
conda activate paper2audio
paper2speech $1 -o "${1%.*}.mp3"

For creating an html page:

export PATH=/opt/homebrew/bin:/opt/local/bin:$PATH
source ~/opt/miniconda3/etc/profile.d/conda.sh
conda activate paper2audio
file_name=${1##*/}
paper2speech $1 -o "/path/to/paper2speech/out/${file_name%.*}.html"

Where the two paths in the first line should be the locations of node and latexmlc. 5. save the action and give it a name, e.g. "Paper2Speech", or "PaperAI", respectively.

FAQ

What to do if I get the error: Mathpix CLI conversion failed?

There is likely an unsupported LaTeX command in your mmd file.

  1. Please go to snip.mathpix.com and paste the content of your mmd file into a new note. You will get a preview on the right. Any command unsupported in Mathpix Markdown will show up as yellow warning.
  2. Inside text_to_speech.py, add a replacement to the refine_mmd() function at the bottom. Please also create a PR or an issue, so that I can fix the bug. Alternatively, if you can live with the error, you can export the note as tex from Mathpix and then run paper2speech on the tex file.

Limitations (for PDFs)

  • only works for English
  • currently does not support images in PDFs

Roadmap

  • create a Dockerfile for easy installation