Text to Speech Support #755

andrewfrench · 2024-04-23T16:45:39Z

Introduces support for Text to Speech workloads. For example:

from griptape.drivers import OpenAiTextToSpeechDriver
from griptape.structures import Agent
from griptape.tools.text_to_speech_client.tool import TextToSpeechClient
from griptape.utils import Chat


agent = Agent(tools=[
    TextToSpeechClient(
        output_dir="audio_out",
        engine=TextToSpeechEngine(
            text_to_speech_driver=OpenAiTextToSpeechDriver(),
        ),
    ),
])

Chat(agent).start()

todos:

TextToSpeechClient implementation
TextToSpeechClient documentation
Driver documentation
Engine documentation
AudioArtifact documentation
ElevenLabs SDK as an optional dependency
Generalized MediaArtifactFileOutputMixin

dylanholmes

Great job!

The comments I've added are mostly "food for thought"

dylanholmes · 2024-04-24T13:09:41Z

pyproject.toml

@@ -57,6 +57,7 @@ pandas = {version = "^1.3", optional = true}
 pypdf = {version = "^3.9", optional = true}
 pillow = {version = "^10.2.0", optional = true}
 mail-parser = {version = "^3.15.0", optional = true}
+elevenlabs = "^1.1.2"


Should this be an optional dependency?

dylanholmes · 2024-04-24T13:20:14Z

griptape/utils/play_audio.py

+def play_audio(artifact: AudioArtifact) -> AudioArtifact:
+    elevenlabs.play(artifact.value)


Does it at all matter what format the AudioArtifact.value is? (Or are we ok with relying on elevenlabs to throw a runtime error?)

Realistically, we shouldn't rely on the Eleven Labs SDK to play audio, that's just convenience for demo purposes and this should be reworked before approval/merge. We should expect to receive audio data in common enough formats that we should be able to play it with common Python/OS utilities.

dylanholmes · 2024-04-24T13:23:55Z

griptape/mixins/media_artifact_file_output_mixin.py

-class ImageArtifactFileOutputMixin:
+class MediaArtifactFileOutputMixin:


If this mixin is just for make it easier to writes bytes to a file, then why not generalize all the way to BlobArtifactFileOutputMixin? (Or a FileOutputMixin that takes a bytes in the write method)

I think we'll want to accept some sort of artifact here because we fall back to the artifact name as output filename if one isn't provided (if output_dir is set and we might expect multiple artifacts to end up there). Agreed that there's no reason to limit ourselves to MediaArtifacts, though.

collindutter · 2024-04-24T13:43:40Z

Nice work but...docs. There's no escaping them now 😄

vachillo · 2024-05-15T20:26:54Z

griptape/drivers/text_to_speech/elevenlabs_text_to_speech_driver.py

+        metadata={"serializable": True},
+    )
+    voice: str = field(kw_only=True, metadata={"serializable": True})
+    output_format: str = field(default="mp3_44100_128", kw_only=True, metadata={"serializable": True})


nit: maybe move this default to a top level constant?

not sure what the guideline is for inline defaults vs top-level constants. seems like its done both ways.

What if computer could talk?

83a69c3

andrewfrench marked this pull request as draft April 23, 2024 16:48

andrewfrench and others added 8 commits April 23, 2024 09:57

Remove height and width from audio artifact

7805bf0

Return audio data in audio artifact

9c72f97

Add dummy audio generation driver for config defaults

02f0e88

Add tasks

cf81f16

Export tasks

bd17b35

the rest?

0cde770

Return the played artifact

c0572ef

Fixes, test fixes

d05e557

andrewfrench changed the title ~~Text to Audio Driver~~ Text to Audio Generation Apr 23, 2024

andrewfrench marked this pull request as ready for review April 23, 2024 17:54

dylanholmes previously approved these changes Apr 24, 2024

View reviewed changes

andrewfrench force-pushed the french/240423/text-to-audio branch from 49295bb to 29cd347 Compare April 24, 2024 22:12

audio artifact unit tests

41f8f0c

andrewfrench dismissed dylanholmes’s stale review via 41f8f0c April 24, 2024 22:13

andrewfrench force-pushed the french/240423/text-to-audio branch from 29cd347 to 41f8f0c Compare April 24, 2024 22:13

TextToSpeechClient implementation

a973dc7

andrewfrench marked this pull request as draft May 9, 2024 14:12

andrewfrench added 6 commits May 14, 2024 16:10

Renamings, tests, docs

7b85d66

Merge branch 'dev' into french/240423/text-to-audio

cfa71d8

poetry lock --no-update

19f5997

Fix tests

9a97999

Fix docs

f42681e

Add OpenAI TTS driver

bbd3319

andrewfrench changed the title ~~Text to Audio Generation~~ Text to Speech Support May 15, 2024

andrewfrench marked this pull request as ready for review May 15, 2024 00:45

andrewfrench requested review from dylanholmes and a team May 15, 2024 00:45

andrewfrench added 4 commits May 14, 2024 17:49

ElevenLabs as optional dependency

544e76a

Merge branch 'dev' into french/240423/text-to-audio

ebd4a35

poetry lock --no-update

4abc157

Rename TextToSpeechEvents, remove negative prompts

7352233

zachgiordano previously approved these changes May 15, 2024

View reviewed changes

vachillo previously approved these changes May 15, 2024

View reviewed changes

Wire up Eleven Labs key for integration tests

6fb3001

andrewfrench dismissed stale reviews from vachillo and zachgiordano via 6fb3001 May 15, 2024 20:34

andrewfrench requested review from vachillo and zachgiordano May 15, 2024 20:34

zachgiordano approved these changes May 15, 2024

View reviewed changes

vachillo approved these changes May 15, 2024

View reviewed changes

andrewfrench merged commit 44a2c62 into dev May 15, 2024
9 checks passed

andrewfrench deleted the french/240423/text-to-audio branch May 15, 2024 20:44

hkhajgiwale pushed a commit to hkhajgiwale/griptape that referenced this pull request May 25, 2024

Text to Speech Support (griptape-ai#755)

c98b26d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text to Speech Support #755

Text to Speech Support #755

andrewfrench commented Apr 23, 2024 •

edited

dylanholmes left a comment

dylanholmes Apr 24, 2024

dylanholmes Apr 24, 2024

andrewfrench Apr 24, 2024

dylanholmes Apr 24, 2024 •

edited

andrewfrench Apr 24, 2024

collindutter commented Apr 24, 2024

vachillo May 15, 2024

vachillo May 15, 2024

		def play_audio(artifact: AudioArtifact) -> AudioArtifact:
		elevenlabs.play(artifact.value)

		class ImageArtifactFileOutputMixin:
		class MediaArtifactFileOutputMixin:

Text to Speech Support #755

Text to Speech Support #755

Conversation

andrewfrench commented Apr 23, 2024 • edited

dylanholmes left a comment

Choose a reason for hiding this comment

dylanholmes Apr 24, 2024

Choose a reason for hiding this comment

dylanholmes Apr 24, 2024

Choose a reason for hiding this comment

andrewfrench Apr 24, 2024

Choose a reason for hiding this comment

dylanholmes Apr 24, 2024 • edited

Choose a reason for hiding this comment

andrewfrench Apr 24, 2024

Choose a reason for hiding this comment

collindutter commented Apr 24, 2024

vachillo May 15, 2024

Choose a reason for hiding this comment

vachillo May 15, 2024

Choose a reason for hiding this comment

andrewfrench commented Apr 23, 2024 •

edited

dylanholmes Apr 24, 2024 •

edited