Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text to Speech Support #755

Merged
merged 22 commits into from
May 15, 2024
Merged

Text to Speech Support #755

merged 22 commits into from
May 15, 2024

Conversation

andrewfrench
Copy link
Member

@andrewfrench andrewfrench commented Apr 23, 2024

Introduces support for Text to Speech workloads. For example:

from griptape.drivers import OpenAiTextToSpeechDriver
from griptape.structures import Agent
from griptape.tools.text_to_speech_client.tool import TextToSpeechClient
from griptape.utils import Chat


agent = Agent(tools=[
    TextToSpeechClient(
        output_dir="audio_out",
        engine=TextToSpeechEngine(
            text_to_speech_driver=OpenAiTextToSpeechDriver(),
        ),
    ),
])

Chat(agent).start()

todos:

  • TextToSpeechClient implementation
  • TextToSpeechClient documentation
  • Driver documentation
  • Engine documentation
  • AudioArtifact documentation
  • ElevenLabs SDK as an optional dependency
  • Generalized MediaArtifactFileOutputMixin

@andrewfrench andrewfrench marked this pull request as draft April 23, 2024 16:48
@andrewfrench andrewfrench changed the title Text to Audio Driver Text to Audio Generation Apr 23, 2024
@andrewfrench andrewfrench marked this pull request as ready for review April 23, 2024 17:54
dylanholmes
dylanholmes previously approved these changes Apr 24, 2024
Copy link
Contributor

@dylanholmes dylanholmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

The comments I've added are mostly "food for thought"

pyproject.toml Outdated
@@ -57,6 +57,7 @@ pandas = {version = "^1.3", optional = true}
pypdf = {version = "^3.9", optional = true}
pillow = {version = "^10.2.0", optional = true}
mail-parser = {version = "^3.15.0", optional = true}
elevenlabs = "^1.1.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an optional dependency?

Comment on lines 6 to 7
def play_audio(artifact: AudioArtifact) -> AudioArtifact:
elevenlabs.play(artifact.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it at all matter what format the AudioArtifact.value is? (Or are we ok with relying on elevenlabs to throw a runtime error?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, we shouldn't rely on the Eleven Labs SDK to play audio, that's just convenience for demo purposes and this should be reworked before approval/merge. We should expect to receive audio data in common enough formats that we should be able to play it with common Python/OS utilities.

class ImageArtifactFileOutputMixin:
class MediaArtifactFileOutputMixin:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this mixin is just for make it easier to writes bytes to a file, then why not generalize all the way to BlobArtifactFileOutputMixin? (Or a FileOutputMixin that takes a bytes in the write method)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to accept some sort of artifact here because we fall back to the artifact name as output filename if one isn't provided (if output_dir is set and we might expect multiple artifacts to end up there). Agreed that there's no reason to limit ourselves to MediaArtifacts, though.

@collindutter
Copy link
Member

Nice work but...docs. There's no escaping them now 😄

@andrewfrench andrewfrench marked this pull request as draft May 9, 2024 14:12
@andrewfrench andrewfrench changed the title Text to Audio Generation Text to Speech Support May 15, 2024
@andrewfrench andrewfrench marked this pull request as ready for review May 15, 2024 00:45
@andrewfrench andrewfrench requested review from dylanholmes and a team May 15, 2024 00:45
zachgiordano
zachgiordano previously approved these changes May 15, 2024
vachillo
vachillo previously approved these changes May 15, 2024
metadata={"serializable": True},
)
voice: str = field(kw_only=True, metadata={"serializable": True})
output_format: str = field(default="mp3_44100_128", kw_only=True, metadata={"serializable": True})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe move this default to a top level constant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what the guideline is for inline defaults vs top-level constants. seems like its done both ways.

@andrewfrench andrewfrench merged commit 44a2c62 into dev May 15, 2024
9 checks passed
@andrewfrench andrewfrench deleted the french/240423/text-to-audio branch May 15, 2024 20:44
hkhajgiwale pushed a commit to hkhajgiwale/griptape that referenced this pull request May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants