Post

speech to text e2e usecase API

speech to text e2e usecase API

overview

I’ve set up a clean architecture for the ai repository and containerized it. this pipeline is deployed on a remote server ready for action.


background

  • requirements

    we need to develop a speech API that allows users to communicate with golem NPCs. in a Unity-based game, users activate their microphones locally, send the recorded audio to the server, and receive NPC responses in text form.

  • priorities

    1. first, complete the speech transcript to text-to-text pipeline.
    2. ensure smooth API communication on the remote server.
    3. focus on reducing latency as much as possible.
  • secondary priorities

    • prompt tuning
    • enhance security with the WSS protocol

purpose

to build a speech-to-text and text-to-text API that operates reliably on a remote server.

environment setup

  • managing dependencies with uv and pyproject.toml
  • server IP address: http://44.210.134.73:8000/docs
  • api endpoint: ws://44.210.134.73:8000/api/v1/ws/speech/v1
  • we handle dependencies with clean architecture through container implementation and injection.

implementation process

design

  • architecture: DDD + clean architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
demo/src/ai
├── datasets # testfile
├── docker-compose.yaml
├── Dockerfile
├── interaction
│   ├── core
│   │   ├── components #llm and speech component
│   │   ├── di
│   │   │   ├── config.py
│   │   │   └── container.py
│   │   ├── domain
│   │   │   └── usecase
│   │   ├── infra
│   │   │   ├── model_configs.py
│   │   │   └── model_router.py #litellm Router
│   │   └── utils
│   ├── server
│   │   ├── app.py
│   │   ├── core
│   │   │   └── session_manager.py #for ws session managing
│   │   ├── dto
│   │   │   └── speech.py
│   │   ├── router
│   │   │   └── speech
│   │   │       └── v1.py
│   │   └── tests # test server files
│   │       └── test_websocket_speech.py
│   ├── speech
│   │   ├── components
│   │   │   ├── speech_to_text
│   │   │   │   └── llm_speech_to_text_v1.py
│   │   │   └── text_to_text
│   │   │       └── llm_text_to_text_v1.py
│   │   ├── di
│   │   │   └── container.py
│   │   ├── domain
│   │   │   ├── ports
│   │   │   │   ├── speech_to_text.py
│   │   │   │   └── text_to_text.py
│   │   │   └── usecases
│   │   │       └── generate_conversation_response.py
│   │   ├── main.py
│   │   ├── prompts
│   │   │   └── text_to_text_v1.py
│   │   └── tests
│   │       ├── test_speech_to_text.py
│   │       └── test.py
│   └── text
├── Makefile
├── pyproject.toml
└── uv.lock

implementing the domain-driven design approach with a focus on clean architecture principles ensures robust application structure and maintainability.

diagram is as follows:

1st_image

flowchart LR
    subgraph Client
        Mic[Microphone / Unity Client]
        VAD[VAD & Chunking]
        WS[WebSocket Message]
    end
    
    subgraph Server
        Router[/interaction/server/router/speech/v1.py/]
        SessionMgr[/interaction/server/core/session_manager.py/]
        UseCase[/interaction/speech/domain/usecases/generate_conversation_response.py/]
        STT[/interaction/speech/domain/ports/speech_to_text.py<br/>+ adapters/]
        TTT[/interaction/speech/domain/ports/text_to_text.py<br/>+ adapters/]
    end
    
    Mic --> VAD
    VAD --> WS
    WS --> Router
    Router -->|SESSION_START / AUDIO_CHUNK / SESSION_END| SessionMgr
    SessionMgr -->|"assembled audio (BytesIO)"| UseCase
    UseCase -->|transcribe_user_audio_to_text| STT
    STT -->|transcription text| UseCase
    UseCase -->|create_response_from_user_audio_text| TTT
    TTT -->|LLM response text| UseCase
    UseCase -->|"{'transcription','response'}"| Router
    Router -->|PROCESSING / RESULT / ERROR| WS
    WS --> Client

result

  • splitting an audio file /datasets/test.wav into chunks and requesting through the API gives expected results. logging indicates successful test completion.

pr

https://github.com/ob1hnk/Triolingo/pull/1


troubleshooting

  • encountered issues with connecting the speech component to the router through litellm, which was resolved by aligning the implementation with existing components for consistency.
1
2
3
4
5
6
7
8
9
10
11
12
13
class SpeechLLMComponent:
    """
    speech-to-text component using Router
    """
    def __init__(self, router: Router, prompt_path: str = ""):
        """
        Args:
            router: LiteLLM Router instance
            prompt_path: path to prompt file
        """
        self.prompt_path = prompt_path
        self.router = router
        logger.info(f"SpeechLLMComponent initialized with prompt_path: {prompt_path}")
This post is licensed under CC BY 4.0 by the author.