yellowsubmarine372

Speech To Text E2E Usecase API

overview

I’ve set up a clean architecture for the ai repository and containerized it. this pipeline is deployed on a remote server ready for action.


background

purpose

to build a speech-to-text and text-to-text API that operates reliably on a remote server.

environment setup


implementation process

design

demo/src/ai
├── datasets # testfile
├── docker-compose.yaml
├── Dockerfile
├── interaction
│   ├── core
│   │   ├── components #llm and speech component
│   │   ├── di
│   │   │   ├── config.py
│   │   │   └── container.py
│   │   ├── domain
│   │   │   └── usecase
│   │   ├── infra
│   │   │   ├── model_configs.py
│   │   │   └── model_router.py #litellm Router
│   │   └── utils
│   ├── server
│   │   ├── app.py
│   │   ├── core
│   │   │   └── session_manager.py #for ws session managing
│   │   ├── dto
│   │   │   └── speech.py
│   │   ├── router
│   │   │   └── speech
│   │   │       └── v1.py
│   │   └── tests # test server files
│   │       └── test_websocket_speech.py
│   ├── speech
│   │   ├── components
│   │   │   ├── speech_to_text
│   │   │   │   └── llm_speech_to_text_v1.py
│   │   │   └── text_to_text
│   │   │       └── llm_text_to_text_v1.py
│   │   ├── di
│   │   │   └── container.py
│   │   ├── domain
│   │   │   ├── ports
│   │   │   │   ├── speech_to_text.py
│   │   │   │   └── text_to_text.py
│   │   │   └── usecases
│   │   │       └── generate_conversation_response.py
│   │   ├── main.py
│   │   ├── prompts
│   │   │   └── text_to_text_v1.py
│   │   └── tests
│   │       ├── test_speech_to_text.py
│   │       └── test.py
│   └── text
├── Makefile
├── pyproject.toml
└── uv.lock

implementing the domain-driven design approach with a focus on clean architecture principles ensures robust application structure and maintainability.

diagram is as follows:

1st_image

flowchart LR
    subgraph Client
        Mic[Microphone / Unity Client]
        VAD[VAD & Chunking]
        WS[WebSocket Message]
    end

    subgraph Server
        Router[/interaction/server/router/speech/v1.py/]
        SessionMgr[/interaction/server/core/session_manager.py/]
        UseCase[/interaction/speech/domain/usecases/generate_conversation_response.py/]
        STT[/interaction/speech/domain/ports/speech_to_text.py<br/>+ adapters/]
        TTT[/interaction/speech/domain/ports/text_to_text.py<br/>+ adapters/]
    end

    Mic --> VAD
    VAD --> WS
    WS --> Router
    Router -->|SESSION_START / AUDIO_CHUNK / SESSION_END| SessionMgr
    SessionMgr -->|"assembled audio (BytesIO)"| UseCase
    UseCase -->|transcribe_user_audio_to_text| STT
    STT -->|transcription text| UseCase
    UseCase -->|create_response_from_user_audio_text| TTT
    TTT -->|LLM response text| UseCase
    UseCase -->|"{'transcription''response'}"| Router
    Router -->|PROCESSING / RESULT / ERROR| WS
    WS --> Client

result

pr

https://github.com/ob1hnk/Triolingo/pull/1


troubleshooting

class SpeechLLMComponent:
    """
    speech-to-text component using Router
    """
    def __init__(self router: Router prompt_path: str = ""):
        """
        Args:
            router: LiteLLM Router instance
            prompt_path: path to prompt file
        """
        self.prompt_path = prompt_path
        self.router = router
        logger.info(f"SpeechLLMComponent initialized with prompt_path: {prompt_path}")
← all posts