Skip to content

Fix clip_timestamps format, Windows UTF-8 console, and non-dict LLM response#4

Open
Ahmed-Ezzat20 wants to merge 1 commit into
bakrianoo:masterfrom
Ahmed-Ezzat20:fix/stt-and-cli-bugs
Open

Fix clip_timestamps format, Windows UTF-8 console, and non-dict LLM response#4
Ahmed-Ezzat20 wants to merge 1 commit into
bakrianoo:masterfrom
Ahmed-Ezzat20:fix/stt-and-cli-bugs

Conversation

@Ahmed-Ezzat20

Copy link
Copy Markdown
Contributor

Summary

Three bug fixes discovered during end-to-end testing of the pipeline:

1. clip_timestamps format bug (faster-whisper backend)

_transcribe_faster_whisper() passes speech_clips_sec as a list of dicts ([{"start": 0.5, "end": 3.2}, ...]) to clip_timestamps, but BatchedInferencePipeline.transcribe() expects a flat list of seconds ([0.5, 3.2, ...]). This causes a TypeError at runtime.

Fix: Convert to flat [start1, end1, start2, end2, ...] format.

2. Windows console UnicodeEncodeError

On Windows, sys.stdout defaults to cp1252 encoding, which cannot represent Arabic, CJK, or other non-Latin characters. Any print() or log message containing a non-Latin project slug (e.g. مين-هو-مستر-عزت) crashes with UnicodeEncodeError.

Fix: Reconfigure stdout/stderr to UTF-8 with errors="replace" at CLI entry point, guarded by sys.platform == "win32".

3. describe.py crash on non-dict LLM response

json_repair.loads() can return a list, string, or None when small/local LLMs (e.g. Ollama qwen3:4b) produce malformed JSON. The subsequent .get() call crashes with AttributeError: 'list' object has no attribute 'get'.

Fix: Guard with isinstance(description, dict) check, wrapping non-dict responses into the expected schema.

Test plan

  • Verified all modified modules import correctly
  • Test faster-whisper transcription with --method faster-whisper (requires GPU or CPU with the model)
  • Test pipeline with non-Latin project names on Windows
  • Test describe stage with a small local LLM (e.g. Ollama qwen3:4b)

…esponse

- faster-whisper: convert clip_timestamps to flat list of seconds
  [start1, end1, ...] instead of list of dicts, fixing TypeError
  in BatchedInferencePipeline.transcribe()
- cli: reconfigure stdout/stderr to UTF-8 on Windows to prevent
  UnicodeEncodeError on non-Latin project names
- describe: guard against json_repair.loads() returning a non-dict
  (e.g. list or string) from small/local LLMs
@Ahmed-Ezzat20 Ahmed-Ezzat20 force-pushed the fix/stt-and-cli-bugs branch from 4f62228 to 33e5741 Compare April 9, 2026 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant