Für Englisch:
whisper-ctranslate2 –model large-v3 –model_dir models –language English –device auto –output_format all –pretty_json True „input.mp4“
Für Deutsch:
whisper-ctranslate2 –model large-v3 –model_dir models –language German –device cpu –output_format all –pretty_json True „input.mp4“
Für Portugiesisch:
whisper-ctranslate2 –model large-v3 –model_dir models –language Portuguese –device cpu –output_format all –pretty_json True „input.mp4“
Weitere Sprachen (Stand 8. August 2025):
[–language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
Aufrufen der Hilfefunktion:
whisper-ctranslate2 –help
usage: whisper-ctranslate2 [-h]
[–model {tiny,tiny.en,base,base.en,small,small.en,medium,medium.en,large-v1,large-v2,large-v3,distil-large-v2,distil-large-v3,distil-medium.en,distil-small.en}]
[–model_directory MODEL_DIRECTORY]
[–model_dir MODEL_DIR]
[–local_files_only LOCAL_FILES_ONLY]
[–output_dir OUTPUT_DIR]
[–output_format {txt,vtt,srt,tsv,json,all}]
[–pretty_json PRETTY_JSON]
[–print_colors PRINT_COLORS] [–verbose VERBOSE]
[–highlight_words HIGHLIGHT_WORDS]
[–max_line_width MAX_LINE_WIDTH]
[–max_line_count MAX_LINE_COUNT]
[–max_words_per_line MAX_WORDS_PER_LINE]
[–device {auto,cpu,cuda}] [–threads THREADS]
[–device_index DEVICE_INDEX]
[–compute_type {default,auto,int8,int8_float16,int8_bfloat16,int8_float32,int16,float16,float32,bfloat16}]
[–task {transcribe,translate}]
[–language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
[–temperature TEMPERATURE]
[–temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
[–prompt_reset_on_temperature PROMPT_RESET_ON_TEMPERATURE]
[–best_of BEST_OF] [–beam_size BEAM_SIZE]
[–patience PATIENCE]
[–length_penalty LENGTH_PENALTY]
[–suppress_blank SUPPRESS_BLANK]
[–suppress_tokens SUPPRESS_TOKENS]
[–initial_prompt INITIAL_PROMPT]
[–condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT]
[–compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]
[–logprob_threshold LOGPROB_THRESHOLD]
[–no_speech_threshold NO_SPEECH_THRESHOLD]
[–word_timestamps WORD_TIMESTAMPS]
[–prepend_punctuations PREPEND_PUNCTUATIONS]
[–append_punctuations APPEND_PUNCTUATIONS]
[–repetition_penalty REPETITION_PENALTY]
[–no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE]
[–hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD]
[–vad_filter VAD_FILTER]
[–vad_threshold VAD_THRESHOLD]
[–vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MS]
[–vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_S]
[–vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MS]
[–version] [–hf_token HF_TOKEN]
[–speaker_name SPEAKER_NAME]
[–live_transcribe LIVE_TRANSCRIBE]
[–live_volume_threshold LIVE_VOLUME_THRESHOLD]
[–live_input_device LIVE_INPUT_DEVICE]
positional arguments:
audio audio file(s) to transcribe (default: None)
options:
-h, –help show this help message and exit
–version show program’s version number and exit
Model selection options:
–model {tiny,tiny.en,base,base.en,small,small.en,medium,medium.en,large-v1,large-v2,large-v3,distil-large-v2,distil-large-v3,distil-medium.en,distil-small.en}
name of the Whisper model to use (default: small)
–model_directory MODEL_DIRECTORY
directory where to find a CTranslate2 Whisper model
(e.g. fine-tuned model) (default: None)
Model caching control options:
–model_dir MODEL_DIR
the path to save model files; uses
~/.cache/huggingface/ by default (default: None)
–local_files_only LOCAL_FILES_ONLY
use models in cache without connecting to Internet to
check if there are newer versions (default: False)
Configuration options to control generated outputs:
–output_dir OUTPUT_DIR, -o OUTPUT_DIR
directory to save the outputs (default: .)
–output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
format of the output file; if not specified, all
available formats will be produced (default: all)
–pretty_json PRETTY_JSON, -p PRETTY_JSON
produce json in a human readable format (default:
False)
–print_colors PRINT_COLORS
print the transcribed text using an experimental color
coding strategy to highlight words with high or low
confidence (default: False)
–verbose VERBOSE whether to print out the progress and debug messages
(default: True)
–highlight_words HIGHLIGHT_WORDS
underline each word as it is spoken in srt and vtt
output formats (requires –word_timestamps True)
(default: False)
–max_line_width MAX_LINE_WIDTH
the maximum number of characters in a line before
breaking the line in srt and vtt output formats
(requires –word_timestamps True) (default: None)
–max_line_count MAX_LINE_COUNT
the maximum number of lines in a segment in srt and
vtt output formats (requires –word_timestamps True)
(default: None)
–max_words_per_line MAX_WORDS_PER_LINE
(requires –word_timestamps True, no effect with
–max_line_width) the maximum number of words in a
segment (default: None)
Computing configuration options:
–device {auto,cpu,cuda}
device to use for CTranslate2 inference (default:
auto)
–threads THREADS number of threads used for CPU inference (default: 0)
–device_index DEVICE_INDEX
device ID where to place this model on (default: 0)
–compute_type {default,auto,int8,int8_float16,int8_bfloat16,int8_float32,int16,float16,float32,bfloat16}
Type of quantization to use (see
https://opennmt.net/CTranslate2/quantization.html)
(default: auto)
Algorithm execution options:
–task {transcribe,translate}
whether to perform X->X speech recognition
(‚transcribe‘) or X->English translation (‚translate‘)
(default: transcribe)
–language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
language spoken in the audio, specify None to perform
language detection (default: None)
–temperature TEMPERATURE
temperature to use for sampling (default: 0)
–temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
temperature to increase when falling back when the
decoding fails to meet either of the thresholds below
(default: 0.2)
–prompt_reset_on_temperature PROMPT_RESET_ON_TEMPERATURE
resets prompt if temperature is above this value. Arg
has effect only if condition_on_previous_text is True
(default: 0.5)
–best_of BEST_OF number of candidates when sampling with non-zero
temperature (default: 5)
–beam_size BEAM_SIZE
number of beams in beam search, only applicable when
temperature is zero (default: 5)
–patience PATIENCE optional patience value to use in beam decoding, as in
https://arxiv.org/abs/2204.05424, the default (1.0) is
equivalent to conventional beam search (default: 1.0)
–length_penalty LENGTH_PENALTY
optional token length penalty coefficient (alpha) as
in https://arxiv.org/abs/1609.08144, uses simple
length normalization by default (default: 1.0)
–suppress_blank SUPPRESS_BLANK
suppress blank outputs at the beginning of the
sampling (default: True)
–suppress_tokens SUPPRESS_TOKENS
comma-separated list of token ids to suppress during
sampling; ‚-1‘ will suppress most special characters
except common punctuations (default: -1)
–initial_prompt INITIAL_PROMPT
optional text to provide as a prompt for the first
window. (default: None)
–condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
if True, provide the previous output of the model as a
prompt for the next window; disabling may make the
text inconsistent across windows, but the model
becomes less prone to getting stuck in a failure loop
(default: True)
–compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
if the gzip compression ratio is higher than this
value, treat the decoding as failed (default: 2.4)
–logprob_threshold LOGPROB_THRESHOLD
if the average log probability is lower than this
value, treat the decoding as failed (default: -1.0)
–no_speech_threshold NO_SPEECH_THRESHOLD
if the probability of the <|nospeech|> token is higher
than this value AND the decoding has failed due to
logprob_threshold, consider the segment as silence
(default: 0.6)
–word_timestamps WORD_TIMESTAMPS
(experimental) extract word-level timestamps and
refine the results based on them (default: False)
–prepend_punctuations PREPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation
symbols with the next word (default: „‚“¿([{-)
–append_punctuations APPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation
symbols with the previous word (default:
„‚.。,,!!??::”)]}、)
–repetition_penalty REPETITION_PENALTY
penalty applied to the score of previously generated
tokens (set > 1 to penalize) (default: 1.0)
–no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE
prevent repetitions of ngrams with this size (set 0 to
disable) (default: 0)
–hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD
When word_timestamps is True, skip silent periods
longer than this threshold (in seconds) when a
possible hallucination is detected (default: None)
VAD filter arguments:
–vad_filter VAD_FILTER
enable the voice activity detection (VAD) to filter
out parts of the audio without speech. This step is
using the Silero VAD model
https://github.com/snakers4/silero-vad. (default:
False)
–vad_threshold VAD_THRESHOLD
when vad_filter is enabled, probabilities above this
value are considered as speech. (default: None)
–vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MS
when vad_filter is enabled, final speech chunks
shorter min_speech_duration_ms are thrown out.
(default: None)
–vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_S
when vad_filter is enabled, Maximum duration of
speech chunks in seconds. Longer will be split at the
timestamp of the last silence. (default: None)
–vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MS
when vad_filter is enabled, in the end of each
speech chunk time to wait before separating it.
(default: None)
Diarization options:
–hf_token HF_TOKEN HuggingFace token which enables to download the
diarization models. (default: )
–speaker_name SPEAKER_NAME
Name to use to identify the speaker (e.g. SPEAKER_00).
(default: SPEAKER)
Live transcribe options:
–live_transcribe LIVE_TRANSCRIBE
live transcribe mode (default: False)
–live_volume_threshold LIVE_VOLUME_THRESHOLD
minimum volume threshold to activate listening in live
transcribe mode (default: 0.2)
–live_input_device LIVE_INPUT_DEVICE
Set live stream input device ID (see python -m
sounddevice for a list) (default: None)