Kategorie: whisper-ctranslate2

whisper‑ctranslate2 ist ein effektives und leistungsfähiges Tool für die automatische Transkription und optionale Übersetzung von Audio- und Videodateien.

  • whisper-ctranslate2 – Installation für Linux Mint


    Installation unter Linux:
    Überprüfen, ob Python installiert ist:

    Im Terminal eingeben:
    python3 -V
    Mögliche Antwort des Rechners: Python 3.10.12

    Mit der Version 3.10.12 funktioniert faster-Whisper.
    Stand 11.05.2024 lautet die Mindestanforderung: „Python 3.8 oder größer“

    pip installieren:
    Im Terminal eingeben:
    sudo apt install python3-pip

    faster-Whisper installieren:
    Im Terminal eingeben:
    pip install -U faster-Whisper
    Ob dieser Schritt wirklich notwendig ist, kann ich nicht mit Sicherheit sagen. Jedenfalls funktioniert das Ganze am Ende.

    whisper-ctranslate2 installieren:
    Im Terminal eingeben:
    pip install -U whisper-ctranslate2

    Jetzt den Rechner neu starten.

  • whisper-ctranslate2 – Anwendung


    Für Englisch:

    whisper-ctranslate2 –model large-v3 –model_dir models –language English –device auto –output_format all –pretty_json True „input.mp4“

    Für Deutsch:

    whisper-ctranslate2 –model large-v3 –model_dir models –language German –device cpu –output_format all –pretty_json True „input.mp4“

    Für Portugiesisch:

    whisper-ctranslate2 –model large-v3 –model_dir models –language Portuguese –device cpu –output_format all –pretty_json True „input.mp4“


    Weitere Sprachen (Stand 8. August 2025):

    [–language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]


    Aufrufen der Hilfefunktion:

    whisper-ctranslate2 –help

    usage: whisper-ctranslate2 [-h]
    [–model {tiny,tiny.en,base,base.en,small,small.en,medium,medium.en,large-v1,large-v2,large-v3,distil-large-v2,distil-large-v3,distil-medium.en,distil-small.en}]
    [–model_directory MODEL_DIRECTORY]
    [–model_dir MODEL_DIR]
    [–local_files_only LOCAL_FILES_ONLY]
    [–output_dir OUTPUT_DIR]
    [–output_format {txt,vtt,srt,tsv,json,all}]
    [–pretty_json PRETTY_JSON]
    [–print_colors PRINT_COLORS] [–verbose VERBOSE]
    [–highlight_words HIGHLIGHT_WORDS]
    [–max_line_width MAX_LINE_WIDTH]
    [–max_line_count MAX_LINE_COUNT]
    [–max_words_per_line MAX_WORDS_PER_LINE]
    [–device {auto,cpu,cuda}] [–threads THREADS]
    [–device_index DEVICE_INDEX]
    [–compute_type {default,auto,int8,int8_float16,int8_bfloat16,int8_float32,int16,float16,float32,bfloat16}]
    [–task {transcribe,translate}]
    [–language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
    [–temperature TEMPERATURE]
    [–temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
    [–prompt_reset_on_temperature PROMPT_RESET_ON_TEMPERATURE]
    [–best_of BEST_OF] [–beam_size BEAM_SIZE]
    [–patience PATIENCE]
    [–length_penalty LENGTH_PENALTY]
    [–suppress_blank SUPPRESS_BLANK]
    [–suppress_tokens SUPPRESS_TOKENS]
    [–initial_prompt INITIAL_PROMPT]
    [–condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT]
    [–compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]
    [–logprob_threshold LOGPROB_THRESHOLD]
    [–no_speech_threshold NO_SPEECH_THRESHOLD]
    [–word_timestamps WORD_TIMESTAMPS]
    [–prepend_punctuations PREPEND_PUNCTUATIONS]
    [–append_punctuations APPEND_PUNCTUATIONS]
    [–repetition_penalty REPETITION_PENALTY]
    [–no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE]
    [–hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD]
    [–vad_filter VAD_FILTER]
    [–vad_threshold VAD_THRESHOLD]
    [–vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MS]
    [–vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_S]
    [–vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MS]
    [–version] [–hf_token HF_TOKEN]
    [–speaker_name SPEAKER_NAME]
    [–live_transcribe LIVE_TRANSCRIBE]
    [–live_volume_threshold LIVE_VOLUME_THRESHOLD]
    [–live_input_device LIVE_INPUT_DEVICE]

    positional arguments:
    audio audio file(s) to transcribe (default: None)

    options:
    -h, –help show this help message and exit
    –version show program’s version number and exit

    Model selection options:
    –model {tiny,tiny.en,base,base.en,small,small.en,medium,medium.en,large-v1,large-v2,large-v3,distil-large-v2,distil-large-v3,distil-medium.en,distil-small.en}
    name of the Whisper model to use (default: small)
    –model_directory MODEL_DIRECTORY
    directory where to find a CTranslate2 Whisper model
    (e.g. fine-tuned model) (default: None)

    Model caching control options:
    –model_dir MODEL_DIR
    the path to save model files; uses
    ~/.cache/huggingface/ by default (default: None)
    –local_files_only LOCAL_FILES_ONLY
    use models in cache without connecting to Internet to
    check if there are newer versions (default: False)

    Configuration options to control generated outputs:
    –output_dir OUTPUT_DIR, -o OUTPUT_DIR
    directory to save the outputs (default: .)
    –output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
    format of the output file; if not specified, all
    available formats will be produced (default: all)
    –pretty_json PRETTY_JSON, -p PRETTY_JSON
    produce json in a human readable format (default:
    False)
    –print_colors PRINT_COLORS
    print the transcribed text using an experimental color
    coding strategy to highlight words with high or low
    confidence (default: False)
    –verbose VERBOSE whether to print out the progress and debug messages
    (default: True)
    –highlight_words HIGHLIGHT_WORDS
    underline each word as it is spoken in srt and vtt
    output formats (requires –word_timestamps True)
    (default: False)
    –max_line_width MAX_LINE_WIDTH
    the maximum number of characters in a line before
    breaking the line in srt and vtt output formats
    (requires –word_timestamps True) (default: None)
    –max_line_count MAX_LINE_COUNT
    the maximum number of lines in a segment in srt and
    vtt output formats (requires –word_timestamps True)
    (default: None)
    –max_words_per_line MAX_WORDS_PER_LINE
    (requires –word_timestamps True, no effect with
    –max_line_width) the maximum number of words in a
    segment (default: None)

    Computing configuration options:
    –device {auto,cpu,cuda}
    device to use for CTranslate2 inference (default:
    auto)
    –threads THREADS number of threads used for CPU inference (default: 0)
    –device_index DEVICE_INDEX
    device ID where to place this model on (default: 0)
    –compute_type {default,auto,int8,int8_float16,int8_bfloat16,int8_float32,int16,float16,float32,bfloat16}
    Type of quantization to use (see
    https://opennmt.net/CTranslate2/quantization.html)
    (default: auto)

    Algorithm execution options:
    –task {transcribe,translate}
    whether to perform X->X speech recognition
    (‚transcribe‘) or X->English translation (‚translate‘)
    (default: transcribe)
    –language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
    language spoken in the audio, specify None to perform
    language detection (default: None)
    –temperature TEMPERATURE
    temperature to use for sampling (default: 0)
    –temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
    temperature to increase when falling back when the
    decoding fails to meet either of the thresholds below
    (default: 0.2)
    –prompt_reset_on_temperature PROMPT_RESET_ON_TEMPERATURE
    resets prompt if temperature is above this value. Arg
    has effect only if condition_on_previous_text is True
    (default: 0.5)
    –best_of BEST_OF number of candidates when sampling with non-zero
    temperature (default: 5)
    –beam_size BEAM_SIZE
    number of beams in beam search, only applicable when
    temperature is zero (default: 5)
    –patience PATIENCE optional patience value to use in beam decoding, as in
    https://arxiv.org/abs/2204.05424, the default (1.0) is
    equivalent to conventional beam search (default: 1.0)
    –length_penalty LENGTH_PENALTY
    optional token length penalty coefficient (alpha) as
    in https://arxiv.org/abs/1609.08144, uses simple
    length normalization by default (default: 1.0)
    –suppress_blank SUPPRESS_BLANK
    suppress blank outputs at the beginning of the
    sampling (default: True)
    –suppress_tokens SUPPRESS_TOKENS
    comma-separated list of token ids to suppress during
    sampling; ‚-1‘ will suppress most special characters
    except common punctuations (default: -1)
    –initial_prompt INITIAL_PROMPT
    optional text to provide as a prompt for the first
    window. (default: None)
    –condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
    if True, provide the previous output of the model as a
    prompt for the next window; disabling may make the
    text inconsistent across windows, but the model
    becomes less prone to getting stuck in a failure loop
    (default: True)
    –compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
    if the gzip compression ratio is higher than this
    value, treat the decoding as failed (default: 2.4)
    –logprob_threshold LOGPROB_THRESHOLD
    if the average log probability is lower than this
    value, treat the decoding as failed (default: -1.0)
    –no_speech_threshold NO_SPEECH_THRESHOLD
    if the probability of the <|nospeech|> token is higher
    than this value AND the decoding has failed due to
    logprob_threshold, consider the segment as silence
    (default: 0.6)
    –word_timestamps WORD_TIMESTAMPS
    (experimental) extract word-level timestamps and
    refine the results based on them (default: False)
    –prepend_punctuations PREPEND_PUNCTUATIONS
    if word_timestamps is True, merge these punctuation
    symbols with the next word (default: „‚“¿([{-)
    –append_punctuations APPEND_PUNCTUATIONS
    if word_timestamps is True, merge these punctuation
    symbols with the previous word (default:
    „‚.。,,!!??::”)]}、)
    –repetition_penalty REPETITION_PENALTY
    penalty applied to the score of previously generated
    tokens (set > 1 to penalize) (default: 1.0)
    –no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE
    prevent repetitions of ngrams with this size (set 0 to
    disable) (default: 0)
    –hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD
    When word_timestamps is True, skip silent periods
    longer than this threshold (in seconds) when a
    possible hallucination is detected (default: None)

    VAD filter arguments:
    –vad_filter VAD_FILTER
    enable the voice activity detection (VAD) to filter
    out parts of the audio without speech. This step is
    using the Silero VAD model
    https://github.com/snakers4/silero-vad. (default:
    False)
    –vad_threshold VAD_THRESHOLD
    when vad_filter is enabled, probabilities above this
    value are considered as speech. (default: None)
    –vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MS
    when vad_filter is enabled, final speech chunks
    shorter min_speech_duration_ms are thrown out.
    (default: None)
    –vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_S
    when vad_filter is enabled, Maximum duration of
    speech chunks in seconds. Longer will be split at the
    timestamp of the last silence. (default: None)
    –vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MS
    when vad_filter is enabled, in the end of each
    speech chunk time to wait before separating it.
    (default: None)

    Diarization options:
    –hf_token HF_TOKEN HuggingFace token which enables to download the
    diarization models. (default: )
    –speaker_name SPEAKER_NAME
    Name to use to identify the speaker (e.g. SPEAKER_00).
    (default: SPEAKER)

    Live transcribe options:
    –live_transcribe LIVE_TRANSCRIBE
    live transcribe mode (default: False)
    –live_volume_threshold LIVE_VOLUME_THRESHOLD
    minimum volume threshold to activate listening in live
    transcribe mode (default: 0.2)
    –live_input_device LIVE_INPUT_DEVICE
    Set live stream input device ID (see python -m
    sounddevice for a list) (default: None)