Google音声認識API：各単語のタイムスタンプ？

Question

Googleの音声認識APIを使用して、http://www.google.com/speech-api/v2/recognize?...にリクエストを送信することで、音声ファイル（WAV、MP3など）の文字起こしを取得することができます。

例：WAVファイルで「one 2 three for five」と言いました。 Google APIは私にこれを与えます：

{ u'alternative': [ {u'transcript': u'12345'}, {u'transcript': u'1 2 3 4 5'}, {u'transcript': u'one two three four five'} ], u'final': True }

質問：各単語が言われた時間（秒単位）を取得することは可能ですか？

私の例では：

['one', 0.23, 0.80], ['two', 1.03, 1.45], ['three', 1.79, 2.35], etc.

つまり「1つ」という言葉は、00：00：00.23から00：00：00.80の間に言われています。
「2」という言葉は、00：00：01.03から00：00：01.45（秒単位）の間に言われています。

PS：英語以外の言語、特にフランス語をサポートするAPIを探しています。

deweydb · Answer

もう1つの答えは今では時代遅れだと思います。これは、Google Cloud Search APIで可能になりました： https://cloud.google.com/speech/docs/async-time-offsets

Nikolay Shmyrev · Answer

グーグルAPIでは不可能です。

Wordのタイムスタンプが必要な場合は、次のような他のAPIを使用できます。

CMUSphinx -無料のオフライン音声認識API

SpeechMatics SaaS音声認識API

IBMの音声認識API

Ishmeet Kaur · Answer

はい、それは非常に可能です。あなたがする必要があるのは：

構成セットでenable_Word_time_offsets = True

config = types.RecognitionConfig( .... enable_Word_time_offsets=True)

次に、代替の各Wordについて、次のコードのように開始時刻と終了時刻を出力できます。

for result in result.results: alternative = result.alternatives[0] print(u'Transcript: {}'.format(alternative.transcript)) print('Confidence: {}'.format(alternative.confidence)) for Word_info in alternative.words: Word = Word_info.Word start_time = Word_info.start_time end_time = Word_info.end_time print('Word: {}, start_time: {}, end_time: {}'.format( Word, start_time.seconds + start_time.nanos * 1e-9, end_time.seconds + end_time.nanos * 1e-9))

これにより、次の形式で出力が得られます。

Transcript: Do you want me to give you a call back? Confidence: 0.949534416199 Word: Do, start_time: 1466.0, end_time: 1466.6 Word: you, start_time: 1466.6, end_time: 1466.7 Word: want, start_time: 1466.7, end_time: 1466.8 Word: me, start_time: 1466.8, end_time: 1466.9 Word: to, start_time: 1466.9, end_time: 1467.1 Word: give, start_time: 1467.1, end_time: 1467.2 Word: you, start_time: 1467.2, end_time: 1467.3 Word: a, start_time: 1467.3, end_time: 1467.4 Word: call, start_time: 1467.4, end_time: 1467.6 Word: back?, start_time: 1467.6, end_time: 1467.7

出典： https://cloud.google.com/speech-to-text/docs/async-time-offsets