Python – 语音识别文本相似性度量库jiwer,可计算文字错误率WER、匹配错误率MER等相似性度量指标
1 jiwer
Github项目地址:https://github.com/jitsi/jiwer
jiwer是一个python库,可用于语音识别时度量识别文本和准确文本之间的相似性。该库可度量的指标包括相似性估计文字错误率(WER,Word Error Rate),匹配错误率(MER,Match Error Rate)、丢失的单词信息(WIL,Word Information Lost)、保留的单词信息(WIP, Word Information Preserved)。
比如常用的估计文字错误率(WER)的计算公式为:
其中,reference为ground truth,即正确的文本字符数,sub为需替换的字符数,del为需删除的字符数,ins为需插入的字符数。即有两个文本段pred与reference,WER主要是为了描述pred相比正确的文本段reference的文字错误率,即pred与reference相比出现了多少需替换、需删除、需插入的字符数,这些字符数就是与目标文本的差异。
关于WER、MER、WIL之间更为详细的比较,可参考这篇论文,
1.1 jiwer的安装
如果python>=3.6,则使用pip安装:
<code class="language-python line-numbers">pip install jiwer </code>
1.2 jiwer的使用
1.2.1 计算英文的wer
代码示例
<code class="language-python line-numbers"># -*- coding: utf-8 -*- from jiwer import wer if __name__ == '__main__': ground_truth = "hello world" hypothesis = "hello duck" error = wer(ground_truth, hypothesis) print(error) </code>
输出:
<code class="line-numbers">0.5 </code>
1.2.2 计算中文的wer
从上述计算英文的示例看,jiwer库在计算英文字符串的WER的结果是正确的。但是经过我的测试,如果输入的字符串是中文的,只要相比较的两个字符串有一个汉字不同,其WER的结果都为1.0,比如:
<code class="language-python line-numbers"># -*- coding: utf-8 -*- from jiwer import wer if __name__ == '__main__': ground_truth = "我想吃饭" hypothesis = "我想吃屎" error = wer(ground_truth, hypothesis) print(error) </code>
输出:
<code class="line-numbers">1.0 </code>
出现这个问题的原因我猜测应该是字符编码的问题。
在英文中,我们会把hello和world当做一个独立词,在上述英文的例子中,因为hypothesis中的duck是错误的,需要使用world进行替换,所以需要替换的词就为world,world含有5个英文字母,而ground_truth中含有hello world共10个英文字母,所以WER就为0.5。
而在中文中,如果继续使用wer
,则会将“我想吃饭”和"我想吃屎"都只是视为一个单独的词,所以只要有一个汉字不一样,那么整句话都被认为是错误的,这就是为什么WER总是输出1.0。
所以对中文字符串进行WER计算的时候,可以使用cer(character error rate,单词错误率,把每一个中文字符当做一个character)对两个中文字符串的估计文字错误率进行度量:
<code class="language-python line-numbers"># -*- coding: utf-8 -*- from jiwer import cer if __name__ == '__main__': ground_truth = "我想吃饭" hypothesis = "我想吃屎" error = cer(ground_truth, hypothesis) print(error) </code>
输出
<code class="line-numbers">0.25 </code>
1.2.3 计算多个句子的wer
<code class="language-python line-numbers"># -*- coding: utf-8 -*- from jiwer import wer if __name__ == '__main__': ground_truth = ["hello world", "i like monthy python"] hypothesis = ["hello duck", "i like python"] error = wer(ground_truth, hypothesis) print(error) </code>
输出
<code class="line-numbers">0.3333333333333333 </code>
1.2.4 对两个需比较的文本进行预处理,然后再计算
示例代码
<code class="language-python line-numbers"># -*- coding: utf-8 -*- import jiwer if __name__ == '__main__': ground_truth = "I very like python!" hypothesis = "i like <a href="https://www.stubbornhuang.com/tag/python/" title="浏览关于“Python”的文章" target="_blank" class="tag_link">Python</a>?\n" transformation = jiwer.Compose([ jiwer.ToLowerCase(), jiwer.RemoveWhiteSpace(replace_by_space=True), jiwer.RemoveMultipleSpaces(), jiwer.ReduceToListOfListOfWords(word_delimiter=" ") ]) error = jiwer.wer( ground_truth, hypothesis, truth_transform=transformation, hypothesis_transform=transformation ) print(error) </code>
输出
<code class="line-numbers">0.5 </code>
在上述代码中,jiwer.Compose(transformations: List[Transform])
用于组合多个字符预处理变换操作,可用的变换操作如下:
(1) ReduceToListOfListOfWords
jiwer.ReduceToListOfListOfWords(word_delimiter=" ")
可用于将一个或多个句子转换为单词列表。句子可以作为字符串(一个句子)或字符串列表(一个或多个句子)给出。
例子
<code class="language-python line-numbers">sentences = ["hi", "this is an example"] print(jiwer.ReduceToListOfListOfWords()(sentences)) # prints: [['hi'], ['this', 'is', 'an, 'example']] </code>
(2) ReduceToSingleSentence
jiwer.ReduceToSingleSentence(word_delimiter=" ")
可用于将多个句子转换为单个句子。句子可以作为字符串(一个句子)或字符串列表(一个或多个句子)给出。
例子
<code class="language-python line-numbers">sentences = ["hi", "this is an example"] print(jiwer.ReduceToSingleSentence()(sentences)) # prints: ['hi this is an example'] </code>
(3) RemoveSpecificWords
jiwer.RemoveSpecificWords(words_to_remove: List[str])
可用于过滤掉某些单词
例子
<code class="line-numbers">sentences = ["yhe awesome", "the apple is not a pear", "yhe"] print(jiwer.RemoveSpecificWords(["yhe", "the", "a"])(sentences)) # prints: ["awesome", "apple is pear", ""] </code>
(4) RemoveWhiteSpace
jiwer.RemoveWhiteSpace(replace_by_space=False)
可用于过滤掉空白。空白字符是\t
, \n
, \r
, \x0b
,\x0c
和SentencesToListOfWords
。请注意,默认情况下,空格也会被删除,这将导致无法使用 将句子拆分为单词jiwer.RemovePunctuation()
。
例子
<code class="language-python line-numbers">sentences = ["this is an example", "hello\tworld\n\r"] print(jiwer.RemoveWhiteSpace()(sentences)) # prints: ["thisisanexample", "helloworld"] print(jiwer.RemoveWhiteSpace(replace_by_space=True)(sentences)) # prints: ["this is an example", "hello world "] # note the trailing spaces </code>
(5) RemovePunctuation
jiwer.RemoveMultipleSpaces()
可用于过滤掉标点符号。标点符号如下:
<code class="language-python line-numbers">'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~' </code>
例子
<code class="language-python line-numbers">sentences = ["this is an example!", "hello. goodbye"] print(jiwer.RemovePunctuation()(sentences)) # prints: ['this is an example', "hello goodbye"] </code>
(6) RemoveMultipleSpaces
jiwer.Strip()
可用于过滤掉单词之间的多个空格。
例子
<code class="language-python line-numbers">sentences = ["this is an example ", " hello goodbye ", " "] print(jiwer.RemoveMultipleSpaces()(sentences)) # prints: ['this is an example ', " hello goodbye ", " "] # note that there are still trailing spaces </code>
(7) Strip
jiwer.RemoveEmptyStrings()
可用于删除所有前导和尾随空格。
例子
<code class="language-python line-numbers">sentences = [" this is an example ", " hello goodbye ", " "] print(jiwer.Strip()(sentences)) # prints: ['this is an example', "hello goodbye", ""] # note that there is an empty string left behind which might need to be cleaned up </code>
(8) RemoveEmptyStrings
jiwer.ExpandCommonEnglishContractions()
可用于删除空字符串。
例子
<code class="language-python line-numbers">sentences = ["", "this is an example", " ", " "] print(jiwer.RemoveEmptyStrings()(sentences)) # prints: ['this is an example'] </code>
(9) ExpandCommonEnglishContractions
let's
可用于替换常见的缩略词,例如let us
to jiwer.SubstituteWords(dictionary: Mapping[str, str])
。
例子
<code class="language-python line-numbers">sentences = ["she'll make sure you can't make it", "let's party!"] print(jiwer.ExpandCommonEnglishContractions()(sentences)) # prints: ["she will make sure you can not make it", "let us party!"] </code>
(10) SubstituteWords
foo
可用于将一个单词替换为另一个单词。请注意,整个单词是匹配的。如果您尝试替换的单词是另一个单词的子字符串,则不会受到影响。例如,如果您替换bar
为foobar
,则该词barbar
将不会替换为jiwer.SubstituteRegexes(dictionary: Mapping[str, str])
。
例子
<code class="language-python line-numbers">sentences = ["you're pretty", "your book", "foobar"] print(jiwer.SubstituteWords({"pretty": "awesome", "you": "i", "'re": " am", 'foo': 'bar'})(sentences)) # prints: ["i am awesome", "your book", "foobar"] </code>
(11) SubstituteRegexes
jiwer.ToLowerCase()
可用于将匹配正则表达式的子字符串替换为另一个子字符串。
例子
<code class="language-python line-numbers">sentences = ["is the world doomed or loved?", "edibles are allegedly cultivated"] # note: the regex string "\b(\w+)ed\b", matches every word ending in 'ed', # and "" stands for the first group ('\w+). It therefore removes 'ed' in every match. print(jiwer.SubstituteRegexes({r"doom": r"sacr", r"\b(\w+)ed\b": r""})(sentences)) # prints: ["is the world sacr or lov?", "edibles are allegedly cultivat"] </code>
(12) ToLowerCase
jiwer.ToUpperCase()
可用于将每个字符转换为小写。
例子
<code class="language-python line-numbers">sentences = ["You're PRETTY"] print(jiwer.ToLowerCase()(sentences)) # prints: ["you're pretty"] </code>
(13) ToUpperCase
jiwer.RemoveKaldiNonWords()
可用于将每个字符替换为大写。
例子
<code class="language-python line-numbers">sentences = ["You're amazing"] print(jiwer.ToUpperCase()(sentences)) # prints: ["YOU'RE AMAZING"] </code>
(14) RemoveKaldiNonWords
[]
可用于删除 和 之间的任何<>
单词[laugh]
。这在处理来自 Kaldi 项目的假设时很有用,该项目可以输出非单词,例如<unk>
和
。
例子
<code class="language-python line-numbers">sentences = ["you <unk> like [laugh]"] print(jiwer.RemoveKaldiNonWords()(sentences)) # prints: ["you like "] # note the extra spaces </code>
本文作者:StubbornHuang
版权声明:本文为站长原创文章,如果转载请注明原文链接!
原文标题:Python – 语音识别文本相似性度量库jiwer,可计算文字错误率WER、匹配错误率MER等相似性度量指标
原文链接:https://www.stubbornhuang.com/2174/
发布于:2022年06月20日 10:06:00
修改于:2023年06月25日 21:06:23
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。
评论
49