Python - 语音识别文本相似性度量库jiwer，可计算文字错误率WER、匹配错误率MER等相似性度量指标

1 jiwer

Github项目地址：https://github.com/jitsi/jiwer

jiwer是一个python库，可用于语音识别时度量识别文本和准确文本之间的相似性。该库可度量的指标包括相似性估计文字错误率（WER,Word Error Rate），匹配错误率（MER,Match Error Rate）、丢失的单词信息（WIL,Word Information Lost）、保留的单词信息（WIP, Word Information Preserved）。

比如常用的估计文字错误率（WER）的计算公式为：

WER = \frac{sub + del + ins}{reference}

其中， $reference$ 为ground truth，即正确的文本字符数， $sub$ 为需替换的字符数， $del$ 为需删除的字符数， $ins$ 为需插入的字符数。即有两个文本段 $pred$ 与 $reference$ ， $WER$ 主要是为了描述 $pred$ 相比正确的文本段 $reference$ 的文字错误率，即 $pred$ 与 $reference$ 相比出现了多少需替换、需删除、需插入的字符数，这些字符数就是与目标文本的差异。

关于WER、MER、WIL之间更为详细的比较，可参考这篇论文，

From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition

1.1 jiwer的安装

如果python>=3.6，则使用pip安装：

<code class="language-python line-numbers">pip install jiwer
</code>

1.2 jiwer的使用

1.2.1 计算英文的wer

代码示例

<code class="language-python line-numbers"># -*- coding: utf-8 -*-
from jiwer import wer

if __name__ == '__main__':
    ground_truth = "hello world"
    hypothesis = "hello duck"

    error = wer(ground_truth, hypothesis)

    print(error)
</code>

输出：

<code class="line-numbers">0.5
</code>

1.2.2 计算中文的wer

从上述计算英文的示例看，jiwer库在计算英文字符串的WER的结果是正确的。但是经过我的测试，如果输入的字符串是中文的，只要相比较的两个字符串有一个汉字不同，其WER的结果都为1.0，比如：

<code class="language-python line-numbers"># -*- coding: utf-8 -*-
from jiwer import wer

if __name__ == '__main__':
    ground_truth = "我想吃饭"
    hypothesis = "我想吃屎"

    error = wer(ground_truth, hypothesis)

    print(error)
</code>

输出：

<code class="line-numbers">1.0
</code>

出现这个问题的原因我猜测应该是字符编码的问题。

在英文中，我们会把hello和world当做一个独立词，在上述英文的例子中，因为hypothesis中的duck是错误的，需要使用world进行替换，所以需要替换的词就为world，world含有5个英文字母，而ground_truth中含有hello world共10个英文字母，所以WER就为0.5。

而在中文中，如果继续使用wer，则会将“我想吃饭”和"我想吃屎"都只是视为一个单独的词，所以只要有一个汉字不一样，那么整句话都被认为是错误的，这就是为什么WER总是输出1.0。

所以对中文字符串进行WER计算的时候，可以使用cer（character error rate，单词错误率，把每一个中文字符当做一个character）对两个中文字符串的估计文字错误率进行度量：

<code class="language-python line-numbers"># -*- coding: utf-8 -*-
from jiwer import cer

if __name__ == '__main__':
    ground_truth = "我想吃饭"
    hypothesis = "我想吃屎"

    error = cer(ground_truth, hypothesis)

    print(error)
</code>

输出

<code class="line-numbers">0.25
</code>

1.2.3 计算多个句子的wer

<code class="language-python line-numbers"># -*- coding: utf-8 -*-
from jiwer import wer

if __name__ == '__main__':
    ground_truth = ["hello world", "i like monthy python"]
    hypothesis = ["hello duck", "i like python"]

    error = wer(ground_truth, hypothesis)

    print(error)
</code>

输出

<code class="line-numbers">0.3333333333333333
</code>

1.2.4 对两个需比较的文本进行预处理，然后再计算

示例代码

<code class="language-python line-numbers"># -*- coding: utf-8 -*-
import jiwer

if __name__ == '__main__':
    ground_truth = "I very like python!"
    hypothesis = "i like <a href="https://www.stubbornhuang.com/tag/python/" title="浏览关于“Python”的文章" target="_blank" class="tag_link">Python</a>?\n"

    transformation = jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.RemoveWhiteSpace(replace_by_space=True),
        jiwer.RemoveMultipleSpaces(),
        jiwer.ReduceToListOfListOfWords(word_delimiter=" ")
    ])

    error = jiwer.wer(
        ground_truth,
        hypothesis,
        truth_transform=transformation,
        hypothesis_transform=transformation
    )

    print(error)
</code>

输出

<code class="line-numbers">0.5
</code>

在上述代码中，jiwer.Compose(transformations: List[Transform])用于组合多个字符预处理变换操作，可用的变换操作如下：

(1) ReduceToListOfListOfWords

jiwer.ReduceToListOfListOfWords(word_delimiter=" ")可用于将一个或多个句子转换为单词列表。句子可以作为字符串（一个句子）或字符串列表（一个或多个句子）给出。

例子

<code class="language-python line-numbers">sentences = ["hi", "this is an example"]

print(jiwer.ReduceToListOfListOfWords()(sentences))
# prints: [['hi'], ['this', 'is', 'an, 'example']]
</code>

(2) ReduceToSingleSentence

jiwer.ReduceToSingleSentence(word_delimiter=" ")可用于将多个句子转换为单个句子。句子可以作为字符串（一个句子）或字符串列表（一个或多个句子）给出。

例子

<code class="language-python line-numbers">sentences = ["hi", "this is an example"]

print(jiwer.ReduceToSingleSentence()(sentences))
# prints: ['hi this is an example']
</code>

(3) RemoveSpecificWords

jiwer.RemoveSpecificWords(words_to_remove: List[str])可用于过滤掉某些单词

例子

<code class="line-numbers">sentences = ["yhe awesome", "the apple is not a pear", "yhe"]

print(jiwer.RemoveSpecificWords(["yhe", "the", "a"])(sentences))
# prints: ["awesome", "apple is pear", ""]
</code>

(4) RemoveWhiteSpace

jiwer.RemoveWhiteSpace(replace_by_space=False)可用于过滤掉空白。空白字符是\t, \n, \r, \x0b,\x0c和SentencesToListOfWords。请注意，默认情况下，空格也会被删除，这将导致无法使用将句子拆分为单词jiwer.RemovePunctuation()。

例子

<code class="language-python line-numbers">sentences = ["this is an example", "hello\tworld\n\r"]

print(jiwer.RemoveWhiteSpace()(sentences))
# prints: ["thisisanexample", "helloworld"]

print(jiwer.RemoveWhiteSpace(replace_by_space=True)(sentences))
# prints: ["this is an example", "hello world  "]
# note the trailing spaces
</code>

(5) RemovePunctuation

jiwer.RemoveMultipleSpaces()可用于过滤掉标点符号。标点符号如下：

<code class="language-python line-numbers">'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'
</code>

例子

<code class="language-python line-numbers">sentences = ["this is an example!", "hello. goodbye"]

print(jiwer.RemovePunctuation()(sentences))
# prints: ['this is an example', "hello goodbye"]
</code>

(6) RemoveMultipleSpaces

jiwer.Strip()可用于过滤掉单词之间的多个空格。

例子

<code class="language-python line-numbers">sentences = ["this is   an   example ", "  hello goodbye  ", "  "]

print(jiwer.RemoveMultipleSpaces()(sentences))
# prints: ['this is an example ', " hello goodbye ", " "]
# note that there are still trailing spaces
</code>

(7) Strip

jiwer.RemoveEmptyStrings()可用于删除所有前导和尾随空格。

例子

<code class="language-python line-numbers">sentences = [" this is an example ", "  hello goodbye  ", "  "]

print(jiwer.Strip()(sentences))
# prints: ['this is an example', "hello goodbye", ""]
# note that there is an empty string left behind which might need to be cleaned up
</code>

(8) RemoveEmptyStrings

jiwer.ExpandCommonEnglishContractions()可用于删除空字符串。

例子

<code class="language-python line-numbers">sentences = ["", "this is an example", " ",  "                "]

print(jiwer.RemoveEmptyStrings()(sentences))
# prints: ['this is an example']
</code>

(9) ExpandCommonEnglishContractions

let's可用于替换常见的缩略词，例如let usto jiwer.SubstituteWords(dictionary: Mapping[str, str])。

例子

<code class="language-python line-numbers">sentences = ["she'll make sure you can't make it", "let's party!"]

print(jiwer.ExpandCommonEnglishContractions()(sentences))
# prints: ["she will make sure you can not make it", "let us party!"]
</code>

(10) SubstituteWords

foo可用于将一个单词替换为另一个单词。请注意，整个单词是匹配的。如果您尝试替换的单词是另一个单词的子字符串，则不会受到影响。例如，如果您替换bar为foobar，则该词barbar将不会替换为jiwer.SubstituteRegexes(dictionary: Mapping[str, str])。

例子

<code class="language-python line-numbers">sentences = ["you're pretty", "your book", "foobar"]

print(jiwer.SubstituteWords({"pretty": "awesome", "you": "i", "'re": " am", 'foo': 'bar'})(sentences))

# prints: ["i am awesome", "your book", "foobar"]
</code>

(11) SubstituteRegexes

jiwer.ToLowerCase()可用于将匹配正则表达式的子字符串替换为另一个子字符串。

例子

<code class="language-python line-numbers">sentences = ["is the world doomed or loved?", "edibles are allegedly cultivated"]

# note: the regex string "\b(\w+)ed\b", matches every word ending in 'ed', 
# and "" stands for the first group ('\w+). It therefore removes 'ed' in every match.
print(jiwer.SubstituteRegexes({r"doom": r"sacr", r"\b(\w+)ed\b": r""})(sentences))

# prints: ["is the world sacr or lov?", "edibles are allegedly cultivat"]
</code>

(12) ToLowerCase

jiwer.ToUpperCase()可用于将每个字符转换为小写。

例子

<code class="language-python line-numbers">sentences = ["You're PRETTY"]

print(jiwer.ToLowerCase()(sentences))

# prints: ["you're pretty"]
</code>

(13) ToUpperCase

jiwer.RemoveKaldiNonWords()可用于将每个字符替换为大写。

例子

<code class="language-python line-numbers">sentences = ["You're amazing"]

print(jiwer.ToUpperCase()(sentences))

# prints: ["YOU'RE AMAZING"]
</code>

(14) RemoveKaldiNonWords

[]可用于删除和之间的任何<>单词[laugh]。这在处理来自 Kaldi 项目的假设时很有用，该项目可以输出非单词，例如<unk>和。

例子

<code class="language-python line-numbers">sentences = ["you <unk> like [laugh]"]

print(jiwer.RemoveKaldiNonWords()(sentences))

# prints: ["you  like "]
# note the extra spaces
</code>

联系我

资助我们

随机推荐

C++11 – 父类与子类相互包含的时候该如何正确的使用智能指针，防止循环引用

资源分享 – Mastering Graphics Programming with Vulkan 英文PDF下载

资源分享 – Real-Time Rendering, Fourth Edition 英文PDF下载

OpenCV – 新建一个图片，并在图片上画由一点到另一点的直线，采用反走样形式

C++ – std::string输出双引号到字符串

资源分享 – Introduction to 3D Game Programming with DirectX 9.0 英文PDF下载

最新评论

Python – 语音识别文本相似性度量库jiwer，可计算文字错误率WER、匹配错误率MER等相似性度量指标

1 jiwer

1.1 jiwer的安装

1.2 jiwer的使用

1.2.1 计算英文的wer

1.2.2 计算中文的wer

1.2.3 计算多个句子的wer

1.2.4 对两个需比较的文本进行预处理，然后再计算

发表评论点击这里取消回复。

联系我

资助我们

随机推荐

C++11 – 父类与子类相互包含的时候该如何正确的使用智能指针，防止循环引用

资源分享 – Mastering Graphics Programming with Vulkan 英文PDF下载

资源分享 – Real-Time Rendering, Fourth Edition 英文PDF下载

OpenCV – 新建一个图片，并在图片上画由一点到另一点的直线，采用反走样形式

C++ – std::string输出双引号到字符串

资源分享 – Introduction to 3D Game Programming with DirectX 9.0 英文PDF下载

最新评论

Python – 语音识别文本相似性度量库jiwer，可计算文字错误率WER、匹配错误率MER等相似性度量指标

1 jiwer

1.1 jiwer的安装

1.2 jiwer的使用

1.2.1 计算英文的wer

1.2.2 计算中文的wer

1.2.3 计算多个句子的wer

1.2.4 对两个需比较的文本进行预处理，然后再计算

发表评论 点击这里取消回复。

大家都在搜

关注我们的公众号

发表评论点击这里取消回复。