基于稀疏編碼的魯棒說話人識別方法研究
發(fā)布時間:2018-07-26 10:50
【摘要】:說話人識別又稱聲紋識別,是一種通過語音確定說話人身份的技術。由于使用語音具有采集方便、成本低廉等優(yōu)點,說話人識別被廣泛用于生物認證、安全監(jiān)控、軍事偵查和金融交互等領域,具有廣闊的應用前景。數十年來,世界各國的研究機構和公司企業(yè)紛紛投入大量人力物力展開研究,有力地推動了說話人識別技術的發(fā)展。目前說話人識別技術已逐步從實驗室走向應用,而現實環(huán)境的復雜性對說話人識別提出了更高的要求,包括魯棒性、實時性、識別率和穩(wěn)定性等。這就要求在說話人識別關鍵環(huán)節(jié)上有所突破,尤其是語音活動檢測、特征提取,以及說話人模型的構建等方面。目前的說話人識別技術在干凈語音環(huán)境下有理想的識別率,但在噪聲環(huán)境下,其性能會急劇降低,這阻礙了說話人識別技術走向現實應用。本文針對說話人識別技術缺乏噪聲魯棒性的問題,將稀疏編碼技術用于說話人識別的各個環(huán)節(jié),包括語音活動檢測、語音特征提取和說話人建模等,提出了系統(tǒng)的解決方案,以提高說話人系統(tǒng)在噪聲環(huán)境下的識別率,主要工作包括以下幾個方面:首先,從理論上分析了兩種稀疏編碼方法對噪聲的建模能力,為稀疏編碼的應用奠定了基礎。稀疏編碼在對噪聲的建模方面有兩種方式:第一種用殘差對噪聲建模,噪聲的理論模型是高斯白噪聲,其內在的假定在于語音在語音字典上稀疏,而噪聲在語音字典上不稀疏,白噪聲在任何字典上都表現得不稀疏,滿足了這一要求;第二種采用一個噪聲字典對噪聲建模,其內在假定在于語音和噪聲在各自的字典上稀疏,且在自己的字典上比在對方的字典上更稀疏。本文從理論上分析了這兩種稀疏編碼方式重構信號時誤差的上下限,然后用實驗驗證了理論分析的結論,表明當噪聲不稀疏時,第一種方法和第二種方法的重構誤差在理論上有相同的下限和不同的上限;當噪聲也可能稀疏時,第二種方法增加了一個字典對噪聲建模,融入了更多先驗知識,其重構誤差上限要低于第一種方法。然后,針對語音活動檢測容易受到噪聲影響的問題,基于稀疏編碼構建噪聲字典,提出了一種對噪聲魯棒的語音活動檢測方法。語音活動檢測是說話人識別的第一步,能減少算法處理的數據量,提高識別效率。目前的語音活動檢測方法雖然也考慮了噪聲,但只能解決噪聲環(huán)境已知,且噪聲環(huán)境不變的情況。當噪聲環(huán)境發(fā)生改變,或者噪聲不平穩(wěn),其性能將急劇降低。本文首先采用高斯混合模型識別噪聲類型;然后將經過訓練后的噪聲字典與語音字典拼接成一個大字典;最后,將混噪語音稀疏表示在拼接后的大字典上,并用語音字典上的稀疏表示實現語音和非語音的判定。從結果上看,本文的方法實現了對噪聲環(huán)境的感知,能有針對性地選擇字典去適應噪聲,在復雜噪聲環(huán)境下取得了更好的識別效果。接下來,提出了兩種對噪聲不敏感的特征提取方法。特征提取是說話人識別中的關鍵環(huán)節(jié)。一方面我們要求特征具有區(qū)分性;另一方面,我們希望特征受到噪聲的干擾盡可能地小。本文提出的第一種特征采用了感知最小方差無畸變響應技術,同時采用了平移差分倒譜算法,有效地融入了說話人語音的長時信息。所提取的特征不僅在干凈環(huán)境下能取得良好性能,而且在混噪語音以及信道失配等聲學條件下也優(yōu)于目前主流的特征。在YOHO數據庫和ROSSI數據庫上的實驗結果表明,該特征在噪聲和信道畸變的情況下能有效提高識別系統(tǒng)的魯棒性。第二種特征將混噪語音分解在語音字典上,然后用稀疏表示重構語音,并提取梅爾倒譜特征用于模型訓練和識別。由于稀疏編碼可以用殘差或者用噪聲字典對噪聲建模,使得重構后的信號不含有噪聲,因此能提取到受噪聲影響小的語音特征。最后,提出了兩階段稀疏分解的說話人識別框架。目前的說話人識別方法普遍將所有說話人字典拼接在一起形成一個大字典,雖然具有一定的區(qū)分性,但是存在兩方面問題。一方面拼接出來大字典原子數目過于龐大,降低了識別效率;另一方面,競爭的類別過多,稀釋了真實說話人的競爭力。所提出的方法在第一階段將待識別語音被分解到每個說話人字典上,然后通過重構計算殘差,并對殘差進行排序后,選取一個包含真實說話人的字典子集;第二階段將新字典子集拼接成一個大字典,再次將待識別語音分解到大字典上,用各字典上的稀疏表示計算得分后識別說話人。這種結構在第一階段去除了大量無關說話人字典,減少了算法的時間復雜度;第二階段采用區(qū)分式識別方法確保了識別率。實驗表明,本文所提出的兩階段稀疏分解方法既提高了識別速度,又提高了準確率。
[Abstract]:Speaker recognition is also known as sound pattern recognition. It is a technology to determine the speaker's identity through speech. Because of its advantages of convenient acquisition and low cost, speaker recognition has been widely used in the fields of biometric authentication, security monitoring, military investigation and financial interaction. The research institutions and companies have invested a lot of human and material resources to promote the development of speaker recognition technology. At present, speaker recognition technology has been gradually applied from the laboratory, and the complexity of the real environment has put forward higher requirements for speaker recognition, including robustness, real-time, recognition rate and stability. This requires some breakthroughs in the key link of speaker recognition, especially the detection of voice activity, feature extraction, and the construction of the speaker model. The present speaker recognition technology has an ideal recognition rate in a clean voice environment, but in the noisy environment, the ability of the speaker can be reduced sharply, which hinders the speaker recognition technology. In view of the fact that the speaker recognition technology is not robust to noise, this paper applies the sparse coding technique to each link of speaker recognition, including speech activity detection, speech feature extraction and speaker modeling, and proposes a system solution to improve the recognition rate of the speaker system in the noisy environment. The work includes the following aspects: firstly, the modeling ability of two sparse coding methods for noise is analyzed theoretically, which lays the foundation for the application of sparse coding. There are two ways to model the noise by sparse coding: the first model is modeled with residual to noise, and the theoretical model of noise is Gauss white noise, the inherent assumption is that The speech is sparse in the speech dictionary, and the noise is not sparse in the speech dictionary. White noise is not sparse in any dictionary and satisfies this requirement; the second uses a noise dictionary to model the noise, the inherent assumption is that the speech and noise are sparse on their respective dictionaries, and in their dictionaries they are compared to the other's dictionaries. In this paper, the upper and lower bounds of the error of the two sparse coding methods are analyzed theoretically, and the results of the theoretical analysis are verified by experiments. It is shown that when the noise is not sparse, the reconstruction error of the first method and the second method has the same lower limit and the different upper limit in theory; when the noise is also likely to be sparse, it may be sparse. The second method adds a dictionary to noise modeling and integrates more prior knowledge. The upper limit of its reconstruction error is lower than the first method. Then, in view of the problem that speech activity detection is easily affected by noise, a noise dictionary based on sparse coding is constructed, and a speech activity detection method for noise robust is proposed. Detection is the first step of speaker recognition, which can reduce the amount of data processed by the algorithm and improve the recognition efficiency. Although the current speech activity detection method also takes into account the noise, it can only solve the condition that the noise environment is known and the noise environment is constant. When the noise environment changes, or the noise is not stable, the performance will be reduced sharply. This paper is the first in this paper. First, the Gauss hybrid model is used to identify the noise type; then the trained noise dictionary and the speech dictionary are spliced into a large dictionary. Finally, the sparse speech is sparse expressed on the large dictionary after the splicing, and the speech and non speech recognition is realized by the sparse representation on the speech dictionary. The perception of noise environment can select the dictionary to adapt to the noise and obtain better recognition effect in the complex noise environment. Then, two kinds of feature extraction methods for noise insensitivity are proposed. Feature extraction is the key link in speaker recognition. On the one hand, we require distinguishing features; on the other hand, I We hope that the characteristics of the noise are as small as possible. The first feature proposed in this paper uses a perceptual minimum variance distortion free response technique and a translation differential cepstrum algorithm, which effectively integrates the long time information of the speaker's speech. The extracted feature can not only achieve good performance in a dry environment, but also be mixed with noise. The experimental results on the YOHO database and the ROSSI database show that the feature can effectively improve the robustness of the recognition system in the case of noise and channel distortion. The second features decompose the noisy speech into the speech dictionary, and then use the sparse representation to reconstruct the speech. Speech, and extract the features of Mel cepstrum for model training and recognition. Because the sparse coding can be used to model the noise with residual or noise dictionary, the reconstructed signal does not contain noise, so it can extract the speech feature affected by noise. Finally, a speaker recognition framework for two order segment sparse decomposition is proposed. The speaker recognition method generally joins all the speaker's dictionaries together to form a large dictionary, although it has a certain distinction, there are two problems. On the one hand, the number of large dictionaries is too large to reduce the recognition efficiency; on the other hand, the competitive category is too much, which dilutes the competitiveness of the real speaker. In the first stage, the speech is decomposed to each speaker's dictionary in the first stage, and then the residual error is calculated by reconstructing the residual, and after sorting the residuals, a dictionary subset containing the real speaker is selected. The second stage splicing the new dictionary subsets into a large dictionary, and then decomposing the recognized speech to the large dictionary again, using the different words. The sparse representation on the dictionary identifies the speaker after calculating the score. This structure removes a large number of unrelated speaker's dictionaries in the first stage and reduces the time complexity of the algorithm. The second stage uses a regional segmentation method to ensure the recognition rate. The experiment shows that the two step sparse decomposition method proposed in this paper not only improves the recognition speed, but also improves the recognition rate. The accuracy is improved.
【學位授予單位】:哈爾濱理工大學
【學位級別】:博士
【學位授予年份】:2016
【分類號】:TN912.34
本文編號:2145770
[Abstract]:Speaker recognition is also known as sound pattern recognition. It is a technology to determine the speaker's identity through speech. Because of its advantages of convenient acquisition and low cost, speaker recognition has been widely used in the fields of biometric authentication, security monitoring, military investigation and financial interaction. The research institutions and companies have invested a lot of human and material resources to promote the development of speaker recognition technology. At present, speaker recognition technology has been gradually applied from the laboratory, and the complexity of the real environment has put forward higher requirements for speaker recognition, including robustness, real-time, recognition rate and stability. This requires some breakthroughs in the key link of speaker recognition, especially the detection of voice activity, feature extraction, and the construction of the speaker model. The present speaker recognition technology has an ideal recognition rate in a clean voice environment, but in the noisy environment, the ability of the speaker can be reduced sharply, which hinders the speaker recognition technology. In view of the fact that the speaker recognition technology is not robust to noise, this paper applies the sparse coding technique to each link of speaker recognition, including speech activity detection, speech feature extraction and speaker modeling, and proposes a system solution to improve the recognition rate of the speaker system in the noisy environment. The work includes the following aspects: firstly, the modeling ability of two sparse coding methods for noise is analyzed theoretically, which lays the foundation for the application of sparse coding. There are two ways to model the noise by sparse coding: the first model is modeled with residual to noise, and the theoretical model of noise is Gauss white noise, the inherent assumption is that The speech is sparse in the speech dictionary, and the noise is not sparse in the speech dictionary. White noise is not sparse in any dictionary and satisfies this requirement; the second uses a noise dictionary to model the noise, the inherent assumption is that the speech and noise are sparse on their respective dictionaries, and in their dictionaries they are compared to the other's dictionaries. In this paper, the upper and lower bounds of the error of the two sparse coding methods are analyzed theoretically, and the results of the theoretical analysis are verified by experiments. It is shown that when the noise is not sparse, the reconstruction error of the first method and the second method has the same lower limit and the different upper limit in theory; when the noise is also likely to be sparse, it may be sparse. The second method adds a dictionary to noise modeling and integrates more prior knowledge. The upper limit of its reconstruction error is lower than the first method. Then, in view of the problem that speech activity detection is easily affected by noise, a noise dictionary based on sparse coding is constructed, and a speech activity detection method for noise robust is proposed. Detection is the first step of speaker recognition, which can reduce the amount of data processed by the algorithm and improve the recognition efficiency. Although the current speech activity detection method also takes into account the noise, it can only solve the condition that the noise environment is known and the noise environment is constant. When the noise environment changes, or the noise is not stable, the performance will be reduced sharply. This paper is the first in this paper. First, the Gauss hybrid model is used to identify the noise type; then the trained noise dictionary and the speech dictionary are spliced into a large dictionary. Finally, the sparse speech is sparse expressed on the large dictionary after the splicing, and the speech and non speech recognition is realized by the sparse representation on the speech dictionary. The perception of noise environment can select the dictionary to adapt to the noise and obtain better recognition effect in the complex noise environment. Then, two kinds of feature extraction methods for noise insensitivity are proposed. Feature extraction is the key link in speaker recognition. On the one hand, we require distinguishing features; on the other hand, I We hope that the characteristics of the noise are as small as possible. The first feature proposed in this paper uses a perceptual minimum variance distortion free response technique and a translation differential cepstrum algorithm, which effectively integrates the long time information of the speaker's speech. The extracted feature can not only achieve good performance in a dry environment, but also be mixed with noise. The experimental results on the YOHO database and the ROSSI database show that the feature can effectively improve the robustness of the recognition system in the case of noise and channel distortion. The second features decompose the noisy speech into the speech dictionary, and then use the sparse representation to reconstruct the speech. Speech, and extract the features of Mel cepstrum for model training and recognition. Because the sparse coding can be used to model the noise with residual or noise dictionary, the reconstructed signal does not contain noise, so it can extract the speech feature affected by noise. Finally, a speaker recognition framework for two order segment sparse decomposition is proposed. The speaker recognition method generally joins all the speaker's dictionaries together to form a large dictionary, although it has a certain distinction, there are two problems. On the one hand, the number of large dictionaries is too large to reduce the recognition efficiency; on the other hand, the competitive category is too much, which dilutes the competitiveness of the real speaker. In the first stage, the speech is decomposed to each speaker's dictionary in the first stage, and then the residual error is calculated by reconstructing the residual, and after sorting the residuals, a dictionary subset containing the real speaker is selected. The second stage splicing the new dictionary subsets into a large dictionary, and then decomposing the recognized speech to the large dictionary again, using the different words. The sparse representation on the dictionary identifies the speaker after calculating the score. This structure removes a large number of unrelated speaker's dictionaries in the first stage and reduces the time complexity of the algorithm. The second stage uses a regional segmentation method to ensure the recognition rate. The experiment shows that the two step sparse decomposition method proposed in this paper not only improves the recognition speed, but also improves the recognition rate. The accuracy is improved.
【學位授予單位】:哈爾濱理工大學
【學位級別】:博士
【學位授予年份】:2016
【分類號】:TN912.34
【參考文獻】
相關碩士學位論文 前1條
1 劉婷婷;基于因子分析的與文本無關的說話人辨認方法研究[D];中國科學技術大學;2014年
,本文編號:2145770
本文鏈接:http://www.wukwdryxk.cn/shoufeilunwen/xxkjbs/2145770.html