若干統(tǒng)計計算模型研究及其在生物醫(yī)學(xué)信息處理中的應(yīng)用
發(fā)布時間:2018-06-24 11:52
本文選題:胎兒心電圖 + 集合經(jīng)驗(yàn)?zāi)B(tài)分解。 參考:《山東大學(xué)》2016年博士論文
【摘要】:本課題來源于醫(yī)學(xué)和生物學(xué)中的實(shí)際問題,主要研究利用時間序列分析、統(tǒng)計信號處理、統(tǒng)計機(jī)器學(xué)習(xí)和模式識別、Meta(薈萃)分析等方法構(gòu)建了四個高效的統(tǒng)計計算模型,并利用這些模型進(jìn)行了宮內(nèi)胎兒心電圖信號提取和去噪,真核生物蛋白質(zhì)編碼區(qū)識別,二代測序短序列大數(shù)據(jù)背景下的病毒預(yù)測以及酒精依賴癥與NPY基因多態(tài)性的關(guān)聯(lián)Meta分析等問題的研究.高精度胎兒心電圖(Fetal electrocardiogram,FECG)在輔助醫(yī)師監(jiān)測胎兒在宮中變化情況并作出臨床診斷方面具有非常重要的價值,然而在現(xiàn)實(shí)情況中,清晰的FECG卻很難得到,這是因?yàn)樵贔ECG中往往混雜著母體心電信號(Maternal ECG,MECG)和其他的噪聲污染,如基線漂移,工頻干擾及其他高頻噪聲等.在第一章中我們提出了一種新型的自適應(yīng)綜合算法用于母嬰心電信號分離和FECG去噪,該算法集成了獨(dú)立分量分析(Independent Composition Analysis,ICA),集合經(jīng)驗(yàn)?zāi)B(tài)分解(Ensemble Empirical Mode Decomposition,EEMD)和小波收縮(Wavelet Shrinkage,WS)等算法優(yōu)勢.首先,我們利用獨(dú)立分量分析(ICA)將胎兒心電信號(FECG)從腹部混合信號 (Abdominal ECG,AECG)中分離出來,從而得到含噪聲的FECG其次,我們設(shè)計一個基于集合經(jīng)驗(yàn)?zāi)B(tài)分解和小波收縮的綜合算法對上一步得到的含噪FECG進(jìn)行去噪.該算法包括EEMD分解,有用子信號統(tǒng)計信息量檢驗(yàn)及其小波收縮處理,部分信號重構(gòu)去除基線漂移等三個階段.最后,我們采用模擬信號和真實(shí)信號進(jìn)行測試,通過計算模擬信號去噪前后的信噪比(Signal-to-noise-ratio,SNR),均方誤差(Mean Square Error,MSE)以及相關(guān)系數(shù)(R)對算法準(zhǔn)確性評估.結(jié)果顯示,我們提出的ICA-EEMD-WS綜合算法優(yōu)于傳統(tǒng)信號分離和去噪方法。真核生物DNA序列的蛋白質(zhì)編碼區(qū)(外顯子)能夠在翻譯過程中控制蛋白質(zhì)的生成,對于生命進(jìn)程具有極為重要的意義.在第二章中,我們將生物信息學(xué)中的蛋白質(zhì)編碼區(qū)識別(基因結(jié)構(gòu)預(yù)測)轉(zhuǎn)化為模式識別或分類問題進(jìn)行處理.在真核生物DNA序列的蛋白質(zhì)編碼區(qū)(外顯子)和非編碼區(qū)(內(nèi)含子)預(yù)測方面,前人已經(jīng)提出了很多分類技術(shù).其中,基于數(shù)字信號處理(digital signal processing, DSP)的離散傅里葉變換(discrete Fourier transform, DFT)因其具有不依賴于先驗(yàn)知識的優(yōu)勢在該領(lǐng)域取得了較大的成功.但是這類基于DFT的方法因?yàn)槠渥V分辨率低和譜能量泄露等本質(zhì)性的不足,使其在短DNA序列預(yù)測方面迅速失去優(yōu)勢.第二章中,我們提出了一種新的基于自回歸(autoregressiveAR)譜分析和小波包變換的(wavelet packets transform, WPT)的綜合算法用于提升編碼區(qū)識別效率和準(zhǔn)確性.該算法首先利用一種DNA序列數(shù)值化方法(Code13 mapping method)將DNA序列轉(zhuǎn)為數(shù)值序列,然后將此數(shù)值序列視為自回歸模型的觀測信號,利用高效的Marple算法通過計算Yule-Walker方程組的方法來估計自回歸模型的能量譜密度(power spectral density, PSD)最后,利用能量譜密度在頻率θ=27r/3處的值(也稱為周期三特性,three-base periodicity(TBP) property)得到信噪比(SNR)曲線.對該信噪比曲線利用小波包變換算法去噪后,選取適當(dāng)?shù)拈撝颠_(dá)到識別外顯子區(qū)域的目的.最后,利用三個著名的標(biāo)準(zhǔn)測試集(GENSCAN65, HMR195和BG570)進(jìn)行算法測試,結(jié)果顯示,新算法較傳統(tǒng)的基于DFT的方法能更加準(zhǔn)確地識別出蛋白質(zhì)編碼區(qū).病毒(尤其是致病病毒)已經(jīng)威脅人類健康數(shù)千年而且近些年來新病毒及其變種不斷出現(xiàn),因此如何利用計算生物學(xué)技術(shù)協(xié)助醫(yī)學(xué)專家在二代測序海量短序列數(shù)據(jù)庫中快速縮小疑似病毒篩選范圍,為其后續(xù)實(shí)驗(yàn)確診病毒提供高質(zhì)量候選對象、大幅度節(jié)省實(shí)驗(yàn)成本、提高新病毒應(yīng)急反應(yīng)能力和時效性,以及加快大規(guī)模疫苗研制和生成,挽救生命和減少感染人群等具有重要意義.第三章中,我們將序列比對與非序列比對方法相結(jié)合提出了一套綜合分類算法用于病毒和人類的識別(分類)以及進(jìn)一步的不同病毒類別預(yù)測.該算法首先采用BLAST技術(shù)將待分類序列分別與大型的病毒數(shù)據(jù)庫和人類數(shù)據(jù)庫進(jìn)行比對,如果能夠從中找到高度同源的目標(biāo)序列,則該目標(biāo)序列的類別即可視為待分類序列的類別,算法停止.對于那些比對不上的序列,我們提出的非序列比對方法就可以發(fā)揮補(bǔ)充作用,首先將待分類DNA序列轉(zhuǎn)換為數(shù)值向量,將其作為支持向量機(jī)(Support vector machine, SVM)分類器的輸入對其進(jìn)行類別預(yù)測,得到其預(yù)測類別.如果被預(yù)測為”病毒”,將繼續(xù)利用多分類隨機(jī)森林(Random Forest, RF)進(jìn)行病毒類別預(yù)測,即繼續(xù)預(yù)測該”病毒”屬于六種病毒類別中的哪一種.利用獨(dú)立的8個測試集對我們提出的綜合算法進(jìn)行測試,并與其它預(yù)測方法進(jìn)行比較.結(jié)果顯示在病毒-人類分類效果方面具有較好的預(yù)測結(jié)果,尤其在較短的序列預(yù)測方面結(jié)果基本令人滿意.在病毒水平的多分類預(yù)測中,盡管總體準(zhǔn)確率不是很高,但是預(yù)測結(jié)果可以作為生物學(xué)家進(jìn)一步的參考.總之,本研究能夠幫助生物學(xué)家和醫(yī)學(xué)專家進(jìn)行NGS短序列海量數(shù)據(jù)的大幅度篩選,從而大大縮小候選病毒序列的范圍,有助于提升病毒尤其是病原性病毒的識別確認(rèn)效率,為治療和預(yù)防重大流行性傳染病提供有力的技術(shù)支撐.酒精依賴癥(Alcohol dependence,AD)是一種典型的慢性酒精中毒,是由于長期反復(fù)飲酒所致的對酒的一種特殊的心理狀態(tài).1990-2010的20年間,在全球所有疾病風(fēng)險因素中,飲酒已經(jīng)從原來的第6位快速上升為第3位,僅次于高血壓和二手煙.過度飲酒不僅導(dǎo)致與健康相關(guān)的損害,而且會帶來社會傷害,如交通事故、犯罪、虐待兒童、家庭暴力及各種形式的傷害等.因此,飲酒相關(guān)問題已經(jīng)將會成為包括我國在內(nèi)的全球重要的公共衛(wèi)生問題之一.盡管酒精依賴癥的發(fā)病率持續(xù)增加,但是其確切的病因和發(fā)病機(jī)理目前仍不完全清楚.目前研究認(rèn)為AD是與遺傳和環(huán)境等多因素有關(guān)的復(fù)雜精神疾病,而且大量研究已經(jīng)證實(shí)酒依賴癥與遺傳因素密切相關(guān).在神經(jīng)肽Y(NPY)基因多態(tài)性與酒精依賴癥之間關(guān)聯(lián)性的研究方面,各國研究人員已經(jīng)在全球不同人群中進(jìn)行了十多年的研究,但是在兩個主要單核苷酸多態(tài)性(SNP),rs16139和rs16147位點(diǎn),研究結(jié)果卻呈現(xiàn)出不吻合,甚至完全相反的結(jié)論,以至于與AD相關(guān)的易感基因尚未最終定論.這是因?yàn)椴煌巳?不同種族間遺傳背景和環(huán)境影響因素的不同,導(dǎo)致同一基因在不同人群,不同種族之間等位基因及基因型頻率可能存在差異,故而對同一疾病發(fā)生的影響也可能存在差異.如何利用現(xiàn)有隨機(jī)病例對照研究資料尋找酒精依賴癥的易感基因,從基因水平篩選高危人群并為其有針對性地提供早期干預(yù),診斷,實(shí)現(xiàn)個性化治療具有重要的臨床應(yīng)用價值和社會效益.鑒于現(xiàn)有關(guān)于NPY基因多態(tài)性與酒精依賴癥關(guān)聯(lián)性研究中出現(xiàn)了研究結(jié)果不一致的情況,第四章中,我們主要圍繞NPY基因多態(tài)性與酒精依賴癥之間是否存在顯著的相關(guān)性問題,利用SNP的Meta分析方法對目前已經(jīng)發(fā)表的關(guān)于神經(jīng)肽Y(NPY)基因多態(tài)性,尤其是兩個重要SNP(rs16139 口rs16147)與AD發(fā)病風(fēng)險的流行病學(xué)文獻(xiàn)進(jìn)行定量分析和綜合評估.本章我們嚴(yán)格按照SNP的Meta分析方法的基本要求,通過廣泛收集現(xiàn)有國內(nèi)外高質(zhì)量研究文獻(xiàn),將現(xiàn)有的關(guān)于NPY基因多態(tài)性與酒精依賴癥關(guān)聯(lián)性的文獻(xiàn)進(jìn)行定量綜合分析.首先對對照組進(jìn)行哈迪-溫伯格遺傳平衡定律(Hardy-Weinberg equilibrium,HWE)平衡檢驗(yàn),隨后進(jìn)行各研究的異質(zhì)性檢驗(yàn).上述檢驗(yàn)通過后,利用基于Logistic回歸模型的最佳遺傳模型選擇策略確定采用顯性遺傳模型來合并各研究的p值,并進(jìn)行了亞組分析,最后利用漏斗圖,Egger線性回歸法和Begg秩相關(guān)法進(jìn)行檢驗(yàn)排除了發(fā)表偏倚.結(jié)果顯示,大部分人群目前尚無充分證據(jù)表明NPY基因多態(tài)性與酒精依賴癥之間存在顯著的關(guān)聯(lián)性.但是在亞組分析中發(fā)現(xiàn)個別人群(如芬蘭人)的SNP rs16139與酒精依賴癥具有相關(guān)性.本章對多個現(xiàn)有結(jié)果的Meta分析,從統(tǒng)計角度上增加了樣本量,提高了檢驗(yàn)效能,尤其是當(dāng)多個研究結(jié)果不一致或都沒有統(tǒng)計學(xué)意義時,采用meta分析可得到更加接近真實(shí)情況的綜合分析結(jié)果,為臨床醫(yī)師和科研人員深入理解酒精依賴癥的發(fā)病機(jī)理及其基因診斷和治療提供了科學(xué)依據(jù).第五章我們主要針對四個子課題的研究進(jìn)行了總結(jié),尤其深刻剖析了各研究存在的不足之處及原因分析,最后給出了今后研究的改進(jìn)方案.
[Abstract]:This topic derives from the practical problems in medicine and biology, mainly using time series analysis, statistical signal processing, statistical machine learning and pattern recognition, Meta (meta) analysis and other methods to construct four efficient statistical computing models, and use these models to extract and denoise the intrauterine fetal electrocardiogram signal, and the true nuclear birth. Protein coding region identification, virus prediction in the two generation sequencing short sequence large data background, and the association Meta analysis of alcohol dependence and NPY gene polymorphism. High precision fetal electrocardiogram (Fetal electrocardiogram, FECG) is used to monitor fetal changes in the uterus and make clinical diagnosis by assisting doctors. It is very important, however, in reality, clear FECG is difficult to get, because in FECG, it is often mixed with the Maternal ECG (MECG) and other noise pollution, such as baseline drift, frequency interference and other high frequency noise. In the first chapter, we propose a new adaptive synthesis algorithm. The algorithm integrates Independent Composition Analysis (ICA), ensemble empirical mode decomposition (Ensemble Empirical Mode Decomposition, EEMD), and wavelet shrinkage (Wavelet Shrinkage). Firstly, we use independent component analysis (independent component analysis) to make fetal cardiac electrocardiogram (FECG). The number (FECG) is separated from the Abdominal ECG (AECG), and then the noise containing FECG is obtained. We design a comprehensive algorithm based on the set of empirical mode decomposition and wavelet contraction to denoise the noise containing FECG obtained in the last step. The algorithm includes EEMD decomposition, the statistical information test of useful subsignals and their wavelets. Three stages, such as shrinkage, partial signal reconstruction and baseline drift. Finally, we use analog signals and real signals to test the accuracy of the algorithm by calculating the signal to noise ratio (Signal-to-noise-ratio, SNR), mean square error (Mean Square Error, MSE) and correlation coefficient (R) before and after the de-noising of analog signals. The results show that, The ICA-EEMD-WS synthesis algorithm is superior to the traditional signal separation and denoising methods. The protein coding region (exons) of the eukaryotic DNA sequence (exons) can control the formation of protein in the process of translation. In the second chapter, we identify the protein coding region in bioinformatics (base). The structure prediction is transformed into a pattern recognition or classification problem. In the protein coding region (exons) and the non coding region (intron) prediction in the DNA sequence of eukaryotes, many classification techniques have been proposed. Among them, discrete Fourier transform (discrete Fo) based on the digital signal processing (DSP) Urier transform, DFT (DFT) has achieved great success in this field because of its advantages of not relying on prior knowledge. But this kind of DFT based method has lost its advantages in short DNA sequence prediction because of its low spectral resolution and spectral energy leakage. In the second chapter, we propose a new kind of self - based method. The integrated algorithm of regression (autoregressiveAR) spectrum analysis and wavelet packet transform (wavelet packets transform, WPT) is used to improve the efficiency and accuracy of the coding region recognition. Firstly, the algorithm uses a DNA sequence numerical method (Code13 mapping method) to turn the DNA sequence into a numerical sequence, and then the numerical sequence is considered as a autoregressive model. The observation signal is used to estimate the energy spectrum density (power spectral density, PSD) of the autoregressive model by using the efficient Marple algorithm to estimate the energy spectrum density of the autoregressive model (power spectral density, PSD). The signal to noise ratio (SNR) curve is obtained by using the value of the energy spectrum density at the frequency theta =27r/3 (also known as the period three characteristic, three-base periodicity (TBP) property). After denoising the signal-to-noise ratio curve using the wavelet packet transform algorithm, the appropriate threshold is selected to identify the exon region. Finally, three famous standard test sets (GENSCAN65, HMR195 and BG570) are used to test the algorithm. The results show that the new algorithm can identify the protein coding more accurately than the traditional DFT based method. Areas. Viruses (especially pathogenic viruses) have threatened human health for thousands of years and new viruses and their varieties have appeared in recent years. Therefore, how to use computational biology technology to help medical experts quickly reduce the range of suspected virus screening in the two generation sequencing massive short sequence database to provide high quality for its subsequent laboratory diagnosis of viruses. In the third chapter, we put forward a set of comprehensive classification algorithms used for viruses. And human identification (classification) and further different virus category prediction. First, the algorithm uses BLAST technology to compare the unclassified sequences to large virus databases and human databases. If a highly homologous target sequence can be found, the category of the target sequence can be considered as the category of the unclassified sequence. The algorithm stops. For those unmatched sequences, the non sequence alignment method we propose can play a complementary role. First, the DNA sequence to be classified is converted to a numerical vector, which is used as the input of the support vector machine (Support vector machine, SVM) classifier to predict the line category. "Virus" will continue to use Random Forest (RF) to predict virus category, which continues to predict which one of the six types of viruses. We use an independent 8 test set to test the integrated algorithm proposed by us and compare it with its prediction method. The results show that the virus humans are in the virus human. The results of classification have good prediction results, especially in shorter sequence prediction. In the multi classification prediction of virus level, although the overall accuracy rate is not very high, the prediction results can be used as a further reference for biologists. In conclusion, this study can help biologists and medical experts to enter into the study. A large scale screening of NGS short sequence mass data greatly reduces the range of candidate virus sequences, helps to improve the recognition and recognition efficiency of the virus, especially the pathogenic virus, and provides a powerful technical support for the treatment and prevention of major epidemic infectious diseases. Alcohol dependence (Alcohol dependence, AD) is a typical chronic alcohol. Poisoning is the 20 year of a special psychological state of alcohol caused by prolonged drinking. Among all the global risk factors for.1990-2010, drinking has risen rapidly from the original sixth to third, second only to hypertension and secondhand smoke. Excessive alcohol consumption not only leads to health related damage, but also causes social harm. Such as traffic accidents, crime, child abuse, domestic violence and various forms of injury. Therefore, drinking related issues have become one of the most important public health problems in the world, including our country. Although the incidence of alcohol dependence continues to increase, the exact etiology and pathogenesis are still not completely clear. AD is a complex mental disease associated with multiple factors such as heredity and environment, and a large number of studies have confirmed that alcohol dependence is closely related to genetic factors. In the study of the association between neuropeptide Y (NPY) gene polymorphism and alcohol dependence, researchers have conducted more than 10 years of research in different populations around the world. But in the two major single nucleotide polymorphisms (SNP), rs16139 and rs16147 sites, the results of the study were not consistent, even completely opposite, that the susceptible genes associated with AD had not been finalized. This is because different populations, different ethnic backgrounds and environmental factors affect the same gene. There may be differences in alleles and genotype frequencies between different races, so there may be differences in the impact of the same disease. How to use the available random case control research data to find the susceptible genes of alcohol dependence, screening and providing early intervention for high-risk groups from the gene level The realization of individualized treatment has important clinical value and social benefits. In the fourth chapter, in the fourth chapter, we mainly focus on whether there is a significant correlation between NPY gene polymorphism and alcohol dependence. SNP's Meta analysis is used to quantitatively analyze and evaluate the current published epidemiological literature on the polymorphism of neuropeptide Y (NPY) gene, especially the two important SNP (rs16139 mouth rs16147) and AD. In this chapter, we strictly comply with the basic requirements of SNP's Meta analysis method, through the extensive collection of existing domestic and foreign countries. A quantitative and comprehensive analysis of the existing literature on the association between NPY gene polymorphism and alcohol dependence. First, a balance test of the Hardy Weinberg equilibrium (HWE) law of genetic balance (Hardy-Weinberg equilibrium, HWE) was carried out in the control group, and then the heterogeneity of each study was tested. The above test was passed and used on the basis of Log The optimal genetic model selection strategy of the istic regression model determines that the dominant genetic model is used to merge the P values of each study, and a subgroup analysis is carried out. Finally, the publication bias is excluded by the funnel plot, the Egger linear regression and the Begg rank correlation method. The results show that there is no sufficient evidence for the NPY gene polymorphism at present. There was a significant correlation between sex and alcohol dependence. But in the subgroup analysis, the SNP rs16139 of other groups (such as Finns) was found to be associated with alcohol dependence. The Meta analysis of multiple existing results in this chapter increased the sample size from a statistical point of view and improved the effectiveness of the test, especially when multiple research results were inconsistent or When there is no statistical significance, meta analysis can be used to get a comprehensive analysis that is closer to the real situation. It provides a scientific basis for clinicians and researchers to understand the pathogenesis of alcohol dependence and its genetic diagnosis and treatment. In the fifth chapter, we mainly summarize the research on four sub topics, especially deep. The deficiencies and causes of these researches are analyzed. Finally, the improvement plan for future research is given.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2016
【分類號】:R318;TN911.7
,
本文編號:2061430
本文鏈接:http://www.wukwdryxk.cn/shoufeilunwen/xxkjbs/2061430.html
最近更新
教材專著