怎么使用基因型数据做pca进行ld分析

点击联系发帖人 时间：2017-03-31 07:21

添加json ld数据

全基因组重测序数据分析详细说明_图文_百度文库
两大类热门资源免费畅读
续费一年阅读会员，立省24元！
全基因组重测序数据分析详细说明
阅读已结束，下载文档到电脑
想免费下载本文？
定制HR最喜欢的简历
下载文档到电脑，方便使用
还剩14页未读，继续阅读
定制HR最喜欢的简历
你可能喜欢【图文】R语言在遗传统计学中的应用_百度文库
两大类热门资源免费畅读
续费一年阅读会员，立省24元！
R语言在遗传统计学中的应用
大小：393.50KB
登录百度文库，专享文档复制特权，财富值每天免费拿！
你可能喜欢工具类服务
编辑部专用服务
作者专用服务
人类基因组中基因内连锁不平衡分布的分析
基因组的遗传变异是按照LD(连锁不平衡)的模式组织形成。一般认为基因内的LD强度要显著地高于基因间隔区域的LD强度，然而在更加精细的基因结构、基因功能分类和基因的表达模式上，目前尚没有对LD的分布进行系统地研究。这里，我们利用国际HapMap工程产生的单体型数据构建出一个强、弱LD区间的分析平台，然后将感兴趣的序列特征映射到此平台上。基因分类的结果表明：涉及到免疫反应、防御机制和信号传导的基因倾向位于弱LD的区域内；而与DNA和RNA代谢、蛋白组装相关的基因富含于强LD区域内。我们进一步对基因表达模式进行分析，发现弱LD的基因比强LD的基因有更高的表达量和更窄的表达谱。具体比较每个组织时，我们发现在免疫细胞株、扁桃体、垂体和苍白球等组织中，弱LD基因有较高的表达量；相反在内皮层细胞株、杏仁核和下丘脑等组织中强LD基因的表达量占优势。基因的分类和表达分析的结果暗示出：强烈的负选择作用于强LD的基因阻止高频的重组事件破坏基因的保守性；适应性选择的效应可能驱动弱LD基因发生一定的交叉重组从而保证基因的遗传多样性。此外，通过比较基因内部的结构，我们发现CDS和3’UTR的平均LD强度要明显地高于5’UTR、基因上游和基因下游的平均LD强度，该结果与前两者有更强的序列保守性是基本一致的。本工作有利于更加深入地理解基因内LD的结构模式，为研究自然选择如何影响LD的分布提供一定的指导。
学科专业：
授予学位：
学位授予单位：
导师姓名：
学位年度：
本文读者也读过
相关检索词
万方数据知识服务平台--国家科技支撑计划资助项目（编号：2006BAH03B01）(C)北京万方数据股份有限公司
万方数据电子出版社上传用户：wmfnfgdmgw资料价格：5财富值&&『』文档下载：『』&&『』学位专业：&关键词：&&&&&权力声明：若本站收录的文献无意侵犯了您的著作版权，请点击。摘要:（摘要内容经过系统自动伪原创处理以避免复制，下载原文正常，内容请直接查看目录。）目标：基于单核苷酸多态性（Single nucleotide polymorphism，SNP）的全基因组联系关系研讨（Genome wide association study，GWAS）可以或许有用地发掘多基因庞杂性状疾病的易感基因，在国际外的疾病遗传联系关系研讨中已获得了普遍运用。但因为纯真SNP的联系关系剖析存在一些缺陷和限制，最近几年来更多研讨开端成长基因程度的疾病遗传联系关系剖析办法。本研讨目标是成长一种新的基于连锁不屈衡（Linkagedisequilibrium，LD）构造的基因程度联系关系剖析办法，应用Monte Carlo数据模仿办法对其及其它几种经常使用的基因程度的联系关系剖析办法停止评价，懂得各类办法的优缺陷和实用前提，并将新办法运用到真实的冠芥蒂GWAS数据，发掘冠芥蒂相干的易感收集模块和基因，为庞杂性状疾病的病发机制研讨供给新线索。办法：1、应用Monte Carlo办法模仿基因程度的遗传联系关系数据。起首假定基因型数据为持续型变量数据且屈服多元正态散布，依据事后设定好的相干矩阵即LD系数矩阵（初始LD阵），发生持续型模仿数据；然后依据预设的病例组和对比组的基因型频率将模仿数据分段团圆化，发生相符各项预设前提的遗传模仿数据，且基因型模仿数据的相干阵等于初始LD阵。2、应用Monte Carlo模仿数据评价基因程度的联系关系剖析办法。我们成长了一种新的基于LD构造的基因程度联系关系剖析办法（LD-Fisher）：起首应用单倍型剖析算法对基因的LD构造停止剖析，取得基因上绝对自力的单倍域，并取得每一个单倍域中联系关系最明显的SNP，然后应用Fisher组正当取得基因程度的整合剖析成果。我们依据病例组和对比组的等位基因频率、SNP与疾病之间的联系关系系数、SNP数目、单倍域数目、易感SNP数目、SNP的LD构造等参数，采取Monte Carlo办法对参数的各类预设值和组合停止模仿，应用这些模仿数据评价多种基因程度的联系关系剖析办法的统计功能。3、运用基因程度的联系关系剖析办法剖析冠芥蒂GWAS数据，发掘冠芥蒂易感收集模块和基因。在对冠芥蒂GWAS数据基因程度的联系关系剖析基本上，构建冠芥蒂相干生物信息收集，并对收集模块和特点停止剖析，发掘冠芥蒂相干的易感收集模块和基因。成果：1、应用SAS法式我们完成了基因程度的遗传联系关系数据的Monte Carlo模仿，成果显示，模仿遗传数据的等位基因频率和LD构造等参数均异常接近事后设定的参数。2、在几种基因程度的联系关系剖析办法中， Logistic主成份剖析法（Principalcomponent analysis-logistic regression，PCA-logistic）和我们成长的LD-Fisher表示最为凸起。PCA-logistic在设定较高积累进献率参数95%（PCA95）后，不管单倍域数量若干，其统计功能都接近1，而下降积累进献率阈值为85%（PCA85）后模仿成果其实不幻想；LD-Fisher战胜了Fisher组正当所存在的遭到SNP的LD构造影响的成绩，在1个单倍域的情形下统计功能接近1，比PCA95略低，跨越PCA85；而在多个单倍域的情形下，到达PCA95的统计功能。3、经由过程对冠芥蒂的基因程度联系关系剖析（LD-Fisher）和生物收集剖析发明了四个冠芥蒂的易感收集模块，个中最主要的一个模块包括15个互相连通的的子功效模块。我们发明模块中存在两个主要的冠芥蒂易感基因MAPK10（OR=32.5，P3.51011）和COL4A2（OR=2.7，P2.81010），它们获得了其他基因程度的联系关系剖析办法和GWAS数据集的自力验证。结论：1、我们所成长的基因程度的遗传联系关系数据的Monte Carlo模仿办法可以或许发生知足预设参数的模仿数据，并用于基因程度的联系关系剖析办法的评价剖析，也能够用于其他遗传联系关系剖析办法的评价。2、我们所成长的基因程度联系关系剖析办法LD-Fisher不只具有和PCA-logistic邻近的很高的统计功能，并且由于其具有直不雅简练的遗传学说明，可以用于多基因庞杂疾病的基因程度的联系关系剖析。3、经由过程对冠芥蒂真实GWAS数据的运用，我们发明基因程度的联系关系剖析办法和生物收集剖析办法可以或许改良今朝纯真运用SNP联系关系剖析所存在的缺乏，增进多基因庞杂疾病的易理性研讨和疾病份子机制的说明。Abstract:Objective: wide association study (SNP) polymorphism (GWAS) nucleotide Genome (Single) can be useful to explore the susceptibility genes of multiple genes, and it has been widely used in the study of the relationship between the disease and genetic system. However, because of the relationship between the pure SNP, there are some defects and limitations, in recent years, more research on the development of the genetic relationship between the degree of disease genetic linkage analysis method. The research goal is to grow a new linkage flexion degree of gene relationship scale (Linkagedisequilibrium LD) structure analysis method, Monte Carlo data mimic the way the and several other frequently used the gene degree of correlation based on analysis method evaluation, understand all kinds of measures of the advantages and disadvantages and application conditions and new approach will be applied to the real crown grudges GWAS data, explore crown grudges coherent susceptible collect module and the gene,, mechanism research provides new clues for complex traits and diseases disease.. Approach: 1, the use of Carlo Monte approach to mimic the degree of genetic linkage of the genetic relationship between the data. The genotype data is assumed to be continuous variable data and yield multivariate normal distribution, and the LD coefficient matrix (initial LD) is set up, which is based on the assumption that the genotype frequency of the default group and the contrast group will be modeled as a genetic model, which is based on the assumption that the coherent array is equivalent to the initial LD matrix. 2, the application of Carlo Monte to imitate the degree of data evaluation of the degree of genetic relationship analysis method. We're growing a new analysis approach (LD-Fisher) based on the degree of gene relationship of LD structure: chapeau application haplotype analysis algorithm for gene of LD structure analysis, obtained gene absolutely independent haploid domain and obtain each haploid domain association's most significant SNP, then applied Fisher group due to obtain the extent of gene integration analysis of the results. We cases and contrast group of allele frequencies, SNP and disease between the contact relation coefficient, SNP number, number of haploid domain, easy to sense the number of SNPs, SNP LD structure parameters such as the basis, take Monte Carlo method on parameters of various preset value and the combination of imitation, applying these imitation data evaluation of multiple genes degree of correlation analysis method of statistical function. 3, using the extent of gene association analysis method to analyze coronary heart disease GWAS data, explore coronary heart disease susceptibility gene modules and collection. On coronary heart disease GWAS data gene degree of correlation analysis to basically build crown grudges coherent biological information collection, and to collect module and the characteristic analysis, explore the crown grudges coherent susceptible gene modules and collection. Results: 1. Using the SAS program, we have completed the Carlo Monte simulation of the degree of genetic linkage data. The results show that the parameters of the allele frequency and LD structure of the genetic data are abnormal close to the ex post setting parameters. 2, Logistic regression PCA-logistic (analysis-logistic) and the LD-Fisher of our growing Principalcomponent are the most prominent in the analysis of the relationship between the degree of gene. PCA-logistic after setting a higher accumulation contribution rate parameters (95% PCA95), regardless of the number of haploid number of domains and the function of statistics are close to 1 and decreased accumulation contribution rate threshold results simulation to 85% (PCA85) in fact, LD-Fisher overcome the Fisher group due to the presence of the being SNP LD structure influence the result. In case of a haplotype block statistics function is close to 1, slightly lower than the PCA95, across the PCA85; and in case of multiple haplotype block, arrived PCA95 statistical functions. 3, through the process of coronary heart disease gene association analysis (LD-Fisher) and biological collection analysis invented four coronary heart disease susceptible collection module, medium the main modules including 15 connected to each other by the sub function module. We found the module exists in two major coronary heart disease susceptible gene mapk10 OR=32.5 P3.51011 and col4a2 OR=2.7 P2.81010, they obtained other genes degree of association analysis approach and GWAS data set of independent verification. Conclusions: 1. The genetic linkage data of Carlo Monte in the genetic relationship of the growth of our genes can be used to imitate the data, and also can be used to evaluate the relationship between the genetic relationship and genetic analysis. 2, the degree of the growth of the gene, the relationship between the analysis of the relationship between the LD-Fisher and the PCA-logistic is not only a very high statistical function, and because of its direct and concise, and can be used for the genetic analysis of the degree of gene related diseases. 3, through the application process of real GWAS data of coronary heart disease. We find that the extent of gene related analysis methods and biological network analysis method may improve the current pure目录:摘要4-7Abstract7-8引言11-13第一部分遗传关联数据的模拟及 SAS 实现13-24&&&&1 背景和目的13-14&&&&2 材料和方法14-16&&&&&&&&2.1 模拟原理14&&&&&&&&2.2 初始 LD 阵14&&&&&&&&2.3 连续型基因型数据的模拟14-15&&&&&&&&2.4 基因型频率的设定15&&&&&&&&2.5 连续型基因型数据的离散化15-16&&&&&&&&2.6 产生批量模拟数据16&&&&3 结果分析16-22&&&&&&&&3.1 模拟一个单倍域的遗传关联数据16-18&&&&&&&&3.2 模拟两个单倍域的遗传关联数据18-22&&&&4 讨论22-24第二部分利用模拟数据评价基因水平的关联分析方法24-32&&&&1 背景和目的24&&&&2 材料和方法24-26&&&&&&&&2.1 遗传关联数据的模拟24-25&&&&&&&&2.2 常用的基因水平的关联分析方法25-26&&&&&&&&2.3 我们发展的基因水平的关联分析方法（LDFisher）26&&&&3 结果分析26-30&&&&4 讨论30-32第三部分应用基因水平的关联分析方法挖掘冠心病易感网络模块和易感基因32-44&&&&1 背景和目的32&&&&2 材料和方法32-35&&&&&&&&2.1 冠心病的基因水平的关联分析（LDFisher）32-33&&&&&&&&2.2 构建和分析冠心病的蛋白质相互作用网络33-34&&&&&&&&2.3 通过最显著 SNP 法和 VEGAS 验证功能模块中的易感基因34&&&&&&&&2.4 通过 CARDIoGRAMplusC4D 数据集验证功能模块和易感基因34-35&&&&&&&&2.5 通过我们基于 GWAS 的网络分析验证之前报道的冠心病易感基因35&&&&3 结果分析35-42&&&&&&&&3.1 冠心病的 PPI 网络和易感模块35-38&&&&&&&&3.2 通过文献检索验证结果38-41&&&&&&&&3.3 通过最显著 SNP 法和 VEGAS 进行验证41&&&&&&&&3.4 通过独立的数据集进行验证41&&&&&&&&3.5 验证之前报道的冠心病易感基因41-42&&&&4 讨论42-44参考文献44-47附录A 综述47-55&&&&参考文献52-55附录B 程序代码55-69在学研究成果69-70致谢70分享到：相关文献|}

久游无息网