Giremi RNA 编辑位点分析软件
简介:
GIREMI是一种通过SNP位点,SNPdb以及genome.fa别RNA编辑位点的软件。
通过计算测序reads中识别出的错配对的互信息(MI),以区分RNA编辑位点和SNP。它还通过线性模型(GLM)训练集,以进一步增强预测能力;该模型利用序列偏倚信息以及未知单核苷酸变体(SNV)的错配比与基因的估计等位基因比之间的差异。
1.环境和变量
- Linux,perl,R;
- 处理常规人类数据集至少需要8 GB的内存;
- HTSlib : 用于读取SAM / BAM文件请确保配置文件中的动态库路径;
- samtools :用于生成参考基因组序列的faidx索引,一般参考基因组文件夹都有
1 2 3 | export PATH=/usr/local/bin/samtools:$PATH export LD_LIBRARY_PATH=/home/fanyucai/software/glibc/glibc-v2.14/lib/:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/home/fanyucai/software/samtools/samtools-bcftools-htslib-1.0_x64-linux/lib/:$LD_LIBRARY_PATH |
2.输入文件
- SNP结果表格,该物种dbSNP(一般VCF格式),参考基因组文件夹;

SNP_result.xls

dbSNP
3.分析流程
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | ###1.基因正负链信息 less genome/gene.gtf |awk -F' |;' -v OFS=' ' '{print $10,$7}'|sort -u |sed 's/gene_id "//g;s/"//g;s///g' >gene2strand.txt ###2.得到SNP和RS对应(消耗较多内存,20G左右) less dbSNP|grep -v "#"|awk -F' ' -v OFS=' ' '{print $1,$2,$4,$5,$3}' >genome/dbSNP.vcf ##注意染色体号和基因组一致 cat SNP*.xls|cut -f 1-4|sort -u >snp.all.txt #R --slave <<EOF #a<-read.table("snp.all.txt",sep=" ") #b<-read.table("genome/dbSNP.vcf",sep=" ") #mergedb<-merge(a,b,by.x=c("V1","V2"),by.y=c("V1","V2"),all.x=T) #write.table(mergedb, file = "all.SNP.RS", sep = " ", quote = F, row.names = F) #EOF ###或者较小的SNPdb信息可以awk比对; less all.SNP.txt |awk -F' ' -v OFS=' ' 'NR==FNR{a[$1$2$3$4]=$1$2$3$4;b[$1$2$3$4]=$5}NR>FNR{if($1$2$3$4==a[$1$2$3$4])print $0,"1";else print $0,"0"}' genome/dbSNP.vcf ->all.SNP.RS ###3.整理SNP文件为输入格式 for Sample in bam/*.bam;do n=$(basename $Sample|sed 's/.bam//'); less bam/SNP_$n.xls |sed '1d'|cut -f 1-4,12|awk -F' ' -v OFS=' ' '{if($5=="")print $1,$2,$3,$4,"Inte";else print $0}'|awk -F' ' -v OFS=' ' 'NR==FNR{a[$1]=$1;b[$1]=$2}NR>FNR{if($5==a[$5])print $0,b[$5];else print $0,"#"}' gene2strand.txt -|awk -F' ' -v OFS=' ' 'NR==FNR{a[$1" "$2" "$3" "$4]=$1" "$2" "$3" "$4;b[$1" "$2" "$3" "$4]=$5}NR>FNR{if(($1" "$2" "$3" "$4)==a[$1" "$2" "$3" "$4])print $1,($2-1),$2,$5,"1",$6;else print $1,($2-1),$2,$5,"0",$6}' all.SNP.RS -|awk -F' ' -v OFS=' ' '{if($4 ~/^gene/)print $1,$2,$3,"Inte",$5,$6;else print $0}' >input.$n.tmp; ###4.进行RNAE位点鉴别 $cwd/giremi -f $cwd/genome/genome.fa -l $cwd/input.$n.tmp -o $cwd/output.$n.txt $cwd/$Sample; ###5.R脚本计算是否为RNAE #Rscript /public/cluster2/works/lipeng/RNA_seq/test/RNAedit/GIREMI/giremi.r -i output.$n.txt -o result.$n.xls; |
4.结果展示

RNAE.result