How to compare multiple lines in one file and output a combined entry
我有一个显示四列的文件:
chr开始结束记录
像这样:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | chrI 128980 129130 F53G12.5b chrI 132280 132430 F53G12.5c.2 chrI 132280 132430 F53G12.5a chrI 132280 132430 F53G12.5b chrI 132280 132430 F53G12.5c.1 chrI 133600 133750 F53G12.5c.2 chrI 133600 133750 F53G12.5a chrI 133600 133750 F53G12.5b chrI 133600 133750 F53G12.5c.1 chrI 136240 136390 F53G12.4 chrI 139100 139250 F53G12.3 chrI 163220 163370 F56C11.2a chrI 163220 163370 F56C11.2b chrI 173900 174050 F56C11.6a chrI 173900 174050 F56C11.6b chrI 173900 174050 F56C11.6c chrI 182240 182390 F56C11.3 chrI 184080 184230 Y48G1BL.2a chrI 190720 190870 Y48G1BL.2a |
并且重复了许多区域(由chr start end描述),因为它们映射到1个以上的转录本
例如:
1 2 3 4 | chrI 133600 133750 F53G12.5c.2 chrI 133600 133750 F53G12.5a chrI 133600 133750 F53G12.5b chrI 133600 133750 F53G12.5c.1 |
我想要的是一个代码,其中具有与第1,2,3列相同的行,并从中获取第4列的最短公共部分(在本例中为F53G12.5),并输出一个精简条目,即:
1 | chrI 133600 133750 F53G12.5 |
或例如:
1 2 3 4 | chrI 83280 83430 Y48G1C.10a chrI 90420 90570 Y48G1C.10b chrI 90420 90570 Y48G1C.10c chrI 90420 90570 Y48G1C.10a |
它应该给
1 2 | chrI 83280 83430 Y48G1C.10a chrI 90420 90570 Y48G1C.10 |
您对此有意见吗? 非常感谢
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | (a[$1""$2""$3]) { t=0; word=""; delete w1; delete w2; split($4,w1,""); split(a[$1""$2""$3],w2,""); t=(length($4)<length(a[$1""$2""$3]))?length($4):length(a[$1""$2""$3]) for (x=1;x<=t;x++) { if (w1[x]==w2[x]) { word=word""w1[x] } a[$1""$2""$3]=word } next } { a[$1""$2""$3]=$4 } END { for (x in a) print x,a[x] } |
您的档案:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | $ cat file chrI 128980 129130 F53G12.5b chrI 132280 132430 F53G12.5c.2 chrI 132280 132430 F53G12.5a chrI 132280 132430 F53G12.5b chrI 132280 132430 F53G12.5c.1 chrI 133600 133750 F53G12.5c.2 chrI 133600 133750 F53G12.5a chrI 133600 133750 F53G12.5b chrI 133600 133750 F53G12.5c.1 chrI 136240 136390 F53G12.4 chrI 139100 139250 F53G12.3 chrI 163220 163370 F56C11.2a chrI 163220 163370 F56C11.2b chrI 173900 174050 F56C11.6a chrI 173900 174050 F56C11.6b chrI 173900 174050 F56C11.6c chrI 182240 182390 F56C11.3 chrI 184080 184230 Y48G1BL.2a chrI 190720 190870 Y48G1BL.2a |
输出:
1 2 3 4 5 6 7 8 9 10 11 | $ awk -f script.awk file chrI 173900 174050 F56C11.6 chrI 128980 129130 F53G12.5b chrI 182240 182390 F56C11.3 chrI 139100 139250 F53G12.3 chrI 136240 136390 F53G12.4 chrI 132280 132430 F53G12.5 chrI 163220 163370 F56C11.2 chrI 184080 184230 Y48G1BL.2a chrI 190720 190870 Y48G1BL.2a chrI 133600 133750 F53G12.5 |
我怀疑这可以用Pandas整齐地完成,比这要好得多,但是我对Pandas不太熟悉,所以...提交时无需调试。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | def longest_identical_substring(words): result = words[0] for idx in range(len(words[0]), 0, -1): substrings = [w[:idx] for w in words] if max(substrings) == min(substrings): result = substrings[0] else: return result transcripts = defaultdict(list) with open('myfile.csv') as infile: reader = csv.reader(infile) for row in reader: transcripts[row[:3]].append(row[3]) for ((chr, start, end), ts) in transcripts.items(): print(chr, start, end, longest_identical_substring(ts)) |
你们真棒!感谢您提交所有答案。这是使用Perl的另一种解决方案:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | #!/usr/bin/env perl #use Data::Dumper qw(Dumper); use strict; use warnings; my $filename = $ARGV[0]; my @matrix; my @transcripts; my @transcript; my %referenceTable; my $count=0; my $oldkey=""; my $key=""; my @keys; my @key; my %hash; open FILE,"< $filename" or die"can not open file\ "; while (my $line=<FILE>) { my ($chromosome, $start, $stop, $transcript) = split("\\t", $line); $key = $chromosome ."SPACE" . $start ."SPACE" . $stop; if ($oldkey ne $key) { $count = 0; $oldkey = $key; } push @{$referenceTable{$key}}, $transcript; $count++; } my $output; my ($k, $v); #Not @v -- $v will contain string that will be a reference to an array while (($k, $v) = each(%referenceTable)){ my ($chromosome, $start, $stop) = split(/SPACE/, $k); print"chromosome start stop \\: $chromosome\\t $start\\t $stop \\t"; print"Common prefix \\: \\t"; $output = getleastcommonprefix(@{$v}); print $output ."\ "; } #print Dumper \\%referenceTable; sub getleastcommonprefix { my @searcharray = @_; my $common = $searcharray[0]; foreach my $index (1 .. $#searcharray) { $_ = $searcharray[0] . reverse $searcharray[$index]; m/(.*)(.*)(??{quotemeta reverse $1})/s; if (length $1 < length $common) { $common = $1; } } ## end foreach my $index (1 .. $#searcharray) return $common; } ## end sub getleastcommonprefix #print 'Common prefix for file $filename [' . getleastcommonprefix(@array_of_test_names) ."]\ "; |
使用awk
1 2 3 4 5 6 7 8 9 10 11 12 | awk '{sub(/[^0-9]+/,"",$2);NF=2} !a[$1]++' FS=. OFS=. file chrI 128980 129130 F53G12.5 chrI 132280 132430 F53G12.5 chrI 133600 133750 F53G12.5 chrI 136240 136390 F53G12.4 chrI 139100 139250 F53G12.3 chrI 163220 163370 F56C11.2 chrI 173900 174050 F56C11.6 chrI 182240 182390 F56C11.3 chrI 184080 184230 Y48G1BL.2 chrI 190720 190870 Y48G1BL.2 |
我认为Python的
1 2 3 4 5 6 7 8 9 | import itertools from operator import itemgetter def combine(data): for group, group_lines in itertools.groupby(data, itemgetter(0,1,2)): names = [line[3] for line in group_lines] prefix ="".join(t[0] for t in itertools.takewhile(lambda x:len(set(x))==1, zip(*names))) yield group + (prefix,) |
用类似的方式运行它:
1 2 3 | with open(filename) as f: for item in combine(line.split() for line in f): print("{:8}{:8}{:8}{}".format(*item)) |
运行示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | >>> data ="""chrI 128980 129130 F53G12.5b chrI 132280 132430 F53G12.5c.2 chrI 132280 132430 F53G12.5a chrI 132280 132430 F53G12.5b chrI 132280 132430 F53G12.5c.1 chrI 133600 133750 F53G12.5c.2 chrI 133600 133750 F53G12.5a chrI 133600 133750 F53G12.5b chrI 133600 133750 F53G12.5c.1 chrI 136240 136390 F53G12.4 chrI 139100 139250 F53G12.3 chrI 163220 163370 F56C11.2a chrI 163220 163370 F56C11.2b chrI 173900 174050 F56C11.6a chrI 173900 174050 F56C11.6b chrI 173900 174050 F56C11.6c chrI 182240 182390 F56C11.3 chrI 184080 184230 Y48G1BL.2a chrI 190720 190870 Y48G1BL.2a""".splitlines() >>> for item in combine(line.split() for line in data): print("{:8}{:8}{:8}{}".format(*item)) chrI 128980 129130 F53G12.5b chrI 132280 132430 F53G12.5 chrI 133600 133750 F53G12.5 chrI 136240 136390 F53G12.4 chrI 139100 139250 F53G12.3 chrI 163220 163370 F56C11.2 chrI 173900 174050 F56C11.6 chrI 182240 182390 F56C11.3 chrI 184080 184230 Y48G1BL.2a chrI 190720 190870 Y48G1BL.2a |
没有进行所有调试,这是一个简单的awk语句:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt chrI 128980 129130 F53G12.5 chrI 132280 132430 F53G12.5 chrI 132280 132430 F53G12.5 chrI 132280 132430 F53G12.5 chrI 132280 132430 F53G12.5 chrI 133600 133750 F53G12.5 chrI 133600 133750 F53G12.5 chrI 133600 133750 F53G12.5 chrI 133600 133750 F53G12.5 chrI 136240 136390 F53G12.4 chrI 139100 139250 F53G12.3 chrI 163220 163370 F56C11.2 chrI 163220 163370 F56C11.2 chrI 173900 174050 F56C11.6 chrI 173900 174050 F56C11.6 chrI 173900 174050 F56C11.6 chrI 182240 182390 F56C11.3 chrI 184080 184230 Y48G1BL.2 chrI 190720 190870 Y48G1BL.2 awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort|uniq chrI 128980 129130 F53G12.5 chrI 132280 132430 F53G12.5 chrI 133600 133750 F53G12.5 chrI 136240 136390 F53G12.4 chrI 139100 139250 F53G12.3 chrI 163220 163370 F56C11.2 chrI 173900 174050 F56C11.6 chrI 182240 182390 F56C11.3 chrI 184080 184230 Y48G1BL.2 chrI 190720 190870 Y48G1BL.2 |
要按列排序,请对列ID排序-nrk数字反向k
1 2 3 4 5 6 7 8 9 10 11 | awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort -nrk2,3|uniq chrI 190720 190870 Y48G1BL.2 chrI 184080 184230 Y48G1BL.2 chrI 182240 182390 F56C11.3 chrI 173900 174050 F56C11.6 chrI 163220 163370 F56C11.2 chrI 139100 139250 F53G12.3 chrI 136240 136390 F53G12.4 chrI 133600 133750 F53G12.5 chrI 132280 132430 F53G12.5 chrI 128980 129130 F53G12.5 |
根据列更新:
1 2 3 4 5 6 7 8 9 10 11 | awk '{ if( match($4, /[0-9a-zA-Z]+\\.[0-9a-zA-Z]/)) { trimmed=substr($4,RSTART,RLENGTH); } print $1"\\t"$2"\\t"$3"\\t"trimmed;}' test.txt |sort|uniq chrI 128980 129130 F53G12.5 chrI 132280 132430 F53G12.5 chrI 133600 133750 F53G12.5 chrI 136240 136390 F53G12.4 chrI 139100 139250 F53G12.3 chrI 163220 163370 F56C11.2 chrI 173900 174050 F56C11.6 chrI 182240 182390 F56C11.3 chrI 184080 184230 Y48G1BL.2 chrI 190720 190870 Y48G1BL.2 |
这是昨晚的努力完成的:-)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | #!/bin/bash sort file |\\ awk ' NR==1 {f123=$1""$2""$3;trans=$4;next} # NR=1, i.e. first line { # NR!=1, i.e. subsequent lines if(f123!=$1""$2""$3){ # Fields 1-3 have changed printf"%s %s\ ",f123,trans f123=$1""$2""$3;trans=$4 }else{ # Fields 1-3 unchanged, do common transcript newtrans=$4 x=length(newtrans) # Get shorter of two transcripts if(length(trans)<x) x=length(trans) # Copy common part common="" for(c=1;c<=x;c++){ if(substr(trans,c,1)==substr(newtrans,c,1))common=common""substr(trans,c,1) } trans=common } } END {if(common)printf"%s %s\ ",f123,common} ' |
一些注意事项...基本上是对输入文件进行了排序,因此具有类似char / start / end值的记录彼此相邻。然后将其通过管道传输到awk。读取第一行时,字段(列)1到3被聚集在一起,并保存为变量" f123"。读取后续行时,会将前三列与最后看到的三列进行比较。如果前三列中的任何部分已更改,则将看到的最后一行及其笔录一起输出。如果前三列没有更改,那么我们要处理一个新的成绩单。然后,通过复制字母直到一个不相同,来计算最后一个笔录和当前笔录共有的最短前缀,直到新的笔录被保存以供下次更改第1列至第3列时输出。当我们达到最后一个记录时,我们可能一直在积累一个新的普通成绩单,如果是,我们将其输出。