Sum an unsorted list of columns in csv file?
我对Python还是很陌生(对脚本非常生rust,我的上一次试验是在2001年左右的bash和Perl上),已经尝试过搜索SO-但说实话甚至不知道要寻找什么。我相当确定这是微不足道的-我有点as愧。
我有一个相当大的CSV文件(约26k行),以制表符分隔的格式:
1 2 3 4 5 6 7 8 9 10 11 12 | name, current_value, current_pct, change_pct ItemA 2452434324 7,70% -1,19 ItemB 342331 2,40% -0,45 ItemC 32412123 3,90% 3,87 ItemD 0 0 -4,52 ItemE 12318231 14,80% 0 ItemA 542312134 1,60% 0,11 ItemC 2423423425 11,21% -0,01 ItemE 3141888103 30,00% 0 ItemB 78826 1,01% 12,01 ItemA 89937 0,04% 0 ... |
总共有大约300个"项目"(重复,但是顺序不同,有时只出现一次或两次),每个项目都有一个"当前值"(整数,从0到大约10亿(或10亿/百万毫),当前百分比值(此刻对我而言并不有趣),以及上次读取后的百分比变化(不同的文件,此刻对我而言并不有趣)。
我要实现的是每个
1 2 3 4 5 6 | name total_pct_change ItemA -1,08 ItemB 11,56 ItemC 3,86 ItemD -4,52 ItemE 0 |
我打算创建一个
到目前为止,我所拥有的:
1 2 3 4 5 6 7 8 9 10 11 12 | import csv, sys, string xlsfile = sys.argv[1] with open(xlsfile, 'rb') as f: reader = csv.reader(f, delimiter='\\t') item = row[0] pct_change = row[3] # this is where I draw a blank # was thinking of something akin to # foreach item do sum(pct_change) # but that's obviously wrong print item, sum_pct_change f.close() |
有效的pandas解决方案:
1 2 3 4 5 6 7 8 9 | import pandas as pd with open(xlsfile) as fobj: header = [entry.strip() for entry in next(fobj).split(',')] data = pd.read_csv(xlsfile, delim_whitespace=True, decimal=',', names=header, skiprows=1) summed = data.groupby(by=['name'])['change_pct'].sum() print(summed) |
输出:
1 2 3 4 5 6 7 | name ItemA -1.08 ItemB 11.56 ItemC 3.86 ItemD -4.52 ItemE 0.00 Name: change_pct, dtype: float64 |
编辑
如果文件是
1 2 3 | data = pd.read_csv('pct2.csv', sep=';', decimal=',') summed = data.groupby(by=['name'])['change_pct'].sum() print(summed) |
Pandas是用于处理表格数据的出色工具。
在这里,您会做:
1 2 3 4 5 6 7 | import pandas as pd data = pd.read_csv('path_to_your_file', sep='\\t', header=0, decimal=',') summed = data.groupby(by=['name'])['change_pct'].sum() summed.to_csv('name_of_output_file', sep='\\t') |
要注意的一些问题:如果列名中有空白,则需要清理,或在上面的代码中使用确切的列名(例如,
这是一种相当易读的方法,可将读取的每一行转换为
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | from collections import namedtuple import csv import sys xlsfile = sys.argv[1] # define field names for easy access Record = namedtuple('Record', 'name, current_value, current_pct, change_pct') totals = {} # dictionary to hold totals with open(xlsfile, 'rb') as f: reader = csv.reader(f, delimiter='\\t') next(reader) # skip over header row for rec in (Record._make(row) for row in reader): totals[rec.name] = (totals.get(rec.name, 0.0) + float(rec.change_pct)) print('name total_change_pct') for item in sorted(totals.items()): print('{:5} {:.2f}'.format(item[0], item[1])) |
输出:
1 2 3 4 5 6 | name total_change_pct ItemA -1.08 ItemB 11.56 ItemC 3.86 ItemD -4.52 ItemE 0.00 |
使用
1 2 3 4 5 6 7 8 9 10 11 12 13 | from collections import defaultdict with open(xlsfile) as fobj: next(fobj) # throw away first line res = defaultdict(float) for line in fobj: values = line.split() # split at whitespace # use value of first column as key # take value of last column replace `,` by `.` and convert to `float` # and use as value res[values[0]] += float(values[-1].replace(',', '.')) print(res) |
输出:
1 2 3 4 5 6 | defaultdict(float, {'ItemA': -1.0799999999999998, 'ItemB': 11.56, 'ItemC': 3.8600000000000003, 'ItemD': -4.52, 'ItemE': 0.0}) |