Pandas Dataframe find intervals and count occurances
我得到了不同事件混合出现的列表。例如,event1可能发生了三次,然后另一个事件以及稍后在event1上再次发生。
我需要的是每个事件的间隔以及在这些间隔中该事件的发生次数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | values = { '2017-11-28 11:00': 'event1', '2017-11-28 11:01': 'event1', '2017-11-28 11:02': 'event1', '2017-11-28 11:03': 'event2', '2017-11-28 11:04': 'event2', '2017-11-28 11:05': 'event1', '2017-11-28 11:06': 'event1', '2017-11-28 11:07': 'event1', '2017-11-28 11:08': 'event3', '2017-11-28 11:09': 'event3', '2017-11-28 11:10': 'event2', } import pandas as pd df = pd.DataFrame.from_dict(values, orient='index').reset_index() df.columns = ['time', 'event'] df['time'] = df['time'].apply(pd.to_datetime) df.set_index('time', inplace=True) df.sort_index(inplace=True) df.head() |
预期结果是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | occurrences = [ {'start':'2017-11-28 11:00', 'end':'2017-11-28 11:02', 'event':'event1', 'count':3}, {'start':'2017-11-28 11:03', 'end':'2017-11-28 11:04', 'event':'event2', 'count':2}, {'start':'2017-11-28 11:05', 'end':'2017-11-28 11:07', 'event':'event1', 'count':3}, {'start':'2017-11-28 11:08', 'end':'2017-11-28 11:09', 'event':'event3', 'count':2}, {'start':'2017-11-28 11:10', 'end':'2017-11-28 11:10', 'event':'event2', 'count':1}, ] |
我当时正在考虑使用pd.merge_asof查找间隔的开始/结束时间,并使用pd.cut(如此处所述)进行分组和计数。但是我莫名其妙地被困住了。任何帮助表示赞赏。
尝试以下方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | In [68]: x = df.reset_index() In [69]: (x.groupby(x.event.ne(x.event.shift()).cumsum()) ...: .apply(lambda x: ...: pd.DataFrame({ ...: 'start':[x['time'].min()], ...: 'end':[x['time'].min()], ...: 'event':[x['event'].iloc[0]], ...: 'count':[len(x)]}) ...: ) ...: .reset_index(drop=True) ...: .to_dict('r') ...: ) Out[69]: [{'count': 3, 'end': Timestamp('2017-11-28 11:00:00'), 'event': 'event1', 'start': Timestamp('2017-11-28 11:00:00')}, {'count': 2, 'end': Timestamp('2017-11-28 11:03:00'), 'event': 'event2', 'start': Timestamp('2017-11-28 11:03:00')}, {'count': 3, 'end': Timestamp('2017-11-28 11:05:00'), 'event': 'event1', 'start': Timestamp('2017-11-28 11:05:00')}, {'count': 2, 'end': Timestamp('2017-11-28 11:08:00'), 'event': 'event3', 'start': Timestamp('2017-11-28 11:08:00')}, {'count': 1, 'end': Timestamp('2017-11-28 11:10:00'), 'event': 'event2', 'start': Timestamp('2017-11-28 11:10:00')}] |
或以下内容(如果要使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | In [75]: (x.groupby(x.event.ne(x.event.shift()).cumsum()) ...: .apply(lambda x: ...: pd.DataFrame({ ...: 'start':[x['time'].min().strftime('%Y-%m-%d %H:%M:%S')], ...: 'end':[x['time'].min().strftime('%Y-%m-%d %H:%M:%S')], ...: 'event':[x['event'].iloc[0]], ...: 'count':[len(x)]}) ...: ) ...: .reset_index(drop=True) ...: .to_dict('r') ...: ) Out[75]: [{'count': 3, 'end': '2017-11-28 11:00:00', 'event': 'event1', 'start': '2017-11-28 11:00:00'}, {'count': 2, 'end': '2017-11-28 11:03:00', 'event': 'event2', 'start': '2017-11-28 11:03:00'}, {'count': 3, 'end': '2017-11-28 11:05:00', 'event': 'event1', 'start': '2017-11-28 11:05:00'}, {'count': 2, 'end': '2017-11-28 11:08:00', 'event': 'event3', 'start': '2017-11-28 11:08:00'}, {'count': 1, 'end': '2017-11-28 11:10:00', 'event': 'event2', 'start': '2017-11-28 11:10:00'}] |
这是两个解决方案。第一个基于vivek-harikrishnan提供的链接,并在此处进行了说明。它为间隔创建连续的数字,并累计计算该间隔内的出现次数。
1 2 3 4 5 6 7 8 | #%% first solution # create intervals and count occurrences per interval df['interval'] = (df['event'] != df['event'].shift(1)).astype(int).cumsum() df['count'] = df.groupby(['event', 'interval']).cumcount() + 1 # now group by intervals df.groupby('interval').last() |
第二种解决方案基于maxu给出的以上答案。与第一个想法类似,它还会创建间隔号,但还会找到此类间隔的开始/结束时间戳记。
1 2 3 4 5 6 7 8 9 10 11 | #%% second solution df = df.reset_index() # create intervals df = df.groupby(df['event'].ne(df['event'].shift()).cumsum()) # calc start/end times and count occurances at the same time df.apply(lambda x: pd.DataFrame({ 'start':[x['time'].min()], 'end':[x['time'].max()], 'event':[x['event'].iloc[0]], 'count':[len(x)]})).reset_index(drop=True) |