Resampling a pandas dataframe with multi-index containing timeseries
为创建似乎是该问题的重复而
表示歉意。我有一个数据框,其形状大致如下所示:
1 2 3 4 5 6 7 8 9 | df_lenght = 240 df = pd.DataFrame(np.random.randn(df_lenght,2), columns=['a','b'] ) df['datetime'] = pd.date_range('23/06/2017', periods=df_lenght, freq='H') unique_jobs = ['job1','job2','job3',] job_id = [unique_jobs for i in range (1, int((df_lenght/len(unique_jobs))+1) ,1) ] df['job_id'] = sorted( [val for sublist in job_id for val in sublist] ) df.set_index(['job_id','datetime'], append=True, inplace=True) |
1 2 3 4 5 6 7 | a b job_id datetime 0 job1 2017-06-23 00:00:00 -0.067011 -0.516382 1 job1 2017-06-23 01:00:00 -0.174199 0.068693 2 job1 2017-06-23 02:00:00 -1.227568 -0.103878 3 job1 2017-06-23 03:00:00 -0.847565 -0.345161 4 job1 2017-06-23 04:00:00 0.028852 3.111738 |
我将需要对
我尝试了两种方法:
1-按此处的建议进行堆放和堆放
1 | df.unstack('job_id','datetime').resample('D').mean().rolling(window=2).mean().stack('job_id', 'datetime') |
这将返回错误
2-使用
1 2 | level_values = df.index.get_level_values result = df.groupby( [ level_values(i) for i in [0,1] ] + [ pd.Grouper(freq='D', level=2) ] ).mean().rolling(window=2).mean() |
这不会返回错误,但是似乎没有适当地对df进行重新采样/分组。结果似乎包含每小时数据点,而不是每天:
1 2 3 4 5 6 7 8 | print(result[:5]) a b job_id datetime 0 job1 2017-06-23 NaN NaN 1 job1 2017-06-23 0.831609 1.348970 2 job1 2017-06-23 -0.560047 1.063316 3 job1 2017-06-23 -0.641936 -0.199189 4 job1 2017-06-23 0.254402 -0.328190 |
首先让我们定义一个重采样器函数:
1 2 | def resampler(x): return x.set_index('datetime').resample('D').mean().rolling(window=2).mean() |
然后,我们对job_id进行分组并应用重采样器功能:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | df.reset_index(level=2).groupby(level=1).apply(resampler) Out[657]: a b job_id datetime job1 2017-06-23 NaN NaN 2017-06-24 0.053378 0.004727 2017-06-25 0.265074 0.234081 2017-06-26 0.192286 0.138148 job2 2017-06-26 NaN NaN 2017-06-27 -0.016629 -0.041284 2017-06-28 -0.028662 0.055399 2017-06-29 0.113299 -0.204670 job3 2017-06-29 NaN NaN 2017-06-30 0.233524 -0.194982 2017-07-01 0.068839 -0.237573 2017-07-02 -0.051211 -0.069917 |
让我知道这是否是你要的。
IIUC,您希望按
分组
1 | ( [ level_values(i) for i in [0,1] ] + [ pd.Grouper(freq='D', level=2) ] ) |
您要分组
1 | [df.index.get_level_values(1), pd.Grouper(freq='D', level=2)] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import numpy as np import pandas as pd np.random.seed(2017) df_length = 240 df = pd.DataFrame(np.random.randn(df_length,2), columns=['a','b'] ) df['datetime'] = pd.date_range('23/06/2017', periods=df_length, freq='H') unique_jobs = ['job1','job2','job3',] job_id = [unique_jobs for i in range (1, int((df_length/len(unique_jobs))+1) ,1) ] df['job_id'] = sorted( [val for sublist in job_id for val in sublist] ) df.set_index(['job_id','datetime'], append=True, inplace=True) grouped = df.groupby([df.index.get_level_values(1), pd.Grouper(freq='D', level=2)]) result = grouped.mean().rolling(window=2).mean() print(result) |
产量
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | a b job_id datetime job1 2017-06-23 NaN NaN 2017-06-24 -0.203083 0.176141 2017-06-25 -0.077083 0.072510 2017-06-26 -0.237611 -0.493329 job2 2017-06-26 -0.297775 -0.370543 2017-06-27 0.005124 0.052603 2017-06-28 0.226142 -0.015584 2017-06-29 -0.065595 0.210628 job3 2017-06-29 -0.186865 0.347683 2017-06-30 0.051508 0.029909 2017-07-01 0.005341 0.075378 2017-07-02 -0.027131 0.132192 |