Realign different date ranges to “last x days” for each row
我需要分析直到每个用户活跃的最后60天。
我的数据框包含每个用户(\\'DataSourceId \\')处于活动状态的日期(\\'CalendarDate \\')(\\'Activity \\'一个整数)-每个日期一行。我已经按照DataSourceId对数据框进行了分组,所以我在各列中都有日期,而且我抓住了每个用户活动的最后一天\\'max_date \\':
1 | df['max_date'] = df.groupby('DataSourceId')['CalendarDate'].transform('max') |
尽管\\'CalendarDate \\'和\\'max_date \\'实际上是
1 2 3 4 | ID Jan1 Jan2 Jan3 Jan4 Jan5... max_date 1 8 15 10 Jan5 2 2 13 Jan3 3 6 11 Jan2 |
现在,我想为每行将日历日期的列重新调整为"最近x天"。像这样:
1 2 3 4 | ID Last Last-1 Last-2 Last-3 ... Last-x 1 10 15 8 2 13 2 3 11 6 |
我无法找到任何类似转换的示例,并且真的被困在这里。
编辑:
适应了jezrael的解决方案后,我发现它偶尔会失败。
我认为问题与jezrael的解决方案中的以下代码有关:
示例:此数据失败(和
1 2 3 4 | CalendarDate 2017-07-02 2017-07-03 2017-07-06 2017-07-07 2017-07-08 2017-07-09 DataSourceId 1000648 NaN 188.37 178.37 NaN 128.37 18.37 1004507 51.19 NaN 52.19 53.19 NaN NaN |
具体来说,重新对齐的数据框如下所示:
1 2 3 4 | Last-0 Last-1 Last-2 Last-3 Last-4 Last-5 DataSourceId 1000648 18.37 128.37 NaN 178.37 188.37 NaN 1004507 52.19 NaN 51.19 NaN NaN 53.19 |
如果我通过将ID 1000648更改为1100648(从而使其成为第二行)来更改数据帧中的顺序,则结果为(
1 2 3 4 | Last-0 Last-1 Last-2 Last-3 Last-4 Last-5 DataSourceId 1004507 NaN NaN 53.19 52.19 NaN 51.19 1100648 NaN 178.37 188.37 NaN 18.37 128.37 |
如果性能很重要,请使用更改后的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | #select all columns without last A = df.iloc[:, 1:-1].values print (A) [[nan 8. nan 15. 10.] [ 2. nan 13. nan nan] [ 6. 11. nan nan nan]] #count NaNs values r = df.bfill(axis=1).isna().sum(axis=1).values #oldier pandas versions #r = df.bfill(axis=1).isnull().sum(axis=1).values #boost solution by https://stackoverflow.com/a/30428192 #r = A.shape[1] - (~np.isnan(A)).cumsum(axis=1).argmax(axis=1) - 1 print (r) [0 2 3] rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]] # Use always a negative shift, so that column_indices are valid. # (could also use module operation) r[r < 0] += A.shape[1] column_indices = np.flip(column_indices - r[:,np.newaxis], axis=1) print (column_indices) [[ 4 3 2 1 0] [ 2 1 0 -1 -2] [ 1 0 -1 -2 -3]] result = A[rows, column_indices] #https://stackoverflow.com/a/51613442 #result = strided_indexing_roll(A,r) print (result) [[10. 15. nan 8. nan] [13. nan 2. nan nan] [11. 6. nan nan nan]] |
1 2 3 4 5 6 7 8 | c = [f'Last-{x}' for x in np.arange(result.shape[1])] df1 = pd.DataFrame(result, columns=c) df1.insert(0, 'ID', df['ID']) print (df1) ID Last-0 Last-1 Last-2 Last-3 Last-4 0 1 10.0 15.0 NaN 8.0 NaN 1 2 13.0 NaN 2.0 NaN NaN 2 3 11.0 6.0 NaN NaN NaN |
编辑:
如果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | A = df.iloc[:, :-1].values print (A) [[nan 8. nan 15. 10.] [ 2. nan 13. nan nan] [ 6. 11. nan nan nan]] r = df.bfill(axis=1).isna().sum(axis=1).values print (r) [0 2 3] rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]] # Use always a negative shift, so that column_indices are valid. # (could also use module operation) r[r < 0] += A.shape[1] column_indices = np.flip(column_indices - r[:,np.newaxis], axis=1) print (column_indices) [[ 4 3 2 1 0] [ 2 1 0 -1 -2] [ 1 0 -1 -2 -3]] result = A[rows, column_indices] print (result) [[10. 15. nan 8. nan] [13. nan 2. nan nan] [11. 6. nan nan nan]] |
1 2 3 4 5 6 7 8 9 | c = [f'Last-{x}' for x in np.arange(result.shape[1])] #use DataFrame constructor df1 = pd.DataFrame(result, columns=c, index=df.index) print (df1) Last-0 Last-1 Last-2 Last-3 Last-4 ID 1 10.0 15.0 NaN 8.0 NaN 2 13.0 NaN 2.0 NaN NaN 3 11.0 6.0 NaN NaN NaN |
您可以使用此代码
首先找到最后一个连续的空值,并随着每个系列的计数移位,它将起作用。
1 2 3 4 | df1 = df[df.columns.difference(['ID'])] df1 = df1.apply(lambda x:x.shift(x[::-1].isnull().cumprod().sum())[::-1],axis=1) df1.columns = ['Last-'+str(i) for i in range(df1.columns.shape[0])] df1['ID'] = df['ID'] |
退出:
1 2 3 4 | Last-0 Last-1 Last-2 Last-3 Last-4 ID 0 10.0 15.0 NaN 8.0 NaN 1 1 13.0 NaN 2.0 NaN NaN 2 2 11.0 6.0 NaN NaN NaN 3 |
请尝试以下代码,让我知道是否有帮助。
1 2 | df = df.iloc[:,list(range(len(df.columns)-1,0,-1))] print(df) |