Pandas rolling corr with no overlap
我有几个价格回报系列,我想以一种在日期之间没有重叠的方式来计算N天的滚动相关性,即如果我的第一个相关性矩阵属于[2000-04-05-2000] -06-04],下一个相关矩阵应属于[2000-06-05-2000-08-04]。使用常规的df.rolling(window = window).corr(df,pairwise = True)将返回重叠的日期。
我知道将滚动方法的结果切成薄片会得到我想要的东西,但这意味着我们正在花费时间来计算我不会使用的相关性,从而浪费了资源。
有什么建议吗?
更新:
这是输入的示例:
更新2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | outputs for pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None pandas: 0.20.3 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: 0.26.1 numpy: 1.14.5 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: 1.6.3 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.1.0 openpyxl: 2.4.8 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.1.0 bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None |
您可以使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | def dcorr(df, n): return df.resample(f"{n}D", on='date').apply(lambda d: d.corr()) dcorr(df, 20) A B date 2000-01-01 A 1.000000 0.241121 B 0.241121 1.000000 2000-01-21 A 1.000000 0.083664 B 0.083664 1.000000 2000-02-10 A 1.000000 0.432988 B 0.432988 1.000000 2000-03-01 A 1.000000 -0.269869 B -0.269869 1.000000 2000-03-21 A 1.000000 -0.188370 B -0.188370 1.000000 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | df.set_index('date').groupby(pd.Grouper(freq='20D')).corr() A B date 2000-01-01 A 1.000000 0.241121 B 0.241121 1.000000 2000-01-21 A 1.000000 0.083664 B 0.083664 1.000000 2000-02-10 A 1.000000 0.432988 B 0.432988 1.000000 2000-03-01 A 1.000000 -0.269869 B -0.269869 1.000000 2000-03-21 A 1.000000 -0.188370 B -0.188370 1.000000 |
或
1 2 3 4 5 6 7 8 9 | df.set_index('date').groupby(pd.Grouper(freq='20D')).corr().unstack()[('A', 'B')] date 2000-01-01 0.241121 2000-01-21 0.083664 2000-02-10 0.432988 2000-03-01 -0.269869 2000-03-21 -0.188370 Name: (A, B), dtype: float64 |
您还可以明确显示要关联的列:
1 | df.resample("20D", on='date').apply(lambda d: d.A.corr(d.B)) |
设置
1 2 3 4 5 | np.random.seed([3, 1415]) n = 100 df = pd.DataFrame(np.random.rand(n,2), columns=['A','B']) df['date'] = pd.date_range('2000-01-01', periods=n, name='date') |
调试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import pandas as pd import numpy as np np.random.seed([3, 1415]) n = 100 df = pd.DataFrame( np.random.rand(n, 4), pd.date_range('2000-01-01', periods=n, name='date'), ['ABC','XYZ __', 'One', 'Two Three'] ) def dcorr(df, n): return df.resample(f"{n}D").apply(lambda d: d.corr()) dcorr(df, 20) |
输出
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | ABC XYZ __ One Two Three date 2000-01-01 ABC 1.000000 -0.029687 0.403720 0.078800 XYZ __ -0.029687 1.000000 -0.231223 -0.333266 One 0.403720 -0.231223 1.000000 0.330959 Two Three 0.078800 -0.333266 0.330959 1.000000 2000-01-21 ABC 1.000000 -0.024610 0.206002 -0.059523 XYZ __ -0.024610 1.000000 -0.601174 -0.101306 One 0.206002 -0.601174 1.000000 0.149536 Two Three -0.059523 -0.101306 0.149536 1.000000 2000-02-10 ABC 1.000000 -0.361072 0.156693 -0.040827 XYZ __ -0.361072 1.000000 -0.077173 -0.232536 One 0.156693 -0.077173 1.000000 0.343754 Two Three -0.040827 -0.232536 0.343754 1.000000 2000-03-01 ABC 1.000000 0.204763 -0.013132 0.115202 XYZ __ 0.204763 1.000000 -0.339747 -0.206922 One -0.013132 -0.339747 1.000000 0.310002 Two Three 0.115202 -0.206922 0.310002 1.000000 2000-03-21 ABC 1.000000 0.062841 -0.245393 0.233697 XYZ __ 0.062841 1.000000 -0.213742 0.341582 One -0.245393 -0.213742 1.000000 0.251169 Two Three 0.233697 0.341582 0.251169 1.000000 |
(许多)方法之一是用批号标记您的行。批处理方式取决于您。然后使用groupby apply和定义的函数来计算相关性。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | n = 100 df = pd.DataFrame(np.random.rand(n,2), columns=['A','B']) df['date'] = pd.date_range('2000-01-01', periods=n, name='date') df['batch'] = np.arange(n) // 20 def process_batch(dg): return pd.DataFrame([[ dg['date'].min(), dg['date'].max(), dg[['A','B']].corr().values[0][1] ]], columns=['date_min', 'date_max', 'corr']) df.groupby('batch').apply(process_batch).reset_index(1, drop=True) |
结果:
1 2 3 4 5 6 7 | date_min date_max corr batch 0 2000-01-01 2000-01-20 -0.403241 1 2000-01-21 2000-02-09 -0.091487 2 2000-02-10 2000-02-29 0.091835 3 2000-03-01 2000-03-20 0.029466 4 2000-03-21 2000-04-09 0.100756 |