关于python：pandas无重叠滚动

Pandas rolling corr with no overlap

我有几个价格回报系列，我想以一种在日期之间没有重叠的方式来计算N天的滚动相关性，即如果我的第一个相关性矩阵属于[2000-04-05-2000] -06-04]，下一个相关矩阵应属于[2000-06-05-2000-08-04]。使用常规的df.rolling(window = window).corr(df，pairwise = True)将返回重叠的日期。

我知道将滚动方法的结果切成薄片会得到我想要的东西，但这意味着我们正在花费时间来计算我不会使用的相关性，从而浪费了资源。

有什么建议吗？

更新：

这是输入的示例：

enter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

outputs for pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.14.5
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

resample

您可以使用pd.DataFrame.resample使用"20D"指定20天的时间规则。使用on自变量指定要重新采样的列。生成的resample对象类似于groupby对象，并且可以处理apply方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

def dcorr(df, n):
return df.resample(f"{n}D", on='date').apply(lambda d: d.corr())

dcorr(df, 20)

A B
date
2000-01-01 A 1.000000 0.241121
B 0.241121 1.000000
2000-01-21 A 1.000000 0.083664
B 0.083664 1.000000
2000-02-10 A 1.000000 0.432988
B 0.432988 1.000000
2000-03-01 A 1.000000 -0.269869
B -0.269869 1.000000
2000-03-21 A 1.000000 -0.188370
B -0.188370 1.000000

groupby

1
2
3
4
5
6
7
8
9
10
11
12
13
14

df.set_index('date').groupby(pd.Grouper(freq='20D')).corr()

A B
date
2000-01-01 A 1.000000 0.241121
B 0.241121 1.000000
2000-01-21 A 1.000000 0.083664
B 0.083664 1.000000
2000-02-10 A 1.000000 0.432988
B 0.432988 1.000000
2000-03-01 A 1.000000 -0.269869
B -0.269869 1.000000
2000-03-21 A 1.000000 -0.188370
B -0.188370 1.000000

或

1
2
3
4
5
6
7
8
9

df.set_index('date').groupby(pd.Grouper(freq='20D')).corr().unstack()[('A', 'B')]

date
2000-01-01 0.241121
2000-01-21 0.083664
2000-02-10 0.432988
2000-03-01 -0.269869
2000-03-21 -0.188370
Name: (A, B), dtype: float64

您还可以明确显示要关联的列：

1	df.resample("20D", on='date').apply(lambda d: d.A.corr(d.B))

设置

1
2
3
4
5

np.random.seed([3, 1415])

n = 100
df = pd.DataFrame(np.random.rand(n,2), columns=['A','B'])
df['date'] = pd.date_range('2000-01-01', periods=n, name='date')

调试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

import pandas as pd
import numpy as np

np.random.seed([3, 1415])

n = 100
df = pd.DataFrame(
np.random.rand(n, 4),
pd.date_range('2000-01-01', periods=n, name='date'),
['ABC','XYZ __', 'One', 'Two Three']
)

def dcorr(df, n):
return df.resample(f"{n}D").apply(lambda d: d.corr())

dcorr(df, 20)

输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

ABC XYZ __ One Two Three
date
2000-01-01 ABC 1.000000 -0.029687 0.403720 0.078800
XYZ __ -0.029687 1.000000 -0.231223 -0.333266
One 0.403720 -0.231223 1.000000 0.330959
Two Three 0.078800 -0.333266 0.330959 1.000000
2000-01-21 ABC 1.000000 -0.024610 0.206002 -0.059523
XYZ __ -0.024610 1.000000 -0.601174 -0.101306
One 0.206002 -0.601174 1.000000 0.149536
Two Three -0.059523 -0.101306 0.149536 1.000000
2000-02-10 ABC 1.000000 -0.361072 0.156693 -0.040827
XYZ __ -0.361072 1.000000 -0.077173 -0.232536
One 0.156693 -0.077173 1.000000 0.343754
Two Three -0.040827 -0.232536 0.343754 1.000000
2000-03-01 ABC 1.000000 0.204763 -0.013132 0.115202
XYZ __ 0.204763 1.000000 -0.339747 -0.206922
One -0.013132 -0.339747 1.000000 0.310002
Two Three 0.115202 -0.206922 0.310002 1.000000
2000-03-21 ABC 1.000000 0.062841 -0.245393 0.233697
XYZ __ 0.062841 1.000000 -0.213742 0.341582
One -0.245393 -0.213742 1.000000 0.251169
Two Three 0.233697 0.341582 0.251169 1.000000

(许多)方法之一是用批号标记您的行。批处理方式取决于您。然后使用groupby apply和定义的函数来计算相关性。

1
2
3
4
5
6
7
8
9
10
11
12
13
14

n = 100
df = pd.DataFrame(np.random.rand(n,2), columns=['A','B'])
df['date'] = pd.date_range('2000-01-01', periods=n, name='date')

df['batch'] = np.arange(n) // 20

def process_batch(dg):
return pd.DataFrame([[
dg['date'].min(),
dg['date'].max(),
dg[['A','B']].corr().values[0][1]
]], columns=['date_min', 'date_max', 'corr'])

df.groupby('batch').apply(process_batch).reset_index(1, drop=True)

结果：

1
2
3
4
5
6
7

date_min date_max corr
batch
0 2000-01-01 2000-01-20 -0.403241
1 2000-01-21 2000-02-09 -0.091487
2 2000-02-10 2000-02-29 0.091835
3 2000-03-01 2000-03-20 0.029466
4 2000-03-21 2000-04-09 0.100756