R - rolling 12 month median by group with inconsistent time interval
我有许多小组在过去几年收集的数据(值)。我想计算每个组的12个月滚动平均值(使用前12个月)。我调查了动物园(和其他)滚动功能,它们似乎都是固定的时间间隔(例如12个月= 12行),但是我的数据中的日期间隔对于每个组而言并不一致。每月或每隔一个月收集一次数据值,但也存在其他差距。我想我需要一个滚动的中位数函数,该函数可以为每个组收集前364天的值。我将不胜感激!
下面是我的数据示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | Date Group Value 8/17/2013 A 5 10/2/2013 A 13 1/15/2014 A 11 3/15/2014 A 2 5/22/2014 A 7 7/15/2014 A 1 9/3/2014 A 1 11/15/2014 A 7 7/22/2013 B 13 8/5/2013 B 13 9/7/2013 B 12 10/16/2013 B 6 11/17/2013 B 5 12/9/2013 B 15 1/30/2014 B 1 2/23/2014 B 10 3/24/2014 B 15 4/5/2014 B 3 5/26/2014 B 3 6/16/2014 B 4 8/5/2014 B 6 9/26/2014 B 8 10/16/2014 B 15 11/29/2014 B 12 12/13/2016 B 1 |
我想在此表" Rolling Median"中添加一列,其中包含每个组的12个月(或365天)滚动中位数。
此软件包可能对您有所帮助:
https://github.com/mgahan/boRingTrees
它称为
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | ##Utilize the data.table package library(data.table) setDT(data) data[, Date2 := as.Date(Date,format="%m/%d/%Y")] #Format date field #Apply rollingByCalcs function (full function code is below) data[, Roll_Median := rollingByCalcs(data,bylist=c("Group"),dates="Date2",target="Value", lower=0,upper=365,incbounds=T,stat=median,na.rm=T,cores=1)] rollingByCalcs <- function(data,bylist=NULL,dates,target=NULL, lower,upper,incbounds=T,stat=length,na.rm=T,cores=1){ tic <- Sys.time() require("data.table") require("parallel") data <- data.table(data) if (is.null(bylist)){ data[, id.filler := 1] bylist <-"id.filler" } if (is.null(target)){ data[,target:=1] target <-"target" } ##Create group by variable data[,Grp.Var:=.GRP,by=bylist] ##Assign variable names data[,target:=data[,eval(parse(text=target))]] data[,dates:=data[,eval(parse(text=dates))]] ##Create"list" of comparison dates Ref <- data[,list(Compare_Value=list(I(target)),Compare_Date=list(I(dates))), by=c("Grp.Var")] ##Compare two lists and see of the compare date is within N days data$Roll.Val <- mcmapply(FUN = function(RD, NUM) { d <- as.numeric(RD-Ref$Compare_Date[[NUM]]) true.vals <- between(x=d,lower=lower,upper=upper,incbounds=incbounds) out <- stat(Ref$Compare_Value[[NUM]][true.vals]) return(out) }, RD = data$dates,NUM=data$Grp.Var,mc.cores=cores) print(Sys.time()-tic) return(data$Roll.Val) } |
您可以编写一个辅助函数。这是使用dplyr软件包的一个:
1 2 3 4 5 6 7 8 9 | library(dplyr) rollingMedian <- function(targetDate, targetGroup) { dat %>% mutate(thisDiff = difftime(as.Date(Date), targetDate, unit ="days")) %>% filter(thisDiff < 0, thisDiff > -366, Group == targetGroup) %>% summarise(medValue = median(Value)) } dat$rollingMed <- mapply(rollingMedian, dat$Date, dat$Group) |
结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 | dat Date Group Value rollingMed 1 2013-08-17 A 5 NA 2 2013-10-02 A 13 5 3 2014-01-15 A 11 9 4 2014-03-15 A 2 11 5 2014-05-22 A 7 8 6 2014-07-15 A 1 7 7 2014-09-03 A 1 7 8 2014-11-15 A 7 2 9 2013-07-22 B 13 NA 10 2013-08-05 B 13 13 ... |
使用的数据:
1 2 3 4 5 6 7 8 9 10 11 | dat <- structure(list(Date = structure(c(1376697600, 1380672000, 1389744000, 1394841600, 1400716800, 1405382400, 1409702400, 1416009600, 1374451200, 1375660800, 1378512000, 1381881600, 1384646400, 1386547200, 1391040000, 1393113600, 1395619200, 1396656000, 1401062400, 1402876800, 1407196800, 1411689600, 1413417600, 1417219200, 1481587200), tzone ="UTC", class = c("POSIXct", "POSIXt")), Group = c("A","A","A","A","A","A","A","A", "B","B","B","B","B","B","B","B","B","B","B","B","B", "B","B","B","B"), Value = c(5L, 13L, 11L, 2L, 7L, 1L, 1L, 7L, 13L, 13L, 12L, 6L, 5L, 15L, 1L, 10L, 15L, 3L, 3L, 4L, 6L, 8L, 15L, 12L, 1L)), .Names = c("Date","Group","Value"), row.names = c(NA, -25L), class ="data.frame") |