xml with nested siblings to data frame in R
我是R语言中解析XML的新手。我试图将XML解析为可用的数据框。我已经尝试了XML包中的一些XPath函数,但似乎无法得出正确的答案。
这是我的XML:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | <ResidentialProperty> <Listing> <StreetAddress> <StreetNumber>11111</StreetNumber> <StreetName>111th</StreetName> <StreetSuffix>Avenue Ct</StreetSuffix> <StateOrProvince>WA</StateOrProvince> </StreetAddress> <MLSInformation> <ListingStatus Status="Active"/> <StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate> </MLSInformation> <GeographicData> <Latitude>11.111111</Latitude> <Longitude>-111.111111</Longitude> <County>Pierce</County> </GeographicData> <SchoolData> <SchoolDistrict>Puyallup</SchoolDistrict> </SchoolData> <View>Territorial</View> </Listing> <YearBuilt>1997</YearBuilt> <InteriorFeatures>Bath Off Master,Dbl Pane/Storm Windw</InteriorFeatures> <Occupant> <Name>Vacant</Name> </Occupant> <WaterFront/> <Roof>Composition</Roof> <Exterior>Brick,Cement Planked,Wood,Wood Products</ </ResidentialProperty> |
我跑步时:
1 | ResidentialProperty <- xmlToDataFrame(nodes=getNodeSet(doc,"//ResidentialProperty")) |
父节点内的子节点的值被压缩为:
1 | 11111111thAvenue CtWA2015-07-05T23:48:53.41011.111111-111.111111PiercePuyallupTerritorial |
如果我将一个节点下移到,则会发生相同的事情:
1 | 11111111thAvenue CtWA |
所有子节点的值都粘贴在一起。
我还尝试了一种行之有效的蛮力方法:
1 2 3 4 5 6 7 8 9 10 | StreetAddress <- xmlToDataFrame(nodes=getNodeSet(doc,"//StreetAddress")) MLSInformation <- xmlToDataFrame(nodes=getNodeSet(doc,"//MLSInformation")) GeographicData <- xmlToDataFrame(nodes=getNodeSet(doc,"//GeographicData")) SchoolData <- xmlToDataFrame(nodes=getNodeSet(doc,"//SchoolData")) YearBuilt <- xmlToDataFrame(nodes=getNodeSet(doc,"//YearBuilt")) InteriorFeatures <- xmlToDataFrame(nodes=getNodeSet(doc,"//InteriorFeatures")) Occupant <- xmlToDataFrame(nodes=getNodeSet(doc,"//Occupant")) Roof <- xmlToDataFrame(nodes=getNodeSet(doc,"//Roof")) Exterior <- xmlToDataFrame(nodes=getNodeSet(doc,"//Exterior")) df <- cbind(StreetAddress, MLSInformation, GeographicData, SchoolData, YearBuilt, InteriorFeatures, Occupant, Roof, Exterior) |
但是某些列名称与预期不符:
1 2 3 4 | > colnames(df) [1]"StreetNumber" "StreetName" "StreetSuffix" "StateOrProvince" "ListingStatus" [6]"StatusChangeDate""Latitude" "Longitude" "County" "SchoolDistrict" [11]"text" "text" "Name" "text" "text" |
我正在尝试找到一种方法,可以将每个原子值排序到数据帧的适当列中,其中列名是节点的名称,即使在嵌套的子节点内也是如此。另外,我的数据可能会随时间变化,因此我正在寻找一种动态函数来符合数据,并在可能的情况下产生预期的结果。
我以为这是一个有点普通的XML模式(带有嵌套的子层),所以尽管没有在搜索中使用错误的术语,却没有找到太多关于该主题的信息,我很惊讶。我猜有一个简单的答案。您有什么建议吗?
考虑
1 2 3 4 5 6 7 8 9 10 11 12 13 | library(XML) library(plyr) # xml <- '<ResidentialProperty>........' doc <- xmlParse(xml, asText = TRUE) df <- do.call(rbind.fill, lapply(doc['//ResidentialProperty'], function(x) { names <- xpathSApply(x, './/.', xmlName) names <- names[which(names =="text") - 1] values <- xpathSApply(x,".//text()", xmlValue) return(as.data.frame(t(setNames(values, names)), stringsAsFactors = FALSE)) })) df # StreetNumber StreetName StreetSuffix StateOrProvince StatusChangeDate Latitude Longitude County SchoolDistrict View YearBuilt InteriorFeatures Name Roof Exterior # 1 11111 111th Avenue Ct WA 2015-07-05T23:48:53.410 11.111111 -111.111111 Pierce Puyallup Territorial 1997 Bath Off Master,Dbl Pane/Storm Windw Vacant Composition Brick,Cement Planked,Wood,Wood Products |