关于python：如何通过h5py区分HDF5数据集和组？

How to differentiate between HDF5 datasets and groups with h5py?

我使用Python软件包h5py(版本2.5.0)访问我的hdf5文件。

我想遍历文件的内容并对每个数据集执行一些操作。

使用visit方法：

1
2
3
4
5
6
7
8
9
10

import h5py

def print_it(name):
dset = f[name]
print(dset)
print(type(dset))

with h5py.File('test.hdf5', 'r') as f:
f.visit(print_it)

对于测试文件，我获得：

1
2
3
4

告诉我文件中有一个数据集和一个组。但是，除了使用type()区分数据集和组外，没有其他明显的方法。不幸的是，h5py文档未对此主题进行任何说明。他们始终假设您事先知道什么是组，什么是数据集，例如，因为他们自己创建了数据集。

我想要一些类似的东西：

1
2
3
4

f = h5py.File(..)
for key in f.keys():
x = f[key]
print(x.is_group(), x.is_dataset()) # does not exist

使用h5py在Python中读取未知的hdf5文件时，如何区分组和数据集？如何获得所有数据集，所有组，所有链接的列表？

不幸的是，h5py api中没有内置的方法可以检查此情况，但是您可以使用is_dataset = isinstance(item, h5py.Dataset)轻松检查项目的类型。

要列出文件的所有内容(尽管文件属性除外)，您可以将Group.visititems与可调用对象一起使用，该可调用对象带有项目的名称和实例。

相关讨论

尽管Gall和James Smith的回答总体上表明了解决方案，但仍需要遍历分层HDF结构和对所有数据集进行过滤。我使用yield from做到了这一点，它在Python 3.3+中可用，效果很好，并在此处展示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14

import h5py

def h5py_dataset_iterator(g, prefix=''):
for key in g.keys():
item = g[key]
path = '{}/{}'.format(prefix, key)
if isinstance(item, h5py.Dataset): # test for dataset
yield (path, item)
elif isinstance(item, h5py.Group): # test for group (go down)
yield from h5py_dataset_iterator(item, path)

with h5py.File('test.hdf5', 'r') as f:
for (path, dset) in h5py_dataset_iterator(f):
print(path, dset)

由于h5py使用python词典作为其交互的选择方法，因此您需要使用" values()"函数来实际访问项目。因此，您可以使用列表过滤器：

1	datasets = [item for item in f["Data"].values() if isinstance(item, h5py.Dataset)]

递归执行此操作应该足够简单。

例如，如果要打印HDF5文件的结构，则可以使用以下代码：

1
2
3
4
5
6
7
8
9
10
11
12
13

def h5printR(item, leading = ''):
for key in item:
if isinstance(item[key], h5py.Dataset):
print(leading + key + ': ' + str(item[key].shape))
else:
print(leading + key)
h5printR(item[key], leading + ' ')

# Print structure of a `.h5` file
def h5print(filename):
with h5py.File(filename, 'r') as h:
print(filename)
h5printR(h, ' ')

例

1
2
3
4
5
6
7
8
9
10
11
12
13
14

>>> h5print('/path/to/file.h5')

file.h5
test
repeats
cell01: (2, 300)
cell02: (2, 300)
cell03: (2, 300)
cell04: (2, 300)
cell05: (2, 300)
response
firing_rate_10ms: (28, 30011)
stimulus: (300, 50, 50)
time: (300,)

我更喜欢这种解决方案。它在hdf5文件" h5file"中找到所有对象的列表，然后根据类对它们进行排序，这与之前提到的类似，但并非如此简洁：

1
2
3
4
5

import h5py
fh5 = h5py.File(h5file,'r')
fh5.visit(all_h5_objs.append)
all_groups = [ obj for obj in all_h5_objs if isinstance(fh5[obj],h5py.Group) ]
all_datasets = [ obj for obj in all_h5_objs if isinstance(fh5[obj],h5py.Dataset) ]