BeautifulSoup(Python)和解析HTML表

BeautifulSoup (Python) and parsing HTML table

#####更新######：renderContents()而不是contents [0]可以解决问题。如果有人可以提供更好，更优雅的解决方案，我将仍然开放！

我正在尝试解析许多网页以获得所需的数据。该表没有类别/ ID标记。因此，我必须在tr内容中搜索"网站"。

眼前的问题：
显示td.contents仅适用于文本，但由于某些原因不适用于超链接？我究竟做错了什么？在Python中使用bs是否有更好的方法？

那些提示lxml的人，我这里有一个正在进行的线程centOS，而没有管理员权限的lxml安装在此时证明是很少的。因此，探索BeautifulSoup选项。

HTML示例：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

<table border="2" width="100%">
<tbody><tr>
<td width="33%" class="BoldTD">Website</td>
<td width="33%" class="BoldTD">Last Visited</td>
<td width="34%" class="BoldTD">Last Loaded</td>
</tr>
<tr>
<td width="33%">

</td>
<td width="33%">01/14/2011
</td>
<td width="34%">
</td>
</tr>
<tr>
<td width="33%">
stackoverflow.com
</td>
<td width="33%">01/10/2011
</td>
<td width="34%">
</td>
</tr>
<tr>
<td width="33%">

</td>
<td width="33%">01/10/2011
</td>
<td width="34%">
</td>
</tr>
</tbody></table>

到目前为止的Python代码：

1
2
3
4
5
6
7
8
9
10
11
12
13

f1 = open(PATH +"/" + FILE)
pageSource = f1.read()
f1.close()
soup = BeautifulSoup(pageSource)
alltables = soup.findAll("table", {"border":"2","width":"100%"} )
print"Number of tables found :" , len(alltables)

for table in alltables:
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
print td.contents[0]

相关讨论

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

from BeautifulSoup import BeautifulSoup

pageSource='''...omitted for brevity...'''

soup = BeautifulSoup(pageSource)
alltables = soup.findAll("table", {"border":"2","width":"100%"} )

results=[]
for table in alltables:
rows = table.findAll('tr')
lines=[]
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text=td.renderContents().strip('\
')
lines.append(text)
text_table='\
'.join(lines)
if 'Website' in text_table:
results.append(text_table)
print"Number of tables found :" , len(results)
for result in results:
print(result)

产量

1
2
3
4
5
6
7
8
9
10
11
12

Number of tables found : 1
Website
Last Visited
Last Loaded

01/14/2011

stackoverflow.com
01/10/2011

01/10/2011

这与您想要的东西接近吗？
问题是td.contents返回NavigableStrings和汤tags的列表。例如，运行print(td.contents)可能会产生

1	['', '', '']

因此，选择列表的第一个元素会使您错过标签。