关于python:lxml web scraping,特定的单词提取

lxml web-scraping, specific word extraction

我正在使用自动化脚本从局域网网站上刮计数器,现在我拔我的头发。

代码看起来像这样

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<TR><td><p align="left" style="margin-left: 30;">title
</p></td><td><p>
   
</p></td>
</TR>
<TR><td><p align="left" style="margin-left: 40;">table one
</p></td><td><p>
 Task&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;average
</p></td>
</TR>
<TR><td><p align="left" style="margin-left: 40;">
</p></td><td><p>
 number&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;number
</p></td>
</TR>
    <TR><td><p align="left" style="margin-left: 40;">1-1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 6490&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">2-4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 442&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">5-10&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 44&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;6
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">11-20&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;15
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">21-30&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;25
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">31-50&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;40
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">sum
</p></td><td><p>
 6982&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1
</p></td>
    </TR>

所以在每个站点我都有相同的单词重复,比如1-2,2-4,5-10等等,我想从下面提取数字,比如6490442。按照特定的顺序,所以看起来应该像

1
2
3
task - counter
1-1 = 6490
2-4 = 442

为了做到这一点,我用

1
2
3
4
5
6
7
8
import requests
from lxml import html

pageContent=requests.get(
 'http://x.html')
tree = html.fromstring(pageContent.content)
scraped = tree.xpath('//p/text()')
print scraped

女巫显然印了这样的东西 xa0 xa0 xa0 xa0 xa0任务',u'1-1 xa0 xa0 xa0 xa0 xa0 xa0计数器',u'6490

我被困住了…尝试使用其他方法,但失败了。


试试这个。它将为您获取您上面提到的确切输出。这里,content是上面粘贴的HTML元素的容器。

1
2
3
4
5
from lxml.html import fromstring
root = fromstring(content)
for items in root.cssselect("tr")[3:]:
    data = [' '.join(item.text_content().split()).split("")[0] for item in items.cssselect("td")]
    print(' = '.join(data))

输出:

1
2
3
4
5
6
7
1-1 = 6490
2-4 = 442
5-10 = 44
11-20 = 3
21-30 = 2
31-50 = 1
sum = 6982


这会使它起作用。我为您的输出设计了一个dict,您可以方便地用于各种用途。-

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
text ="""<TR><td><p align="left" style="margin-left: 30;">title
</p></td><td><p>
   

</p></td>
</TR>
<TR><td><p align="left" style="margin-left: 40;">table one
</p></td>
<td><p>
 Task&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;average
</p></td>
</TR>
<TR><td><p align="left" style="margin-left: 40;">
</p></td><td><p>
 number&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;number
</p></td>
</TR>
    <TR><td><p align="left" style="margin-left: 40;">1-1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 6490&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">2-4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 442&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">5-10&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 44&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;6
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">11-20&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;15
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">21-30&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;25
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">31-50&nbsp;&nbsp;&nbsp;&nbsp;C
</p></td><td><p>
 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;40
</p></td>
    </TR>
    <TR><td><p align="left" style="margin-left: 40;">sum
</p></td><td><p>
 6982&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1
</p></td>
    </TR>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(text,"lxml")
data = {}
for tr in soup.find_all('tr')[3:-1]:
    p = tr.find_all('td')
    task = p[0].text.split()[0].strip()
    counter = p[1].text.split()[0].strip()
    data[task] = counter
print(data)

产量

1
{'1-1': '6490', '2-4': '442', '5-10': '44', '11-20': '3', '21-30': '2', '31-50': '1'}