关于python：Web Scraping编码价格

Web Scraping coded price

在网上抓取文章时，价格在要素中，而不在资源中。而是有以下编码文本

1
2
3
4
5
6
7
8

var f3699334f586f4f2bb6edc10899026d63 = function(value) {
return base64UTF8Codec.decode(arguments[0])
};

replaceWith(
document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'),
f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA=')
);

如何将文本解码为价格？

enter

相关讨论

文本是base64编码的。如果可以使用beautifulsoup找到正确的标签，则可以使用re模块提取正确的信息：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

import re
import base64
from bs4 import BeautifulSoup

txt = '''
var f3699334f586f4f2bb6edc10899026d63 = function(value){return base64UTF8Codec.decode(arguments[0])};
replaceWith(document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'), f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA='));
'''

soup = BeautifulSoup(txt, 'html.parser')

# 1. locate the right tag
script = soup.script

# 2. get coded text from the script tag
coded_text = re.findall(r".*\\('(.*?)'\\)\\);", script.text)[0]

# 3. decode the text
decoded_text = base64.b64decode(coded_text) # b'\
<span class="pull-right"> 2.590,- </span>\
'

# 4. get the price from the decoded text
soup2 = BeautifulSoup(decoded_text, 'html.parser')

print(soup2.span.get_text(strip=True))

打印：

2.590,-