Web Scraping coded price
在网上抓取文章时,价格在要素中,而不在资源中。而是有以下编码文本
1 2 3 4 5 6 7 8 | var f3699334f586f4f2bb6edc10899026d63 = function(value) { return base64UTF8Codec.decode(arguments[0]) }; replaceWith( document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'), f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA=') ); |
如何将文本解码为价格?
文本是base64编码的。如果可以使用beautifulsoup找到正确的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import re import base64 from bs4 import BeautifulSoup txt = ''' var f3699334f586f4f2bb6edc10899026d63 = function(value){return base64UTF8Codec.decode(arguments[0])}; replaceWith(document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'), f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA=')); ''' soup = BeautifulSoup(txt, 'html.parser') # 1. locate the right tag script = soup.script # 2. get coded text from the script tag coded_text = re.findall(r".*\\('(.*?)'\\)\\);", script.text)[0] # 3. decode the text decoded_text = base64.b64decode(coded_text) # b'\ <span class="pull-right"> 2.590,- </span>\ ' # 4. get the price from the decoded text soup2 = BeautifulSoup(decoded_text, 'html.parser') print(soup2.span.get_text(strip=True)) |
打印:
1 | 2.590,- |