关于网页抓取：需要使用python清理网页抓取的数据

Need to clean web scraped data using python

我正在尝试编写用于从 http://goldpricez.com/gold/history/lkr/years-3 抓取数据的代码。我写的代码如下。该代码有效，并给了我预期的结果。

1
2
3
4
5
6
7

import pandas as pd

url ="http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)

print(df)

但结果是一些不需要的数据，我只想要表中的数据。请帮我解决这个问题。

这里我添加了带有不需要数据的输出图像(红色圆圈)

相关讨论

您使用 .read_html 的方式将返回所有表的列表。您的表位于索引 3

1
2
3
4
5
6
7

import pandas as pd

url ="http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)[3]

print(df)

.read_html 调用 URL，并在后台使用 BeautifulSoup 解析响应。您可以像在 .read_csv 中那样更改解析、表的名称、传递标头。查看 .read_html 了解更多详情。

为了速度，你可以使用 lxml 例如pd.read_html(url, flavor='lxml')[3]。默认情况下，使用第二慢的 html5lib。另一种风格是 html.parser。这是它们中最慢的。

为此使用 BeautifulSoup，下面的代码可以完美运行

1
2
3
4
5
6
7
8
9

import requests
from bs4 import BeautifulSoup
url ="http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text,"html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
print(data[i].text.strip()," ", data[i+1].text.strip())

使用 BeautifulSoup 的另一个优点是它比你的代码更快