如何在Java中获取HTML

How to fetch HTML in Java

不使用任何外部库，最简单的方法是将网站的HTML内容提取为字符串？

我当前正在使用此：

1
2
3
4
5
6
7
8
9
10
11
12

String content = null;
URLConnection connection = null;
try {
connection = new URL("http://www.google.com").openConnection();
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\\\Z");
content = scanner.next();
scanner.close();
}catch ( Exception ex ) {
ex.printStackTrace();
}
System.out.println(content);

但不确定是否有更好的方法。

相关讨论

这对我来说效果很好：

1
2
3
4
5
6
7

URL url = new URL(theURL);
InputStream is = url.openStream();
int ptr = 0;
StringBuffer buffer = new StringBuffer();
while ((ptr = is.read()) != -1) {
buffer.append((char)ptr);
}

不确定所提供的其他解决方案是否更有效。

相关讨论

我刚刚将这个帖子留在了您的其他帖子中，尽管您上面所说的内容也可能起作用。我认为任何一个都不比另一个容易。只需使用代码顶部的import org.apache.commons.HttpClient即可访问Apache软件包。

编辑：忘记了链接;)

相关讨论

虽然不是Vanilla-Java，但我将提供一个更简单的解决方案。使用Groovy;-)

1	String siteContent = new URL("http://www.google.com").text

它不是库，而是通常在大多数服务器中安装的名为curl的工具，或者您可以通过

1	sudo apt install curl

然后获取任何html页面并将其存储到本地文件中，例如示例

1	curl https://www.facebook.com/ > fb.html

您将获得主页html。您也可以在浏览器中运行它。

相关讨论