关于枚举:使用Floki和HttPotion的Elixir脚本无法解析url

Elixir script using Floki and HttPotion fails to parse url

我正在尝试使用Floki和HttPotion为来自Wikipedia的文章文本编写脚本。我的失败代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
defmodule Scraper do

  def start do
    base ="https://en.wikipedia.org"
    response = HTTPotion.get base <>"/wiki/Main_Page"
    html = response.body
    main_bg = Floki.find(html,".MainPageBG")
    main_bg
      |> Floki.find("table tr li a")
      |> Floki.attribute("href")
      |> Enum.map(fn(addr) -> HTTPotion.get(base <> addr) end)
  end
end

我引用的是Floki自述文件中的内容:

1
2
3
4
html
|> Floki.find(".pages a")
|> Floki.attribute("href")
|> Enum.map(fn(url) -> HTTPoison.get!(url) end)

当我将结果传送到Floki.attribute("href")时,我会得到一个不错的URL路径名称列表,例如:

1
2
3
4
5
6
7
8
9
10
11
["/wiki/Japanese_aircraft_carrier_Hiry%C5%ABwow",
"/wiki/Boys_Don%27t_Cry_(film)wow","/wiki/Elias_Abraham_Rosenbergwow",
"/wiki/Japanese_aircraft_carrier_Hiry%C5%ABwow",
"/wiki/Boys_Don%27t_Cry_(film)wow","/wiki/Elias_Abraham_Rosenbergwow",
"/wiki/Wikipedia:Today%27s_featured_article/November_2015wow",
"https://lists.wikimedia.org/mailman/listinfo/daily-article-lwow",
"/wiki/Wikipedia:Featured_articleswow","/wiki/Schloss_Krobnitzwow",
"/wiki/Prussiawow","/wiki/Albrecht_von_Roonwow","/wiki/Harry_Winerwow",
"/wiki/Rob_Thomas_(writer)wow","/wiki/Of_Vice_and_Menwow",
"/wiki/Veronica_Marswow","/wiki/Meithalunwow","/wiki/Palestinian_peoplewow",
"/wiki/Marj_Sanurwow","/wiki/Soma_Norodomwow",...]

但是,当行|> Enum.map(fn(addr) -> HTTPotion.get(base <> addr) end)运行时,出现此错误:

1
2
3
4
5
** (HTTPotion.HTTPError) {:url_parsing_failed, {:error, :invalid_uri}}
    (httpotion) lib/httpotion.ex:209: HTTPotion.handle_response/1
       (elixir) lib/enum.ex:977: anonymous fn/3 in Enum.map/2
       (elixir) lib/enum.ex:1261: Enum."-reduce/3-lists^foldl/2-0-"/3
       (elixir) lib/enum.ex:977: Enum.map/2

我看到了:url_parsing_failed,但是我不明白为什么。当我尝试使用列表中的各个URL路径的Enum.map(fn(addr) -> HTTPotion.get(base <> addr)时,它们都起作用。

  • 我的语法错误吗?
  • 我是否缺少管道或枚举的工作方式?
  • 我在正确的Rails上吗?

根据manukall的回答,这里起作用了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
defmodule Scraper do
  def transform_url(url_or_path ="/" <> _, base), do: base <> url_or_path
  def transform_url(url, _base), do: url

  def start do
    base ="https://en.wikipedia.org"
    response = HTTPotion.get base <>"/wiki/Main_Page"
    html = response.body
    main_bg = Floki.find(html,".MainPageBG")
    main_bg
      |> Floki.find("table tr li a")
      |> Floki.attribute("href")
      |> Enum.map(fn(url) -> OldRazor.transform_url(url, base) end)
      |> Enum.map(fn(url) -> HTTPotion.get(url) end)
  end
end

如果您再次仔细查看URL列表,您会发现其中有一个绝对URL:" https://lists.wikimedia.org/mailman/listinfo/daily-article-lwow "。这不适用于HTTPotion.get(base <> addr),因为它将最终请求一个网址,例如" https://en.wikipedia.orghttps://lists.wikimedia.org/mailman/listinfo/daily-article-lwow "。

一种修复方法是编写另一个函数transform_url,该函数检查值是否以/开头,然后在其前面附加基本URL:

1
2
  def transform_url(url_or_path ="/" <> _, base), do: base <> url_or_path
  def transform_url(url, _base), do: url

然后将其用作

1
2
  ...
  |> Enum.map(fn(url) -> HTTPoison.get!(transform_url((url)) end)