Elixir script using Floki and HttPotion fails to parse url
我正在尝试使用Floki和HttPotion为来自Wikipedia的文章文本编写脚本。我的失败代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 | defmodule Scraper do def start do base ="https://en.wikipedia.org" response = HTTPotion.get base <>"/wiki/Main_Page" html = response.body main_bg = Floki.find(html,".MainPageBG") main_bg |> Floki.find("table tr li a") |> Floki.attribute("href") |> Enum.map(fn(addr) -> HTTPotion.get(base <> addr) end) end end |
我引用的是Floki自述文件中的内容:
1 2 3 4 | html |> Floki.find(".pages a") |> Floki.attribute("href") |> Enum.map(fn(url) -> HTTPoison.get!(url) end) |
当我将结果传送到
1 2 3 4 5 6 7 8 9 10 11 | ["/wiki/Japanese_aircraft_carrier_Hiry%C5%ABwow", "/wiki/Boys_Don%27t_Cry_(film)wow","/wiki/Elias_Abraham_Rosenbergwow", "/wiki/Japanese_aircraft_carrier_Hiry%C5%ABwow", "/wiki/Boys_Don%27t_Cry_(film)wow","/wiki/Elias_Abraham_Rosenbergwow", "/wiki/Wikipedia:Today%27s_featured_article/November_2015wow", "https://lists.wikimedia.org/mailman/listinfo/daily-article-lwow", "/wiki/Wikipedia:Featured_articleswow","/wiki/Schloss_Krobnitzwow", "/wiki/Prussiawow","/wiki/Albrecht_von_Roonwow","/wiki/Harry_Winerwow", "/wiki/Rob_Thomas_(writer)wow","/wiki/Of_Vice_and_Menwow", "/wiki/Veronica_Marswow","/wiki/Meithalunwow","/wiki/Palestinian_peoplewow", "/wiki/Marj_Sanurwow","/wiki/Soma_Norodomwow",...] |
但是,当行
1 2 3 4 5 | ** (HTTPotion.HTTPError) {:url_parsing_failed, {:error, :invalid_uri}} (httpotion) lib/httpotion.ex:209: HTTPotion.handle_response/1 (elixir) lib/enum.ex:977: anonymous fn/3 in Enum.map/2 (elixir) lib/enum.ex:1261: Enum."-reduce/3-lists^foldl/2-0-"/3 (elixir) lib/enum.ex:977: Enum.map/2 |
我看到了
- 我的语法错误吗?
- 我是否缺少管道或枚举的工作方式?
- 我在正确的Rails上吗?
根据manukall的回答,这里起作用了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | defmodule Scraper do def transform_url(url_or_path ="/" <> _, base), do: base <> url_or_path def transform_url(url, _base), do: url def start do base ="https://en.wikipedia.org" response = HTTPotion.get base <>"/wiki/Main_Page" html = response.body main_bg = Floki.find(html,".MainPageBG") main_bg |> Floki.find("table tr li a") |> Floki.attribute("href") |> Enum.map(fn(url) -> OldRazor.transform_url(url, base) end) |> Enum.map(fn(url) -> HTTPotion.get(url) end) end end |
如果您再次仔细查看URL列表,您会发现其中有一个绝对URL:" https://lists.wikimedia.org/mailman/listinfo/daily-article-lwow "。这不适用于
一种修复方法是编写另一个函数
1 2 | def transform_url(url_or_path ="/" <> _, base), do: base <> url_or_path def transform_url(url, _base), do: url |
然后将其用作
1 2 | ... |> Enum.map(fn(url) -> HTTPoison.get!(transform_url((url)) end) |