URL parsing in Python - normalizing double-slash in paths
我正在开发一个需要解析HTML页面中的URL(主要是HTTP URL)的应用程序-我无法控制输入,并且其中某些如预期的那样有些混乱。
我经常遇到的一个问题是,在解析和加入路径部分中带有双斜杠的URL时,urlparse非常严格(甚至可能是bug)?例如:
1 2 3 | testUrl = 'http://www.example.com//path?foo=bar' urlparse.urljoin(testUrl, urlparse.urlparse(testUrl).path) |
我没有得到预期的结果
顺便说一句,我之所以运行这样的代码,是因为这是迄今为止我发现的唯一将URL中的查询/片段部分剥离的方法。 也许有更好的方法可以做到,但是我找不到。
任何人都可以推荐一种避免这种情况的方法,还是我应该自己使用(相对简单,我知道)正则表达式来规范化路径?
单独的路径(
http://tools.ietf.org/html/rfc3986.html#section-3.3
If a URI does not contain an authority component, then the path cannot begin with two slash characters ("//").
我不特别喜欢以下任何一种解决方案,但它们可以起作用:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import re import urlparse testurl = 'http://www.example.com//path?foo=bar' parsed = list(urlparse.urlparse(testurl)) parsed[2] = re.sub("/{2,}","/", parsed[2]) # replace two or more / with one cleaned = urlparse.urlunparse(parsed) print cleaned # http://www.example.com/path?foo=bar print urlparse.urljoin( testurl, urlparse.urlparse(cleaned).path) # http://www.example.com//path |
根据您的工作,可以手动进行加入:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import re import urlparse testurl = 'http://www.example.com//path?foo=bar' parsed = list(urlparse.urlparse(testurl)) newurl = ["" for i in range(6)] # could urlparse another address instead # Copy first 3 values from # ['http', 'www.example.com', '//path', '', 'foo=bar', ''] for i in range(3): newurl[i] = parsed[i] # Rest are blank for i in range(4, 6): newurl[i] = '' print urlparse.urlunparse(newurl) # http://www.example.com//path |
如果您只想获取不带查询部分的URL,我将跳过urlparse模块,然后执行以下操作:
1 | testUrl.rsplit('?') |
网址将位于返回列表的索引0,查询位于索??引1。
不可能有两个"?"在网址中,因此它适用于所有网址。
官方urlparse文档中提到:
If url is an absolute URL (that is, starting with // or scheme://), the url‘s host name and/or scheme will be present in the result. For example
1 2 3 | urljoin('http://www.cwi.nl/%7Eguido/Python.html', ... '//www.python.org/%7Eguido') 'http://www.python.org/%7Eguido' |
If you do not want that behavior, preprocess the url with urlsplit() and urlunsplit(), removing possible scheme and netloc parts.
所以你可以做:
1 2 | urlparse.urljoin(testUrl, urlparse.urlparse(testUrl).path.replace('//','/')) |
输出=
这可能并不完全安全,但是您可以使用此正则表达式:
1 2 3 4 5 | import re def sanitize_url(url: str) -> str: return re.sub(r"([^:]/)(/)+", r"\\1", url) |
它将用" [非冒号]后跟一个斜杠"替换" [非冒号]后跟两个斜杠"。 [非冒号]用于保留http://或https://。
我已经接受了我的需求@yunhasnawa的答案。这是一部分:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import urllib2 from urlparse import urlparse, urlunparse def sanitize_url(url): url_parsed = urlparse(url) return urlunparse((url_parsed.scheme, url_parsed.netloc, avoid_double_slash(url_parsed.path), '', '', '')) def avoid_double_slash(path): parts = path.split('/') not_empties = [part for part in parts if part] return '/'.join(not_empties) >>> sanitize_url('https://hostname.doma.in:8443/complex-path////next//') 'https://hostname.doma.in:8443/complex-path/next' |
在我尝试纠正路径中的双斜杠而不触碰http://位的初始双斜杠的情况下,此答案似乎给出了最佳结果。
这是代码:
1 2 3 4 5 6 | from urlparse import urljoin from functools import reduce def slash_join(*args): return reduce(urljoin, args).rstrip("/") |
尝试这个:
1 2 3 4 5 6 7 8 9 10 11 12 13 | def http_normalize_slashes(url): url = str(url) segments = url.split('/') correct_segments = [] for segment in segments: if segment != '': correct_segments.append(segment) first_segment = str(correct_segments[0]) if first_segment.find('http') == -1: correct_segments = ['http:'] + correct_segments correct_segments[0] = correct_segments[0] + '/' normalized_url = '/'.join(correct_segments) return normalized_url |
范例网址:
1 2 3 4 | print(http_normalize_slashes('http://www.example.com//path?foo=bar')) print(http_normalize_slashes('http:/www.example.com//path?foo=bar')) print(http_normalize_slashes('www.example.com//x///c//v///path?foo=bar')) print(http_normalize_slashes('http://////www.example.com//x///c//v///path?foo=bar')) |
将返回:
1 2 3 4 | http://www.example.com/path?foo=bar http://www.example.com/path?foo=bar http://www.example.com/x/c/v/path?foo=bar http://www.example.com/x/c/v/path?foo=bar |
希望能帮助到你.. :)
那不是解决方案吗?
1 | urlparse.urlparse(testUrl).path.replace('//', '/') |