关于urlparse：Python中的URL解析-标准化路径中的双斜杠

URL parsing in Python - normalizing double-slash in paths

我正在开发一个需要解析HTML页面中的URL(主要是HTTP URL)的应用程序-我无法控制输入，并且其中某些如预期的那样有些混乱。

我经常遇到的一个问题是，在解析和加入路径部分中带有双斜杠的URL时，urlparse非常严格(甚至可能是bug)？例如：

1
2
3

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl,
urlparse.urlparse(testUrl).path)

我没有得到预期的结果http://www.example.com//path(甚至用归一化的单斜杠甚至更好)，最后还是http://path。

顺便说一句，我之所以运行这样的代码，是因为这是迄今为止我发现的唯一将URL中的查询/片段部分剥离的方法。也许有更好的方法可以做到，但是我找不到。

任何人都可以推荐一种避免这种情况的方法，还是我应该自己使用(相对简单，我知道)正则表达式来规范化路径？

相关讨论

官方urlparse文档中提到：

If url is an absolute URL (that is, starting with // or scheme://), the url‘s host name and/or scheme will be present in the result. For example

1
2
3

urljoin('http://www.cwi.nl/%7Eguido/Python.html',
... '//www.python.org/%7Eguido')
'http://www.python.org/%7Eguido'

If you do not want that behavior, preprocess the url with urlsplit() and urlunsplit(), removing possible scheme and netloc parts.

所以你可以做：

1 2	urlparse.urljoin(testUrl, urlparse.urlparse(testUrl).path.replace('//','/'))

输出= 'http://www.example.com/path'

这可能并不完全安全，但是您可以使用此正则表达式：

1
2
3
4
5

import re

def sanitize_url(url: str) -> str:
return re.sub(r"([^:]/)(/)+", r"\\1", url)

它将用" [非冒号]后跟一个斜杠"替换" [非冒号]后跟两个斜杠"。 [非冒号]用于保留http：//或https：//。

我已经接受了我的需求@yunhasnawa的答案。这是一部分：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

import urllib2
from urlparse import urlparse, urlunparse

def sanitize_url(url):
url_parsed = urlparse(url)
return urlunparse((url_parsed.scheme, url_parsed.netloc, avoid_double_slash(url_parsed.path), '', '', ''))

def avoid_double_slash(path):
parts = path.split('/')
not_empties = [part for part in parts if part]
return '/'.join(not_empties)

>>> sanitize_url('https://hostname.doma.in:8443/complex-path////next//')
'https://hostname.doma.in:8443/complex-path/next'

在我尝试纠正路径中的双斜杠而不触碰http：//位的初始双斜杠的情况下，此答案似乎给出了最佳结果。

这是代码：

1
2
3
4
5
6

from urlparse import urljoin
from functools import reduce

def slash_join(*args):
return reduce(urljoin, args).rstrip("/")

尝试这个：

1
2
3
4
5
6
7
8
9
10
11
12
13

def http_normalize_slashes(url):
url = str(url)
segments = url.split('/')
correct_segments = []
for segment in segments:
if segment != '':
correct_segments.append(segment)
first_segment = str(correct_segments[0])
if first_segment.find('http') == -1:
correct_segments = ['http:'] + correct_segments
correct_segments[0] = correct_segments[0] + '/'
normalized_url = '/'.join(correct_segments)
return normalized_url

范例网址：

1
2
3
4

print(http_normalize_slashes('http://www.example.com//path?foo=bar'))
print(http_normalize_slashes('http:/www.example.com//path?foo=bar'))
print(http_normalize_slashes('www.example.com//x///c//v///path?foo=bar'))
print(http_normalize_slashes('http://////www.example.com//x///c//v///path?foo=bar'))

将返回：

1
2
3
4

http://www.example.com/path?foo=bar
http://www.example.com/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar

希望能帮助到你.. ：)

那不是解决方案吗？

1	urlparse.urlparse(testUrl).path.replace('//', '/')