Posted Dec 3, 2008 23:38 UTC (Wed) by jwb (guest, #15467)
[Link]
That wasn't exactly what I was referring to:
Python 2.5.2 (r252:60911, Oct 5 2008, 19:24:49)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib2 import urlopen
>>> urlopen('http://user:password@google.com/')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "/usr/lib/python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
File "/usr/lib/python2.5/urllib2.py", line 399, in _open
'_open', req)
File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
result = func(*args)
File "/usr/lib/python2.5/urllib2.py", line 1107, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.5/urllib2.py", line 1064, in do_open
h = http_class(host) # will parse host:port
File "/usr/lib/python2.5/httplib.py", line 639, in __init__
self._set_hostport(host, port)
File "/usr/lib/python2.5/httplib.py", line 651, in _set_hostport
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: 'password@google.com'
urllib2 can't understand a URL when the authority contains a colon. The fact that there exists, quite separately, urlparse module actually reinforces my point. The tuple returned by urlparse is completely useless anywhere else in the standard library. urlopen won't accept it. urlopen takes either a string or an instance of urllib2.Request object.
In Perl the situation is quite satisfactory. The URI module exists and works, and work harmoniously with HTTP::Message and its descendants, which in turn work harmoniously with LWP and WWW::Mechanize and so forth.
Considering python's age and the fact that it has developed coincidentally with the web, you would think that python's web support would be quite mature by now, but it isn't. python's support for basic web operations in quite bad.
What is to replace Perl then?
Posted Dec 4, 2008 0:15 UTC (Thu) by sbergman27 (guest, #10767)
[Link]
"""
That wasn't exactly what I was referring to
"""
You mean when you said: "You can do anything you want in python, except the incredibly simple things like parsing URLs."? Yeah, I can see where you might really have meant you have a trivial quibble with the syntax.
Python URL parsing 101, in case anyone is interested:
Posted Dec 4, 2008 0:17 UTC (Thu) by jwb (guest, #15467)
[Link]
That's great, now what are you going to do with that tuple? You can't feed it to urllib2. urllib2 wants the URL as a string, but it can't parse all the ones that urlparse can parse. See the problem?
What is to replace Perl then?
Posted Dec 4, 2008 2:40 UTC (Thu) by drag (subscriber, #31333)
[Link]
Ya. I would consider that a bug.
What is to replace Perl then?
Posted Dec 4, 2008 3:42 UTC (Thu) by sbergman27 (guest, #10767)
[Link]
If so, it doesn't seem to be one that many Python users care about. I spend a lot of time in Python web development communities (Django, TurboGears) and its not something I hear complaints about.
There have, however, been a couple of request nibbles on the issue tracker over the last 4 years or so to add the functionality to urllib2 as well, and no actual opposition to it. Interestingly, there was activity today from a dev saying he was implementing it, noting that it would be trivial to do.
Personally, I think its probably a good idea, but since its just a few lines to handle this case, it doesn't really bother me.
Like I said, this seems something of a cherry-picked example to "prove" that Python's url handling is "not mature". I'm sure that Python and Ruby folks could pick more than a few cherries regarding Perl's problems if they wanted to.
What is to replace Perl then?
Posted Dec 4, 2008 1:49 UTC (Thu) by jamesh (guest, #1159)
[Link]
It probably isn't very helpful to you, but the RFC for HTTP URLs doesn't actually allow for putting passwords in the URL.
Posted Dec 4, 2008 2:27 UTC (Thu) by jwb (guest, #15467)
[Link]
That is interesting, I hadn't noticed before that HTTP RFC specifies 'host' instead of 'authority', but you're right it's of little use because the problem is that such URLs are found in the wild and when you're building a crawler you pretty much have to handle them.