How can I get the base of a URL in Python?

Question

The best way to do this is use urllib.parse.

From the docs:

The module has been designed to match the Internet RFC on Relative
Uniform Resource Locators. It supports the following URL schemes:
file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp,
prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn,
svn+ssh, telnet, wais, ws, wss.

You’d want to do something like this using urlsplit and urlunsplit:

from urllib.parse import urlsplit, urlunsplit

split_url = urlsplit('http://127.0.0.1/asdf/login.php?q=abc#stackoverflow')

# You now have:
# split_url.scheme   "http"
# split_url.netloc   "127.0.0.1" 
# split_url.path     "/asdf/login.php"
# split_url.query    "q=abc"
# split_url.fragment "stackoverflow"

# Use all the path except everything after the last "https://stackoverflow.com/" 
clean_path = "".join(split_url.path.rpartition("https://stackoverflow.com/")[:-1])

# "/asdf/"

# urlunsplit joins a urlsplit tuple
clean_url = urlunsplit(split_url)

# "http://127.0.0.1/asdf/login.php?q=abc#stackoverflow"


# A more advanced example 
advanced_split_url = urlsplit('http://foo:[email protected]:5000/asdf/login.php?q=abc#stackoverflow')

# You now have *in addition* to the above:
# advanced_split_url.username   "foo"
# advanced_split_url.password   "bar"
# advanced_split_url.hostname   "127.0.0.1"
# advanced_split_url.port       "5000"

Leave a Comment Cancel reply