The other day I came across an article that asked if you could identify
every part of a URL, then listed the 6 parts of a URL,
scheme://domain:port/path?query=string#anchor. Now that's not wrong,
in that those 6 parts, put together that way do make up a valid URL, but that's hardly all of the story. In the
article's defense, it does say there are other parts and that those are the most common, but if you went into an
interview and insisted that was the definition of a URL you wouldn't be "acing" the interview.
In reality, URIs (which include URLs)
are made up of 5 parts,
scheme://[authority]path?query#fragment, with each of
those having its own definition. Scheme and path are required, but
authority, query, and fragment are optional.
Yes, domain:port is one example of an authority, but so is
You won't see that very often, but it is valid. Between the scheme and the
path there are some number of /'s. Sometimes (usually) there's 2 of them, but
sometimes there's 3, and occasionally 0. I'm pretty sure 1 / is also valid, but I'm not
actually sure. And you can have pairs of them inside the path section. It only
looks like an on-disk path. It's not.
According to the spec, query is pretty much open. It's a string. By convention, it's a set of key-value pairs, but that's not required. Fragment is even less clearly defined since it's defined as a sub-resource inside the main URI, so it's totally dependent on the scheme.
And then there's character set. A URI is basically limited to [a-z,A-Z,0-9,._~] with a bunch of caveats depending on which part of the URI you're talking about. Any other character needs to be encoded.
So, as noted, URIs are complicated. And hard to parse correctly. The solution? Don't. You'll only get it wrong and get tripped up later. Use the one built into your language or find an appropriate library. For c++ that seems to be cpp_netlib_uri, and for python, urllib. For Golang/Java/C# (anyone actually using C# ?), there are great implementations in the standard library.