by Leon Rosenshein

What's In A URI?

The other day I came across an article that asked if you could identify every part of a URL, then listed the 6 parts of a URL, scheme://domain:port/path?query=string#anchor. Now that's not wrong, in that those 6 parts, put together that way do make up a valid URL, but that's hardly all of the story. In the article's defense, it does say there are other parts and that those are the most common, but if you went into an interview and insisted that was the definition of a URL you wouldn't be "acing" the interview.

In reality, URIs (which include URLs) are made up of 5 parts, scheme://[authority]path?query#fragment, with each of those having its own definition. Scheme and path are required, but authorityquery, and fragment are optional.

Yes, domain:port is one example of an authority, but so is bob:Password@contoso.com:6543. You won't see that very often, but it is valid. Between the scheme and the path there are some number of /'s. Sometimes (usually) there's 2 of them, but sometimes there's 3, and occasionally 0. I'm pretty sure 1 / is also valid, but I'm not actually sure. And you can have pairs of them inside the path section. It only looks like an on-disk path. It's not.

According to the spec, query is pretty much open. It's a string. By convention, it's a set of key-value pairs, but that's not required. Fragment is even less clearly defined since it's defined as a sub-resource inside the main URI, so it's totally dependent on the scheme

And then there's character set. A URI is basically limited to [a-z,A-Z,0-9,._~] with a bunch of caveats depending on which part of the URI you're talking about. Any other character needs to be encoded.

So, as noted, URIs are complicated. And hard to parse correctly. The solution? Don't. You'll only get it wrong and get tripped up later. Use the one built into your language or find an appropriate library. For c++ that seems to be cpp_netlib_uri, and for python, urllib. For Golang/Java/C# (anyone actually using C# ?), there are great implementations in the standard library.