Research

A practical guide to URI encoding and URI decoding

Encoding URIs* has always been a tough job. What exactly do you need to encode? Why are there sometimes %20 and + signs in the URI? Why do we need to encode anyway? Reading through the RFC3986 gives us a lot of answers to these questions. This article is therefor a practical summary from what is written in there.

The RFC only defines the "generic syntax" of URIs. We should take note that those are only the most common URI schemes (e.g. ftp, http, file) and new URI schemes to be used in the future.

This article only focusses on the most common URI schemes (like in the RFC). Those are the most practical used URIs on the web. Next to that provides the article a few tips for web developers to work with URIs and HTML forms.

* Note that URI is just a collective name for both URL and URN

How to encode and decode URIs?

These are the summarized steps that need to be taken when encoding and decoding a URI. An extended explanation can be found below this section.

Encoding URIs

  1. Split the URI based on its components, based on the general delimiters.
  2. Do you know the subcomponent delimiters of this URI?
    1. Yes, split the components into subcomponents based on the subcomponent delimiters that you know.
    2. No, leave it.
  3. Is there a specific encoding that the decoder of the URI is using?
  4. Is it determined which encoding should be used to encode, either on our side or the decoders' side?
    1. Yes, use that encoding to encode the components.
    2. No, use "UTF-8". (When you encode URIs in an HTML web form, you can set the encoding on the form itself or on the global HTML page)
  5. Encode every component and subcomponent differently. Use the table to encode all characters except the "allowed characters" and the "special characters".

Decoding URIs

  1. Split the URI based on its components, based on the general delimiters.
  2. Do you know the subcomponent delimiters of this URI?
    1. Yes, split the components into subcomponents based on the subcomponent delimiters that you know.
    2. No, leave it.
  3. Is it determined which encoding should be used to decode, either on our side or the encoders' side?
    1. Yes, use that encoding to decode the components.
    2. No, use "UTF-8". (When you decode HTTP URIs, look for the encoding in the HTTP header)
  4. Decode every component and subcomponent the same, simply convert all the percent-encoded octets to their decoded value.

Note on HTML form submissions
When encoding and decoding HTML form submissions, there is already a given set of rules and subcomponent delimiters to take into account. Look here for a more extended description

Extended theory: Breaking the URI into pieces

The RFC describes the different components of a generic URI. This is important, because every component has its own characters that need encoding and decoding.

A URI can be divided into components with delimiters. The RFC distinguishes general-delimiters and subcomponent delimiters.

General delimiters

The RFC reserves certain characters to be used as general delimiters. Those are: :, /, ?, #, [, ], @. These might seem familiar if we take a look at a simple URI.

foo://alice@example.com:8042/over/there?search=bar#nose \_/ \___/ \_________/ \__/ \________/ \________/ \__/ | | | | | | | scheme userinfo host port path query fragment

* In practice it is also possible to have an IP addresses (IPv4 and IPv6) instead of a registered host name, but for simplicity sake we leave those out.
** Old URIs can also contain a password in the userinfo alice:password, but that has been deprecated because of security.

When processing a URI into its components, a URI should be delimited using the general delimiters based on a "first-match-wins" algorithm. So the first ? that is encountered after the path or host indicates the start of the query part. All ? behind that, are not considered general delimiters anymore. This means that the following URI should be processed like this:

foo://alice@example.com:8042/ove@r/there;color=red;height=200?name?color=brown&height=300#nose? \_/ \___/ \_________/ \__/ \______________________________/ \_________________________/ \___/ | | | | | | | scheme userinfo host port path query fragment

Note that the second @ (in the path) is treated as data in a path and not as a delimiter between userinfo and host. Same goes for the second ? in the query.

Subcomponent delimiters

In every component there are possible subcomponent delimiters defined. All the reserved subcomponent delimiters are: ! $ & ' ( ) * + , ; =.

Within a component, an implementation-specific entity (an API) can determine that it also delimits further with characters from the reserved subcomponent set. But this is only a rule of the implementation-specific entity itself. The RFC doesn't specify anything about which characters to use when, it only shows the allowed characters to be chosen as subcomponent delimiters.

So let's say that an API has two additional subcomponent rules.

  1. In the path, we distinguish key-value pairs coupled with the = character and delimited by the ; character (back in the day the so called Matrix URIs).
  2. In the query, we distinguish key-value pairs coupled with the = character and delimited by the & character (used for handling web forms).

Further processing of the path and query component would give us:

foo://alice@example.com:8042/ove@r/there;color=red;height=200?name?color=brown&height=300#nose? \___/ \_/ \____/ \_/ \________/ \___/ \____/ \_/ | | | | | | | | k1 v1 k2 v2 k1 v1 k2 v2 \______________________________/ \_________________________/ | | path query

So if you process URIs you always need to know the components of the URI (determined by the general delimiters) and the subcomponents if there are any specified (determined by the subcomponent delimiters) by an API for example. Only then can you determine the right semantics.

Why to encode?

We would like to express as many characters of every language in URIs, so the whole world can use URIs. This is powerful. But not all applications that need to display URIs and their characters, want to implement the complete Unicode character set.

Next to that, we've seen that the URI has a certain structure delimited by special characters. We still would like the possibility to use those characters in a URI as data, only not as a special character. In that case, we encode the character that has special meaning in a URI.

Different encoding formats

When we determine that a character should be encoded, we should encode it in a percent-encoded hexadecimal pair (a so called percent-encoded octet). There are multiple encoding formats, for example "UTF-8", "UTF-16" or "ISO-8859-1" which encode characters from the UCS into 8-bit octets (not all of them cover the whole UCS). When we want to transform the letter æ to the percent-encoded octet in UTF-8, we see in this table that it has two hexadecimal pairs: C3A6. This would result in an encoding of the form %C3%A6. When we would transform the same letter æ to the percent-encoded octet in "ISO-8859-1", we see in this table that it has one hexadecimal pair: E6. So the resulting encoding would look like this %E6.

Different encodings format will generate different percent-encoded octets. Therefor it is always important to realize in which encoding something is encoded.

What characters to encode and decode?

When a URI is split into the right components we are ready to encode! For every component there is a defined list of which characters should never be encoded, which characters should sometimes be encoded and which characters should always be encoded (given it is a generic URI). In the following table all the characters from the RFC are distilled and separated.

Component Always allowed Implementation-specific allowed Not allowed (for generic URIs) %HH allowed?
Scheme a-z A-Z 0-9 + - . No
Userinfo a-z A-Z 0-9 - . _ ~ ! $ & ' ( ) * + , ; = @ : Yes
Host a-z A-Z 0-9 - . _ ~ ! $ & ' ( ) * + , ; = Yes
Port 0-9 No
Path a-z A-Z 0-9 - _ ~ @ : ! $ & ' ( ) * + , ; = / (to determine hierarchy of paths)
.. . (to determine relative paths)
Yes
Query a-z A-Z 0-9 - . _ ~ @ : / ? ! $ & ' ( ) * + , ; = Yes
Fragment a-z A-Z 0-9 - . _ ~ @ : / ? ! $ & ' ( ) * + , ; = Yes
* Note: a-z means the range of all letters from a to z
** Note: %HH means a percent-encoded octet
*** Note: In a path the . is sometimes allowed, see this section

So how should we read this table?

For every component there are certain characters always allowed and should never be escaped (column 2). The implementation-specific characters are sometimes delimiters, so when you are not sure just escape them as used as data (column 3). The not allowed characters have special meaning within the component and should always be escaped (again, only when used as data) (column 4). The last column indicates if encoding into a percent-encoded octet is even allowed withing that component. If not, escaping is never done.

I would like to emphasize again, that it is important to know which implementation-specific delimiters are used. Only then can you know for sure what to encode. As a rule of thumb, you can say that you encode everything, except for the always allowed characters within a component. But only when you first have broken the URI in its right parts (for example the path should be split correctly on / in subcomponents and complete path components like .. should be normalized or not encoded).

Let's take a look at an example. We have a UTF-8 encoded website foo.com which doesn't have any implementation-specific delimiters. We have a section on our website with the name Me&you@.-:, a page called The/best? and a query with name The/best?. Our URI would need to be encoded like this:

http://foo.com/Me%26you@.-:/The%2Fbest%3F?The/best? \__________/ \___________/ \_______/ | | | section page name query

We only encode the & in the section, the / and ? in the page name and nothing in the query as per our allowed characters in column 2.

Let's look at another example with the query string, when there is no implementation-specific delimiter that is being used. Take this query string: pàræm#2===$300". We would encode (to UTF-8) the à, æ, #, = and $ character, the rest is allowed. It would look like: p%C3%A0r%C3%A6m%232%3D%3D%3D%24300.

What if we would have implementation-specific delimiters? If we would use the same implementation-specific delimiters that are being used on the web for form submissions, the encoding would be different. If we would like to express a key param#1 and value $300 as stated in the HTML specification then we know that key value pairs are coupled with the = sign and delimited by the & sign. Only then do we know that we need to encode these parts separately: param#1 and $300. When we would encode with UTF-8 it would look like this: param%231=%24300.

If we would apply the same rules on this query string: param#1==$300 where the value would be =$300, it would look like: param%231=%3D%24300.

Submitting a form in HTML

Let's look at one of the most practical situations for web developers. When submitting a form the data can either be send via the POST or GET method. According to the HTML specification, this data should be encoded with the "application/x-www-form-urlencoded" media type, in the query part. This encoding differs from the RFC3986, where all non-alphanumeric characters except -, _, * and . are encoded. Also the space character is replaced with a + sign.

Example:
Submit data from the website foo.com via a form with encoding type UTF-8:

email=test@qqq.is password=te st

The website only encodes the @ and the (space) is turned into a + sign, so it generates this URI:

http://foo.com?email=test%40qqq.is&password=te+st

So encoding and decoding a query string from a HTML form submission, has already a set of rules according to the media type. Just be aware to use those rules and subcomponent delimiters on the query.

Relative paths

The RFC also describes the use of relative paths in URIs. A relative path is a path which has something in the form of dot-segments: ./ (one dot and a slash) or ../ (two dots and a slash). These are intended to express a relative path in the hierarchy of the total path. They overlap with the use of paths in file systems and are used everywhere on the internet.

When relative paths are identified they should never be encoded

Normalization

The process of making a relative path absolute is called normalization. A path subcomponent that is exactly .. indicates a path higher in hierarchy and a . indicates the same path. So this relative URI: http://foo.com/a/../b/./c/d/../e can be normalized to: http://foo.com/b/c/e.

Using a dot within a path

It is allowed to use a dot in path names as data like this: /a/b/c.d/e, /a../b./ or /a...b/. Only when the complete path subcomponent is either a . or a .., than it is not allowed as data. So the these are examples where the path subcomponents are not treated as data: /a/b/./c and /a/../b, but the indicate a relative path.

Final notes

This information is extracted from the RFC 3986 and interpreted. If you have any comments or questions, just send an email via the form below.

Contributors

  • Dino