Research

A practical guide to URI encoding and URI decoding

Encoding URIs* has always been a tough job. What exactly do you need to encode? Why are there sometimes %20 and + signs in the URI? Why do we need to encode anyway? The answer to these questions are written in RFC3986. This article is a practical summary of the RFC.

The RFC only describes the generic syntax of a URI. All URIs should follow that syntax, but every scheme-specific (e.g. http, ftp, file) or implementation-specific (a browser or web application that handles URIs) syntax can have additional rules. It is therefor always necessary to know the additional rules to correctly encode and decode.

* Note that URI is just a collective name for both URL and URN

TL;DR

These are the summarized steps that need to be taken when encoding and decoding a URI. An extended explanation can be found below this section.

Encoding and decoding URIs

  • 1. Split the URI into its components, based on the generic delimiters.
  • 2. Are there scheme-specific (e.g. http, ftp, file) or implementation-specific (a browser or web application) delimiters that need to be used for this URI?
  • 3. Is there a protocol used that subscribes the encoding?
    • a. Yes, use the subscribed encoding by the protocol
    • b. No, use the same encoding as the surrounding text
  • 4a. ENCODING: Encode every component and/or subcomponent with the chosen encoding. Use the table to encode all characters except the "never encode" and (depending on your subcomponent delimiters) the "sometimes encode" characters.
  • 4b. DECODING: Decode every component and/or subcomponent, convert all the percent-encoded octets to their decoded value by using the chosen encoding.

Breaking the URI into pieces

A URI can be divided into components by certain characters that are used as delimiters. The RFC distinguishes generic delimiters and subcomponent delimiters.

Generic delimiters

The RFC reserves certain characters to be used as generic delimiters. Those are: :, /, ?, #, [, ], @. These might seem familiar if we take a look at a simple URI.

foo://alice@example.com:8042/over/there?search=bar#nose \_/ \___/ \_________/ \__/ \________/ \________/ \__/ | | | | | | | scheme userinfo host port path query fragment

* In practice it is also possible to have an IP addresses (IPv4 and IPv6) instead of a registered host name, but for simplicity sake we leave those out.
** Old URIs can also contain a password in the userinfo alice:password, but that has been deprecated because of security.

When processing a URI into its components, a URI should be delimited using the generic delimiters based on a "first-match-wins" algorithm. The first ? that is encountered after the path or host indicates the start of the query part. Other ? that come after the first occurence, are not considered generic delimiters anymore. This means that the following URI should be processed like this:

foo://alice@example.com:8042/ove@r/there;color=red;height=200?name?color=brown&height=300#nose? \_/ \___/ \_________/ \__/ \______________________________/ \_________________________/ \___/ | | | | | | | scheme userinfo host port path query fragment

Note that the second @ (in the path) is treated as data in a path and not as a delimiter between userinfo and host. Same goes for the second and third ? in the query and fragment.

Subcomponent delimiters

In every component there are possible subcomponent delimiters defined. All the reserved subcomponent delimiters are: ! $ & ' ( ) * + , ; =.

An implementation-specific entity (an API of a web application) can have additional delimitation rules with characters from the reserved subcomponent set. But this is only a rule of the implementation-specific entity itself. The RFC doesn't specify anything about which characters to use when, it only shows the allowed characters to be chosen as subcomponent delimiters.

Let's say that an API has two additional subcomponent rules.

  1. In the path, we distinguish key-value pairs coupled with the = character and delimited by the ; character (back in the day the so called Matrix URIs).
  2. In the query, we distinguish key-value pairs coupled with the = character and delimited by the & character (used for handling web forms).

Further processing of the path and query component would give us:

foo://alice@example.com:8042/ove@r/there;color=red;height=200?name?color=brown&height=300#nose? \___/ \_/ \____/ \_/ \________/ \___/ \____/ \_/ | | | | | | | | k1 v1 k2 v2 k1 v1 k2 v2 \______________________________/ \_________________________/ | | path query

For processing of URIs it is always necessary to know the components of the URI (determined by the generic delimiters) and the subcomponents if there are any specified (determined by the subcomponent delimiters) by an API for example. Only then can the right semantics be determined.

Why to encode?

We would like to express as many characters of every language in URIs, so the whole world can use URIs. This is powerful. But not all applications that need to display URIs and their characters want to implement the complete Unicode Character Set.

Next to that, we've seen that the URI has a certain structure delimited by special characters. We still would like the possibility to use those special characters in a URI as data. In that case, we need to encode them.

Different encoding formats

When we determine that a character should be encoded, we should encode it in a percent-encoded hexadecimal pair (a so called percent-encoded octet). There are multiple encoding formats, for example "UTF-8", "UTF-16" or "ISO-8859-1" which encode characters from the UCS into 8-bit octets (not all of them cover the whole UCS). When we want to transform the character æ to the percent-encoded octet in UTF-8, we see in this table that it has two hexadecimal pairs: C3A6. This would result in an encoding of the form %C3%A6. When we would transform the same letter æ to the percent-encoded octet in "ISO-8859-1", we see in this table that it has one hexadecimal pair: E6. The resulting encoding would look like this %E6.

Different encoding formats will generate different percent-encoded octets. Therefor it is always necessary to determine the encoding that is used to encode or decode.

Which encoding to use?

It could be that a scheme or certain protocol enforces the use of a certain encoding. In that case that encoding is used. When there is nothing subscribed, the same encoding should be used as the surrounding text of the URI.

What characters to encode and decode?

When a URI is split into the right components we are ready to encode! For every component there is a defined list of which characters should never be encoded, which characters should sometimes be encoded because of the implementation and which characters should sometimes be encoded because of their special use in certain parts of the URI. In the following table all the characters from the RFC are summarized and separated.

Component Never encode Sometimes encode (implementation-specific) Sometimes encode (when not used as special character) Encoding allowed?
Scheme a-z A-Z 0-9 + - . No
Userinfo a-z A-Z 0-9 - . _ ~ ! $ & ' ( ) * + , ; = : (to separate user and password) Yes
Host
(ip-addresses ignored)
a-z A-Z 0-9 - . _ ~ ! $ & ' ( ) * + , ; = Yes
Port 0-9 No
Path a-z A-Z 0-9 - _ ~ @ : ! $ & ' ( ) * + , ; = / (to determine hierarchy of paths)
.. . (to determine relative paths)
Yes
Query a-z A-Z 0-9 - . _ ~ @ : / ? ! $ & ' ( ) * + , ; = Yes
Fragment a-z A-Z 0-9 - . _ ~ @ : / ? ! $ & ' ( ) * + , ; = Yes
* Note: a-z means the range of all letters from a to z

How should we read this table?
For every component there are certain characters always allowed and should never be encoded (column 2). If some of the implementation-specific characters are delimiters and you also want to use those character as data, you should encode those characters. If none of the implementation-specific characters are delimiters, you can use all characters as data and don't have to encode anything (column 3). There are some characters that have special meaning within a component and should always be encoded when used as data (column 4). The last column indicates if encoding into a percent-encoded octet is even allowed within that component. If not, encoding is never done.

Example 1
Let's take a look at an example. We have a UTF-8 encoded website http://foo.com where the HTTP protocol doesn't subscribe any implementation-specific delimiters. We have a page on our website with the name bar&fuz (path), a subpage called (fux*) (path) and a query with the name crux&crux. The URI should be displayed like this:

http://foo.com/bar&fuz/(fux*)?crux&crux \____________/ \_______/ | | path query

Since the &, *, (, ) are not used as implementation-specific delimiters, we can freely use them without encoding.

Example 2
Let's look at one of the most practical situations for web developers. When submitting a form the data can either be send via the POST or GET method. According to the HTML specification, this data should be encoded with the application/x-www-form-urlencoded media type in the query part. It subscribes that key-value pairs are coupled with the = sign and delimited by the & sign. Next to that should all non-alphanumeric characters except -, _, * and . be encoded. Also the space character is replaced with a + sign.

Let's submit the following key-value pairs on the website http://foo.com via an HTML form with UTF-8 encoding:

email=test@qqq.is password=te st&

We should encode the @ and the & and the (space) is turned into a + sign. The URI would look like this:

http://foo.com?email=test%40qqq.is&password=te+st%26 \___________________________________/ | query

Since the & and the = are used as implementation-specific delimiters, we need to encode them when they are used as data.*

* Note also that in our table the @ is in the "never encode" column for the query part, but here we still encode it. Because the HTML specification relies on the old RFC1738, we need to make an exception here and also encode the @.

Relative paths

The RFC also describes the use of relative paths in URIs. A relative path is a path which has something in the form of dot-segments: ./ (one dot and a slash) or ../ (two dots and a slash). These are intended to express a relative path in the hierarchy of the total path. They overlap with the use of paths in file systems and are used everywhere on the internet.

When relative paths are identified they should never be encoded.

Normalization

The process of making a relative path absolute is called normalization. A path subcomponent that is exactly .. indicates a path higher in hierarchy and a . indicates the same path. This relative URI: http://foo.com/a/../b/./c/d/../e can be normalized to: http://foo.com/b/c/e.

Using a dot within a path

It is allowed to use a dot in path names as data like this: /a/b/c.d/e, /a../b./ or /a...b/. Only when the complete path subcomponent is either a . or a .. it indicates a relative path, like here: /a/b/./c and /a/../b.

Final notes

This information is extracted from the RFC 3986 and interpreted. If you have any comments or questions, just send an email to webmaster and then the rest of this domain.