A practical guide to URI encoding and URI decoding
Encoding URIs* has always been a tough job. What exactly do you need to encode? Why are there sometimes
+ signs in the URI? Why do we need to encode anyway? The answer to these questions are written in RFC3986. This article is a practical summary of the RFC.
The RFC only describes the generic syntax of a URI. All URIs should follow that syntax, but every scheme-specific (e.g. http, ftp, file) or implementation-specific (a browser or web application that handles URIs) syntax can have additional rules. It is therefor always necessary to know the additional rules, to correctly encode and decode. The case of URIs used in HTML forms is taken as an example in this article.
This article only focusses on the most common URI schemes (like in the RFC). Those are the most practical used URIs on the web. Next to that provides the article a few tips for web developers to work with URIs and HTML forms.
* Note that URI is just a collective name for both URL and URN
These are the summarized steps that need to be taken when encoding and decoding a URI. An extended explanation can be found below this section.
Encoding and decoding URIs
- 1. Split the URI into its components, based on the generic delimiters.
- 2. Are there scheme-specific (e.g. http, ftp, file) or implementation-specific (a browser or web application) delimiters that need to be used for this URI?
- a. Yes, split the components into subcomponents based on the subcomponent delimiters.
- b. No, leave it.
- 3. Is there a protocol used that subscribes the encoding?
- a. Yes, use the subscribed encoding by the protocol
- b. No, use the same encoding as the surrounding text
- 4a. ENCODING: Encode every component and/or subcomponent with the chosen encoding. Use the table to encode all characters except the "never encode" and (depending on your subcomponent delimiters) the "sometimes encode" characters.
- 4b. DECODING: Decode every component and/or subcomponent, convert all the percent-encoded octets to their decoded value by using the chosen encoding.
Note on HTML form submissions
When encoding and decoding HTML form submissions, look here for the encoding that needs to be used.
Breaking the URI into pieces
A URI can be divided into components by certain characters that are used as delimiters. The RFC distinguishes generic delimiters and subcomponent delimiters.
The RFC reserves certain characters to be used as generic delimiters. Those are:
@. These might seem familiar if we take a look at a simple URI.
foo://email@example.com:8042/over/there?search=bar#nose \_/ \___/ \_________/ \__/ \________/ \________/ \__/ | | | | | | | scheme userinfo host port path query fragment
* In practice it is also possible to have an IP addresses (IPv4 and IPv6) instead of a registered host name, but for simplicity sake we leave those out.
** Old URIs can also contain a password in the userinfo
alice:password, but that has been deprecated because of security.
When processing a URI into its components, a URI should be delimited using the generic delimiters based on a "first-match-wins" algorithm. So the first
? that is encountered after the path or host indicates the start of the query part. All
? behind that, are not considered generic delimiters anymore. This means that the following URI should be processed like this:
foo://firstname.lastname@example.org:8042/ove@r/there;color=red;height=200?name?color=brown&height=300#nose? \_/ \___/ \_________/ \__/ \______________________________/ \_________________________/ \___/ | | | | | | | scheme userinfo host port path query fragment
Note that the second
@ (in the path) is treated as data in a path and not as a delimiter between userinfo and host. Same goes for the second
? in the query.
In every component there are possible subcomponent delimiters defined. All the reserved subcomponent delimiters are:
An implementation-specific entity (an API of a web application) can have additional delimitation rules with characters from the reserved subcomponent set. But this is only a rule of the implementation-specific entity itself. The RFC doesn't specify anything about which characters to use when, it only shows the allowed characters to be chosen as subcomponent delimiters.
So let's say that an API has two additional subcomponent rules.
- In the path, we distinguish key-value pairs coupled with the
=character and delimited by the
;character (back in the day the so called Matrix URIs).
- In the query, we distinguish key-value pairs coupled with the
=character and delimited by the
&character (used for handling web forms).
Further processing of the path and query component would give us:
foo://email@example.com:8042/ove@r/there;color=red;height=200?name?color=brown&height=300#nose? \___/ \_/ \____/ \_/ \________/ \___/ \____/ \_/ | | | | | | | | k1 v1 k2 v2 k1 v1 k2 v2 \______________________________/ \_________________________/ | | path query
For processing of URIs it is always needed to know the components of the URI (determined by the generic delimiters) and the subcomponents if there are any specified (determined by the subcomponent delimiters) by an API for example. Only then can the right semantics be determined.
Why to encode?
We would like to express as many characters of every language in URIs, so the whole world can use URIs. This is powerful. But not all applications that need to display URIs and their characters, want to implement the complete Unicode character set.
Next to that, we've seen that the URI has a certain structure delimited by special characters. We still would like the possibility to use those special characters in a URI as data. In that case, we need to encode them.
Different encoding formats
When we determine that a character should be encoded, we should encode it in a percent-encoded hexadecimal pair (a so called percent-encoded octet). There are multiple encoding formats, for example "UTF-8", "UTF-16" or "ISO-8859-1" which encode characters from the UCS into 8-bit octets (not all of them cover the whole UCS). When we want to transform the letter
æ to the percent-encoded octet in UTF-8, we see in this table that it has two hexadecimal pairs:
C3A6. This would result in an encoding of the form
When we would transform the same letter
æ to the percent-encoded octet in "ISO-8859-1", we see in this table that it has one hexadecimal pair:
E6. So the resulting encoding would look like this
Different encoding formats will generate different percent-encoded octets. Therefor it is always necessary to determine the encoding that is used to encode.
Which encoding to use?
It could be that a scheme or certain protocol, enforces the use of a certain encoding. In that case that encoding is used. When there is nothing subscribed, than the same encoding should be used as the surrounding text of the URI.
What characters to encode and decode?
When a URI is split into the right components we are ready to encode! For every component there is a defined list of which characters should never be encoded, which characters should sometimes be encoded and which characters should always be encoded (given it is used as data). In the following table all the characters from the RFC are distilled and separated.
|Component||Never encode||Sometimes encode (implementation-specific)||Always encode (when interpreted as data)||Encoding allowed?|
*** Note: In a path the
So how should we read this table?
For every component there are certain characters always allowed and should never be encoded (column 2). If the implementation-specific characters are delimiters, they should not be encoded. In every other case, they should be encoded (column 3). There are some characters that have special meaning within the component and should always be escaped (again, only when used as data) (column 4). The last column indicates if encoding into a percent-encoded octet is even allowed withing that component. If not, escaping is never done.
Let's take a look at an example. We have a UTF-8 encoded website
foo.com which doesn't have any implementation-specific delimiters. We have a section on our website with the name
Me&you@.-:, a page called
The/best? and a query with name
The/best?. Our URI would need to be encoded like this:
http://foo.com/Me%26you@.-:/The%2Fbest%3F?The/best? \__________/ \___________/ \_______/ | | | section page name query
We only encode the
& in the section, the
? in the page name and nothing in the query as per our allowed characters in column 2.
Let's look at another example with the query string, when there is no implementation-specific delimiter that is being used. Take this query string:
pàræm#2===$300". We would encode (to UTF-8) the
$ character, the rest is allowed. It would look like:
What if we would have implementation-specific delimiters? If we would use the same implementation-specific delimiters that are being used on the web for form submissions, the encoding would be different. If we would like to express a key
param#1 and value
$300 as stated in the HTML specification then we know that key value pairs are coupled with the
= sign and delimited by the
& sign. Only then do we know that we need to encode these parts separately:
$300. When we would encode with UTF-8 it would look like this:
If we would apply the same rules on this query string:
param#1==$300 where the value would be
=$300, it would look like:
param%231=%3D%24300. Note that the second
= sign is encoded now, because it is part of the value.
Submitting a form in HTML
Let's look at one of the most practical situations for web developers. When submitting a form the data can either be send via the POST or GET method. According to the HTML specification, this data should be encoded with the
application/x-www-form-urlencoded media type, in the query part. This media type describes that all non-alphanumeric characters except
. should be encoded. Also the space character is replaced with a
Submit data from the website foo.com via a form with encoding type UTF-8:
firstname.lastname@example.org password=te st
The website only encodes the
@ and the
+ sign, so it generates this URI:
So encoding and decoding a query string from a HTML form submission, has already a set of rules according to the media type. Just be aware to use those rules and subcomponent delimiters on the query.
The RFC also describes the use of relative paths in URIs. A relative path is a path which has something in the form of dot-segments:
./ (one dot and a slash) or
../ (two dots and a slash). These are intended to express a relative path in the hierarchy of the total path. They overlap with the use of paths in file systems and are used everywhere on the internet.
When relative paths are identified they should never be encoded
The process of making a relative path absolute is called normalization. A path subcomponent that is exactly
.. indicates a path higher in hierarchy and a
. indicates the same path. So this relative URI:
http://foo.com/a/../b/./c/d/../e can be normalized to:
Using a dot within a path
It is allowed to use a dot in path names as data like this:
/a...b/. Only when the complete path subcomponent is either a
. or a
.., than it is not allowed as data. So the these are examples where the path subcomponents are not treated as data:
/a/../b, but the indicate a relative path.
This information is extracted from the RFC 3986 and interpreted. If you have any comments or questions, just send an email to webmaster and then the rest of this domain.