URL Encoding and what characters are valid in a URI
    • 3 Minutes to read
    • Dark
      Light

    URL Encoding and what characters are valid in a URI

    • Dark
      Light

    Article summary

    Not all characters are valid in a URI. According to the Internet Engineering Task Force's document Uniform Resource Identifier (URI): Generic Syntax, we can define a series of valid characters that can be used "as is", reserved characters that can be used "as is"but have a special meaning and any other character that will need to be "URL-Encoded" into a numerical format to be valid and usable and not conflict with characters that do have a special meaning.

    Valid characters

    The following characters can be used directly within a URL and do not require any special encoding:

    Characters

    Explanation

    A  to Z

    Uppercase alphabetical

    a  to z

    Lowercase alphabetical

    0  to 9 

    Numerical

    - 

    Hyphen

    _ 

    Underscore

    . 

    Dot / full stop

    ~ 

    Tilde

    Reserved characters

    These characters can also be used "as is", but they have a special meaning in a URI, and if they are to be used outside of their reserved meaning, they must be URL-encoded. Failure to do so may generate unforeseen consequences in the processing of the URL. Notably, problems can occur if you send a URI as a parameter for a redirection inside another URL. If these characters are not URL-encoded, the server may be unable to determine where a URL ends and where a parameter starts or how many parameters there are in a received URL.

    Characters

    Explanation

    : 

    Protocol separator or username/password separator when specified in the URL

    @ 

    Credential and host separator

    / 

    Directory separator for resource or folder paths.

    ? 

    Query string separator

    & 

    Separator for key-value pairs if more than one key-value pair is present in the URI

    = 

    Assigns a value to a key in a URI

    # 

    End of URL Anchor, indicating to a browser to jump to that anchor in an HTML page, if present in the source code.

    % 

    Character indicating a "percent-encoded" (URL-encoded) character. It will be followed by a numerical code to represent a reserved character that otherwise could corrupt the meaning of a URI.

    + 

    Represents a space in some query strings (often used in place of a space character).

    [ 

    They are used to enclose an IPv6 address in the URL (e.g., http://[2001:db8::1]/).

    ] 

    ! 

    Reserved for future use, it can appear in various contexts within the URL.

    $ 

    Reserved for use within the query component to delimit special reserved parameters.

    ( 

    Reserved for future use, they can appear in various contexts within the URL.

    ) 

    * 

    Reserved for future use, it can appear in various contexts within the URL.

    , 

    Reserved for future use, it can appear in various contexts within the path or query components.

    ; 

    Sometimes used to separate parameters in the path component (e.g., http://example.com/path;param=value).

    Any character not defined in the valid character list or not used according to the reserved character list usage should be URL encoded.

    Encoding Reserved and Invalid Characters in URLs

    When working with URLs, it's crucial to properly encode reserved and invalid characters to ensure they are correctly interpreted by web servers and browsers. To avoid issues with these characters, you can use online tools like urlencoder.org or the URL encoding features of your programming language.

    Reserved characters have specific meanings defined by RFC documents, and if misused, they can break a URL. If you need to use one of these characters for a different purpose, encode it to its URL-encoded value.

    Encoding an Email Address in a URL

    Consider a variable email in your query string set to mail@example.com. The @ symbol is reserved, so it needs to be encoded to prevent it from being misinterpreted as a username combination.

    For example, the @ symbol is encoded as %40. Here’s a valid query string example:

    http://username:password@server.example.com?email=user%40example.com

    In this example:

    • The @ symbol correctly separates the username from the server address.

    • The email variable properly encodes the @ symbol to %40, ensuring the server accurately interprets the email address.

    Encoding the % character

    The % character is used to denote encoded characters in a URL. If you need to use % as a literal character, it must be encoded as %25.

    For example:

    http://example.com/crypto_currency_falls_40%25_in_one_day

    Here, %25 represents the literal % character, preventing it from being misinterpreted as the start of an encoded character sequence.

    Handling Invalid URLs with Variables

    Proper URL encoding is crucial when embedding one URL within another, such as in redirect links, to avoid conflicts and misinterpretation.

    Consider this invalid URL with unencoded query strings:

    http://www.example.com/login?firstlogin=true&redirect=http://www.example.com/error?secondlogin=false&date=2022-01-01

    The problem here is the presence of multiple unencoded ? and & characters, leading to incorrect parsing of query parameters.

    Valid URL: Encoding the Redirect Parameter

    To make the above URL valid, URL-encode the redirect parameter:

    http://www.example.com/login?firstlogin=true&redirect=http%3A%2F%2Fwww.example.com%2Ferror%3Fsecondlogin%3Dfalse%26date%3D2022-01-01

    In this example:

    • Characters such as :, /, ?, =, and & within the redirect value are encoded to their respective percent-encoded values.

    • This encoding ensures the redirect parameter is treated as a single value and not parsed incorrectly.

    Servers processing this URL must decode the redirect value to interpret and handle it correctly.

    By encoding reserved and invalid characters in URLs, you ensure accurate communication and functionality between clients and servers, avoiding common pitfalls of URL misinterpretation.

    Knowledge Base Reference ID: 202202271326


    Was this article helpful?