Fundamentals 30 min read

Understanding URIs: History, Components, and Encoding/Decoding

This article provides a comprehensive overview of Uniform Resource Identifiers (URIs), covering their historical evolution, the relationship with URLs and URNs, the syntax defined by RFC standards, character sets, component breakdown, and practical encoding and decoding algorithms for web development.

TAL Education Technology

Nov 5, 2020

Understanding URIs: History, Components, and Encoding/Decoding

Uniform Resource Identifier (URI) is a fundamental concept every programmer should understand, alongside related terms such as URL and URN; mastering these helps developers navigate the design of the World Wide Web, troubleshoot URI‑related issues, and grasp encoding/decoding mechanisms for network programming.

1. URI

URI (Uniform Resource Identifier) provides a simple, extensible way to identify resources.

1.1 History of URI

As the Web grew, a need arose for a unique, portable identifier for diverse resources (web pages, e‑books, PDFs, etc.). Tim Berners‑Lee’s hypertext proposal introduced the concept of a URL (Uniform Resource Locator) to name hyperlinks. Later, to separate the notions of location and naming, URN (Uniform Resource Name) was defined.

IETF (Internet Engineering Task Force) is responsible for the URI standards.

1994 RFC 1630 introduced URL and URN and defined the formal URI syntax.

December 1994 RFC 1738 defined absolute and relative URLs; RFC 2141 added URN grammar.

1999 RFC 2732 allowed IPv6 addresses in URIs.

2005 RFC 3986 resolved earlier shortcomings and formalized the generic URI syntax.

RFC 3305 noted that while URL is widely used, it may eventually be deprecated in favor of the broader URI term.

1.2 Comparison of URI, URL, and URN

URI, URL, and URN share a common ancestry; URL originally served as a locator, while URN was introduced for name‑only identification. All three can identify a resource, but they differ in usage scenarios.

1.2.1 Basic Concepts

URI : A generic identifier for any resource, expressed as a string of characters defined by IETF syntax.

URL : A locator that includes the access mechanism (e.g., http, ftp, mailto) and points to a retrievable resource.

URN : A persistent name that identifies a resource within a specific namespace without implying location; e.g., urn:isbn:9971-5-0210-0.

1.2.1 Relationship among the Three

URI is the abstract superset; both URL and URN are specific forms of URI. RFC 3986 states: “A URI can be further classified as a locator, a name, or both.” Consequently, a URL is always a URI, and a URN is a URI that does not convey location.

A URI can be further classified as a locator, a name, or both. The term “Uniform Resource Locator” (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network “location”). – RFC 3986, section 1.1.3

In practice, use the term that best matches the audience’s expectations; when in doubt, “URI” safely covers both URL and URN.

2. URI Character Set

2.1 Design Considerations of URI

URI must be simple, extensible, and portable across different media. It uses a limited character set (US‑ASCII) to ensure compatibility and ease of input on keyboards and other devices.

Portability across systems.

Composed of a character sequence that can be represented in various forms (paper, screen, binary).

Human‑readable characters are preferred.

Consequently, URIs are limited to US‑ASCII characters.

2.2.1 Percent‑Encoding

Characters outside the allowed set are represented using percent‑encoding (also called URL‑encode): pct-encoded = "%" HEXDIG HEXDIG For example, the byte 0x2B ("+") is encoded as %2B. Encoding is case‑insensitive, but uppercase is recommended.

2.2.2 Reserved Characters

Reserved characters serve as delimiters between URI components:

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

2.2.3 Unreserved Characters

Characters that never need encoding:

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

ALPHA = a‑z / A‑Z
DIGIT = 0‑9

2.2.4 Summary

The following diagram (image) illustrates which characters are reserved vs. unreserved and when encoding is required.

3. URI Components

URI syntax consists of several components, each with its own rules. The generic form is:

URI         = scheme ":" [ "//" authority ] path [ "?" query ] [ "#" fragment ]

authority   = [ userinfo "@" ] host [ ":" port ]

Note: scheme and path are required.

Below is an example diagram of the components.

3.1 Component Details

3.1.1 Scheme

Component

Scheme

Allowed characters

a‑z A‑Z 0‑9 + . -

Case‑sensitive

Terminator

Scheme identifies the protocol (e.g., http, ftp, geo). Syntax:

scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

3.1.2 Authority

Component

Authority

Start delimiter

End delimiters

/ ? #

Authority defines a namespace (e.g., example.com) and consists of optional userinfo, mandatory host, and optional port.

3.1.2.1 Userinfo

Component

Userinfo

Allowed characters

pct‑encoded, unreserved, sub‑delims, ":"

Case‑sensitive

Yes

Terminator

userinfo = *( unreserved / pct-encoded / sub-delims / ":" )

3.1.2.2 Host

Component

Host

Allowed characters

pct‑encoded, unreserved, sub‑delims

Case‑sensitive

Terminator

/ :

host = IPv6address / IPv4address / reg-name

IPv6address = [ HEXDIG *( "::" HEXDIG ) ]
IPv4address = DIGIT "." DIGIT "." DIGIT "." DIGIT
reg-name    = *( unreserved / pct-encoded / sub-delims )

3.1.2.3 Port

Component

Port

Allowed characters

0‑9

Terminator

/ port = *DIGIT Typical defaults: HTTP 80, HTTPS 443.

3.1.3 Path

Component

Path

Allowed characters

pct‑encoded, unreserved, sub‑delims, @, :

Terminator

? # EOF

path = path-abempty / path-relative

path-abempty   = *( "/" segment )
path-relative = segment-nocolon *( "/" segment )
segment        = *pchar
pchar          = unreserved / pct-encoded / sub-delims / ":" / "@"
segment-nocolon= unreserved / pct-encoded / sub-delims / "@"

3.1.4 Query

Component

Query

Allowed characters

pct‑encoded, unreserved, sub‑delims, @, :

Start delimiter

End delimiter

# EOF

query = *( pchar / "/" / "?" )

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

3.1.5 Fragment

Component

Fragment

Allowed characters

pct‑encoded, unreserved, sub‑delims, @, :, /, ?

Start delimiter

End delimiter

EOF

fragment = *( pchar / "/" / "?" )

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

3.1.6 Summary

Each component has its own allowed character set; reserved characters act as delimiters between components, while sub‑delimiters may appear inside components.

3.2 Parsing a URI

Because URI syntax follows a context‑free grammar, a recursive‑descent parser can be implemented. The following diagram (image) shows the parsing flow.

Sample pseudocode for a recursive‑descent parser:

<code>/**
 * Read the next character
 */
function next() {
  skip space;
  read next char and return;
}

/**
 * Scan for a special character in the input
 */
function contains(input, special_char) {
  start = input.start, end = input.end;
  while (start < end) {
    if (special_char equals start) return;
    end;
  }
  return start;
}

/**
 * Main URI parsing function
 */
function parse(string uri) {
   parse_scheme;
   skip next ':';
   if (next() == "//") {
       if (contains(substring_uri(/* until path */), '@'))
          parse_userinfo;
       parse_host;
       if (next() == ':') parse_port;
   }
   parse_path;
   if (next() == '?') parse_query;
   if (next() == '#') parse_fragment;
}
</code>

5. Re‑examining Encode and Decode

When generating a URI, each component should be percent‑encoded before concatenating with delimiters. Decoding reverses the process: split by delimiters, then decode each component individually. Unreserved characters may be left unencoded, but encoding them is harmless.

Note: Do not double‑encode or double‑decode a URI, as this corrupts its semantics.

5.1 Implementing Encode and Decode

5.1.1 encode

Pseudocode for percent‑encoding a string, respecting a set of characters that do not need encoding:

<code>/**
 * Encode a string s using percent‑encoding, skipping characters in dontNeedEncodingSet
 */
function encode(s, dontNeedEncodingSet) {
   R = ""; index = 0; len = s.length();
   while (index < len) {
      c = s.charAt(index);
      if (c in dontNeedEncodingSet) {
         R += c;
      } else {
         // handle UTF‑16 surrogate pairs if needed
         out = convertToUTF8(c);
         for each byte out_byte in out {
            R += "%" + toHexUpper(out_byte >> 4) + toHexUpper(out_byte & 0xF);
         }
      }
      ++index;
   }
   return R;
}
</code>

5.1.2 decode

Pseudocode for percent‑decoding a URI string:

<code>/**
 * Decode a percent‑encoded string s
 */
function decode(s) {
   R = ""; index = 0; len = s.length();
   while (index < len) {
      c = s.charAt(index);
      if (c == "%") {
         out = "";
         while (c == "%" && index + 2 < len) {
            c1 = s.charAt(index+1);
            c2 = s.charAt(index+2);
            out += (hexToInt(c1) << 4) | hexToInt(c2);
            index += 3;
            c = s.charAt(index);
         }
         // convert UTF‑8 bytes to characters (handle surrogate pairs if language uses UTF‑16)
         R += utf8ToString(out);
      } else {
         R += c;
         ++index;
      }
   }
   return R;
}
</code>

5.1.3 Summary

Understanding the standards behind URI encoding/decoding helps developers use language‑specific APIs correctly and avoid common pitfalls such as double‑encoding or mishandling reserved characters.

Enjoy your coding trip~

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Encoding URL web standards decoding rfc3986 URI URN

Written by

TAL Education Technology

TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.