Understanding URIs: History, Components, and Encoding/Decoding
This article provides a comprehensive overview of Uniform Resource Identifiers (URIs), covering their historical evolution, the relationship with URLs and URNs, the syntax defined by RFC standards, character sets, component breakdown, and practical encoding and decoding algorithms for web development.
Uniform Resource Identifier (URI) is a fundamental concept every programmer should understand, alongside related terms such as URL and URN; mastering these helps developers navigate the design of the World Wide Web, troubleshoot URI‑related issues, and grasp encoding/decoding mechanisms for network programming.
1. URI
URI (Uniform Resource Identifier) provides a simple, extensible way to identify resources.
1.1 History of URI
As the Web grew, a need arose for a unique, portable identifier for diverse resources (web pages, e‑books, PDFs, etc.). Tim Berners‑Lee’s hypertext proposal introduced the concept of a URL (Uniform Resource Locator) to name hyperlinks. Later, to separate the notions of location and naming, URN (Uniform Resource Name) was defined.
IETF (Internet Engineering Task Force) is responsible for the URI standards.
1994 RFC 1630 introduced URL and URN and defined the formal URI syntax.
December 1994 RFC 1738 defined absolute and relative URLs; RFC 2141 added URN grammar.
1999 RFC 2732 allowed IPv6 addresses in URIs.
2005 RFC 3986 resolved earlier shortcomings and formalized the generic URI syntax.
RFC 3305 noted that while URL is widely used, it may eventually be deprecated in favor of the broader URI term.
1.2 Comparison of URI, URL, and URN
URI, URL, and URN share a common ancestry; URL originally served as a locator, while URN was introduced for name‑only identification. All three can identify a resource, but they differ in usage scenarios.
1.2.1 Basic Concepts
URI : A generic identifier for any resource, expressed as a string of characters defined by IETF syntax.
URL : A locator that includes the access mechanism (e.g., http, ftp, mailto) and points to a retrievable resource.
URN : A persistent name that identifies a resource within a specific namespace without implying location; e.g., urn:isbn:9971-5-0210-0 .
1.2.1 Relationship among the Three
URI is the abstract superset; both URL and URN are specific forms of URI. RFC 3986 states: “A URI can be further classified as a locator, a name, or both.” Consequently, a URL is always a URI, and a URN is a URI that does not convey location.
A URI can be further classified as a locator, a name, or both. The term “Uniform Resource Locator” (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network “location”). – RFC 3986, section 1.1.3
In practice, use the term that best matches the audience’s expectations; when in doubt, “URI” safely covers both URL and URN.
2. URI Character Set
2.1 Design Considerations of URI
URI must be simple, extensible, and portable across different media. It uses a limited character set (US‑ASCII) to ensure compatibility and ease of input on keyboards and other devices.
Portability across systems.
Composed of a character sequence that can be represented in various forms (paper, screen, binary).
Human‑readable characters are preferred.
Consequently, URIs are limited to US‑ASCII characters.
2.2.1 Percent‑Encoding
Characters outside the allowed set are represented using percent‑encoding (also called URL‑encode):
pct-encoded = "%" HEXDIG HEXDIGFor example, the byte 0x2B ("+") is encoded as %2B . Encoding is case‑insensitive, but uppercase is recommended.
2.2.2 Reserved Characters
Reserved characters serve as delimiters between URI components:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="2.2.3 Unreserved Characters
Characters that never need encoding:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
ALPHA = a‑z / A‑Z
DIGIT = 0‑92.2.4 Summary
The following diagram (image) illustrates which characters are reserved vs. unreserved and when encoding is required.
3. URI Components
URI syntax consists of several components, each with its own rules. The generic form is:
URI = scheme ":" [ "//" authority ] path [ "?" query ] [ "#" fragment ]
authority = [ userinfo "@" ] host [ ":" port ]Note: scheme and path are required.
Below is an example diagram of the components.
3.1 Component Details
3.1.1 Scheme
Component
Scheme
Allowed characters
a‑z A‑Z 0‑9 + . -
Case‑sensitive
No
Terminator
:
Scheme identifies the protocol (e.g., http , ftp , geo ). Syntax:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )3.1.2 Authority
Component
Authority
Start delimiter
//
End delimiters
/ ? #
Authority defines a namespace (e.g., example.com ) and consists of optional userinfo, mandatory host, and optional port.
3.1.2.1 Userinfo
Component
Userinfo
Allowed characters
pct‑encoded, unreserved, sub‑delims, ":"
Case‑sensitive
Yes
Terminator
@
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )3.1.2.2 Host
Component
Host
Allowed characters
pct‑encoded, unreserved, sub‑delims
Case‑sensitive
No
Terminator
/ :
host = IPv6address / IPv4address / reg-name
IPv6address = [ HEXDIG *( "::" HEXDIG ) ]
IPv4address = DIGIT "." DIGIT "." DIGIT "." DIGIT
reg-name = *( unreserved / pct-encoded / sub-delims )3.1.2.3 Port
Component
Port
Allowed characters
0‑9
Terminator
/
port = *DIGITTypical defaults: HTTP 80, HTTPS 443.
3.1.3 Path
Component
Path
Allowed characters
pct‑encoded, unreserved, sub‑delims, @, :
Terminator
? # EOF
path = path-abempty / path-relative
path-abempty = *( "/" segment )
path-relative = segment-nocolon *( "/" segment )
segment = *pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
segment-nocolon= unreserved / pct-encoded / sub-delims / "@"3.1.4 Query
Component
Query
Allowed characters
pct‑encoded, unreserved, sub‑delims, @, :
Start delimiter
?
End delimiter
# EOF
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"3.1.5 Fragment
Component
Fragment
Allowed characters
pct‑encoded, unreserved, sub‑delims, @, :, /, ?
Start delimiter
#
End delimiter
EOF
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"3.1.6 Summary
Each component has its own allowed character set; reserved characters act as delimiters between components, while sub‑delimiters may appear inside components.
3.2 Parsing a URI
Because URI syntax follows a context‑free grammar, a recursive‑descent parser can be implemented. The following diagram (image) shows the parsing flow.
Sample pseudocode for a recursive‑descent parser:
/**
* Read the next character
*/
function next() {
skip space;
read next char and return;
}
/**
* Scan for a special character in the input
*/
function contains(input, special_char) {
start = input.start, end = input.end;
while (start < end) {
if (special_char equals start) return;
end;
}
return start;
}
/**
* Main URI parsing function
*/
function parse(string uri) {
parse_scheme;
skip next ':';
if (next() == "//") {
if (contains(substring_uri(/* until path */), '@'))
parse_userinfo;
parse_host;
if (next() == ':') parse_port;
}
parse_path;
if (next() == '?') parse_query;
if (next() == '#') parse_fragment;
}5. Re‑examining Encode and Decode
When generating a URI, each component should be percent‑encoded before concatenating with delimiters. Decoding reverses the process: split by delimiters, then decode each component individually. Unreserved characters may be left unencoded, but encoding them is harmless.
Note: Do not double‑encode or double‑decode a URI, as this corrupts its semantics.
5.1 Implementing Encode and Decode
5.1.1 encode
Pseudocode for percent‑encoding a string, respecting a set of characters that do not need encoding:
/**
* Encode a string s using percent‑encoding, skipping characters in dontNeedEncodingSet
*/
function encode(s, dontNeedEncodingSet) {
R = ""; index = 0; len = s.length();
while (index < len) {
c = s.charAt(index);
if (c in dontNeedEncodingSet) {
R += c;
} else {
// handle UTF‑16 surrogate pairs if needed
out = convertToUTF8(c);
for each byte out_byte in out {
R += "%" + toHexUpper(out_byte >> 4) + toHexUpper(out_byte & 0xF);
}
}
++index;
}
return R;
}5.1.2 decode
Pseudocode for percent‑decoding a URI string:
/**
* Decode a percent‑encoded string s
*/
function decode(s) {
R = ""; index = 0; len = s.length();
while (index < len) {
c = s.charAt(index);
if (c == "%") {
out = "";
while (c == "%" && index + 2 < len) {
c1 = s.charAt(index+1);
c2 = s.charAt(index+2);
out += (hexToInt(c1) << 4) | hexToInt(c2);
index += 3;
c = s.charAt(index);
}
// convert UTF‑8 bytes to characters (handle surrogate pairs if language uses UTF‑16)
R += utf8ToString(out);
} else {
R += c;
++index;
}
}
return R;
}5.1.3 Summary
Understanding the standards behind URI encoding/decoding helps developers use language‑specific APIs correctly and avoid common pitfalls such as double‑encoding or mishandling reserved characters.
Enjoy your coding trip~
TAL Education Technology
TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.