HTML Document Character Set
previous next contents elements attributes
HTML Document Character
Set
Contents
The Document Character Set
Character entities
Human languages define a large number of text characters and human
beings have invented a wide variety of systems for representing these
characters in a computer. Unless proper precautions are taken,
differing character representations may not be understood by user
agents in all parts of the world.
The Document Character Set
To promote interoperability, SGML requires that each application
(including HTML), as part of its definition, define its document
character set. A document character set is a set of abstract
characters (such as the Cyrillic letter "I", the Chinese character
meaning "water", etc.) and a corresponding set of integer references
to those characters. SGML considers a document to be a sequence of
references in the document character set.
The document character set for HTML is the Universal Character Set
(UCS) of [ISO10646]. This set is
character-by-character equivalent to Unicode 2.0 ([UNICODE]). Both of these
standards are updated from time to time with new characters
and the amendments should be consulted at the respective Web
sites.
In the current
specification, references to ISO/IEC-10646 or Unicode imply the same
document character set. However, the current document also refers to
the Unicode specification for other issues such as the bidirectional
text algorithm.
Conforming HTML user agents may receive or output a document, or
represent a document internally, using any character
encoding. A character encoding represents some subset of the
document character set. Character encodings such as ISO-8859-1
(commonly referred to as "Latin-1" since it encodes most Western
European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a
Japanese encoding), and euc-jp (another Japanese encoding) save
bandwidth by representing only slices of the document character set.
Thus, character encodings allow authors to work with a convenient
subset of the document character. Authors should not have to know
anything about the underlying character encoding of the document or
tool they are using --- writing Japanese in a UTF-8 editor is as easy
as writing Japanese in a JIS or SHIFT_JIS editor.
Character encodings also mean that authors are not required to
enter a document's text in the form of references the document
character set. Requiring authors to work with such a large character
encoding would be cumbersome and wasteful (although encodings such as
UTF-8 that cover all of Unicode do exist).
To allow this convenience, conforming user agents must correctly
map to [UNICODE] all characters in
any character encodings ("charsets") they recognize (or behave as if
they did). A list of recommended character encodings for various
scripts and languages will be provided in a separate document.
How does a user agent know which character encoding has been used to
encode a given document?
In many cases, before a Web server sends an HTML document over the
Web, it tries to figure out the character encoding (by a variety of
techniques such as examining the first few bytes of the file, checking
its encoding against a database of known files and encodings,
etc.). The server transmits the document and the name of the character
encoding to the receiving user agent by way of the
charset parameter of the HTTP "Content-Type" field. For
example, the following HTTP header announces that the character
encoding is "euc-jp".
Content-Type: text/html; charset=euc-jp
The value of the "charset" parameter must be the name of a
"charset" as defined in [RFC2045].
Unfortunately, not all servers send information about the character
encoding (even when the character encoding is different from the
widely used ISO-8859-1 encoding). HTML therefore allows authors a way
to tell user agents which character encoding has been used by
specifying it explicitly in the document header with the META element. For example, to specify that the
character encoding of the current document is "euc-jp", include
the following META declaration:
<META http-equiv="Content-Type" Content="text/html; charset=euc-jp">
This mechanism has a notable limit: the user agent cannot
interpret the META element to determine the
character encoding if it doesn't already know the character encoding of
the document. The META declaration must
only be used when the character encoding is organized such that
ASCII characters stand for themselves at least until the META element is parsed. In this case, conforming
user agents must correctly interpret the META element.
To sum up, conforming user agents must observe the following
priorities when determining a document's character encoding, (from highest
priority to lowest):
Explicit user action to override erroneous behavior.
An HTTP "charset" parameter in a "Content-Type" field.
A META declaration with "http-equiv"
set to "Content-Type" and a value set for "charset".
The "charset" attribute set for the A
and LINK elements.
User agent heuristics and user settings. For example, user agents
typically assume that in the absence of other indicators, the
character encoding is ISO-8859-1. This
assumption may lead to an unreadable presentation of certain
documents.
In all cases, the value of the "charset" attribute or parameter
must be the name of a "charset" as defined in [RFC2045].
If, for a specific application, it becomes necessary to refer
to characters outside [ISO10646],
characters should be assigned to a private zone to avoid conflicts
with present or future versions of the standard. This is highly
discouraged, however, for reasons of portability.
Note: Modern web servers can be configured with information about
which document is using which character encoding. Webmasters should
use these facilities but should take pains to configure the server
properly.
Character entities
Your hardware and software configuration probably won't allow you
to refer to all Unicode characters through simple input
mechanisms, so SGML offers character encoding-independent mechanisms for
specifying any character from the document character set.
Numeric character references (either decimal or hexadecimal
form).
Named character references.
Numeric character references specify the integer reference of a
Unicode character. A numeric character reference with the syntax &#D;
refers to Unicode decimal character number D. A numeric character
reference with the syntax &#xH; refers to Unicode hexadecimal
character number H. The hexadecimal representation is a new SGML
convention and is particularly useful since character standards use
hexadecimal representations.
Here are some examples:
Entity å refers to the letter "a" with a small
circle above it (used, for example, in Norwegian).
Entity å refers to the same character
with the hexadecimal representation.
Entity И refers to the
Cyrillic capital letter "I".
Entity 水 refers to the Chinese
character for water with the hexadecimal representation.
To give authors a more intuitive way to refer to characters in the
document character set, HTML offers a set of named character
entities. Named character references replace integer references
with symbolic names. The named entity å refers to the same
Unicode character as å. There is no named entity
for the Cyrillic capital letter "I". The full list of named character entities
is included in this specification.
Four named character entities deserve special mention since they are
frequently used to "escape" special characters: For text appearing as
part of the content of an element, you should escape < as < to
avoid possible confusion with the beginning of a tag. The &
character should be escaped as & to avoid confusion with the
beginning of an entity reference.
You should also escape & within attribute values since entity
references are allowed within cdata attribute values. In addition,
you should escape > as > to avoid problems with older user
agents that incorrectly perceive this as the end of a tag when coming
across this character in quoted attribute values.
Rather than worry about rules for quoting attribute values, its
often easier to encode any instance of " by " and to always
use " for quoting attribute values. Many people find it simpler to
always escape these 4 characters in element content and attribute
values.
"&" to represent the & sign.
"<" to represent the < sign.
">" to represent the > sign.
"" to represent the " mark.
Names of named character entities are case-sensitive. Thus,
Å refers to a different character (upper case A, ring) than
å (lower case a, ring).
Note: In SGML, it is possible to eliminate the final ";" after a
numeric or named character reference in some cases (e.g., at a line
break or directly before a tag). In other circumstances it may not be
eliminated (e.g., in the middle of a word). We strongly suggest using
the ";" in all cases to avoid problems with user agents that require
this character to be present.
previous next contents elements attributes
Wyszukiwarka
Podobne podstrony:
CharsetEncoderCharSequenceCharsetEncoderaccept charsetCharSeqHelperCharsetDecoderCharsetCharsetcharsetcharsetCharsetProviderCharSequenceCharSeqHelperCharSeqHolderCharsetProviderwięcej podobnych podstron