Declaring character encodings in HTML

background

You should always specify the encoding used for an HTML or XML page. If you don’t, you risk that characters in your content are incorrectly interpreted. This is not just an issue of human readability, increasingly machines need to understand your data too. You should also check that you are not specifying different encodings in different places.

This article offers simple advice on how to create the needed declarations. It does so in two ways:

it gives simple advice that works for those who just want to be told quickly what to do
it provides further information for those who want to understand the topic better or explore alternative approaches. It assumes you read the quick answer first.

If you need to better understand what characters and character encodings are, see the article Character encodings for beginners. For information about declaring encodings for CSS style sheets, see CSS character encoding declarations.

quick answer

Here we give quick summary information for those who just want to be told what to do, with minimal explanations. Follow these steps:

Decide whether to use the HTTP header
Check the table in in-document declarations for the format you are using.
Read about character encoding names

If you don’t understand this summary advice or want to understand the reasoning, the links will take you to sections lower down the page which provide examples and explanations.

HTTP headers

You should definitely use HTTP header declarations if it is likely that the document will be transcoded (ie. the character encoding will be changed by intermediary servers), since HTTP declarations have higher precedence than in-document ones.

Otherwise you should use HTTP headers if it makes sense for any type of content, but in conjunction with an in-document declaration (see below). You should always ensure that HTTP declarations are consistent with the in-document declarations.

If your page is encoded as UTF-16, do not declare your file to be “UTF-16BE” or “UTF-16LE”, use “UTF-16” only, and send abyte-order mark with your file.

In-document declarations

In each of the examples below, unless otherwise indicated, substitute the appropriate charset name for where you see “UTF-8”.

Format	What to do
HTML5	Use the meta charset attribute in a meta element at the top of the head element, and ensure that the whole declaration fits within the first 1024 bytes of the page. `<meta charset="UTF-8">`
HTML5 with UTF-16	Ensure that there is a byte-order mark at the beginning of the file. The HTML Working Group is currently discussing whether you can use a meta element declaration in the head element when the encoding is UTF-16. For now, don’t.
HTML4	Use a pragma directive at the top of the head element. `<meta http-equiv="Content-type" content="text/html;charset=UTF-8">`
XHTML 1.x served with text/html MIME type	Use UTF-8 for your page encoding, and use a pragma directive at the top of the headelement. `<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />` (If you want to use any other encoding for your content, read the detailed information below.)
XHTML 1.x served as XML	Use the encoding declaration of the XML declaration on the first line of the page. Ensure there is nothing before it, including spaces (although a byte-order mark is ok). `<?xml version="1.0" encoding="UTF-8"?>`

Character encoding names

You can find names for character encodings in the IANA registry. You should use these names for each method of specifying a character encoding listed here. Note that although these are called charset names by IANA, in reality they refer to the encodings, not the character sets.

The IANA registry commonly includes multiple names for the same encoding. In this case you should use the name designated as ‘preferred’.

It is possible to invent your own encoding names preceded by x-, but this is not usually a good idea since it limits interoperability.

Note that there is a hyphen in the name UTF-8.

Alternative approaches

There are various alternatives to the approaches recommended above, but they typically involve pros and cons. Follow these links for more details.

Using the pragma directive in HTML5 instead of the meta charset attribute.
Using the meta charset attribute in HTML4.
Using an XML declaration in XHTML 1.x served as HTML.

more details

Here we give more detailed information about the various possible ways of declaring character encoding information, starting with HTTP headers, and then listing the various in-document approaches for non-UTF-16 encoded pages. There is a special subsection for pages encoded as UTF-16.

This section covers:

The HTTP header

The Content-Type information in the HTTP header can include information about the character encoding for the document.

HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
...
Content-Type: text/html; charset=UTF-8
Content-Language: en

Encoding can be specified in the HTTP header for files containing such things as CSS and JavaScript, not just HTML markup.

If your document is dynamically created using scripting, you may be able to explicitly add this information to the HTTP header. For example, in PHP use the header() function before generating any content, for example:

<?php header('Content-type: text/html; charset=utf-8'); ?>
<!DOCTYPE html>
...

If you are serving static files, the server may associate this information with the files. The method of setting up a server to pass character encoding information in this way will vary from server to server. You should check with the server administrator.

As an example, Apache servers typically provide a default encoding, which can usually be overridden by user settings. For example, a user might add the following line to a .htaccess file to serve all files with a .html extension as UTF-8 in this and all child directories:

AddType 'text/html; charset=UTF-8' html

For more information on changing the encoding in the HTTP header, see Setting the HTTP charset parameter.

Let’s look at whether it is appropriate to declare the encoding in the HTTP header, inside the page, or both.

Advantages

The HTTP header information has the highest priority when it conflicts with in-document declarations. Intermediate servers that transcode the data (ie. convert to a different encoding) sometimes take advantage of this to change the encoding of a document before sending it on to small devices that only recognize a few encodings. Because the HTTP header information has precedence over any in-document declaration, transcoders typically do not change the internal encoding declarations, just the document encoding and the declaration in the HTTP headers.
User agents can easily find the character encoding information when it is sent in the HTTP header.

Disadvantages

It may be difficult for content authors to change the encoding information for static files on the server – especially when dealing with an ISP. They will need knowledge of and access to the server settings.
Server settings may get out of synchronization with the document for one reason or another. This may happen, for example, if you rely on the server default, and that default is changed. This is a very bad situation, since the higher precedence of the HTTP information versus the in-document declaration may cause the document to become unreadable.
There are potential problems for both static and dynamic documents if they are not read from a server; for example, if they are saved to a location such as a CD or hard disk. In these cases any encoding information from an HTTP header is not available.
Similarly, if the character encoding is only declared in the HTTP header, this information is no longer available for files during editing, or when they are processed by such things as XSLT or scripts, or when they are sent for translation, etc.

So should I use this method?

If serving files via HTTP from a server, it is never a problem to send information about the character encoding of the document in the HTTP header, as long as that information is correct.

If you think that there is a chance that the encoding of the file may be changed by an intermediary before it reaches the user (eg. transcoded to an encoding recognizable to a mobile phone), you may particularly want to consider using the HTTP declaration, since that is where the change will take place.

On the other hand, because of the disadvantages listed above we recommend that you should always declare the encoding information inside the document as well.

(Some people would argue that it is rarely appropriate to declare the encoding in the HTTP header if you are going to repeat it in the content of the document. In this case, they are proposing that the HTTP header say nothing about the document encoding. Note that this would usually mean taking action to disable any server defaults.)

The meta charset attribute

The HTML5 specification describes a new way to declare the encoding for a document that is already supported by the major browsers. You can use this for pages written using HTML5 markup. Alternatively, you can use the pragma directive, but you shouldn’t use both in the same page.

If you use this declaration in HTML4 pages, the HTML4 validator will complain (although the browser will still detect the information).

The declaration looks as follows.

<meta charset="UTF-8">

The HTML5 specification requires that the whole meta element fit in the first 1024 bytes of the document, so always include it at the top of the head element.

You don’t strictly need to use an explicit declaration if you have used UTF-8, but it is better to do so because it allows visual inspection of the encoding from the source code. It may also enable better support in older browsers and in authoring tools.

If you encode your page as UTF-16, see Using UTF-16.

The pragma directive

This is a meta element, which should appear as close as possible to the top of the head element, and which looks as follows:

<meta http-equiv="Content-type" content="text/html;charset=UTF-8">

For XHTML syntax, you should, of course, have ” />” after the content attribute, rather than just “>”.

The encoding of the document is specified just after charset=. In this case the specified encoding is the Unicode encoding, UTF-8.

The pragma directive should be used for pages written in HTML 4.01. It should also be used for XHTML 1.x documents served as HTML, since the HTML parser will not pick up encoding information from the XML declaration.

In HTML5 you can either use this approach for declaring the encoding, or the newly specified meta charset attribute, but not both in the same page. The encoding declaration should also fit within the first 1024 bytes of the document, so you should generally put it immediately after the opening tag of the head element.

An in-document encoding like this allows the document to be read correctly when not on a server. This applies not only to static documents read from disk or CD, but also dynamic documents that are saved by the reader.

An in-document declaration also helps developers, testers, or translation production managers who want to visually check the encoding of a document.

If you encode your page as UTF-16, see Using UTF-16.

The XML declaration

The XML declaration is defined by the XML standard. It appears at the top of an XML file and supports an encoding declaration. For example:

<?xml version="1.0" encoding="UTF-8"?>

An XML declaration is required for a document parsed as XML if the encoding of the document is other than UTF-8 or UTF-16 and if the encoding is not provided by a higher level protocol, ie. the HTTP header.

This is significant, because if you decide to omit the XML declaration you must choose either UTF-8 or UTF-16 as the encoding for the page if it is to be used without HTTP!

You should use an XML declaration to specify the encoding of any XHTML 1.x document served as XML.

It can be useful to use an XML declaration for web pages served as XML, even if the encoding is UTF-8 or UTF-16, because an in-document declaration of this kind also helps developers, testers, or translation production managers ascertain the encoding of the file with a visual check of the source code.

Using the XML declaration for XHTML served as HTML. XHTML served as HTML is parsed as HTML, even though it is based on XML syntax, and therefore an XML declaration should not be recognized by the browser. It is for this reason that you should use a pragma directive to specify the encoding when serving XHTML in this way*.

* Conversely, the pragma directive, though valid, is not recognized as an encoding declaration by XML parsers.

On the other hand, the file may also be used at some point as input to other processes that do use XML parsers. This includes such things as XML editors, XSLT transformations, AJAX, etc. In addition, sometimes people use server-side logic to determine whether to serve the file as HTML or XML. For these reasons, if you aren’t using UTF-8 or UTF-16 you should add an XML declaration at the beginning of the markup, even if it is served to the browser as HTML. This would make the top of a file look like this:

<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http‎://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<head>

<meta http-equiv="Content-type" content="text/html;charset=ISO-8859-1" />

...

If you are using UTF-8 or UTF-16, however, there is no need for the XML declaration, especially as the meta element provides for visual inspection of the encoding by people processing the file.

Catering for older browsers. If anything appears before the DOCTYPE declaration in Internet Explorer 6, the page is rendered in quirks mode. If you are using UTF-8 or UTF-16 you can omit the XML declaration, and you will have no problem.

If, however, you are not using these encodings and Internet Explorer 6 users still count for a significant proportion of your readers, and if your document contains constructs that are affected by the difference between standards mode vs. quirks mode, then this may be an issue. If you want to ensure that your pages are rendered in the same way on all standards-compliant browsers, you will have to add workarounds to your CSS to overcome the differences.

There may also be some other rendering issues associated with an XML declaration, though these are probably only an issue for much older browsers. The XHTML specification warns that processing instructions are rendered on some user agents. Also, some user agents interpret the XML declaration to mean that the document is unrecognized XML rather than HTML, and therefore may not render the document as expected. You should do testing on appropriate user agents to decide whether this will be an issue for you.

Of course, as mentioned above, if you use UTF-8 or UTF-16 you can omit the XML declaration and the file will still work as XML or HTML. This is probably the simplest solution.

Using UTF-16

According to the results of a Google sample of several billion pages in 2010, less than 0.01% of pages on the Web are encoded in UTF-16. Most of the time you are probably better off choosing UTF-8 as your encoding (which by the same survey accounted for over 50% of all Web pages). One reason for this is that there are special rules for declaring the encoding of a UTF-16 page.

In this article we generally recommend declaring the encoding inside the document, even if you also declare it in the HTTP header. The HTML5 specification, however, currently forbids the use of the meta charset attribute or the pragma directive to declare UTF-16. There is some discussion around whether that is necessary, and things may change. For the time being, however, if you want your HTML5 code to validate, you should not use element declarations for UTF-16 encoded content.

Whether you use element-based declarations or not, you should ensure that you always have a byte-order mark at the very start of a UTF-16 encoded file. In effect, this is the in-document declaration.

Furthermore, if your page is encoded as UTF-16, do not declare your file to be “UTF-16BE” or “UTF-16LE”, use “UTF-16” only. The byte-order mark at the beginning of your file will indicate whether the encoding scheme is little-endian or big-endian. (This is because content explicitly encoded as, say, UTF-16BE should not use a byte-order mark; but HTML5 requires a byte-order mark for UTF-16 encoded pages.)

The charset attribute on a link

The HTML 4.01 specification describes a charset attribute that can be added to the a, link and script elements and is supposed to indicate the encoding of the document you are linking to.

This can be used on an embedded link element like this:

See our <a href="/mysite/mydoc.html" charset="ISO-8859-1">list of publications</a>.

You could also use it like this to indicate the encoding of a CSS style sheet:

<link rel="stylesheet" charset="Windows-1251" href="mystyles.css" type="text/css">

The idea is that the browser would be able to apply the right encoding to the document it retrieves if no encoding is specified for the document in any other way.

The use of this attribute on an a or link element is currently deprecated by the HTML5 specification, so you should avoid using it on those elements.

In addition, there are some things to consider before using this attribute. Firstly, it is not well supported by major browsers. Secondly, it is hard to ensure that the information is correct at any given time. The author of the document pointed to may well change the encoding of the document without you knowing. If the author still hasn’t specified the encoding of their document, you will now be asking the browser to apply an incorrect encoding. And thirdly, it shouldn’t be necessary anyway if people follow the guidelines in this tutorial and mark up their documents properly. That is a much better approach.

This way of indicating the encoding of a document has the lowest precedence (ie. if the encoding is declared in any other way, this will be ignored). This means that you can’t use this to correct incorrect declarations either.

Precedence rules

In the case of conflict between multiple encoding declarations, precedence rules apply to determine which declaration wins out. For XHTML and HTML, the precedence is as follows, with 1 being the highest.

HTTP Content-Type header
byte-order mark (BOM)
XML declaration
meta element
link charset attribute

The high precedence of the HTTP header is useful, as mentioned earlier, in situations where the encoding of the document is changed by an intermediary server, since such ‘transcoding’ is unlikely to change the in-document declarations. The transcoding server should, however, declare the new encoding in the HTTP header.

The HTML5 specification (which is not yet stable) formally describes precedence for the byte-order mark (BOM). According to the specification, the BOM has lower precedence than the HTTP Content-Type header, but higher precedence than anything else. At the time of writing, this was not consistently implemented in the latest versions of major browsers. For more information see the test results.