100 IS: UTF-8 with BOM (100.html)

A text file can be encoded as UTF-8 or as UTF-8 with BOM. This article will try to explain the difference.

thumb: StackOverflow

StackOverflow

What's different between UTF-8 and UTF-8 without a BOM? Which is better?

What does "better" mean? "Shorter"? "More Portable"?
Answer: UTF-8 can be auto-detected better by contents than by BOM. The method is simple: try to read the file (or a string) as UTF-8 and if that succeeds, assume that the data is UTF-8. Otherwise assume that it is CP1252 (or some other 8 bit encoding). Any non-UTF-8 eight bit encoding will almost certainly contain sequences that are not permitted by UTF-8. Pure ASCII (7 bit) gets interpreted as UTF-8, but the result is correct that way too. – Tronic

Scanning large files for UTF-8 content takes time. A BOM makes this process much faster. In practice you often need to do both. The culprit nowadays is that still a lot of text content isn't Unicode, and I still bump into tools that say they do Unicode (for instance UTF-8) but emit their content a different codepage. – Jeroen Wiert Pluimers

UTF-8 does not have a BOM. When you put a U+FEFF code point at the start of a UTF-8 file, special care must be made to deal with it. This is just one of those Microsoft naming lies, like calling an encoding "Unicode" when there is no such thing. – tchrist

The UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

.. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials , for more information.

It might not be recommended but it did wonders to my powershell script when trying to output "æøå" – Marius

Regardless of it not being recommended by the standard, it's allowed, and I greatly prefer having something to act as a UTF-8 signature rather the alternatives of assuming or guessing. Unicode-compliant software should/must be able to deal with its presence, so I personally encourage its use. – martineau

@bames53: Yes, in an ideal world storing the encoding of text files as file system metadata would be a better way to preserve it. But most of us living in the real world can't change the file system of the OS(s) our programs get run on -- so using the Unicode standard's platform-independent BOM signature seems like the best and most practical alternative IMHO. – martineau

@martineau Just yesterday I ran into a file with a UTF-8 BOM that wasn't UTF-8 (it was CP936). What's unfortunate is that the ones responsible for the immense amount of pain cause[d] by the UTF-8 BOM are largely oblivious to it. – bames53

The other excellent answers already answered that:

-There is no official difference between UTF-8 and BOM-ed UTF-8 -A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF -Those bytes, if present, must be ignored when extracting the string from the file/stream.

But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...

For example, the data [EF BB BF 41 42 43] could either be:

-The legitimate ISO-8859-1 string "ï»¿ABC"
-The legitimate the UTF-8 string "ABC"

So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above

@Alcott : You understood correctly. The string [EF BB BF 41 42 43] is just a bunch of bytes. You need external information to choose how to interpret it. If you believe those bytes were encoded using ISO-8859-1, then the string is "ï»¿ABC". If you believe those bytes were encoded using UTF-8, then it is "ABC". If you don't know, then you must try to find out. The BOM could be a clue. The absence of invalid character[s] when decoded as UTF-8 could be another... In the end, unless you can memorize/find he encoding somehow, ant array of bytes is just an array of bytes. – paercebal

-------------------------------------------------------------------
| Conclusion/Suggestion: When you receive a file that you believe |
| is a normal text file, if you see no invalid characters, then   |
| treat the file as a normal text file.  If invalid characters    |
| appear, then prefix the file with EF BB BF without any Carriage |
| Return or Line Feed, and try to read it again.  If the invalid  |
| characters disappear, you have fixed your problem by inserting  |
| a BOM. If you don't know how to insert the hexadecimal          |
| characters EF BB BF, get someone to help you do it.             |
|                                                                 |
| If you send a text file to someone and they say that it appears |
| to have invalid characters in it, prefix the file by EF BB BF   |
| and resend it to them.  If they say that the problem is now     |
| solved,you "have added a BOM" to solve the problem.             |
|                                                                 |
| But do NOT prefix EF BB BF to every text file that you send     |
| (or receive) just because you read this article.                |
|    - David KC Cole                                              |
-------------------------------------------------------------------

What is StackOverflow?

It is a website where programmers can pose their questions. Other programmers who understand the problem and/or know the answer can provide helpful comments. The person who posed the question returns after a while to see if the question has been answered. Sometimes questions are answered within an hour, sometimes after weeks or months. There are no guarantees about what you see at StackOverflow. But the programming community is often quick to offer their "two cents worth".

Web Sources

Web Source S100:01:www

Stack Overflow - UTF with BOM?

by various programmers

WebMaster: Ye Old King Cole

Click here to return to ePC Articles by Old King Cole