Ameba Ownd

アプリで簡単、無料ホームページ作成

Why utf 8 is better

2022.01.07 19:17




















When is it beneficial to use encodings other than UTF-8? Aside from dealing with pre-unicode documents, that is. And more importantly, why isn't UTF-8 the default in most languages? That is, why do I often need to explicitly set it? For an external encoding i. The one place that counts as an exception to this is in file names, where you must use the platform's conventions if you want any kind of interoperability at all. Fortunately, many platforms now use UTF-8 for this so the warning is a moot point there.


For an internal encoding, things are more complex. The issue is that a character in UTF-8 is not a constant number of bytes, which makes all sorts of operations rather more complex than you might hope. In particular, indexing into the string by character a very common operation when doing string processing! Don't think that you can get away with storing that internal encoding to disk or giving it to another program.


And don't forget that there's a lot of legacy data out there, far too much to dismiss. Of particular concern are various East Asian languages which have complex encodings that are potentially quite a bit shorter than UTF-8, so resulting in less pressure to convert, but there are many other issues lurking even in Western systems.


I don't want to know what is happening in major bank databases…. The answer is that UTF-8 is by far the best general-purpose data interchange encoding, and is almost mandatory if you are using any of the other protocols that build on it mail, XML, HTML, etc.


However, UTF-8 is a multi-byte encoding and relatively new, so there are lots of situations where it is a poor choice. Here are a few. These environments do not internally support UTF-8 or any multibyte encoding. Legacy data. There are vast mountains of textual data already encoded in some 8 bit format, including various code pages, JIS, etc. The remaining cases involve the use of text files. The difference in performance will be the bigger, the longer the string.


There are also, of course, disadvantages: ISO requires that you use some out of band means to specify the encoding being used, and it only supports one of these languages at a time. For example, you can encode all the characters of the Cyrillic Russian, Belorussian, etc. ISO is really only useful for European alphabets.


To support most of the alphabets used in most Chinese, Japanese, Korean, Arabian, etc. Some of these E. If there's any chance you'll ever want to support them, I'd consider it worthwhile to use Unicode just in case. In particular, for uses where:. By using ASCII as your encoding you avoid the complexity of multi-byte encoding while retaining at least some human-readability.


Of course you need to be careful that the data really isn't going to be presented to end users, because if it ends up being visible as happened in the case of URLs , then users are rightly going to expect that data to be in a language they can read. ANSI can be many things, most being 8 bit character sets in this regard like code page under Windows. If you were thinking of 8-bit character sets, one very important advantage would be that all representable characters are 8-bits exactly, where in UTF-8 they can be up to 24 bits.


You could use ANSI, but then you run into the problems of all the different code pages. This "code page hell" is the reason why the Unicode standard was defined. UTF-8 is but a single encoding of that standard, there are many more. UTF being the most widely used as it is the native encoding for Windows.


That way it doesn't matter and you don't have to worry about with which code page your users have set up their systems. Sign up to join this community. The best answers are voted up and rise to the top. Noncharacters are valid in UTFs and must be properly converted.


For more details on the definition and use of noncharacters, as well as their correct representation in each UTF, see the Noncharacters FAQ. Q: Because most supplementary characters are uncommon, does that mean I can ignore them?


A: Most supplementary characters expressed with surrogate pairs in UTF are not too common. However, that does not mean that supplementary characters should be neglected. Among them are a number of individual characters that are very popular, as well as many sets important to East Asian procurement specifications. Among the notable supplementary characters are:.


A: Compared with BMP characters as a whole, the supplementary characters occur less commonly in text. This remains true now, even though many thousands of supplementary characters have been added to the standard, and a few individual characters, such as popular emoji, have become quite common. The relative frequency of BMP characters, and of the ASCII subset within the BMP, can be taken into account when optimizing implementations for best performance: execution speed, memory usage, and data storage.


Such strategies are particularly useful for UTF implementations, where BMP characters require one bit code unit to process or store, whereas supplementary characters require two. Strategies that optimize for the BMP are less useful for UTF-8 implementations, but if the distribution of data warrants it, an optimization for the ASCII subset may make sense, as that subset only requires a single byte for processing and storage in UTF This term should now be avoided.


UCS-2 does not describe a data format distinct from UTF, because both use exactly the same bit code unit representations. However, UCS-2 does not interpret surrogate code points, and thus cannot be used to conformantly represent supplementary characters. Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc.


This single 4 code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character. For more information, see Section 3. A: This depends. However, the downside of UTF is that it forces you to use bits for each character, when only 21 bits are ever needed.


The number of significant bits needed for the average character in common texts is much lower, making the ratio effectively that much worse. In many situations that does not matter, and the convenience of having a fixed number of code units per character can be the deciding factor.


These features were enough to swing industry to the side of using Unicode UTF While a UTF representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF less compelling. With UTF APIs the low level indexing is at the storage or code unit level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the code units.


This provides efficiency at the low levels, and the required functionality at the high levels. If its ever necessary to locate the n th character, indexing by character can be implemented as a high level operation. However, while converting from such a UTF code unit index to a character index or vice versa is fairly straightforward, it does involve a scan through the bit units up to the index point.


While there are some interesting optimizations that can be performed, it will always be slower on average. Therefore locating other boundaries, such as grapheme, word, line or sentence boundaries proceeds directly from the code unit index, not indirectly via an intermediate character code index.


A: Almost all international functions upper-, lower-, titlecasing, case folding, drawing, measuring, collation, transliteration, grapheme-, word-, linebreaks, etc. Single code-point APIs almost always produce the wrong results except for very simple languages, either because you need more context to get the right answer, or because you need to generate a sequence of characters to return the right answer, or both. Trying to collate by handling single code-points at a time, would get the wrong answer.


The same will happen for drawing or measuring text a single code-point at a time; because scripts like Arabic are contextual, the width of x plus the width of y is not equal to the width of xy. In particular, the title casing operation requires strings as input, not single code-points at a time. In other words, most API parameters and fields of composite data types should not be defined as a character, but as a string. And if they are strings, it does not matter what the internal representation of the string is.


Both UTF and UTF-8 are designed to make working with substrings easy, by the fact that the sequence of code units for a given code point is unique.


Q: Are there exceptions to the rule of exclusively using string parameters in APIs? A: The main exception are very low-level operations such as getting character properties e. As one 4-byte sequence or as two 4-byte sequences? A: The definition of UTF requires that supplementary characters those using surrogate pairs in UTF be encoded with a single 4-byte sequence. A: If an unpaired surrogate is encountered when converting ill-formed UTF data, any conformant converter must treat this as an error.


All this encodings are used to encode texts, transfer data and so on. I sorted the characteristics of UTF-8, and encodings for you from Wikipedia :. I think absolutely no. Humanity will definitely remain with UTF-8 encoding as it was, is and will be the global standard for the whole world.


In conclusion, I must say that every thing in the world needs a particular approach.