«   2025/01   »
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
Archives
Today
Total
01-23 10:39
관리 메뉴

lancelot.com

Character set 관련, DBCSs, UTF-8, UTF-32 본문

프로그래밍

Character set 관련, DBCSs, UTF-8, UTF-32

lancelot50 2008. 6. 29. 19:32

1. Window via C/C++
 p.12
 In a double-byte character set, each character in a string consist of either 1 or 2 bytes.  With Kanji, for example, if the first character is between 0x81 and 0x9F or between 0xE0 and 0xFC, you must look at the next byte to determin the full character in the string.

DBCS는 1byte 나 2byte인 문자로 구성된 문자열입니다.  예를들어 '간지'같은경우, 첫째 문자가 0x81 과 0x9F사이에 있거나 0xE0 과 0xFC사이에 있으면, 다음 byte를 봐야 글씨를 알수있다(즉, 2바이트 짜리다 라는 얘기죠)


UTF-8
UTF-8 encodes some characters as 1 byte, some characters as 2 bytes, some characters as 3 bytes, and some characters as 4 bytes.  Characters with a value below 0x0080 are compressed to 1 byte, which works very well for characters used in the United states.  Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works well for European and Middle Eastern languages.  Characters of 0x0800 and above are converted to 3 bytes, which works well for East Asian languages.  Finally, surrogate pairs are written out as 4 bytes.  UTF-8 is an extreamely popular encoding format, but it's less efficient than UTF-16 if you encode many characters with values of 0x0800 or above.

UTF-8 은,  문자에따라 1,2,3,4 byte가 될수있네요. 
0x0080 아래인건 1byte로 나타냅니다. 미쿡에서 쓰는 문자들이랍니다.
0x0080 - 0x07FF 사이는 2byte로 나타냅니다. European 과 중동언어 네요.
0x0800 위로는 3byte로 나타냅니다.  동아시아 언어에 쓰인다는군요.
surrogate pair 는 4 byte로 나타낸다는데 이건 뭔지 아직 잘모르겠군요.


UTF-32
UTF-32 encodes every character as 4 bytes.
긴말이 필요없습니다.