ICU: 文字コードの変換 [C++] | Netsphere Laboratories

(2019.10)

ICU を使った文字コード変換について。ある文字コードのテキストを別の文字コードへ変換する際、間に、必ず Unicode を経由する。

Unicodeへの変換

ICUでは, Unicode code point は UChar32 型, UTF-16 code unit は UChar 型で表す。文字列は UnicodeString クラスのオブジェクト (内部 UTF-16 固定) になる。

データ列がシフトJIS、日本語EUCなどUnicode以外のときは、読込み時に, 何らかの方法で変換してやる必要がある。

UNIXであれば iconv() が使えるので、それを使ってもよい。変換表をいくつもあれこれ使うと、思わぬところで文字化けしたりする恐れがある。iconv() を使わないのなら、ICUの変換器を使う。UConverter クラスが文字コードの変換器。

▲ iconv

次のソースコードは、日本語EUCからUnicodeへ変換したうえで、UnicodeStringオブジェクトを作る。

まずは、必要なヘッダを include する。

  3| #undef USE_ICONV
  4| 
  5| #include <stdio.h>
  6| #include <string.h>
  7| #include <assert.h>
  8| #ifdef USE_ICONV
  9|   #include <iconv.h>
 10| #else
 11|   #include <unicode/ucnv.h>
 12| #endif
 13| #include <unicode/unistr.h>
 14| #include <unicode/uchar.h>
 15| #include <unicode/schriter.h>

iconvを使った変換は, iconv_open()で文字コードを指定し、iconv() で変換する。変換先は UTF-8 or UTF-16 にすること。次の例は, 変換後の大きさが 1,000 バイトに収まるという仮定を置いている (手抜き) ので、実用にするときは修正が必要。

変換したバイト列と文字コード名をUnicodeStringに与えて文字列オブジェクトを生成する。

 17| #ifdef USE_ICONV
 18| static const char* UTF8_CES = "UTF-8";
 19| static const char* EUCJP_CES = "eucJP-open";
 20| #endif
 21| 
 22| int main()
 23| {
 24|     const char* euc = "ABC日本語のテキスト\\/￥／＼";
 25| #ifdef USE_ICONV
 26|     // エンコーディングの変換：iconvを使う場合
 27|     iconv_t cd = iconv_open(UTF8_CES, EUCJP_CES);
 28|     size_t left = strlen(euc);
 29|     char buf[1000];
 30|     char* p = buf;
 31|     size_t bufleft = 1000;
 32|     printf("euc = %x, p = %x\n", euc, p);
 33|     size_t r = iconv(cd, const_cast<char**>(&euc), &left, &p, &bufleft);
 34|     printf("euc = %x, p = %x\n", euc, p);
 35|     printf("r = %d, left = %d, bufleft = %d\n", r, left, bufleft);
 36|     *p = '\0';
 37|     iconv_close(cd);
 38|     UnicodeString str(buf, UTF8_CES);

ICU のコンバータ

今度はICUのコンバータを使ってみる。ucnv_open() でUConverterオブジェクトを生成し、それと変換元のバイト列をUnicodeStringに与えるだけでいい。

ucnv_open() に渡す文字コード名だが、日本語EUCの場合は、mappings/convrtrs.txtを見ると、次の3つがある。ibm-954 は NEC特殊文字が全滅。ICU では EUC-JP がまともな変換表。

ucnv_open()に渡す文字コード名	コメント
`ibm-33722_VPUA`	ibm-5050. JIS X 0208記号も足らない。不可。
`ibm-33722_P120-1999`	'\' => yen sign. NG
`ibm-954`	NEC特殊文字とその他の記号が収録されていない。不可。
`EUC-JP`	これが一番まとも。

その他、aliasなどは、次のページで見れる；

ICU Demonstration - Converter Explorer

 39| #else
 49|     UErrorCode error = U_ZERO_ERROR;
 50|     UConverter* cnv = ucnv_open("ibm-954", &error);
 51|     assert(U_SUCCESS(error));
 52|     UnicodeString str(euc, strlen(euc), cnv, error);
 53|     assert(U_SUCCESS(error));
 54| #endif

文字コード間の変換

ある文字コードのバイト列を、別の文字コードに変換する。レガシーな文字コードの場合だけでなく、UTF-8などの場合も、このようにすればいい。

1回で完結

メモリ上にバイト列が納まる場合は、ucnv_convert() が一番簡単。

C++

[RAW]

#include <unicode/ucnv.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
int main()
{
FILE* fp = fopen("sjis.txt", "r");
char buf[10000];
int r = fread(buf, 1, sizeof(buf), fp);
buf[r] = '\0';
fclose(fp);
char buf2[10000];
UErrorCode err = U_ZERO_ERROR;
// convenience function.
// "ibm-954" は NEC特殊文字が全滅。使えない.
r = ucnv_convert(/*"ibm-954"*/ "EUC-JP", // toConverterName
"MS932", // fromConverterName
buf2, // target
sizeof(buf2), // targetCapacity
buf, // source
strlen(buf), // sourceLength
&err); // pErrorCode
buf2[r] = '\0';
fp = fopen("euc.txt", "w");
if (!fp) {
fprintf(stderr, "fopen failed.\n");
return 1;
}
time_t now = time(nullptr);
fprintf(fp, "%s", ctime(&now)); // ctime(): 改行が付く.
fwrite(buf2, 1, r, fp);
fclose(fp);
return 0;
}

ストリームを変換

メモリに載りきらない場合や、ストリームを読込みながら変換したい場合は、やや複雑になる。

1文字の途中で入力のバイト列がいったん切れることもあるし、エスケープシーケンスで文字集合を切り替えるようなものだと、変換器が状態を持つ。

まずは、呼び出す部分。

C++

[RAW]

int main()
{
FILE* input = fopen("sjis.txt", "r");
FILE* output = fopen("euc.txt", "w");
if (!output) {
fprintf(stderr, "fopen failed.\n");
return 1;
}
convert_and_copy(input, output);
fclose(input);
fclose(output);
return 0;
}

本題の変換する部分。

こういう用途では ucnv_convertEx() も考えられるが、使い方があまりに複雑なため、よくない。

ucnv_toUnicode() と ucnv_fromUnicode() を組み合わせるのがよい。中間表現として pivot バッファを用意し、いったん Unicode に変換する。

ucnv_open() で変換器を生成し、ループを回している間、これを使いまわす。ucnv_close() を忘れずに。

手抜きだが、こんな感じで上手くいく。

C++

[RAW]

void convert_and_copy(FILE* input, FILE* output)
{
assert(input);
assert(output);
char source_buf[9]; // 極端な例。多バイト文字の最長より長いこと。
constexpr int ELEM_SIZE = 30; // source_buf に対して十分な大きさが必要.
UChar pivot[ELEM_SIZE]; // 中間表現. UTF-16
char target_buf[100];
UErrorCode err = U_ZERO_ERROR;
UConverter* sourceCnv = ucnv_open("MS932", &err);
assert(sourceCnv);
UConverter* targetCnv = ucnv_open(/*"ibm-954"*/ "EUC-JP", &err);
assert(targetCnv);
int remaining = 0;
while (true) {
// バッファの先頭とは限らない.
int r = fread(source_buf + remaining, 1, sizeof(source_buf) - remaining,
input);
bool flush = feof(input) || ferror(input);
if ( flush && (remaining + r == 0) )
break; // 未変換のものはない.
// 中間表現に変換.
UChar* pivot_p = pivot;
const char* source_p = source_buf;
ucnv_toUnicode(sourceCnv, &pivot_p, pivot + ELEM_SIZE,
&source_p, // source
source_buf + remaining + r, // sourceLimit
nullptr, flush, &err);
assert(U_SUCCESS(err));
print_uchars(pivot, pivot_p);
// ターゲットに変換.
char* target_p = target_buf;
const UChar* pivot_out = pivot;
ucnv_fromUnicode(targetCnv, &target_p, target_buf + sizeof(target_buf),
&pivot_out, pivot_p, nullptr, flush, &err);
assert(U_SUCCESS(err));
assert(pivot_out == pivot_p);
// 書き込む
fwrite(target_buf, 1, target_p - target_buf, output);
if (flush)
break;
// sourceバッファを詰める
memmove(source_buf, source_p, (remaining + r) - (source_p - source_buf));
remaining = (remaining + r) - (source_p - source_buf);
}
ucnv_close(sourceCnv);
ucnv_close(targetCnv);
}

(2024.7) JIS X 0213:2004 に収録されている文字のうち一部は, Unicode では 2 code points になる。JIS X 0213コード表(1)1面1-23区 JIS 1面の文字のみ。

EUC-JP は,

1面の 2バイト → 1 code point だが 2 code units になるものと, 2 code points (いずれも 1 code unit).
2面の 3バイト → すべて 1 code point, その一部が 2 code units.

ELEM_SIZE の要素 (code unit) 数は, source_buf のバイト数と同じで差し支えない。