@author YAMATODANI Kiyoshi
@version $Id: Overview.html,v 1.3 2007/04/19 04:10:32 kiyoshiy Exp $
'LMLML' is Library of MultiLingualization for ML, which aims to support writing multi-linugalized program in ML. The current version supports multi-byte string processing only.
String manipulation modules of existing ML compilers and ML Basis library assume, in fact, a codec which encodes a character in a byte. They do not expect codecs which encode a character in multi-bytes. Therefore, it is hard for ML programmer to write applications which have to handle texts encoded in various codecs. SML# project developed LMLML to support development of such multi-byte string applications.
With LMLML, you can select used codec dynamically for each string. And, you can manipulate strings encoded in heterogeneous codecs as instances of the same type. Therefore, with LMLML, you can isolate program codes which depends on spcific codecs from program codes of codec-independent.
You can select encoding method dynamically with MultiByteString
structure.
decode
functions of MultiByteString
take a string which specify the encoding method to use.
signature MULTI_BYTE_STRING = sig structure Char : sig type char val decodeBytesSlice : String.string -> Word8VectorSlice.slice -> char option val decodeBytes : String.string -> Word8Vector.vector -> char option val decodeString : String.string -> String.string -> char option : end structure String : sig type string val decodeBytesSlice : String.string -> Word8VectorSlice.slice -> string val decodeBytes : String.string -> Word8Vector.vector -> string val decodeString : String.string -> String.string -> string : end end structure MultiByteString : MULTI_BYTE_STRING
For example, you can decode a byteVector : Word8Vector.vector
in UTF-16 encoding as follows.
# val s1 = MultiByteString.String.decodeBytes "UTF-16" byteVector; val s1 = - : MultiByteString.String.string
Available encoding methods are listed by MultiByteString.getCodecNames
.
# MultiByteString.getCodecNames(); val it = [ "UTF-16", "UTF-16LE", "UTF-16BE", "UTF-8", "SHIFT_JIS", "MS_KANJI", "CSSHIFTJIS", "ISO-2022-JP", "CSISO2022JP", "GB2312", "CSGB2312", "GBK", "CP936", "MS936", "WINDOWS-936", "EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE", "CSEUCPKDFMTJAPANESE", "EUC-JP", "ANSI_X3.4-1968", "ISO-IR-6", "ANSI_X3.4-1986", "ISO_646.IRV:1991", "ASCII", "ISO646-US", "US-ASCII", "US", "IBM367", "CP367", "CSASCII" ] : string list
As described below, new encoding method can be added.
If the encoding method to use is statically fixed, it is more efficient to use a structure that implements the encoding method.
# val s2 = UTF16Codec.String.fromBytes byteVector; val s2 = - : UTF16Codec.String.string
However, strings obtained in these ways are not compatible each other.
# MultiByteString.String.size s2; stdIn:5.1-5.30 Error: operator and operand don't agree operator domain: MultiByteString.String.string operand: UTF16Codec.String.string
MultiByteString
and encoding-specific modules, such as UTF16Codec
, provide interfaces almost compatible with Char
and String
of Basis.
It is easy to upgrade existing ML program to support multi-byte string codecs with minor changes only.
signature MB_CHAR = sig type char val compare : char * char -> order val isAscii : char -> bool : end signature MB_STRING = sig type string type char val sub : string * int -> char val explode : string -> char list : end signature MULTI_BYTE_STRING = sig structure Char : MB_CHAR structure String : MB_STRING sharing type Char.string = String.string sharing type Char.char = String.char end structure MultiByteString : MULTI_BYTE_STRING signature CODEC = sig structure Char : MB_CHAR structure String : MB_STRING sharing type Char.string = String.string sharing type Char.char = String.char end structure UTF16Codec : CODEC
And, LMLML provides functors to generate multibyte-string version of Substring
, StringCvt
and ParserComb
.
functor SubstringBase functor StringConverterBase functor ParserCombinatorBase
Note: For the current version, functors are not loaded in the prelude. You should load "LMLML/extension.sml" as follows to use these functors.
# use "LMLML/extension.sml";
For example, you can obtain a multibyte-string version of Substring
for UTF-16 encoding as follows.
local structure MBS = UTF16Codec.String structure MBC = UTF16Codec.Char structure P = struct type char = MBS.char type string = MBS.string val sub = MBS.sub val substring = MBS.substring val size = MBS.size val concat = MBS.concat val compare = MBS.compare val compareChar = MBC.compare end in structure UTF16Substring : MB_SUBSTRING = SubstringBase(P) end
LMLML supports major codecs, such as ShiftJIS and UTF-16 already. You can extend LMLML by adding a new module that supports an encoding method you need without changing LMLML.
First, you have to define a structure that implements PRIM_CODEC
signature.
For example, to support an encoding "foo", define a structure as follows.
structure FooCodecPrim : PRIM_CODEC = struct val names = ["foo"] : end
Then, apply Codec
functor to it.
structure FooCodec = Codec(FooCodecPrim);
This code registers foo codec to MultiByteString
.
You can decode in foo codec as follows.
val mbs1 = MultiByteString.String.decodeBytes "foo" byteVector;
Of course, you can use FooCodec
directly.
val mbs2 = FooCodec.String.fromBytes byteVector;
Major features of LMLML are implemented without depending on SML# features. LMLML can be used with any compiler that conform to the Definition of Standard ML, including of cource SML# but probably also SML/NJ, MLton, and many others.
LMLML is installed with SML# system. And, its core modules are loaded in prelude.
In current version of SML#, Codec functor is not loaded in prelude for an implementation reason. To use Codec functor to extend LMLML with new codec, you have to load "LMLML/extension.sml" as follows.
# use "LMLML/extension.sml";
Use sources.cm with SML/NJ CM.
Use sources.mlb with MLton Basis system.
As an example programming with LMLML, we try to search a character '剣' in a string "白血病abc剣道".
In Shift_JIS, "白血病abc剣道" is encded into the following byte vector:
0wx94, 0wx92, 0wx8C, 0wx8C, 0wx95, 0wx61, (* 白血病 *) 0wx61, 0wx62, 0wx63, (* abc *) 0wx8C, 0wx95, 0wx93, 0wxB9 (* 剣道 *)
A pair of the second byte of '血' and the first byte of '血' is ( 0wx8C, 0wx95 ), which eqauls to '剣'. Therefore, if we search '剣'(0wx8C, 0wx95) in this byte vector, we find incorrectly the byte sequence spanning the second character and the third character of "白血病".
With LMLML, we can obtain the correct result.
At first, we decode a Shift_JIS string from the byte vector.
(* "白血病abc剣道" *) val bytes = Word8Vector.fromList [ 0wx94, 0wx92, 0wx8C, 0wx8C, 0wx95, 0wx61, (* 白血病 *) 0wx61, 0wx62, 0wx63, (* abc *) 0wx8C, 0wx95, 0wx93, 0wxB9 (* 剣道 *) ]; val string = MultiByteString.String.decodeBytes "Shift_JIS" bytes; (* "剣" *) val KenBytes = Word8Vector.fromList [0wx8C, 0wx95]; (* 剣 *) val KenString = MultiByteString.String.decodeBytes "Shift_JIS" KenBytes;
Then, search "剣" in "白血病abc剣道" .
val (leftSS, rightSS) = MBSubstring.position KenString (MBSubstring.full substring);
The obtained leftSS is first 6 characters of "白血病abc", which indicates that the 7th charcter "剣" is found correctly.
# MBSubstring.size leftSS; val it = 6 : int
In this example, codec is fixed at Shift_JIS, you can write by using Shift_JIS specific module as follows.
val ShiftJISString = ShiftJISCodec.String.fromBytes bytes; val ShiftJISKenString = ShiftJISCodec.String.fromBytes KenBytes; structure ShiftJISSubstring = SubstringBase (struct open ShiftJISCodec.String val compareChar = ShiftJISCodec.Char.compare end); val (leftSS, rightSS) = ShiftJISSubstring.position ShiftJISKenString (ShiftJISSubstring.full ShiftJISString); val len = ShiftJISSubstring.size leftSS;