FarragoCharacterSets

From Eigenpedia

Jump to: navigation, search

This page tracks the status of character set support in Farrago and related projects such as Fennel and LucidDB.

Contents

Background

Through release 0.8.0, the only character set supported has been ISO-8859-1, an 8-bit (single-byte) character set which covers most Western European alphabets.

In this thread, Julian sketched out what it will take to add Unicode support, and JVS filled in some of the details in the followup.

There are currently no plans to add support for additional character sets beyond ISO-8859-1 and Unicode.

Unicode Representation

The encoding to be used for Unicode character data is UCS-2, which is a subset of UTF-16. UCS-2 is strictly fixed-width doublebyte, whereas in UTF16, the full range of 4-byte characters can be expressed via a variable-width encoding scheme. This is a reasonable simplification because Java characters are 16-bit only.

--Zfong 13:59, 21 January 2009 (EST) So, this means that Farrago will not support supplementary characters, which UCS-2 cannot represent. Correct?
--Jvs 14:11, 21 January 2009 (EST): Correct. Basically, this means that they will pass through uninterpreted, meaning for example that the UPPER function may not do the right thing in their presence.

As with integer datatypes, the encoding is tied to the endianness of the platform, so on Intel chips, UCS-2LE (little endian) format is used. Hence, no byte order mark is needed for tuple data.

Character Set Names

In Part 2, Section 4.2, SQL:2003 defines a number of SQL-standard character set names. These are used as part of datatype specifications, e.g. VARCHAR(10) CHARACTER SET "LATIN1". It also allows for system-specific character set names.

Java's java.nio.charset.Charset class defines a number of Java-standard character set names.

Farrago supports a subset of the union of both, internally converting the SQL-level character set names into corresponding Java character set names (method SqlUtil.translateCharacterSetName):

SQL Character Set Java Character Set Notes
LATIN1 ISO-8859-1 single-byte
UTF16 UTF-16LE or UTF-16BE double-byte (depends on platform endianness)

So in a CHARACTER SET clause, the user can specify any of { LATIN1, ISO-8859-1, UTF16, UTF-16xE } where "x" matches the system on which Farrago is running.

Comments:

  • In SQL:2003, ISO8BIT is defined as ISO-8859-1 plus non-printing control characters, which are currently NOT supported by Farrago. If this kind of data needs to be stored, use BINARY or VARBINARY instead.
  • Java's US-ASCII is a subset of ISO-8859-1, so it doesn't need to be supported explicitly
  • We should probably define mappings for the SQL:2003 standard SQL_CHARACTER_SET, SQL_TEXT, and SQL-IDENTIFIER character sets
  • Should we prevent usage of the underlying Java character set names, and require usage of the SQL names instead? We should probably at least be sure to preserve the SQL-level names in all stored metadata.

Default Character Set

Farrago supports a system-wide default character set which is applied to character data types for which no character set is specified explicitly. For example, if the default is changed to Unicode, then an expression like CAST(c AS VARCHAR(20)) will imply Unicode for the target character set of the cast.

This default is stored in the catalog as the defaultCharacterSetName attribute of the instance of CwmCatalog corresponding to LOCALDB. A vanilla build of Farrago has this default set to ISO-8859-1.

The setting can be changed to Unicode via

call sys_boot.mgmt.change_default_character_set_to_unicode();

An additional effect of the call is that the character set attribute of all existing columns, routine parameters, etc (e.g. for system-defined views and procedures) will also be changed to Unicode. Consequently, this system procedure can only be called before the first local table has been defined, otherwise it will fail, since updating the encoding for all existing data would be very expensive.

The intention is that this procedure can be called automatically by an installer in response to the character set choice of the system administrator carrying out the installation.

--Jvs 22:52, 20 January 2009 (EST): According to Stephan, Enki should already support Unicode if the underlying DBMS is configured correctly. This will be tested out as part of release.

Unicode Literals

As of eigenchange 12329, Farrago supports Unicode literals as follows:

  • _UTF16'ανθρωπος'
  • U&'ανθρωπος'
  • U&'\03B1\03BD\03B8\03C1\03C9\03C0\03BF\03C2'
  • U&'!03B1!03BD!03B8!03C1!03C9!03C0!03BF!03C2' UESCAPE '!'

The four above are all equivalent.

Note that escaping is only processed with the U& introducer; _UTF16'\03B1\03BD\03B8\03C1\03C9\03C0\03BF\03C2' won't be processed (the resulting literal will still have backslashes and hex digits instead of the desired Unicode characters). The default escape caharacter is backslash, but this can be overridden with the UESCAPE syntax; per SQL:2003 a single character must be specified (e.g. a string like '\u' cannot be specified in an attempt to match Java conventions), and it cannot be any hex digit, whitespace, plus sign, or double-quote.

Also note that if you attempt to use Unicode characters in a literal string without the introducer, e.g. just plain 'ανθρωπος', an encoding failure exception will be thrown unless Farrago's default encoding has already been changed to Unicode.

SQL:2003 provides for specifying both the U& introducer AND an explicit character set name, but Farrago does not yet support this; for now, U& always implies UTF16.

For tests dealing with Unicode data, see this thread regarding rules for .sql/.ref file encodings.

Unicode Identifiers

As of eigenchange 12366, Farrago supports Unicode identifiers as follows:

  • "ανθρωπος"
  • U&"ανθρωπος"
  • U&"\03B1\03BD\03B8\03C1\03C9\03C0\03BF\03C2"
  • U&"!03B1!03BD!03B8!03C1!03C9!03C0!03BF!03C2" UESCAPE '!'

The four above are all equivalent. For the last one, note that the UESCAPE uses single-quotes even though the identifier body uses double-quotes.

Fennel Unicode Support

Fennel tuple support for Unicode data has been added as mentioned in this thread.

Remaining changes known to be required include:

  • add Unicode support in the Fennel calc
    • --Jvs 01:22, 7 February 2009 (EST): for now, since this is not available, we just tell the optimizer to avoid usage of the Fennel calc for any expressions involving Unicode data (as of eigenchange 12337)
  • deal with executor special cases like in hash join/agg where we have to trim VARCHAR before key comparison.
    • --Jvs 01:19, 7 February 2009 (EST): implemented in eigenchange 12337
  • network byte order support (only needed by extension projects which transmit tuples over the wire between hosts with different endianness)

Farrago Unicode Support

Below are a number of changes and investigations (not including testing) that need to be undertaken beyond the ones already mentioned in the sections above:

  • study wikipedia:Unicode collation algorithm; comparison operators are currently just comparing 16-bit characters as integers, which is probably not good enough
    • --Jvs 15:07, 9 February 2009 (EST): I'm going to defer any work on collation support for now; binary comparison is no worse than what we already have for ISO-8859-1 (where it's not yet possible to do case-insensitive sort). So Java calc and Fennel calc will both (consistently) do binary comparison for now (plus the usual PAD SPACE rules for implicit RTRIM-before-comparison).
  • parser support for unicode string literals
    • --Jvs 16:10, 7 February 2009 (EST): implemented as described in the section above
  • study the rules for conversion between character sets (and other operations such as comparison), and either implement them, or add validation to prevent them
    • --Jvs 01:27, 10 February 2009 (EST): implemented in eigenchange 12343; note that many cases remain where error messages omit the character set in the type string, leading to confusing messages such as "can't convert VARCHAR to CHAR" where what it really means is "can't convert VARCHAR CHARACTER SET UTF16 to CHAR CHARACTER SET LATIN1"
  • implement Unicode support in Java calc (and other areas of Java-level execution, including Janino compilation)
    • --Jvs 18:05, 2 February 2009 (EST): added support in eigenchange 12301; there are some FIXME's remaining for big-endian architectures; functions such as SUBSTRING do not yet support the USING {CHARACTERS|OCTETS} clause (not sure what the point of this is)
  • propagate character set correctly in type derivation for views, UNION, etc
    • --Jvs 01:24, 7 February 2009 (EST): done in eigenchange 12337, but needs more thorough testing
  • review SQL/MED and JDBC driver and figure out what needs to be done
    • --Jvs 20:16, 11 February 2009 (EST): looks like nothing special is needed; for SQL/MED, either declare the foreign columns explicitly as UTF16, or else set the Farrago default to Unicode (tested with the CSVJDBC driver)
  • (maybe) support Unicode in identifiers
    • --Jvs 18:14, 12 February 2009 (EST): implemented in eigenchange 12366
  • investigate csvjdbc wrapper and see if it can be used as a Unicode stand-in for the flatfile wrapper
    • --Jvs 20:43, 9 February 2009 (EST): yes, it can, at least with a UTF-8 encoded flatfile; see farrago/unitsql/expressions/unicode.sql for how to set this up
    • --Jvs 01:14, 6 March 2009 (EST): I also had to make some changes to the flatfile reader in order for it to work correctly on single-byte data after change_default_character_set_to_unicode has been called; as of eigenchange 12410, this works by declaring the flatfile result type as single-byte, but then adding a cast to UTF16 on top
  • investigate sqlline, vjdbc
    • --Jvs 20:43, 9 February 2009 (EST): from cursory testing, vjdbc seems to have no problem. sqlline works fine with Unicode in both input and output when the invoking shell's LANG variable is set to en_US.UTF-8
  • catalog export: use UTF-8 for XMI encoding
    • --Jvs 20:27, 27 April 2009 (EDT): oops, forgot about this one...this means we won't be able to use XMI export/import as a migration mechanism from 0.9.0 to later releases

LucidDB Unicode Support

Work items:

  • Add installer support
    • --Jvs 21:13, 11 February 2009 (EST): we don't actually have a reall installer yet, but an administrator can do call sys_root.change_default_character_set_to_unicode() immediately after running install.sh
Personal tools