Technical Report

The biggest technical difficulty in dealing with Zhuang CJKV ideographs is the large number of non ISO 10646, or Unicode, characters. The dictionary Sawndip Sawdenj alone, as of Unicode 6.0, has approximately 5,900 different non ISO 10646 ideographs. The largest publication considered, the Pingguo Collection is 430,000 characters long, of which about 60,000 characters, or 14%, of the text is not currently in ISO 10646. Even in collections like TY1 where many words are written by using a common Chinese character for its pronunciation or meaning, approximately 2.5% of the text is not currently in ISO 10646, this is still over 100 times the frequency of unencoded characters found in ancient classical Chinese books by the National Library of China according to IRGN 1199 . Though for 143 CJKV ideographs, that they are in a Zhuang Sawndip dictionary has been accepted as evidence of use by the IRG, see IRGN 1528, there has never been a specific submission of Zhuang CJKV ideographs to the IRG.

 A twofold solution was adopted (1) The use of specially designed software and fonts to enable the typing of non ISO 10646 Zhuang characters . (2) The development of specialized software that enable IDS described in text to be processed in the same way as other characters. In order to ensure compatibility with ISO 10646 the same definition of a character was used as that of Annex S of ISO 10646. The main font used in the table is Sawndip.ttf which is an open source font based upon uming.ttf to which over 8,000 Zhuang characters have been added, and uses the PUA plane 16 for characters non currently in ISO 10646.

Inputting data
The first piece of software to type Sawndip was Jedit2, an editor which used images for all characters in the Sawndip dictionary. Though clumsy, this was used to type not only the dictionary but also the 2,300 page long Youjiang collection.

Fig 1 Jedit2

Figure 1 -Screen shot of Jedit2 editor


Later a web based text editor was developed that only requires the installation of the font Sawndip.ttf and a suitable browser such as Firefox to type Zhuang characters. The user can change to between the web based IME or an existing input method on there own machine by pressing 'Ctrl' .

Fig 2 Web IME

Figure 2 - Basic Web IME

This IME alone is insufficient for most texts, therefore the ability to type IDS and convert those IDS into characters was added. The text below comes from page 87 of Li Fanggui's book. When typing, IDS can be used, then once completed, the IDS are converted to characters. This also allows for quick identification of most the 82.5 thousand* different characters in the database.

*75 thousand ISO characters and 7.5 thousand plane 16 pua characters


Fig03 with IDS


Fig 04 IDS converted

Figure 3 and 4 - Advanced features.

Analysis of data

The data is stored on a Linux based server which allows easy analysis of data using standard command line tools. Two main technical difficulties existed here (1) a few standard functions such as 'sort' where designed for UCS2 and not UCS4, so UCS4 compatible functions where made for these (2) Specialized functions where also designed which treated IDS as characters .


Sources

For each source text the location of each Zhuang character, the character itself, along with the reading and meaning where given was noted.

1) Dictionary

For the dictionary the characters where referenced by page (3 digits) section (left 1, right 2) entry (1 digit) and position in entry (2 digits). A character is referenced by the page its entry starts on. The examples in the dictionary are constructed and so were not used as evidence of actual usage.

Fig 5 Zhuang Dictionary Sawndip Sawdenj

Figure 5 - Ancient Zhuang Dictionary Sawndip Sawdenj

2) Pingguo Collection

With over 430,000 characters of text the Pingguo collection is the largest publication. In the database each major author was treated as a separate source (with the exception of 潘润环 and 黄关祥 who use an identical orthography and therefore where treated as a single source). This was because each major author represents a distinct tradition. Acomparison of the characters used for bae (to go), gvaq (pass), ranz (house) and mwngz (you) illustrates how different:-

黄国观            󳗋卦󶒘佲
谭绍明            󶿸卦󶒘名
林兆邦            悲瓜󶒘门
潘润环            󶺶󰕼󶒘孟
黄关祥            󶺶󰕼󶒘孟
陆秀祯,余金莲     悲夸󶒘佲


NB Sawndip.ttf or other suitable font is required to see all of the characters.

Characters in the collection are referenced by page, section (0 left 1 right) and line number (blank lines also counted as lines) and position in a line.


Fig 6 Pingguo Liaoge


Figure 6 - Pingguo Liaoge


3)BLT

More than 209,000 characters long. This was the second most extensive publication. The reference for the character being based on the page in order read. It should be noted that the original manuscripts not the printed version were used.

Fig 7 BLT



Figure 7 - Buluotuo sample page

Fig 8 BLT



Figure 8 - Another Buluotuo sample page

4)TY

For this source only the manuscript line were counted. Characters where referenced by page, line and place in a line.

Fig 9 TY



Figure 9 - Tianyang sample page

In some cases the line count spreads over several pages.

Fig 10 TY



Figure 10 - Another Tianyang sample page


5)SGS

This like the Pingguo collection has a relatively large number of non ISO characters . The Zhuang text is on the left-hand side of each page .

Fig 11 SGS



Figure 11- Three songs book sample


6)The most of the remaining texts are also referenced in a similar way with the exception of the tablet where the lines are ordered as it is conjectured they might be sung in:-

Fig 12 Tablet A


Figure 12 - First Steele of two

Fig 13 Tablet B


Figure 13 - Second Steele of two

Table statistics

PG images in table 4131
of which unicode characters number 3376
for these 4131 characters:-
other source image used is snsd  2152
other source image used is BLT   85
other source image used is zys  21
other source image used is TY  98
other source image used is zyq  4
other source image used is SGS  93
is the only source image in the tables 704

BLT images in table 861
of which unicode characters number 810
for these 861 characters:-
other source image used is snsd  210
other source image used is PG  85
other source image used is TY  11
is the only source image in the tables 555

TY images in table 186
of which unicode characters number 184
for these 186 characters:-
other source image used is snsd  20
other source image used is PG  98
other source image used is BLT  11
other source image used is SGS  8
is the only source image in the tables 48

SGS images in table 487
of which unicode characters number 328
for these 487 characters:-
other source image used is snsd  326
other source image used is PG  93
other source image used is TY  8
is the only source image in the tables 61

LB image in table 1
for this 1 character:-
other source image used is snsd  1

snsd images in table 5847
of which unicode characters number 4638
for these 5847 characters:-
other source image used is PG  2152
other source image used is BLT  210
other source image used is ltw  288
other source image used is zys  512
other source image used is TY  20
other source image used is zyq  12
other source image used is SGS  326
second image also snsd 1
is the only source image in the tables 2203

zys images in table 602
of which unicode characters number 276
for these 602 characters images:-
other source image used is snsd  512
other source image used is PG  21
is the only source image in the tables 69

zyq images in table 25
of which unicode characters number 10
for these 25 characters images:-
other source image used is snsd  12
other source image used is PG  4
is the only source image in the tables 9

ltw images in table 288
of which unicode characters number 135
for these 288 characters images:-
other source image used is snsd  288

(English source for the technical part of the Ceremony Table Report, in Chinese, presented in December 2010)
(last revised for style 2011-12-19)