Cyrillic Font Project

Technical Information

Algorithm:

The script was written in PERL or more specifically, MacPerl, as the translations would take place on the macs at the department. All of the folowing code fragments are in PERL, and are taken more or less directly from the script itself.

The first task was to determine a working idea of the translation algorithm. The translation would need to be able to convert not only single ASCII points, but multiple character sequences to deal with some accenting problems. To start with, the problem was simplified. First the algorithm would search each character, and look up its translation. This takes care of one to one, as well as one to many translations. After the simple translation algorithm was developed, a second-pass search might be done to find any multiple-character sequences.

The translation maps would be read in from a file, in order make the translation process generalized. To make sure that the map-file format would not have to be revised, it was designed around the most general parameters which could be conceived. This meant that the format would have to be able to handle character sequences of differing lengths. In addition it was decided that the map-files should be fairly legible in their raw format. Finally, the font names should appear in the file itself, rather than the file name. (Font names can be quite long) The format decided upon was as follows

Lektorek Russian --> CyrillicII

3b --> fc

3c --> c7

8e --> 65 b1

3e --> c8

The simple translation algorithm follows fairly quickly, once this map is loaded into a hash:

s/(.)/$map{$1}/g

The second-pass algorithm would need to be a little smarter that that. It needs to search for different strings, so alternation would appear to be needed. Also, since they're literally in the regex, they need to be quoted. So when, in reading the map-file, we encounter a many to one translation, we add it to a different hash, called map_tto. When the map-file is read completely, the keys to map_tto are quoted, individually, and joined on '|', the alternation character. This is the search string in the following code fragment:

s/($search)/$map_tto{$1}/g

RTF Reader:

The documents were to be read in RTF format, which would allow the preservation of formatting, but still allow us to work on ASCII text. The RTF reader had to be able to understand the basics of RTF, ignoring most control words, recognizing font changes, special characters, hex-codes, etc. The RTF specification was obtained from Microsoft's WWW site.

The reader was designed recursively. Each level would handle a block, the function was called parse_block(). The text would be searched for an RTF meta (a forgivingly small set of three characters: \,{, and}) open curlies would call another level of interpretation, closed curlies would finish off a level, and backslashes would be parsed for control word content. Each instance of the reader would store its own font number, which it inherited from the previous level. Any intervening text would be translated if necessary, or copied out directly, by another subroutine write()

The RTF header also had to be read,the header function was called read_header().The imporetant tasks of this function were to find the appropriate (source) font tags, and substitute in the new font names. The font tags are later used by write() to detemine if a passage needs translating.


Version History: