You are here: Home | Professional | Normalization for Biblical Hebrew

Normalization for Biblical Hebrew

The order in which you encode your Hebrew text is important for its proper display. This ordering is known as “normalization.” While the ordering of Hebrew characters on your web site may not be affected by the commonly-used normalization routines directly, you will still need to ensure that you normalize the characters yourself to ensure their proper display. The normalization routine utilized on the Web is Normalization Form C (NFC). The problem is that NFC is not compatible with Biblical Hebrew because when it is applied the characters are often moved out of allignment with each other, resulting in the text being displayed incorrectly.

Consider this example from the SBL Hebrew font User Manual:

Although implementation of the Unicode Standard is generally a boon to scholars working with texts in complex scripts, there is an unfortunate and quite serious flaw in the current encoding of Hebrew. This involves the canonical combining class assignments that are used when text is normalised. Normalisation is a process by which sequences of characters in text that can be variously encoded but are semantically identical are treated as identically encoded. This can frequently involve the re-ordering of a sequence of characters. Consider, as an example, this combination of consonant plus marks that occurs in 1 Ch 13:13. This combination of four characters could be encoded in six different ways:

טֵּ֕ - tet + dagesh + tsere + zaqef gadol

טֵּ֕ - tet + dagesh + zaqef gadol + tsere

טֵּ֕ - tet + tsere + dagesh + zaqef gadol

טֵּ֕ - tet + tsere + zaqef gadol + dagesh

טֵּ֕ - tet + zaqef gadol + dagesh + tsere

טֵּ֕ - tet + zaqef gadol + tsere + dagesh

[Note from me: As you can see, only the first and third examples above actually display correctly. This is due to the fact that the ordering is correct in the first and it is very close in the third.] ... If you consider that any combination of consonant plus three marks can be encoded in six different ways, it is easy to realise how even a fairly short word of five or six consonants with all their marks could be encoded in many dozens of different ways. Normalisation is important because it provides a mechanism for all these possible permutations of mark ordering to be resolved to a single canonical order. This is most important when a text not only needs to be displayed but also needs to searched, sorted or spellchecked. If a search algorithm had to look for fifty or more possible and equivalent spellings of a single word, it would be extremely inefficient and slow. So normalisation is applied to reorder every equivalent sequence of characters into a single and consistent order.

In response to this issue, SIL, SBL, and Microsoft have developed a custom normalization routine that ensures that Biblical Hebrew can be displayed correctly. This normalization requires three factors to work properly. First, it requires the use of the correct ordering sequence (see below). Second, it requires the use of a qualified font (currently only Ezra SIL and SBL Hebrew, but Microsoft’s Hebrew fonts released in 2005 are supposed to be compatible, as well). Third, the normalization requires the use of the correct version of Uniscribe.

Uniscribe (usp10.dll) is a file utilized by Microsoft Windows when displaying Unicode characters. Microsoft distributes different versions of Uniscribe, depending on the operating system or version of Office to which it is attached. The most up-to-date version of Uniscribe (#1.473.4060.0), which will display Hebrew correctly according to the custom normalization routine, is only distributed with Office 2003, Serivce Pack 1. The next version will not come out until Longhorn (the code name for Microsoft’s next operating system) is released in 2006 or 2007. I highly suggest that you get this version of Uniscribe if you deal with Unicode Hebrew text on a regular basis.

An Example

The placement of a meteg under a consonant is a great example of this normalization routine at work. While the examples below are correctly ordered, you will not see the last two examples displayed correctly if you do not have the latest version of Uniscribe.

אֽ - A meteg directly under an aleph - encoded as אֽ

אַֽ - A meteg to the right of a patach - encoded as אַֽ

אַֽ - A meteg to the left of a patach - encoded as אַֽ

אֲֽ - A meteg placed medially within a hatef patach - encoded as אֲֽ

I hope that this explanation of normalization helps you. For more resources on normalizaion in general, check out my Professional Links page.

The Normalization Order

This information is also available in the documentation downloaded with the Ezra SIL and SBL Hebrew fonts.

1 – Base consonant

ש

2 – Shin and Sin dot

shin dot - שׁ

sin dot - שׂ

3 – Dagesh/Mapiq

שּ

4 – Rafe

שֿ

5 – Holam

שֹ

6 – Right Meteg

שֶֽ

7 – Lower marks as they occur from right-to-left:

sheva  -  שְ

hatef segol  -  שֱ

hatef patah  -  שֲ

hatef qamets  -  שֳ

hiriq  -  שִ

tsere  -  שֵ

segol  -  שֶ

patah  -  שַ

qamets  -  שָ

qibbuts  -  שֻ

meteg  -  שֽ

etnahta  -  ש֑

tipeha  -  ש֖

tevir  -  ש֛

munah  -  ש֣

mahapakh  -  ש֤

merkha  -  ש֥

merkha kefula  -  ש֦

darga  -  ש֧

yerah ben yomo  -  ש֪

meteg  -  שֽ

low punctum extra.  -  ש̣

8 – Low Pre-positive marks:

yetiv - ש֚

dehi - ש֭

9 – High Pre-positive marks:

geresh muqdam - ש֝

telisha gedola - ש֠

10 – Upper marks as they occur from right-to-left:

shalshelet  -  ש֓

zakef qatan  -  ש֔

zakef gadol  -  ש֕

revia  -  ש֗

zarqa  -  ש֘

geresh  -  ש֜

gershayim  -  ש֞

qarney para  -  ש֟

pazer  -  ש֡

qadma (azla)  -  ש֨

telisha qetana  -  ש֩

ole  -  ש֫

iluy  -  ש֬

masora circle  -  ֯ש

masora/number dot  -  שׄ

combining dot  -  ש̇

combining diaresis  -  ש̈

11 – Post-positive upper marks from the following group:

segolta  -  ש֒

pashta  -  ש֙

telisha qetana  -  ש֩

tzinor  -  ש֮

The various musings and kvetchings of a Torah-loving believer in Messiah. The Four Questions come from Shabbat 31a.

Follow @jtallent