wiki:normalize_arabic_translit

Version 8 (modified by casties, 9 years ago) (diff)

--

Normalizing arabic transliterations

Algorithm for normalizing the existing transliterated arabic (_translit fields) in the database.

New

1. replace letter combinations

Replace the following letter combinations with a single letter:

dj, ch j
th t
kh h
dh d
sh s
gh g

Replace at the end of a word:

aẗ\b, at\b, ah\b a

Replace letters with diacritics:

a

(replace all y and move to 3.?)

2. remove diacritics

Replace all letters with diacritics with the letter without diacritics.

3. replace letters

Replace the following letters to unify the searches:

g, j j

Questions

What about apostrophes/accents? Normalize to single-quote (U+2019) or apostrophe (U+0027)?

<ʿAbd > [space following letter] returns <ʿAbd> [without space] (e.g., ʿAbd Allāh brings up ʿAbdallāh or Abdullah)
<ʿAbd> [without space] returns <ʿAbd > [space following letter]

Currently

source:OpenMind/src/main/java/org/mpi/openmind/repository/utils/NormalizerUtils.java

1. replace letter combinations

Replace the following letter combinations with a single letter.

th t
kh h
dh d
sh s
gh g
"aẗ ", "at ", "ah " "a "
a

2. replace letters with diacritics

Replace all(?) letters with diacritics with the letter without diacritics.

Remove all apostrophes.


For reference:

http://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html

http://junidecode.sourceforge.net/

http://userguide.icu-project.org/transforms/general