wiki:normalize_arabic_translit

Version 13 (modified by casties, 9 years ago) (diff)

--

Normalizing arabic transliterations

Algorithm for normalizing the existing transliterated arabic (_translit fields) in the database.

New

1. normalize "apostrophes"

Replace:

`, ʿ, ‘, ’, ʾ '

2. replace letter combinations

Replace the following letter combinations with a single letter:

dj, ch j
th t
kh h
dh d
sh s
gh g

Replace at the end of a word:

aẗ\b, at\b, ah\b a

Replace:

"'abd " 'abd

Replace letters with diacritics:

a

3. remove diacritics

Replace all letters with diacritics with the letter without diacritics.

4. replace letters

Replace the following letters to unify the searches:

g, c j

Questions

What about apostrophes/accents? Normalize to single-quote (U+2019) or apostrophe (U+0027)?

<ʿAbd > [space following letter] returns <ʿAbd> [without space] (e.g., ʿAbd Allāh brings up ʿAbdallāh or Abdullah)
<ʿAbd> [without space] returns <ʿAbd > [space following letter]
(Chantal: maybe rule that replaces any next letter after Abd? <ʿAbdel>? <ʿAbd->)

Currently

source:OpenMind/src/main/java/org/mpi/openmind/repository/utils/NormalizerUtils.java

1. replace letter combinations

Replace the following letter combinations with a single letter.

th t
kh h
dh d
sh s
gh g
"aẗ ", "at ", "ah " "a "
a

2. replace letters with diacritics

Replace all(?) letters with diacritics with the letter without diacritics.

Remove all apostrophes.


For reference:

http://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html

http://junidecode.sourceforge.net/

http://userguide.icu-project.org/transforms/general