Version 9 (modified by 9 years ago) (diff) | ,
---|
Normalizing arabic transliterations
Algorithm for normalizing the existing transliterated arabic (_translit fields) in the database.
New
1. replace letter combinations
Replace the following letter combinations with a single letter:
dj, ch | j |
th | t |
kh | h |
dh | d |
sh | s |
gh | g |
Replace at the end of a word:
aẗ\b, at\b, ah\b | a |
Replace letters with diacritics:
ỳ | a |
(replace all y and move to 3.?)
2. remove diacritics
Replace all letters with diacritics with the letter without diacritics.
3. replace letters
Replace the following letters to unify the searches:
g, c | j |
Questions
What about apostrophes/accents? Normalize to single-quote (U+2019) or apostrophe (U+0027)?
<ʿAbd > [space following letter] returns <ʿAbd> [without space] (e.g., ʿAbd Allāh brings up ʿAbdallāh or Abdullah) <ʿAbd> [without space] returns <ʿAbd > [space following letter]
Currently
source:OpenMind/src/main/java/org/mpi/openmind/repository/utils/NormalizerUtils.java
1. replace letter combinations
Replace the following letter combinations with a single letter.
th | t |
kh | h |
dh | d |
sh | s |
gh | g |
"aẗ ", "at ", "ah " | "a " |
ỳ | a |
2. replace letters with diacritics
Replace all(?) letters with diacritics with the letter without diacritics.
Remove all apostrophes.
For reference:
http://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html