== Normalizing arabic transliterations == Algorithm for normalizing the existing transliterated arabic (_translit fields) in the database. === New === ==== 1. replace letter combinations ==== Replace the following letter combinations with a single letter: || dj, ch || j || || th || t || || kh || h || || dh || d || || sh || s || || gh || g || Replace at the end of a word: || aẗ\b, at\b, ah\b || a || Replace letters with diacritics: || ỳ || a || (replace all y and move to 3.?) ==== 2. remove diacritics ==== Replace all letters with diacritics with the letter without diacritics. ==== 3. replace letters ==== Replace the following letters to unify the searches: || g, j || j || ==== Questions ==== What about apostrophes/accents? Normalize to single-quote (U+2019) or apostrophe (U+0027)? {{{ <ʿAbd > [space following letter] returns <ʿAbd> [without space] (e.g., ʿAbd Allāh brings up ʿAbdallāh or Abdullah) <ʿAbd> [without space] returns <ʿAbd > [space following letter] }}} === Currently === source:OpenMind/src/main/java/org/mpi/openmind/repository/utils/NormalizerUtils.java ==== 1. replace letter combinations ==== Replace the following letter combinations with a single letter. || th || t || || kh || h || || dh || d || || sh || s || || gh || g || || "aẗ ", "at ", "ah " || "a " || || ỳ || a || ==== 2. replace letters with diacritics ==== Replace all(?) letters with diacritics with the letter without diacritics. Remove all apostrophes. ------------- For reference: http://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html http://junidecode.sourceforge.net/ http://userguide.icu-project.org/transforms/general