= Normalizing arabic transliterations = Algorithm for normalizing the existing transliterated arabic (_translit fields) in the database. == New == === 1. replace letter combinations === Replace the following letter combinations with a single letter: || dj, ch || j || || th || t || || kh || h || || dh || d || || sh || s || || gh || g || Replace at the end of a word: || aẗ\b, at\b, ah\b || a || Replace at the beginning of a word: || "\babd " || abd || Replace letters with diacritics: || ỳ || a || === 2. remove diacritics === Replace all letters with diacritics with the letter without diacritics. === 3. replace letters === Replace the following letters to unify the searches: || g, c || j || === Questions === What about apostrophes/accents? Normalize to single-quote (U+2019) or apostrophe (U+0027)? {{{ <ʿAbd > [space following letter] returns <ʿAbd> [without space] (e.g., ʿAbd Allāh brings up ʿAbdallāh or Abdullah) <ʿAbd> [without space] returns <ʿAbd > [space following letter] (Chantal: maybe rule that replaces any next letter after Abd? <ʿAbdel>? <ʿAbd->) }}} == Currently == source:OpenMind/src/main/java/org/mpi/openmind/repository/utils/NormalizerUtils.java === 1. replace letter combinations === Replace the following letter combinations with a single letter. || th || t || || kh || h || || dh || d || || sh || s || || gh || g || || "aẗ ", "at ", "ah " || "a " || || ỳ || a || === 2. replace letters with diacritics === Replace all(?) letters with diacritics with the letter without diacritics. Remove all apostrophes. ------------- For reference: http://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html http://junidecode.sourceforge.net/ http://userguide.icu-project.org/transforms/general