Abstract Preview:
Given the Amharic language has a lot of perplexing terminology since it features duplicate homophone letters,fidel’s ሀ, ሐ, and ኀ (three of which are pronounced as HA), ሠ and ሰ (both pronounced as SE), አ and ዐ (bothpronounced as AE), and ጸ and ፀ (both pronounced as TSE). The WSD (Word Sense Disambiguation) model, whichtackles the issue of lexical ambiguity in the context of the Amharic language, is developed using a deep learningtechnique. Due to the unavailability of the Amharic wordnet, a total of 1756 examples of paired Amharicambiguous homophonic words were collected. These words were ድህነት(dhnet) and ድኅነት(dhnet), ምሁር(m’hur)and ምሑር(m’hur), በአል(be’al) and በዢል(be’al), አቢይ (abiy) and ዐቢይ(abiy), with a total of 1756 examples.Following word preprocessing, word2vec, fasttext, Term Frequency-Inverse Document Frequency (TFIDF), andbag of words (BoW) were used to vectorize the text. The vectorized text was divided into train and test data. Thetrain data was then analysed using Naive Bayes (NB), K-nearest neighbour (KNN), logistic regression (LG), de-cision trees (DT), random forests (RF), and random oversampling technique. Bidirectional Gate Recurrent Unit(BiGRU) and Bidirectional Long Short-Term Memory (BiLSTM) improved to 99.99 % accuracy even with limiteddatasets.
Key Words: Amharic language, Homophone, Machine learning, Deep learning, Bidirectional, BiLSTM, BiGRU, TFIDF, BoW, Word embedding, Amharic word sense disambiguation
Full Abstract:
Given the Amharic language has a lot of perplexing terminology since it features duplicate homophone letters,fidel’s ሀ, ሐ, and ኀ (three of which are pronounced as HA), ሠ and ሰ (both pronounced as SE), አ and ዐ (bothpronounced as AE), and ጸ and ፀ (both pronounced as TSE). The WSD (Word Sense Disambiguation) model, whichtackles the issue of lexical ambiguity in the context of the Amharic language, is developed using a deep learningtechnique. Due to the unavailability of the Amharic wordnet, a total of 1756 examples of paired Amharicambiguous homophonic words were collected. These words were ድህነት(dhnet) and ድኅነት(dhnet), ምሁር(m’hur)and ምሑር(m’hur), በአል(be’al) and በዢል(be’al), አቢይ (abiy) and ዐቢይ(abiy), with a total of 1756 examples.Following word preprocessing, word2vec, fasttext, Term Frequency-Inverse Document Frequency (TFIDF), andbag of words (BoW) were used to vectorize the text. The vectorized text was divided into train and test data. Thetrain data was then analysed using Naive Bayes (NB), K-nearest neighbour (KNN), logistic regression (LG), de-cision trees (DT), random forests (RF), and random oversampling technique. Bidirectional Gate Recurrent Unit(BiGRU) and Bidirectional Long Short-Term Memory (BiLSTM) improved to 99.99 % accuracy even with limiteddatasets.
Key Words: Amharic language, Homophone, Machine learning, Deep learning, Bidirectional, BiLSTM, BiGRU, TFIDF, BoW, Word embedding, Amharic word sense disambiguation