Abstract Preview:
Given the Amharic language has a lot of perplexing terminology since it features duplicate homophone letters,fidelâs á, á, and á (three of which are pronounced as HA), á and á° (both pronounced as SE), á and á (bothpronounced as AE), and áž and á (both pronounced as TSE). The WSD (Word Sense Disambiguation) model, whichtackles the issue of lexical ambiguity in the context of the Amharic language, is developed using a deep learningtechnique. Due to the unavailability of the Amharic wordnet, a total of 1756 examples of paired Amharicambiguous homophonic words were collected. These words were á”á
áá”(dhnet) and á”á
áá”(dhnet), ááá(mâhur)and ááá(mâhur), á á á(beâal) and á áąá(beâal), á áąá (abiy) and ááąá(abiy), with a total of 1756 examples.Following word preprocessing, word2vec, fasttext, Term Frequency-Inverse Document Frequency (TFIDF), andbag of words (BoW) were used to vectorize the text. The vectorized text was divided into train and test data. Thetrain data was then analysed using Naive Bayes (NB), K-nearest neighbour (KNN), logistic regression (LG), de-cision trees (DT), random forests (RF), and random oversampling technique. Bidirectional Gate Recurrent Unit(BiGRU) and Bidirectional Long Short-Term Memory (BiLSTM) improved to 99.99 % accuracy even with limiteddatasets.
Key Words: Amharic language, Homophone, Machine learning, Deep learning, Bidirectional, BiLSTM, BiGRU, TFIDF, BoW, Word embedding, Amharic word sense disambiguation
Full Abstract:
Given the Amharic language has a lot of perplexing terminology since it features duplicate homophone letters,fidelâs á, á, and á (three of which are pronounced as HA), á and á° (both pronounced as SE), á and á (bothpronounced as AE), and áž and á (both pronounced as TSE). The WSD (Word Sense Disambiguation) model, whichtackles the issue of lexical ambiguity in the context of the Amharic language, is developed using a deep learningtechnique. Due to the unavailability of the Amharic wordnet, a total of 1756 examples of paired Amharicambiguous homophonic words were collected. These words were á”á
áá”(dhnet) and á”á
áá”(dhnet), ááá(mâhur)and ááá(mâhur), á á á(beâal) and á áąá(beâal), á áąá (abiy) and ááąá(abiy), with a total of 1756 examples.Following word preprocessing, word2vec, fasttext, Term Frequency-Inverse Document Frequency (TFIDF), andbag of words (BoW) were used to vectorize the text. The vectorized text was divided into train and test data. Thetrain data was then analysed using Naive Bayes (NB), K-nearest neighbour (KNN), logistic regression (LG), de-cision trees (DT), random forests (RF), and random oversampling technique. Bidirectional Gate Recurrent Unit(BiGRU) and Bidirectional Long Short-Term Memory (BiLSTM) improved to 99.99 % accuracy even with limiteddatasets.
Key Words: Amharic language, Homophone, Machine learning, Deep learning, Bidirectional, BiLSTM, BiGRU, TFIDF, BoW, Word embedding, Amharic word sense disambiguation