Arabic letter frequency
The neutrality of this article is disputed. (December 2013) (Learn how and when to remove this template message)
This article is incomplete.(August 2017)
No language has an exact letter frequency distribution, as all writers write slightly differently. As a rule texts in different languages using the Arabic script (e.g. Arabic, Old Turkish, Persian and Urdu) will have different letter frequencies, most obviously in the case of letters which are only used in some languages (e.g. the Persian letters پ, چ, گ, which are not used to write in Arabic).
Methods encoding the most frequent letters with the shortest symbols were pioneered by telegraph codes, and are used in modern data-compression techniques such as Huffman coding.
What gets counted in input Arabic text?
The Arabic alphabet consists of 28 primary letters, these are letters 1 to 28 in Table 1. The eight modified letters listed in positions 29 to 36 in the same table are used just the same[clarification needed]. If these 8 modified forms are folded into the primary list based on shape or phonetic similarity, the outcome then is as shown in Table 2. For accurate frequency analysis, each of the 36 letters of Table 1 gets its frequency counted independently.
Although the full set of Arabic characters includes about ten diacritics as shown in the Figure 1, frequency analysis of Arabic characters is only concerned with computing the frequency of alphabet letters shown in Table 2.
Sources with over five million letters
The following famous Arabic sources are used to generate an acceptable amount of data on which frequency statistics are conducted.
- The first seven volumes of the series البداية والنهاية (The Beginning and The End) of Ibn Kathir, with 2,855 pages, containing 1,096,047 words, containing 4,326,031 letters.
- The book of الرحيق المختوم (The Sealed Nectar) of Almubarakfuri, with 284 pages, containing 134,662 words, containing 553,740 letters.
- The book of تحفة العروسين (The Masterpiece of the Brides) of Al-shuri, with 239 pages, containing 66,550 words, containing 242,361 letters.
Collectively, these sources add up to 3,378 pages, with 1,297,259 words, and 5,122,132 letters.
The following graphs show the letter frequency distribution for the counted letters; Figure 2 shows a histogram data sorted on Unicode value. Figure 3 shows a histogram data sorted on frequency.
Qur'an letter and word frequency statistics
This section may stray from the topic of the article. (August 2017)
The frequency distribution of letters found in the Qur'an is much the same. The following list highlights statistics particular to one of the most common print editions (the recitation of Hafs through Asim[disambiguation needed]) also available online.
- The total number of letters is 330,709
- The total number of words is 77,797
- The number of different words (without repetition) is 14,870
- The average word length in the Quran is 330,709 ÷ 77,797 = 4.25
- The number of verses is 6,236
- Ibn Kathir, Ismail (1???). The beginning and the End (in Arabic). Retrieved 23 January 2011. Check date values in:
- Almubarakfuri, Safiyyurrahman (19??). The Sealed Nectar. Retrieved 24 January 2011. Check date values in:
- Ash-shuri, Majdi (19??). 82&book= 1763 Masterpiece of the Bride Check
|url=value (help). Retrieved 24 January 2011. Check date values in:
- Madi, Mohsen (2010). "Comparative frequency analysis of Arabic Texts". Intellaren. Retrieved 24 January 2011.
- Madi, Mohsen (2010). "Quran Suras Statistics". Intellaren Articles. Retrieved 16 January 2011. External link in