Talk:Ancient text corpora

Languages Mid‑importance

	Language portal This article is within the scope of WikiProject Languages, a collaborative effort to improve the coverage of languages on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.LanguagesWikipedia:WikiProject LanguagesTemplate:WikiProject Languageslanguage articles
Mid	This article has been rated as Mid-importance on the project's importance scale.

Lists Low‑importance

	This article is within the scope of WikiProject Lists, an attempt to structure and organize all list pages on Wikipedia. If you wish to help, please visit the project page, where you can join the project and/or contribute to the discussion.ListsWikipedia:WikiProject ListsTemplate:WikiProject ListsList articles
Low	This article has been rated as Low-importance on the project's importance scale.

Linguistics High‑importance

	Linguistics portal This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.LinguisticsWikipedia:WikiProject LinguisticsTemplate:WikiProject LinguisticsLinguistics articles
High	This article has been rated as High-importance on the project's importance scale.

Writing systems High‑importance

	Writing portal This article falls within the scope of WikiProject Writing systems, a WikiProject interested in improving the encyclopaedic coverage and content of articles relating to writing systems on Wikipedia. If you would like to help out, you are welcome to drop by the project page and/or leave a query at the project’s talk page.Writing systemsWikipedia:WikiProject Writing systemsTemplate:WikiProject Writing systemsWriting system articles
High	This article has been rated as High-importance on the project's importance scale.

Phoenicia B‑class Mid‑importance

	Phoenicia portal This article is within the scope of the WikiProject Phoenicia, a collaborative effort to improve Wikipedia's coverage of Phoenicia. If you would like to participate, you can visit the project page, where you can join the project and see a list of open tasks.PhoeniciaWikipedia:WikiProject PhoeniciaTemplate:WikiProject PhoeniciaPhoenicia articles
B	This article has been given a rating which conflicts with the project-independent quality rating in the banner shell. Please resolve this conflict if possible.
Mid	This article has been rated as Mid-importance on the project's importance scale.

A fact from Ancient text corpora appeared on Wikipedia's Main Page in the Did you know column on 18 June 2023 (check views). The text of the entry was as follows:

Did you know... that all known writing in Ancient Hebrew totals just 300,000 words, versus 9.9 million in Akkadian?

A record of the entry may be seen at Wikipedia:Recent additions/2023/June. The nomination discussion and review may be seen at Template:Did you know nominations/Ancient text corpora.

Wikipedia

Did you know nomination

The following is an archived discussion of the DYK nomination of the article below. Please do not modify this page. Subsequent comments should be made on the appropriate discussion page (such as this nomination's talk page, the article's talk page or Wikipedia talk:Did you know), unless there is consensus to re-open the discussion at this page. No further edits should be made to this page.

The result was: promoted by Bruxton (talk) 13:47, 12 June 2023 (UTC)[reply]

(

Comment or view
Article history

)

Part of the Akkadian corpus

... that all known writing in Ancient Hebrew totals just 300,000 words, versus 10 million in Akkadian (pictured), 6 million in Ancient Egyptian and 3 million in Sumerian? Source: Peust, Carsten (2000). "Über ägyptische Lexikographie. 1: Zum Ptolemaic Lexikon von Penelope Wilson; 2: Versuch eines quantitativen Vergleichs der Textkorpora antiker Sprachen". Lingua Aegyptia 7 (PDF). pp. 245–260.
- Reviewed: Template:Did you know nominations/Vanadium

Created by Onceinawhile (talk). Self-nominated at 21:45, 1 May 2023 (UTC). Post-promotion hook changes for this nom will be logged at Template talk:Did you know nominations/Ancient text corpora; consider watching this nomination, if it is successful, until the hook appears on the Main Page.[reply]

Article is new, well-cited, long enough (though reads more like a list than an article). The hook is interesting, the source is in German so I'm assuming good faith. QPQ is done, no copyvio found by earwig. I like the image, approve the hook with it. Artem.G (talk) 20:19, 5 May 2023 (UTC)[reply]

@Onceinawhile and Artem.G: I took me a while to find the cited 300,000 claim. But the above hook undershoots the total by 5,500. It was hard to find the claim because our article does not have the wording of Ancient Hebrew or Biblical Hebrew. Instead it says Hebrew Bible. Maybe we should not be piping the link? Bruxton (talk) 17:21, 6 May 2023 (UTC)[reply]

Another ping @Onceinawhile and Artem.G:. Bruxton (talk) 22:36, 13 May 2023 (UTC)[reply]

and @Onceinawhile and Artem.G: Bruxton (talk) 22:36, 13 May 2023 (UTC)[reply]

Sorry, missed the first one somehow. Thanks for checking the source, I agree that it's better to remove the pipe. Artem.G (talk) 05:55, 14 May 2023 (UTC)[reply]

@Bruxton:, thank you for following up here. You were right - the reference was not clear. I have clarified the reference, and also added a quote from the underlying source for the original numbers (Clines). With that I have added the pipe into the article, so I think it is OK to stay here. I have also amended the article so it matches the 300,000 here, which is the correct number. The confusion, which I have now clarified, is that Peust, on whom the estimate is based, excludes the definite article from the word count to ensure consistency between the languages. Onceinawhile (talk) 10:03, 14 May 2023 (UTC)[reply]

Thanks I would like to see what @Cielquiparle:. Thinks about the numbers. I think we need to be. precise but I am not sure the hooks is. Bruxton (talk) 14:58, 14 May 2023 (UTC)[reply]

I hope you end up finding a perfect hook because I think this is a great topic.★Trekker (talk) 19:45, 15 May 2023 (UTC)[reply]

@Cielquiparle: can you help with this nomination? Bruxton (talk) 00:28, 31 May 2023 (UTC)[reply]

@Bruxton: Fascinating topic but I'm not familiar enough with the Wikipedia rules for articles like this (statistics presented in list form). Cielquiparle (talk) 16:32, 31 May 2023 (UTC)[reply]

@Onceinawhile and Artem.G: I am not seeing any progress here so I am going to be bold and by tweaking and shortening the hook. The number in our article is 9.9 million but the hook said 10 million. Bruxton (talk) 13:44, 12 June 2023 (UTC)[reply]

ALT1: ... that all known writing in Ancient Hebrew totals just 300,000 words, versus 9.9 million in Akkadian (pictured)? Bruxton (talk) 13:44, 12 June 2023 (UTC)[reply]

Papyrus Amherst 63

The whole thing has been published. See van der Toorn 2018. Srnec (talk) 01:36, 4 May 2023 (UTC)[reply]

Fantastic new article, thanks Srnec. Onceinawhile (talk) 06:26, 4 May 2023 (UTC)[reply]

"There are also two old African languages that have hardly been explored"

To what does this refer? Srnec (talk) 14:49, 15 May 2023 (UTC)[reply]

Good point - this referred to the following two bullets, but was not well formatted. I have copyedited the whole section so it reads more clearly. Onceinawhile (talk) 20:11, 15 May 2023 (UTC)[reply]

Definite article

To editor Onceinawhile: "The definite article incorporated in languages such as Hebrew, Aramaic, and Greek has no equivalent in most languages, so its frequency would significantly affect the comparability of numbers - this is excluded in the estimates below." — the meaning of "incorporated" should be explained, as it isn't obvious that you are referring to the practice of attaching the article to the word it qualifies (I assume). Also, one could compare Hebrew to languages with separate articles by counting Hebrew articles as separate words or by not counting them in other languages; which is it? Finally, why focus on definite articles when there are other common prefixes which are likewise usually not separate words in Hebrew? Zero^talk 12:47, 17 May 2023 (UTC)[reply]

Thanks Zero0000. The only reason for this is because Peust does it like this, and his is the most comparable source across the widest number of languages. FYI he writes: Demgegenüber möchte ich den bestimmten Artikel des Hebräischen, Aramäischen und Griechischen nicht berücksichtigen, da er in den meisten Sprachen keine Entsprechung hat, wo er aber existiert, durch seine Häufigkeit die Zahlen deutlich beeinflussen würde.. Onceinawhile (talk) 15:30, 17 May 2023 (UTC)[reply]

The paragraph as a whole gives a more clear description. My google-inspired translation: "Another fundamental problem lies in the underlying definition of a word. I have generally understood prepositions as words in their own right, even where they are traditionally written together with the noun (e.g. Hebrew). On the other hand, I do not want to consider the definite article of Hebrew, Aramaic, and Greek, since it has no equivalent in most languages, and where it occurs its frequency would significantly affect the counts." I did not even guess at this meaning when I read your sentence with "incorporated". I propose to replace the sentence by "Attached prepositions are counted as separate words, except in the case of the definite article in Hebrew, Aramaic and Greek." Zero^talk 01:45, 18 May 2023 (UTC)[reply]

Thanks – I have done this. Onceinawhile (talk) 20:01, 31 May 2023 (UTC)[reply]

Plural

Fascinating article! Just...is it my misunderstanding, but we seem to be using two of the possible three plural forms of corpus in the article; couldn't we be consistent? Using two different forms to express the same meaning seems as awkward as using both center and centre in the same paragraph would be. Happy days, ~ Lindsay^Hello 09:15, 18 June 2023 (UTC)[reply]

Thank you - good point. Have fixed this. Onceinawhile (talk) 11:24, 18 June 2023 (UTC)[reply]

Irish?

Hi @Sheila1988: thanks for adding the Irish corpus. Unfortunately I am not sure any of it dates to the period of Ancient history – defined in this article as ending in 300AD – so might not be in scope here? Onceinawhile (talk) 19:36, 18 June 2023 (UTC)[reply]

Yes, thanks, I'll remove it. Sheila1988 (talk) 19:41, 18 June 2023 (UTC)[reply]

Script?

"Canaanite and Aramaic inscriptions" is not a script. That link needs to be removed from the table. Srnec (talk) 20:44, 18 June 2023 (UTC)[reply]

Perhaps the label could be changed to "Northwest Semitic scripts" or similar. We don’t have an article on that topic, only the individual articles Phoenician alphabet, Aramaic alphabet and Paleo-Hebrew script. Each of these articles talk about their related scripts, but there is no parent article. The closest we have at the moment is the lede of Canaanite and Aramaic inscriptions – particularly the quotes in footnotes 8-11 – but really we need a proper article on the script family itself. Onceinawhile (talk) 21:29, 18 June 2023 (UTC)[reply]

Agree that we need an article on the Northwest Semitic scripts. But the link itself is problematic, since we are not talking mainly about inscriptions. I'm fine with a red link. Srnec (talk)

Phoenician

The Phoenician line should perhaps also be removed from the table for now beause it lacks an actual size estimate. Srnec (talk) 20:49, 18 June 2023 (UTC)[reply]

It does have a size estimate, for the number of texts – which is shown in the table. There is also an estimate for the number of words, which is stated in the footnote but not included in the table as it conflicts with the number of texts estimates. Onceinawhile (talk) 21:03, 18 June 2023 (UTC)[reply]

So do many languages not in the table (e.g., Lydian). What are the criteria for inclusion in the table? Srnec (talk) 21:46, 18 June 2023 (UTC)[reply]

The table should include all languages where we have sourced numerical estimates. The only reason there are some in the list below and not yet in the table is because I ran out of time to put them in. I was prioritizing the big ones – and busy unsuccessfully looking for size estimates for ancient Chinese and ancient South Asian corpora. Onceinawhile (talk) 21:59, 18 June 2023 (UTC)[reply]

Aramaic

The Aramaic word count in the table looks like an out-of-date lower bound to me. This source (p. 60) puts the Imperial Aramaic corpus at 237,970 words, citing Stephen Kaufman. This one puts the total Aramaic corpus down to the 13th century at 3 million words. This source puts the Qumran corpus at 21,068 words. Taking these numbers and those in the article right now, I get to at least 287,000 words of pre-3rd century AD Aramaic. I think we should amend the number in the table to read >280,000 and add the two sources I've cited. Srnec (talk) 21:46, 18 June 2023 (UTC)[reply]

Thanks for bringing these. There is definitely more to do on Aramaic, as Peust described it as "confusing" ("unübersichtlich") and he is not specialized in the topic like Kaufman. I have fixed the existing footnote to make Peust's position clearer. The Dead Sea Scrolls figure is a similar count (21,068 vs 15,000 - Peust adjusts for the definite article), and the idea of 3 million is also consistent with Peust who states that after 300 AD the Aramaic corpus "sprunghaft an, da sich jetzt mehrere große Literatursprachen ausbilden".

Kaufman's "CAL ImpAram" estimate of 237,970 should definitely be added, but it is puzzling as it is very different to Peust's figures. Note that Cook (your Brill source) wrote in 2022 that the primary source for Imperial Aramaic is the Egyptian texts (see footnote 6 at Textbook of Aramaic Documents from Ancient Egypt; the way Cook describes the corpus is very similar to Peust's description). Peust estimated the first 3 volumes of TADAE to contain 20,000 words (the fourth volume was printed in the same year as Peust's article so I suspect he didn't have access to it). Onceinawhile (talk) 22:37, 18 June 2023 (UTC)[reply]

"Confusing" it certainly is. Holger Gzella, A Cultural History of Aramaic: From the Beginnings to the Advent of Islam (Brill, 2015), p. 166, says somewhat vaguely that the Elephantine material is "the most extensive part of the total corpus" of Achaemenid Official Aramaic. But on p. 109, he says that "Mesopotamia has yielded the lion's share of the available evidence" for Aramaic in the Neo-Assyrian and Neo-Babylonian periods. So it may depend on what is meant by "Imperial Aramaic", whether it includes the Assyrian and Babylonian periods or not. Gzella has a more recent book on Aramaic, too, but from Google Books I cannot see that he gives word counts. Srnec (talk) 23:58, 20 June 2023 (UTC)[reply]