Language desk
< May 2	<< Apr \| May \| Jun >>	Current desk >

Welcome to the Wikipedia Language Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.

May 3[edit]

invisible word breaks in Chinese[edit]

When I double-click on text, a word is selected. But Chinese is written sans spaces. If I double-click on a string of three or more Chinese characters, one or two characters are selected; and I cannot get overlapping pairs by double-clicking on different characters. Many modern Chinese words are two characters … but how does my browser (or OS) know which pairs are words? I have not found evidence of hidden zero-width breaks. —(If it is the sense of the assembly that this belongs in Computing, I will of course move it there.) —Tamfang (talk) 00:09, 3 May 2024 (UTC)[reply]

Probably from a word list; there's nothing in the writing system itself which would indicate this as far as I know... AnonMoos (talk) 00:31, 3 May 2024 (UTC)[reply]

You don't say which OS but this is the sort of thing modern OSes have built in. MacOS e.g. the OS can translate any text you can select – including text in images – so it makes sense it helps you select blocks of text which makes sense to translate, by e.g. treating 'words' of two characters as single blocks. To do so it probably has to parse not just one or two characters but those surrounding it, however much is needed. --217.23.224.20 (talk) 11:03, 3 May 2024 (UTC)[reply]

This was really exciting actually! Firefox recently pushed an update allowing the selection of text by word in unspaced languages. If I had to guess, you either use Firefox or another app where this was recently implemented, As someone who's only been learning Chinese for a few years, I will soon be shocked that people used software for so long that didn't have this ability. Remsense诉 11:21, 3 May 2024 (UTC)[reply]

WHAAOE: Text segmentation Aecho6Ee (talk) 22:21, 3 May 2024 (UTC)[reply]

Building upon the Firefox thing above, apparently, the relevant update was 122.0, where they call it "language-aware word selection." It is controlled by the flag intl.icu4x.segmenter.enabled, which means the feature is apparently using the ICU4X Unicode library and of that, the segmenter module. Looking at the code (or reading the comments, rather), the segmenter is apparently "using the LSTM model when available and the dictionary model for Chinese and Japanese." Aecho6Ee (talk) 22:21, 3 May 2024 (UTC)[reply]

Neat, thanks for explicating! Remsense诉 22:55, 3 May 2024 (UTC)[reply]

Resolved

—Tamfang (talk) 18:15, 6 May 2024 (UTC)[reply]

Russian sentence?[edit]

Dear hive-mind. I came across this sentence "[...] Павла Григорьевича Г. (род. около 1861), привлекавшегося в 1885 Варшавским губ. жанд. управлением к дознанию по второму делу «Пролетариата»." (https://imwerden.de/pdf/minuvshee_02_1986__ocr.pdf). Does this mean that Pavel Grigorevich was recruited by Warsaw Gendarmerie to intervene in the Proletariat case, or that he could have been arrested or accused in the case? -- Soman (talk) 11:44, 3 May 2024 (UTC)[reply]

[1] indicates that he was investigated. --Soman (talk) 11:59, 3 May 2024 (UTC)[reply]