Talk:Null-terminated string

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Single page for C string functions[edit]

Based on Talk:C standard library#Pages for each function and WP:NOTMANUAL

The following pages essentially discuss the same topic of C string functions: string.h, memset, strcpy, strlcpy, strcat, strrchr, strcspn, ctype.h, strcmp, strlen, memccpy, mempcpy. I propose to cleanup these pages by removing the material that fails WP:NOTMANUAL and by merging the remains into C string.1exec1 (talk) 23:29, 8 October 2011 (UTC)

memset , memcpy, and mempcpy don't operate on C strings (that is, NUL-terminated strings). They operate on buffers with a length specified by one of the arguments. However I agree that there should not be one function per page. Jeh (talk) 04:33, 11 October 2011 (UTC)
I am unsure myself what to do with these functions. They do not really belong to C string, that's a fact. On the other hand, IMO they are not significant enough to warrant a separate page, and even if we chose to create one, there's hardly a good descriptive name for it. I myself tried to think of one, but all of them fit better for functions like malloc, e.g. C memory handling, C memory operations, etc. Going further, mem* functions are in the same header as str* functions, and most of the references I could find, e.g. cplusplus.com, cppreference.com, the C standard, preserve such grouping. Thus I think we wouldn't do a big mistake by sticking with established sources. Given these arguments, I think it's a good option to merge mem* to str*. That said, if you've got a sensible alternative, I'd happily reconsider.1exec1 (talk) 09:10, 11 October 2011 (UTC)
If some page is merged into C string , then we are unable to see the content of that page.:How can one read the information of that page after merging?
Sagar tikore (talk) 06:58, 11 October 2011 (UTC)
The content wouldn't disappear anywhere, just that it would placed in this article, not these separate articles.1exec1 (talk) 09:10, 11 October 2011 (UTC)
Instead of merging the functions in C string we can merge them into string.h
Asmita yendralwar (talk | contribs) 07:58, 11 October 2011 (UTC)
I don't think this would be a good option since it would be inconsistent with other pages discussing C standard library, e.g. C memory operations, C input/output.1exec1 (talk) 09:10, 11 October 2011 (UTC)
I agree with merging all of string.h into this page (there is a table already there) and removing all the trivial pages for the individual functions and changing them to redirects. However some pages with more information, in particular the strlcpy page, need to be preserved (this page contains a bit of history and political intrigue which is interesting but would bloat this page and make the table hard to read). The mem functions should be merged here as well, they are part of string.h and often are used to manipulate c strings.Spitzak (talk) 19:40, 13 October 2011 (UTC)
I also agree to merge all string.h here. Yogesh.rathod07 (talk) 06:40, 17 October 2011 (UTC)
I did a few modifications. In particular I dupliated the old string.h table and text here, it seems to be more accurate and carefully checked. In particular it lists some of the alternative functions which otherwise we lost. I also restored the page for ctype.h as I was under the impression this was to be divided up by header file (and also that file is not really dependent on null-terminated strings). I also restored the gets and strlcpy pages as they had significant text describing historic details and are referred to often from other wikipedia articles. Hope this is all ok.Spitzak (talk) 02:40, 19 October 2011 (UTC)
Oops it looks like I deleted all the external links to C/C++ documentation pages. Probably should be restored.Spitzak (talk) 02:54, 19 October 2011 (UTC)

strlcpy[edit]

I'm going to replace this page with a redirect again. The consensus was to merge all functions, that consist from material failing WP:NOTMANUAL and strlcpy falls into this group of pages. The only section that can be preserved somewhere is Criticism. However, I think we should can delete even that material, because it fails WP:NOTABILITY by not having WP:RS to back up the text. The already provided references are not WP:RS because of WP:SPS.
Going further, the page is imported into Wikibooks at b:C Programming/C Reference/nonstandard/strlcpy, so there absolutely no justification to keep anything failing Wikipedia guidelines here, when the content can be further improved in Wikibooks. 1exec1 (talk) 12:28, 19 October 2011 (UTC)

The criticism is backed by direct quotes from the maintainer of glibc posted on the official glibc mailing list. There was more but it is repeatedly deleted, because of a desire to obfuscate the exact guilty party and to try to claim the argument actually has merit. strlcpy is often mentioned as a indicator of misguided design in Linux and is thus a subject people will look for. I do not think whitewashing this story is good for Linux or for any of the involved parties.Spitzak (talk) 14:38, 19 October 2011 (UTC)
I'm not trying to whitewash or anything. I'm just saying there's not enough notable material to warrant a separate article, especially when strlcpy is not a standard function. That's not to say that we must delete that material - a better idea would be to merge the important bits to C string. In this case, we can create a new section called Extensions, move all the non standard functions there, and to place criticisms and other relevant stuff there. 1exec1 (talk) 16:58, 19 October 2011 (UTC)
While I personally agree with merging the page, you stated that you were going to merge the page but you didn't even do that before blanking it. The best practice according to WP:MERGE is to first obtain consensus, and actually merge the page before redirecting. I'm going to remove the redirect now because clearly the C string page does not cover all of the useful information in the strlcpy page. At the very least, the criticism of strlcpy needs to be addressed before this article can be redirected there. YumOooze (talk) 04:57, 22 October 2011 (UTC)
I think I have addressed your concerns.1exec1 (talk) 12:17, 23 October 2011 (UTC)
There is a lot of books which describe strlcpy, take a look at google books - the article has not been merged - must of the content has been removed with the argument that it violate WP:NOTMANUAL - I think the Section "Usage" could be rewritten by a few edits so it didnt "violate" WP:NOTMANUAL. And I dont understand why you removed this with the statement that it isnt noteable that e.g. Linux Kernel has ported the function - the kernel can not use the standard C library as you may know. I am going to remove the redirect again (until the page has been merged into this article) Christian75 (talk) 15:35, 23 October 2011 (UTC)
Do you wan't to say that all these thousands of functions that are in the Linux kernel deserve a page? Can you find a secondary WP:RS which justifies the inclusion as per wikipedia notability criteria? 1exec1 (talk) 17:53, 23 October 2011 (UTC)
In any case, the current consensus is to merge. See this discussion. As you can see, 5 editors (User:strcat, User:Vadmium, Ruud, Michael, User:1exec1) are for the merge, 2 users against (User:Spitzak, User:Christian75) (please fix if I'm wrong). So I have strong reason to undo your changes. Please establish new consensus before reverting. 1exec1 (talk) 18:14, 23 October 2011 (UTC)
Maybe there was a consensus for the idea of merging many articles but judging my the noise on my watchlist I dunno if there is much consensus for this particular merge from strlcpy into C string. How about something in between like renaming as strlcpy and strlcat? Sources and Wikipedia references seem to group them like that anyway. Vadmium (talk, contribs) 10:19, 24 October 2011 (UTC).

strcpy[edit]

I am going to replace this page with a redirect again. Almost all content fails WP:GNG, because the only secondary source I could find, that supports the material, is man pages, which is not WP:RS. The remaining is already at C string. Since there has been no recent attempt to fix these issues, except one editor who reverts page blanking, I assume that there is no genuine interest in the article.

I will not bring the article to WP:AFD because there is no intent to delete the page proper. Undoable action should be discussed in the talk page (see WP:AFD, specifically For problems that do not require deletion, including <...>, be bold and fix the problem or tag the article appropriately'). 1exec1 (talk) 13:29, 26 October 2011 (UTC)

strcat[edit]

I am going to replace this page with a redirect again. This page is in exactly the same situation as strcpy. Almost all content fails WP:GNG, because the only secondary source I could find, that supports the material, is man pages, which is not WP:RS. The remaining is already at C string. Since there has been no recent attempt to fix these issues, except one editor who reverts page blanking, I assume that there is no genuine interest in the article. 1exec1 (talk) 13:29, 26 October 2011 (UTC)

Have you tried a search with google books? I could find a lot of secondary sources to the article a minute ago - but it takes time to insert the refs. I think its a shame to delete it (I know you "merged" it, but the content didnt move with it). There exsist a lot of sources for strcat at google books "strcat buffer overflow" at google books Christian75 (talk) 17:50, 26 October 2011 (UTC)
You mean those all programming books/manuals? They do not constitute significant coverage as per WP:N, because if we remove the 'how-to use strcat' portion of that material, only a very limited factual mention is left. As per WP:N, you must find a third-party reliable source in which strcat is one of few major subjects. 1exec1 (talk) 19:55, 26 October 2011 (UTC)
Please, read the WP:N again, especially the section you quote significant coverage which says "Significant coverage is more than a trivial mention but it need not be the main topic of the source material." Christian75 (talk) 18:47, 28 October 2011 (UTC)
Show specific examples of books about strcat and then we can talk what fails WP:N and what doesn't. 1exec1 (talk) 09:36, 29 October 2011 (UTC)

character vs byte[edit]

I think we should name characters as characters since these functions are for character string manipulation. I agree that there's an issue with multi-byte characters, but using bytes doesn't completely remove the source of confusion either, as the reader still must know that there exist non single-byte characters. What if we changed bytes back to characters and added a notice that str* functions operate on single-byte characters?1exec1 (talk) 19:05, 18 October 2011 (UTC)

Saying "it only works on the one-byte characters" is wrong, because the string operations will work on the individual bytes that make up parts of multi-byte characters (for instance you can count the number of characters, assuming no bad encoding, by counting the bytes that don't start with 10 binary, thus there are useful operations you can do working with the bytes). The proper term for the units it operates on is "whatever your C compiler means when you say 'char'" but that is hard to read, looks like the word 'character' misspelled, and 'byte' is probably a much more popular term. The C99 documentation is technically correct because they define the word "character" as being "char", but that is not how the word "character" is defined in any wikipedia article about text.
The main problem is that there are a lot of programmers out there who are just smart enough to do horrible things when they think that strlen() has to return the 'number of characters'. If they were a bit stupider we would be ok because they would not get anything to work. But there seems to be an overlap, perhaps best defined as 'idiot savants' or something, where they will actually write working, but horrific complicated code because they took the word "character" literally. These code writers are probably the biggest impediment to getting Unicode to work. There are active attempts to clear up the documentation, such as the BSD man pages which I was quoting, but there remains a huge amount of legacy documentation, including stuff from standards organizations. Anyway I see no reason not to have Wikipedia use modern notation.Spitzak (talk) 02:05, 19 October 2011 (UTC)
Ok, I agree. C++11 uses byte string to name single byte character strings, so I think it's a good idea to stick with it. 1exec1 (talk) 02:26, 19 October 2011 (UTC)

Null-terminated string[edit]

Shouldn't this article be at located at null-terminated string (or NUL-terminated string)? And primarily focus on null-terminated strings instead of C's string library? —Ruud 00:21, 19 October 2011 (UTC)

There's actually not much to say about null terminated string itself apart from the definition. Everything comes down to the operations that are defined on these strings, and the properties of these operations. C string library is the most widely used interface to these operations, so the attention to it seems reasonable to me. 1exec1 (talk) 01:41, 19 October 2011 (UTC)
You could also say a few other things. I haven’t really read the article :P but perhaps a comparison with other ways of storing strings and its relative strengths and weaknesses; languages and other applications where it is used? Vadmium (talk, contribs) 07:55, 24 October 2011 (UTC).
There is a comparison with a leading length at the start of the article!
Okay, so there is in the history section. And there’s more at String (computer science)#Representations. Vadmium (talk, contribs) 10:55, 24 October 2011 (UTC).
I would have to disagree with that. One can easily discuss the asymptotic complexity for various operations on null-terminated strings in terms of abstract functions. In my opinion this article should either be split into an article on null-terminated strings and an article on "Strings in the C programming language", of the latter should be more clearly made into a sub-section of an article whose primary topic is null-terminated strings. —Ruud 13:43, 24 October 2011 (UTC)
I agree with the suggestion to split the article. 1exec1 (talk) 14:22, 24 October 2011 (UTC)

Requested move[edit]

The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

The result of the move request was: page moved per consensus in the discussion. Vegaswikian (talk) 22:56, 5 November 2011 (UTC)



C stringNull-terminated stringRelisted. Discussion on going and may lead to something other then a rename. Vegaswikian (talk) 05:13, 31 October 2011 (UTC) Common and language-neutral name. —Ruud 13:55, 24 October 2011 (UTC)

how is the term C string not neutral?--199.91.207.3 (talk) 17:35, 24 October 2011 (UTC)
I would conjecture that the terms "C string" and "Pascal string" are mostly used by C programmers interfacing with libraries developed for different ABI's, while computer scientists and programmers from other languages would prefer to use the more descriptive terminology "null-terminated" and "length-prefixed" strings. The former already requires you know that C uses strings which are terminated by a null character and Pascal uses strings which are prefixed by their length, while this is self-evident with the latter. —Ruud 20:59, 24 October 2011 (UTC)
I think the term "Pascal string" means a 1-byte prefix length, not just the fact that a length is stored.Spitzak (talk) 23:50, 24 October 2011 (UTC)
True. So a Pascal string would be a particular kind of length-prefixed string. If we would have an article on that topic (which I don't believe we have at the moment), it would likely discuss all kinds of length-prefixed strings, not just 1-byte-length prefixed ones. —Ruud 00:58, 25 October 2011 (UTC)
I would support either a move to Null-terminated string, and/or integration with String (computer science)#Null-terminated, especially if the C stuff is to be a separate article. Vadmium (talk, contribs) 05:14, 25 October 2011 (UTC).
Maybe we can move the article containing the remaining C stuff to C standard string functions or similar title? 1exec1 (talk) 18:15, 29 October 2011 (UTC)
I'd prefer something like "String handling in the C programming language" or (more ambiguously, but more concise) "String handling in C". —Ruud 09:50, 31 October 2011 (UTC)
I see one problem with a title like this: it is not consistent with other pages about C standard library, like C mathematical functions and so on. In my opinion we should have either all articles in one format or the other. If we change all titles to Mathematical functions in C and similar, they become much more ambiguous, because then they refer to all functions (i.e. not necessarily standard ones) in the particular domain of C. Current solution mostly works, because when saying C mathematical functions, C standard mathematical functions is naturally implied (I must agree that this assertion might be far fetched as I'm not native speaker of English). Alternative solution might be something like Standard mathematical functions in C, but this also doesn't sound well (and might be grammatically incorrect; again, I'm not native speaker). Thus I think that certainly being not ideal, C standard string functions or C string functions might be the best option. However, if we decided to ignore the consistency issue, I would agree that String handling in C is an appropriate title. 1exec1 (talk) 23:29, 31 October 2011 (UTC)
I think I've already indicate that I find titles such as "C dynamic memory management" to be pretty awkward and that titles such as "Dynamic memory management in C" more clearly indicate the article is actually a sub-article of both Dynamic memory management and C (programming language). Perhaps the title and scope should even be Memory management in C and clearly linked with at {{main|Memory management in C}} from C (programming language)#Memory management. —Ruud 11:53, 1 November 2011 (UTC)
Ok, you finally convinced me. My previous argument is incorrect in that the scope of the articles is actually broader than the standard functions, as is evident in, for example, the current C string page. So now I think that the in C titles not only sound well, they represent the current and potential scope of the articles much better. Is a change from C *** to *** in C a non-controversial move? Can I implement it without a discussion? 1exec1 (talk) 17:36, 1 November 2011 (UTC)
I've created a centralized discussion at Talk:C_standard_library#Move_articles_about_C_standard_library_from_C_.2A.2A.2A_to_.2A.2A.2A_in_C. 1exec1 (talk) 12:51, 2 November 2011 (UTC)
The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

NULL char in UTF-8[edit]

In response to this edit: A NULL char is not a part of any valid UTF8 sequence; all characters in all multibyte sequences start with a 1 bit. However, you can encode a NULL char into a UTF8 stream with a 0xC0, 0x80 sequence, which then becomes a 0x0000 when converted to UTF16. - Richfife (talk) 16:09, 13 September 2013 (UTC)

Zero is a valid code point, and the UTF-8 encoding of it is a NUL (\0) byte. An 0xC0, 0x80 sequence in a UTF-8 string is an invalid overlong encoding. In fact, the overlong encoding of NUL is used as an example of a security issue from an incorrect UTF-8 implementation in the RFC. strcat (talk) 17:40, 13 September 2013 (UTC)
See the encoding table provided by the Unicode consortium as a resource. strcat (talk) 17:46, 13 September 2013 (UTC)
That is incorrect, you are describing "modified UTF-8" as used by tcl and some other systems. These also incorrectly encode non-BMP as 6 bytes. Officially the encoding of U+0000 in UTF-8 is a single 0 byte.
However I have reverted this, as the 0 character also exists in ASCII and yet it claims that ASCII works. Thus "works" is defined as "works for all characters other than the zero one". The edit implies that UTF-8 is not supported as well as ASCII, which is false, it is supported equally well with exactly one character not represented.Spitzak (talk) 23:57, 13 September 2013 (UTC)
I am not describing modified UTF-8, but I think you may just be responding to the parent comment. strcat (talk) 03:54, 19 September 2013 (UTC)
Can't we say that NUL (\0) is supported by ASCII, extended ASCII and UTF-8, but NUL is not supported by Null-terminated strings, since NUL has an internal meaning to null-terminated strings. But as long as NUL is not part of encoded text (only TAB, CR and LF is among control chars) both ASCII and UTF-8 can be stored in Null-terminated strings. Both ASCII and UTF-8 might sometimes demand storage of NUL, but if you want to store binary data, why not store them as binary data, not as ASCII, UTF-8 or UTF-16 or null-terminated strings.--BIL (talk) 09:09, 14 September 2013 (UTC)
UTF-8 considers NUL to be totally valid text data. It's the encoding of a valid code point - unlike the explicitly forbidden range of surrogates. I added sources for this, and it was reverted to the previous inaccurate claim of \x00 not being valid UTF-8. strcat (talk) 03:56, 19 September 2013 (UTC)
There is no difference between UTF-8 and ASCII with respect to null terminated strings. Actually UTF-8 preserves backwards compatibility with traditional null terminated strings, enabling more reliable information processing and the chaining of multilingual string data with Unix pipes between multiple processes. Using a single UTF-8 encoding with characters for all cultures and regions eliminates the need for switching between code sets. See Lunde, Ken (1999). CJKV information processing. O'Reilly Media. p. 466. ISBN 978-1-56592-224-2. Retrieved 2011-12-23.  Unknown parameter |month= ignored (help). So current wording should be fixed. AgadaUrbanit (talk) 07:02, 19 September 2013 (UTC)
Regardless of whether there's a difference between ASCII and UTF-8, not all UTF-8 can be stored in a null-terminated string per the Unicode and UTF-8 standards (given as a source). They are the only authoritative sources here because they define the encoding. 99.231.135.5 (talk) 01:39, 20 September 2013 (UTC)

Do many small strings imply duplicates?[edit]

My concern is with the statement, "On modern systems memory usage is less of a concern, so a multi-byte length is acceptable (if you have so many small strings that the space used by this length is a concern, you will have enough duplicates that a hash table will use even less memory)." I can write a program that generates many small strings that are not duplicates. Therefore I do not believe that if I have many small strings then there will be duplicates. There might be many programs where many small strings are duplicates and a hash table will use less memory (e.g. the symbol table in a compiler), but I cannot see that this is always true of all programs. — Preceding unsigned comment added by 80.195.2.190 (talk) 12:56, 6 September 2016 (UTC)

If "many" is considered to mean tending towards infinity then there will be duplicates strings after all permutation of small strings are generated. So in that sense many strings implies duplicates. However, consider 8 bit byte characters where there are 2^8=256 different characters, of which the zero character '\0' NUL can be used as a terminator, leaving 255 other characters from which to form strings. Now consider the number of bytes required to store all permutations of strings of short lenth for a NUL terminated string representation and a string representation having a 4-byte length:

length L permutations P size NUL terminated = P * (L+1) size 4-byte length = P * (L+4)
0 255^0 = 1 1 4
1 255^1 = 255 510 1275
2 255^2 =65,025 195,075 390,150
3 255^3 = 16,581,375 66,325,500 116,069,625
4 255^4 = 4,228,250,625 21,141,253,125 33,826,005,000

To store all permutations of strings up to length 4 requires 20 gigabytes in a NUL terminated string representation and 32 gigabytes in a string representation with a 4 byte length. So a hash table storing such short strings will exhaust current memory sizes before all permutations of short strings can be generated. Therefore for hash tables, stored in current memory sizes, many short strings does not necessarily imply there will be duplicate strings. — Preceding unsigned comment added by 80.195.2.190 (talk) 10:10, 12 January 2017 (UTC)

Second column seems off, I get 4, 1275, 390150, 116069625, 33826005000.
However storing these as independent strings would overflow memory just as much as the hash table. The assumption is that the set of strings fits in available memory, and that there are collisions because some strings are used much more often than others ("index" is probably used much more than "X&*v@" in a programming language). Though the length adds 3 bytes (vs nul-terminated) and the hash table adds H more bytes (where H ~= 16), each collision saves length+4+H bytes. So if there are N collisions out of M total strings then the hash table costs M*(3+H)-N*(length+4+H) extra bytes, which could be negative. However I have no idea how to test this on real data or prove whether it is positive or negative.Spitzak (talk) 18:18, 12 January 2017 (UTC)
Thank you for calculating the correct values. I've now fixed the numbers in that column. I basically agree with your analysis, but to be precise it is duplicate strings rather than collisions that save space in the hash table, because an imperfect hash function can generate the same index, and hence a collision, for different strings, and each different string needs to be stored, so space is saved only for duplicates. I share your intuition that there are many real-world data sets where a hash table (or trie, or other data structure) will save memory by exploiting the frequency of duplication. For a particular application I guess we could get a sufficiently large & representative set of data, and use some kind of statistics-keeping memory management to compare different data structures.