Talk:Null character

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computer Security / Computing  (Rated Start-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computer Security, a collaborative effort to improve the coverage of computer security on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Computing.
 

Java null[edit]

Dcoetzee, Java does have a null terminator. If you don't believe me, click here. [[User:Supadawg|supadawg - talk - contribs]] 15:02, 14 Aug 2004 (UTC)

Java has a null terminator character, since it uses the ASCII character set, but it is not, as your modification explicitly said, used to terminate strings. Java strings are objects and can contain embedded nulls. Deco 23:01, 14 Aug 2004 (UTC)

The page is called "Null character". We've just established that Java has a null character. Therefore, it should be included, with a change of wording in the introduction.
P.S. I won't change the page until we agree on something, to avoid an edit war. [[User:Supadawg|supadawg - talk - contribs]] 23:08, 14 Aug 2004 (UTC)

Java does not use the ASCII character set, it uses Unicode. The null character exists but it is not called "null terminator character" because it is not used to mark the end of a string.

You're right, sorry for my errors. In any case the null character has no special meaning in Java. Deco 06:39, 20 August 2005 (UTC)

Efficiency[edit]

I would really like to see a source for this:

"Null-terminated strings can also have efficiency benefits, since operations that traverse a string don't need to keep track of how many characters have been seen, and operations which modify the string's length do not need to update the stored length. Cache performance can also be better."

So have added citation needed. For one, on most architectures both a "subtract and jump if not zero" is identical to a "compare and jump if not zero". Yet by having the length first loops can be unrolled for far greater speed - something that is very difficult to do with a null terminator. For example, many languages that have known lengths are able to use sse2 and move 16 bytes at a time from a string - the same procedure to move a C style string performs atrociously in comparison. Themania (talk) 14:17, 28 January 2008 (UTC)

The only operation that is faster is "give me the tail of this string". This is indeed O(1) for nul-terminated strings, while O(N) at least for all other usable representations of strings (a large object containing both an original pointer+length plus a current pointer+length would also be O(1) but has impractical overhead for other purposes). It is also true that software has taken advantage of this, which is one of the main reasons it is difficult to transition away from nul-terminated strings and why it often gets much slower. Typical example is passing a pointer to the middle of a string and assuming that is free.Spitzak (talk) 18:24, 22 September 2009 (UTC)
The original reason for nul termination is that with an overhead of one byte, you get unlimited string length. The alternative considered when this was done was a 1-byte length, which gave you a limit of 255 bytes. This length limit was much worse than the problems with not allowing one character value. Overhead of more than one byte was not even considered then, you would have been nuts to suggest it. This was when 4K was a typical memory size.Spitzak (talk) 18:24, 22 September 2009 (UTC)
SSE: Modern processors manipulate nul-terminated strings in units far larger than one byte in typical strcpy implementations. This is because you can do tests that will reveal if one of the bytes in the larger unit is nul and then do that last unit with byte-based code. So the SSE advantages are far less than you think. One disadvantage of nul-terminated strings is that they are often not page-aligned at the start, requiring special code at both the start and end of the string.Spitzak (talk) 04:29, 15 April 2010 (UTC)

ow, my brain![edit]

I'm no good at computer. 71.84.126.174 (talk) 08:51, 1 October 2010 (UTC)

"\01" a two-character string?[edit]

Tested negatively in:

  • Bash 4.1.5 - a=$'\01'; echo ${#a}
  • Perl 5.8.9 - perl -e 'my $a = "\01"; print length $a;'
  • Python 2.6.5 - len('\01')

Should we remove that part until someone finds (or invents;) such a language? —Preceding unsigned comment added by 84.73.54.61 (talk) 10:42, 11 January 2011 (UTC)

There is lots of documentation that says "\0" is a null character, which implies "\01" is a null character followed by a '1' character. However it is quite possible that the documentation is wrong! Another question is whether "\0001" is a null character followed by a '1' or a single '\1' character (ie does it eat 3 digits as the C standard say, or eat all possible octal digits), again docs often say ambiguous things or disagree with the C standard. Again perhaps only the docs are wrong. If it appears virtually all implementations match C, it might still be interesting to point out that the common belief and documentation mistake that '\0' is the escape for a null character is not exactly true.Spitzak (talk) 20:03, 11 January 2011 (UTC)
"implies "\01" is a null character followed by a '1' character."
No, not even a byte 0 followed by a byte 1. "\01" means a string of language source code, which is parsed according to the rules of that language. In pretty much every case, for every language, this means that it's treated as a string of digits signifying one numeric value (usually in octal notation, if it's a "\" marker alone). This numeric value then represents one, and only ever one, codepoint (i.e. "character"). For two codepoints it would be expressed as two \nnn \nnn sequences, with two backslashes. Each codepoint is then encoded into bytes or octets. In many cases (i.e. encodings, such as ASCII or ISO-8859-*, codepoints are represented as a single byte. In UTF-16, always as two octets. In UTF-8, most can be represented as a single octet, but there are encoding rules for transparent use of varying-length (i.e. multi-octet) encodings. So "\01" is the same as "\1" and "\001": it always means codepoint ("character") #1 (SOH) and that will (in most cases) be represented by the single byte 0x01. In UTF-16 it could be 0x0001 or 0x0100 because that always uses two octets per codepoint.
I think you are confused about what I am saying. Yes in most (perhaps all) languages using such syntax, "\01" results in a 1-character string with a single character of value 1. However there is quite a lot of documentation that says that "\0" is a nul character. This is not true because a literal reading of that would indicate that "\01" should then produce a nul character from the "\0" followed by something else from the "1", most likely an ASCII '1'. I was under the impression that there were actual implementations that work this way (as it would be easy to write a case statement that was also doing \n and \t to make \0 into a NUL, but more difficult to make \nnn octal work) and assumed there were actual differences between software in the interpretation of "\01". It now sounds like nobody can locate an example and in fact it is only the documentation that is wrong.Spitzak (talk) 22:19, 11 January 2011 (UTC)
No, you're just plain wrong. Read my text again. \01 is not the same thing as \0 \1 Andy Dingley (talk) 22:41, 11 January 2011 (UTC)
Yes, me and everybody agrees that "\01" is never "\0\1". The question is whether it is "\1" or "\0"+"1". I also agree that it is "\1" in the majority and possible all cases. I am however assuming that the ease of writing a statement like "case 'n': return '\n"; case 'r': return '\r'; case '0': return 0;" would lead to a number of implementations that treat "\01" as "\0"+"1". However as testing is not showing any, perhaps this assumption about lazy programmers is false.Spitzak (talk) 17:08, 12 January 2011 (UTC)
It is parsed according to pretty universally followed practice for character-level syntax in programming languages: \nnn is treated as one sequence, with up to three digits (in octal, at least). There are no languages that will treat the "\nnn" notation as "\n" followed by a remainder of "nn". You are of course to write as many inconsistent, broken parsers as you wish to, but they aren't going to be notable and aren't going to get covered here. For any language worth using as an example in these articles, "\nnn", "\nn" and "\n" are all parsed in exactly the same manner; as digits, with assumed left padding with zeros. Andy Dingley (talk) 17:21, 12 January 2011 (UTC)
The YAML official spec does this: http://www.yaml.org/spec/1.2/spec.html#id2776092 YAML is based on JSON, but JSON does not have this problem, as it does not support \0 at all. It appears the only way to get nul in JSON is \u0000.Spitzak (talk) 22:49, 12 January 2011 (UTC)
If you're using YAML, there's no hope. Writing a language spec isn't as hard as it used to be, as it's so well understood these days and there's so much past work to draw from. YAML though is just a mess. Andy Dingley (talk) 22:56, 12 January 2011 (UTC)

Definition of "null character" is NOT zero[edit]

I disagree with the phrase "with the value of zero" since zero has its own unique ASCII character; the phrase "with the value of zero" is misleading. The null character is non-numeric. I invite further discussion on the difference between zero and null since they are not the same philosophically, neither are the same within a numerical ordering. This concept was taught in the computer engineering curriculum at the University of Miami in Coral Gables, FL during my studies there. - MPO, mattjava77@gmail.com — Preceding unsigned comment added by 107.41.63.164 (talk) 03:05, 28 December 2013 (UTC)