Talk:Burrows–Wheeler transform

	This article is within the scope of WikiProject Molecular Biology, a collaborative effort to improve the coverage of Molecular Biology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Molecular BiologyWikipedia:WikiProject Molecular BiologyTemplate:WikiProject Molecular BiologyMolecular Biology articles
???	This article has not yet received a rating on the importance scale.
	This article is supported by the Computational Biology task force (assessed as Mid-importance).

EOF[edit]

The text says that EOF is ignored when sorting, but the example seems to suggest that EOF is considered to come after all normal letters. AxelBoldt 14:14 Aug 26, 2002 (PDT)

The dot wasn't an EOF. It was just a dot that was there to make the rotations stand out visually. Hopefully it's clearer now. --LC 08:06 Aug 27, 2002 (PDT)

Obviously sufficiently many repeats of the transform must eventually restore the original text, and obviously that provides another (usually inefficient?) way of inverting a transform (just store the inputs temporarily, and give back the last input before the text to be inverted turns up again). But one interesting thing is this: since the original is itself the result of a transform on some text, clearly the transform does not necessarily always produce something better suited for compression. It must be something that is statistically likely, based on properties that are common in many inputs. What are these, and what are the odds that a transform won't actually help and might even hurt? If we do get one of these cases, how can we tell (cheaply)? And so on. PML.

Assuming you're talking about the Bijective version (since your claim is false of the other versions), then the statistical property it takes advantage of is the tendency for letter sequences to be exactly repeated more often than merely similar letter sequences. Exact matches create runs of the same letter. Like, when you're alphabetizing rotations of a book and you get to the 'uick' section, it's a pretty good bet that the previous character (the output from the BWT for that letter) is 'q'; when you get to the 'the' section, there are going to be a lot of spaces as the output character, from all the instances of 'the', 'then', 'there', and 'their' and others, but some 'o' will be mixed in from 'other', etc. If you already have a lot of repeats or are simply high-entropy, it's not going to help.

How can repeating the transform restore the original text? The transform is a not a bijection between strings of length n. It is a one-to-one function from strings of length n into the set of ordered pairs of a string of length n and an integer less than n. Thus, different inputs can produce the same output string with a different index. For example .BANANA. and ANA..BAN produce the same output string. -- Luke Somers

There is a simple way to avoid the use of an EOF symbol which is treated differently in the many verisons of BWT. Since any rotation of the original string gets you the same BWT string output. Which is the way the Burrows Wheeler Transform is usually first discussed in the literature. One problem with the Special symbol for EOF is that when dealing with a file on 256 character types one needs to add a special character in to do the BWT. Then one has to pick its value some make it the heaviest symbol some the lightest and in some binary versions they pick a weight half way in between. One method that I have used to get back to the traditional BWT is to rotate the string until it represents a single lyndon word. And then do the BWT as done in the start of this article and pass as an index the number of rotations needed to get the start of original string. One nice feature of this it that when the string is rotated to make a single lyndon word the string resulting from the classical BWT and the Bijective BWT are the SAME. — Preceding unsigned comment added by 69.152.152.134 (talk) 15:23, 22 August 2012 (UTC)[reply]

The current use of special symbols is... concerning. The original 1994 article does not even use an EOF for the "simple version"; instead it just keeps rotating the columns till it finds something that's sorted when decoding. (This could be the idea Special:Diff/715996790 was trying to get across, but... bruv you need to re-read the article.) EOF was only introduced in section 4.1, whereas the decompression algorithm does not even involve the rotation in the simple version. And don't even get me started on who came up with the idea of adding a "start-of-string marker". --Artoria 2e5 🌉 03:18, 1 January 2020 (UTC)[reply]

Bad example[edit]

This article gives a really bad example for this algorithm. Take the text:

SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES

and replace all repeated substrings (as deflate would detect them) with a special control character which I'll represent as _:

SIX.M_ED.P_IES._FT_XTY_.DUS_BOX_

Now take the transformed text:

TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT

and do the same:

TEXYDST.E.XII_XXSMPPSS.B.__EEUSFXDIOI_T

Clearly, the original string compresses better. And indeed gzip and bzip2 confirm this: gzip compresses the above to 67 bytes, bzip2 to 69 bytes. (Both can hardly be called compression: the original was only 44 bytes ...) — Timwi 14:14, 9 Nov 2004 (UTC)

That is unfortunately the curse of writing about data compression: in a substantial number of cases, it can be almost impossible to find an example small enough to be practical, and clear enough to show the principle involved, but large enough so that the savings achieved with the method can be immediately seen to be significant. Add in a requirement about 'no other method can do this example better' and I am afraid you're asking for the impossible. IMHO, it's much better to use an example that shows the principle clearly and note in the text if needed that 'in this example, the saving from the run-length encoding doesn't overcome the overhead of the encoding; in a larger example which produced longer runs, it would.' After all, generations of textbooks have used recursive Fibonacci algorithms to demonstrate recursion when that's not the most efficient way to calculate Fibonacci numbers.

By the way, have you noticed that your deflate example is incorrect? Deflate operates on strings of minimum length 3. Even with your simplification of all replaced substrings taking as much space as one literal, here's how it would actually look:

SIX.MIXED.PIXIES.SIFT._TY_.DUST.BOXES

vs.

TEXYDST.E.XII_XXSMPPSS.B.__EEUSFXDIOI_T

Not a stellar improvement. -- Antaeus Feldspar 15:11, 9 Nov 2004 (UTC)

It doesn't improve deflate, but I think it does improve RLE. My impression was that the point is to create runs. Deco 20:42, 9 Nov 2004 (UTC)

Timwi's point was that in the example given, deflate outperforms BWT. However, his comparison was flawed because he showed deflate getting LZ77 compression from repetition of two-byte sequences. It doesn't; BWT gets compression from repetition of two-byte sequences, but deflate's implementation of LZ77 compression doesn't work on sequences less than three bytes. -- Antaeus Feldspar 02:17, 10 Nov 2004 (UTC)

If not deflate, then what does bzip2 apply after the BWT? — Timwi 11:31, 10 Nov 2004 (UTC)

To the best of my knowledge, it used to apply RLE and then arithmetic coding, but because of patent problems, it switched to RLE and then Huffman coding. It wouldn't make sense to apply deflate after the BWT, because while deflate can get compression from repeated byte sequences, long runs of the same character, and relative frequency of some characters over others, the BWT is going to remove almost every instance of repeated byte sequences by re-ordering the characters so that they lengthen runs, instead. Deflate would still be able to work on the runs, but it wouldn't do so as effectively as an encoding step that works only on runs.

bzip2 never used arithmetic encoding, bzip did. bzip2 does RLE => BWT => MFT => Huffmann, and the reason it does RLE has nothing to do with compression,

but the fact that the authors (because they didn't understand why it was needed and left some part out) didn't implement the BWT properly in bzip, causing extremely long (as in years) of compression time on some inputs. The RLE pre-pass only exists to avoid these "degenerate" cases. If the BWT would have been implemented properly this would not be needed, and in fact, a lot of documentation still refers to fixed-size blocks of input being compressed, which is no longer the case due to the RLE prepass. So RLE is a bug workaround, it's not the best method to apply after a BWT, a straight MFT encoder is. 2001:470:D08D:0:0:0:0:1 (talk) 07:20, 25 January 2013 (UTC)[reply]

It may help to remember that lossless compression can't create compression from nothing; it can only remove redundancy that already exists. The following table shows which compression methods exploit which kind of redundancy. (Remember that Deflate is LZ77 coding followed by Huffman coding.)

	LZ77	BWT	RLE	Huffman/Arithmetic
repeated multi-byte sequences	yes, at cost of two parameters (offset, length)	converts into single-character runs	no	no
single-character runs	yes, at cost of two parameters (offset, length)	no	yes, at cost of single parameter (length of run)	no
non-flat frequency distribution of characters	no	no	no	yes, exploits optimally

As you see, BWT destroys one of the two kinds of redundancy LZ77 can work on, and converts it to the other kind; LZ77 can still work on it, but at twice the cost of RLE. So it wouldn't make sense to use deflate after BWT instead of RLE and Huffman. -- Antaeus Feldspar 16:55, 10 Nov 2004 (UTC)

In the original paper that Burrows and Wheeler wrote, they suggested that their transform be used in conjuction with a Move to Front encoder and an entropy encoder, such as Huffman; they made no reference to RLE. According to the bzip2 website, RLE is used, but the author indicates that it isn't necessary and is only retained because that's how he originally wrote it and he didn't want to create yet another compression format. -- Zawersh 15:05, 14 May 2006 (UTC)[reply]

This whole discussing about a bad example is rather pointless and can be easily solved by quoting burrows and wheeler form there original report (to read it yourself => you can find it at the sources-list in the article) "The size of the input block must be large (a few kilobytes) to achieve good compression." ~ Band B 22:40, 14 February 2007 (UTC)[reply]

C implementation[edit]

Is there any reason that the C implementation given in the Polish Wikipedia shouldn't be included here? --Shutranm 21:47, 20 May 2005 (UTC)[reply]

Because those who wrote this article are more familiar with python, even the pseudocode is pseudo-python. A more neutral pseudocode may look similar to this one:

generate R with the |S| right rotations of string S.
sort lexically R producing R'
produce R" selecting the last character of each string in R'
form the string B concatenating the characters in R"

concrete code may be more precise and a better choice if the language is abstract enough. I have not seen that C implementation, but C is not abstract enough and could unnecessary distract the reader with the control part of the algorithm. It is better to do a more declarative description showing the what not the how, (see Robert Kowalski's "Algorithms=Logic+Control").

As an exercise, Can you identify each step in the following BASH implementation?:

function bwt () {
   echo $((S=$1; 
           X=$(echo $S|sed -E 's/^(.*)(.)$/\2\1/g'); 
           echo $S; 
           while "$X" != "$S" ;
           do echo $X;
              X=$(echo $X|sed -E 's/^(.*)(.)$/\2\1/g');
           done;) | sort | sed -E 's/^(.*)(.)$/\2/g' ) | sed 's/ //g' ; }
bwt "^BANANA|"

knowing the algorithm is very easy, with no algorithm is not so hard but, why bother the reader with all those details needed in programming languages?

note that after identifying each step, the resulting code is ease to read too, it can be rewritten as follows:

function bwt () { generateRotations $1 | sort | takeLastChar | concatLines }

followed by the definition for each filter, which can be skipped by the reader. That is the power of abstraction, so I don't see why actual well designed code, abstract, almost declarative, should be rejected "because Wikipedia is not a code repository".

— Preceding unsigned comment added by 2806:106E:B:DCE2:B91C:FFCC:4813:8157 (talk) 19:52, 12 October 2021 (UTC)[reply]

Collation issues[edit]

Nobody reading this page cares about Posix collation. Why don't we pick a sample string with no such issues? --Doradus 17:45, 16 January 2006 (UTC)[reply]

How do you get from "I don't care" to "nobody cares"? People reading this page quite clearly do care about POSIX collation, or there wouldn't be a mention of it in the first place. I certainly care, and I'm equally certainly reading this page. The point that BWT is sensitive to collation issues is worth making, IMO. — Haeleth Talk 20:19, 22 March 2006 (UTC)[reply]

BWT isn't sensitive to Posix; the sort will work in whatever way you implement the sort. Practically speaking, it's very unlikely anyone will be doing real BWT with Posix C string sorting functions; I found it to be quite a non-sequitur with little or no apparent relation to the subject at hand. The removed section follows: --Brion 03:31, 19 May 2006 (UTC)[reply]

==== Note on sorting convention ====

If you sort with [[Posix]] [[collating]], you get the slightly different string
 TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT
instead of
 TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT

[[ISO 8859]] has complex collating rules, but in this case, periods are
ignored.  Posix collating treats periods as characters.

As a visitor trying to use this page to work out my own implementation, it would be nice to know what sorting convention is being used for the example. Otherwise, I'd say that th example in't very useful. --128.252.110.200 (talk) 01:20, 7 March 2014 (UTC)[reply]

Mistake in example?[edit]

Could someone check the example, please? I was trying to verify something and my simple quick implementation of BWT outputs

TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT

not

TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT

as the article writes.

(And note that I intend to at least replace the black dot with another character, or to remove it altogether. It is not a good style to unnecessarily convey meaning only with a color variation, mind e.g. blind people.)

--Mormegil 18:12, 18 June 2006 (UTC)[reply]

My code gives the same output as Mormegil's, not as the article writes.--Thorwald 06:05, 3 August 2006 (UTC)[reply]

Just tried this out, and I get exactly the same as Mormegil and Thorwald. If you try doing it, and use something like Microsoft Excel (yes, you can stop spitting now), you get the following table:

.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST
.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE
.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX
.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY
.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED
.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES
.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT
BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.
D.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXE
DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.
E.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXI
ED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIX
ES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXI
ESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOX
FT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SI
IE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIX
IES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIX
IFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.S
IX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESS
IXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.M
IXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.P
IXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.P
IXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.S
MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.
OXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.B
PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.
PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.
S.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIE
SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.
SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.
SSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXE
ST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DU
T.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUS
T.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIF
TY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIX
UST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.D
X.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSI
XED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MI
XESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BO
XIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PI
XIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PI
XTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SI
Y.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXT

Which gives "TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT"

Which is understandable, as a space is a smaller ASCII value than [A..Z|a..z]

--Ambulnick 10:17, 27 Sept 2006 (GMT)

If you take the C code and run it on ".BANANA" instead of "Polska Wikipedia", you don't get the same answer as the article gives: BNN.AA@A, you get ANNB.AA. That's because "." sorts before capital letters in ascii, not after, as the example shows. Changing "." to "x" makes the example work. If someone else will verify that they see the same problem, I'll fix the example. But it's late at night -- I may just be confusing myself :-)

Cbogart2 06:04, 18 December 2006 (UTC)[reply]

INCOMPLETE EXAMPLE[edit]

The BWT is not just the transformed string. If using method above you need to include something

like the position of original data in example above where the SIX.MIXED row occurs. If we don't do this then people we wrongly think the normal BWT is just a permutation. When in fact its two outputs the permutation and an integer index. —Preceding unsigned comment added by 75.54.98.144 (talk) 16:28, 2 March 2009 (UTC)[reply]

compression?[edit]

Article says:

The Burrows-Wheeler transform (BWT, also called block-sorting compression)

Technically, BWT is not a compression. Should it be noted? It leads to wrong understanding of the subject —lim 16:54, 16 May 2007 (UTC)[reply]

Bad inefficiency in the example implementation[edit]

Putting aside the issue that this is a really horrible example of program code, there's a major inefficiency in it: In the function rotlexcmp() if 'li' is equal to 'ri', it will needlessly traverse the entire data and then just return 0. According to my own experiments this makes it almost 100 times slower than it could be (well, depending on the exact implementation of the sorting function, of course). Simply adding a "if(li==ri) return 0;" at the beginning solves this issue and makes the program a lot faster with large datasets.

Wikipedia is not about tweaking the last bit of Performance out of some Algorithm. It only has to give the right oputput. —Preceding unsigned comment added by 129.13.72.198 (talk) 08:35, 6 February 2009 (UTC)[reply]

Implementation[edit]

We don’t need two “sample implementations” and should not have them here both. We are not a code repository. Maybe we should clean one of them to focus on the algorithm (e.g. remove those #includes and main() in the C version)? --Mormegil 17:54, 9 November 2007 (UTC)[reply]

I agree with you, I am going to remove the C-implementation. I don't think it serves any purpose here. Sample implementations can be found around the web. The C-code also points to the Polish Wikipedia as the source. Wikipedia is not a reliable source. —ZeroOne (talk / @) 16:28, 14 December 2008 (UTC)[reply]

I do not agree. Although the WP:NOT says Wikipedia is "not a code repository", this is only a suggestion, not a rule. I do not agree with this policy. I think Wikipedia should always have sample code whenever possible. It provides probably the best context to the article. I agree that we should primarily focus on the algorithm, but practical/actual code is always a help. It is yet another reason to visit Wikipedia. I am, and have always been, an inclusionist. PS: What about this entire category: Category:Articles with example C code? Are you going to remove that as well? --Thorwald (talk) 00:02, 15 December 2008 (UTC)[reply]

Well, I'm glad we got a discussion. I also think that sample code is good. In this case I removed the C code and left the Python code, because I found the Python code to be clearer. There's nothing wrong with having C code but I think a sample in one language is enough. If I have to choose between C and Python I remove the C example; if I had to choose between, say, C and assembly language, I'd probably remove the assembly example. Having an example in more than one language feels like artificially expanding the article. If we allow two example languages, why not three? If we allow three different example languages, why not four? Why not five, six or seven examples? —ZeroOne (talk / @) 13:40, 15 December 2008 (UTC)[reply]

Okay. I agree. The Python code is sufficient (and clearer and cleaner). --Thorwald (talk) 01:20, 16 December 2008 (UTC)[reply]

Mathematica Implementation[edit]

Just for Fun.

BWT[str_] := Block[{list, n, data},

 list = StringSplit[str, ""];
 n = Length@list;
 data = Sort@Table[RotateRight[list, i], {i, 1, n}];
 StringJoin@dataAll, -1

]

inBWT[str_, endOfFile_: "@"] := Block[{list, n, K},

 list = StringSplit[str, ""];
 n = Length[list];
 K = Sort@Transpose@List@list;
 Do[K = Sort@Transpose@Prepend[Transpose@K, list] ;, {n - 1}];
 StringJoin@Select[K, Last@# == endOfFile &]1
 ]  —Preceding unsigned comment added by 189.83.138.252 (talk) 22:35, 24 October 2009 (UTC)[reply]

The Bijective BWT[edit]

Does anybody have independent confirmation of the Bijective BWT actually working? After having read the link to David Scott's method, I'm extremely sceptical that his technique works.

The main reason why I'm extremely sceptical is because he essentially appears to be using the 1st column of the BWT reversed. If that works, then there's no reason why you couldn't use a binary input (instead of ASCII characters) and end up with what would surely be the world's most compressible file, consisting merely of a count of 0's and 1's. Tiggs (talk) 00:16, 30 August 2008 (UTC)[reply]

Actually Yuta Mori has put it in OpenBWT-v1.3 people on one of the more active compression sites http://encode.ru/forum/ have tested it with many files. Surprisingly it often bets regular BWT based compressors using code tuned for standard BWT the main draw back beside being hard to understand is that it is somewhat slower to execute than normal BWT. It is not the 1st column of the BWT reversed you can try Scott's or Mori's code on any reasonably sized file to see this this is not what is happening. —Preceding unsigned comment added by 75.54.97.66 (talk) 22:14, 3 October 2008 (UTC)[reply]

Agreed. I've spent some time looking over the source code and it is, indeed, a bijective transform.

Whether or not it should be classified as a Burrows Wheeler transform, or whether it should be called a Scott Transform, however, is a completely different question. As it changes the order that the characters are transformed into, I'd favour the later. Tiggs (talk) 20:53, 10 October 2008 (UTC)[reply]

Actually there really is no standard BWT even for the index version. Many consider the one where you have the full wrap around on compares the original others use a special character for EOF some treat the EOF character as if it has the greatest some the lowest value there are many way to do it. And actually there are more an how to label the index. But if you look at the BWT that uses the full wrap around and do a UNBWTS on it you get the original file back except that it may be rotated since there is no index. Actually the normal BWT could be thought of as subset of BWTS. I have not heard of anyone calling it the Scott Transform but even Yuta referred to as the BWT Scottified. In fact there are a whole series of transforms that could be called bijective BWT transforms but few know about them or understand the concept. But tests done by others show this version tends to lead to better compression than what the users there call a normal BWT. The field is wide open not just for compression. —Preceding unsigned comment added by 75.54.97.66 (talk) 02:33, 11 October 2008 (UTC)[reply]

Purpose of bijective transformation[edit]

Would it be possible to explain why the bijective version is interesting? Eg where is it used and why? Saving the cost of one index does not seem to justify it. The text seems to suggest it was created for more than its mathematical beauty. 128.16.7.220 (talk) 15:17, 7 June 2014 (UTC) Bill[reply]

1) it allows it to be used on data that potentially saturate their character space, so that you would have to make every character fatter to accommodate the EOF character. 2) Why not? A character's a character. — Preceding unsigned comment added by 108.52.71.107 (talk) 03:23, 28 July 2014 (UTC)[reply]

Source for bijective transformation[edit]

I'm concerned that the only source for the bijective transformation is the paper by Gil and Scott that seems to be (in traditional terms) an "unpublished manuscript". That isn't quite up to Wikipedia standards. That isn't just useless policy-wanking in this case; the paper contains meaning-disturbing typos and mistakes that I don't think would have survived in a published paper. At least we shouldn't mislead readers into thinking this is as reliable a source as the ones we usually provide. Furthermore, Gil is an academic, but no reference to the paper appears on his web page, which could indicate that it was later retracted.

In particular: The paper presents a "string-sorting transform" with two variants, depending on how one compares strings of different length where one is a prefix of the other. It also presents an inverse transform, though one without any variants, and it is not quite stated which of the two main transforms it is an inverse of. According to my pencil-and-paper experiments, the inverse transform does seem to invert the "infinite-periodic" variant of the main transform. It does not work for the "usual lexicographical" variant, and indeed that variant is not a bijection! As a minimal example, the input strings "bab" and "abb" both transform to "bba".^(*)

Unfortunately it is only the "usual lexicographical" transformation that the authors provide a linear-time algorithm for (though it is very sketchy, and I'm not convinced by the time analysis). For the infinite-periodic variant, the authors explicitly deny having a linear algorithm. They claim to have an O(nlogn) one, but do not describe it further.

While a bijective variant of the BWT is indeed an interesting and worthy topic, the one presented by the paper is not quite as ready for use as the paper's abstract suggests.

(*)  Input:                 bab      abb
     Lyndon factorization:  b·ab     abb
     Cyclic rotations:      b,ab,ba  abb,bba,bab
     Sorted:                ab       abb
                            b        bab
                            ba       bba
     Output:                bba      bba

(The "infinte-periodic" variant sorts "b" after "ba" and therefore transforms "bab" to "bab").

–Henning Makholm (talk) 13:44, 6 December 2010 (UTC)[reply]

Huh. I managed to completely overlook the Kufleitner paper, which seems to be a published source. Somebody ought to convert those external links to in-line citations ... –Henning Makholm (talk) 14:08, 6 December 2010 (UTC)[reply]

You have the correct transfrom for abb which is bba

however bab which break up to b ab is sorted wrong it should be

ab to b
ba to a
bb to b

you repeat each substring as needed before the comparistion so b sorts after ba not before thus the bwts is bab for bab its a self bwts. One conceptual way to do this is repeat the substring so a common length example if sub strings 3 6 2 1 5 use 30 characters then sort as you would in normal bwt

I also don't think the last changes to explaination aid to helping people understand what it is. I would like to change it back but would prefer someone else fix the english. I read what is now there and I don't think it helps many people to understand the concept. — Preceding unsigned comment added by 99.152.26.69 (talk) 15:15, 17 January 2012 (UTC)[reply]

when when is comparing two strings lexicographically  you pretend the shorter string is padded with spaces this is not the same as comparing two unequal length strings in bijective BWT  you keep repeating the strings until there is a difference.  — Preceding unsigned comment added by 66.136.119.147 (talk) 23:25, 30 September 2012 (UTC)[reply]

BWT-based self-indexes[edit]

Section "BWT in bioinformatics" and pages Compressed suffix array and FM-index should probably be merged into a single page. They are all about essentially the same data structure:

Build the BWT of the text.
Encode the BWT in a way that supports rank(c, i) (the number of occurrences of character c in BWT[1..i]) and select(c, i) (the position of the i th occurrence of character c in the BWT) efficiently.
Use the BWT to simulate the suffix array.

The differences between various compressed suffix arrays and FM-indexes are mostly in the encodings they use. The following survey articles are probably the best resources about compressed self-indexes so far:

G. Navarro, V. Mäkinen: Compressed Full-Text Indexes. ACM Computing Surveys, 2007.
P. Ferragina, R. González, G. Navarro, R. Venturini: Compressed Text Indexes: From Theory to Practice. ACM Journal of Experimental Algorithms, 2009.

Jouni Sirén (talk) 23:48, 18 December 2010 (UTC)[reply]

Example - Inverse Transformation[edit]

The article is very nicely written and clear even for a beginner like me. But I didn't get one point:

Would it be possible to explain, why step 3 in inverse transformation is possible? Sort and concatenate with BW'ed string in step 2 is clarified in the text, but why one can sort again and concatenate the BW'ed string again? Though it works.

cheers, Oliver Drechsel (talk) 08:39, 20 January 2011 (UTC)[reply]

WRONG transformation[edit]

the added character @ should be lexicographically before any other character, so:

ANANA@^B
ANA@^BAN
A@^BANAN
BANANA@^
NANA@^BA
NA@^BANA
^BANANA@
@^BANANA

should be

@^BANANA
ANANA@^B
ANA@^BAN
A@^BANAN
BANANA@^
NANA@^BA
NA@^BANA
^BANANA@

and the resulting string would be ABNN^AA@ — Preceding unsigned comment added by 84.223.127.232 (talk) 19:33, 12 April 2012 (UTC)[reply]

The third implementation of bwt or ibwt has a bug[edit]

Calling bwt('abracadabra') results in ('rdarcaaaabb', [2, 5, 6, 7, 8, 9, 10, 4, 1, 0, 3]) , but calling ibwt('rdarcaaaabb', [2, 5, 6, 7, 8, 9, 10, 4, 1, 0, 3]) results in 'aabracadabr'.

Beautiful[edit]

This article is beautiful. Thank you so much to all for the work in making this algorithm understandable. I mentally cried tears of joy because I was expecting the worst. I've never seen a Wikipedia article so carefully curated to its intended audience.

--Dominic (talk) 12:58, 23 December 2016 (UTC)[reply]

League Table Sort Algorithm[edit]

The algorithm claims to be linear in time, but clearly it cannot be, since for a string of length N it requires the construction of an N×N matrix, or since it is symmetric (over negation), an upper or lower triangle N×N matrix, which require O(N^2) to manipulate.

That glaring flaw aside, it is actually quite a a nice construction algorithm, and elucidates some nice intuitions about the problem of BWT construction, but claiming it is efficient is simply wrong from an algorithmic perspective. Drtomconway (talk) 22:33, 22 December 2019 (UTC)[reply]

Where does it claim to be linear? BernardoSulzbach (talk) 16:11, 23 December 2019 (UTC)[reply]

Example contains an error.[edit]

It may be "by convention", but the Example claims that rotation is done 8 times. It. Is. Not. It is done 7 times for a total of 8 strings if you include the original un-rotated string. That is, N = 7 for a string of 8 characters. It should also, imho, be mentioned that sorting of 'special characters' (which in this case would include both the caret (^) and the EOS (end of string? end of line?) character (|)) depends on where they are arbitrarily positioned in the sort order (it isn't necessary that both occur before the 'a' or after the 'z' of "lexical order".98.21.213.235 (talk) 23:31, 10 February 2021 (UTC)[reply]

Dollar sign instead of pipe to terminate strings[edit]

In is common to use ^ and $ to start and end a string, respectively -- see Regular expression#POSIX basic and extended. Should we use $ instead of | in this article? — $Q$ uantling (talk | contribs) 03:26, 14 November 2023 (UTC)[reply]

I made the change boldly. If this instead needs discussion, let's do it. —

Q

uantling (talk | contribs) 22:01, 14 November 2023 (UTC)[reply]

String-starting and string-ending characters[edit]

The original Burrows–Wheeler paper has neither a string-starting character (like ^) nor a string-ending character (like @ or | or $), and does not invoke them in any way. Can I remove them from this Wikipedia article? — $Q$ uantling (talk | contribs) 00:33, 15 November 2023 (UTC)[reply]

In the paper, “the index I of the original string S in the sorted list of rotations” is an additional output of the compression scheme that is needed as an input for decompression. The WP article instead uses the fact that $ being in the last column will identify M[I]. But I don't see why the start-of-text character ^ is used. stw (talk) 19:57, 23 January 2024 (UTC)[reply]

I’m particularly confused as to why the start of string character is used in the bijective example NixCode (talk) 02:57, 31 January 2024 (UTC)[reply]