User:NorwegianBlue/sandbox

This is the user sandbox of NorwegianBlue. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

P(A|B)={\frac {P(B|A)\,P(A)}{P(B)}}.\,

Want to know:

P(CD|aTTG_{neg}\cap aDGP_{pos}\cap IgA_{neg})={\frac {P(CD)\,P(aTTG_{neg}\cap aDGP_{pos}\cap IgA_{neg}|CD)}{P(aTTG_{neg}\cap aDGP_{pos}\cap IgA_{neg})}}.\,

and

P(CD|aTTG_{neg}\cap aDGP_{pos}\cap IgA_{pos})={\frac {P(CD)\,P(aTTG_{neg}\cap aDGP_{pos}\cap IgA_{pos}|CD)}{P(aTTG_{neg}\cap aDGP_{pos}\cap IgA_{pos})}}.\,

Assume that

P(CD|aTTG_{neg}\cap aDGP_{pos}\cap IgA_{neg})=P(CD|aDGP_{pos}\cap IgA_{neg})={\frac {P(CD)\,P(aDGP_{pos}\cap IgA_{neg}|CD)}{P(aDGP_{pos}\cap IgA_{neg})}}.\,

Video codecs for future storage[edit]

Part 1
I'm getting several hours of super 8 film digitized. I want to keep it in a format that strikes a good balance between keeping the amount of storage space needed at a managable level, and preserving as much as possible of the information in the video. I've had tests done at several companies who specialize in this, and the company so far that produces the best digitizations, uses this hardware, and a codec that VideoLan describes thusly:

Stream: 0
Type: Video
Codec: Packed YUV 4:4:2, U:Y:V:V (2vuy)
Language: English
Resolution: 1280x720
Display resolution: 1280x720
Frame rate: 24

When I calculate the total number of pixels encoded, the file size is about 2.7 bytes per pixel.

Do we have an article about this codec? I've read Chroma subsampling, but anything more specific?
Is this a codec that can be expected to be readable a long time into the future?
In this particular codec, are all frames represented equally faithful to the original, or are there key frames that are represented more faithfully, with diffs in between which use some kind of lossy compression with respect to the key frames? To clarify: The details in the encoding scheme, i.e. whether key frames and diffs are used or whether information is stored a sequence of entire images, does not matter to me. If the diffs are exact, it's fine.
I'm happy with a file size of up to about 4 bytes per pixel. Are there codecs that are widely used, can be expected to be readable a long time in the future, and which can be manipulated with ffmpeg (Windows version from http://ffmpeg.zeranoe.com/builds/), that would be a better choice than the one described above?

Part 2

For some reason, the company mentioned above only delivers the files in .MOV containers. Can I convert to .AVI losslessly with ffmpeg, and safely delete the original?
Will ffmpeg be able to split longer files into shorter sequences losslessly?

Thanks, --xxxx

LINKS[edit]

3TB disk problem[edit]

Ok, here are my conclusions, for the benefit of anyone who might find this thread googling for the error messages:

The misaligned disk problem affects both Seagate and Western Digital 3Tb disks. The only difference is that with the Western digital disk, the reported misalignment was 256 bytes. With the Seagate disk, the reported misalignment was 3072 bytes. The WD disk is described by the supplier as Western Digital® Desktop Green 3TB SATA 6Gb/s, (SATA 3.0), RPM = IntelliPower, 64MB Cache, 3.5 in., and has the producer ID WD30EZRX. The Seagate disk is described by the supplier as Seagate Barracuda® 3TB SATA 6Gb/s (SATA 3.0), 64MB Cache, 7200RPM, 3.5 in., and has the producer ID ST3000DM001. Both disks had 5860533168 blocks as reported by sudo cat /sys/block/sdd/size, which fits nicely with 3Tb. The OS is Lubuntu 12.04.2 LTS.
Zeroing the start and end of the disks, as suggested by Finlay, initially appeared to have no effect. Exactly the same error messages, with the same reported misalignment, were reported (still using palimpsest). I later checked to see whether the zeroing of the starts and ends of the disks actually had worked (by repeating the process, dd-ing the blocks to a file and doing hexdump -C). The supposedly zeroed beginnings and ends of the disk still contained data! I repeated the process, doing partprobe /dev/sdd (to tell the OS that the partition tables were changed) and rebooted. When I then checked whether the sectors were empty, they were. I then used gparted, created a GPT style partition table, and created one 3Tb ext4 partition. Everything went smoothly. When checking the disk with palimpsest, there was no warning of misalignment.
Conclusion: Nuking the starts and ends of the disks as suggested by Finlay works, but you have to take care and check that the sectors are actually cleared, and that the OS is aware of it. I'm not sure why my initial attempts failed silently. Maybe I had palipsest open, while dd-ing, and palimpsest locked the disk. Maybe I had actually succeeded in zeroing the sectors, but failed to tell the OS. I don't know. Anyway, it works now. --NorwegianBlue^talk 15:58, 25 March 2013 (UTC)

Tetranacci numbers etc[edit]

Followup question: I attempted to find the answer to Theurgist's question using elementary probability calculations before RDBury's answer. I tried to separate the problem into individual cases (a sequence of heads that begins at position 1, a sequence of heads that begins at position 2, etc), but soon realized that I was unable to make the "separate" cases non-overlapping. I also realized that the question could easily be answered by brute force calculation, and have now written a small program that does so, and which confirms RDBury's answer. Experimenting a little with variations of the program, I find that an analogous approach holds true for sequences of 2, 3, 5 and 6 tails. The number of sequences of 25 coin tosses with no runs of more than 1 tail is fibonacci_number(25+2), the sequences with no runs of more than 2 tails is tribonacci_number(25+3), the sequences with no runs of more than 3 tails was the original question, the number of sequences with no runs of more than 4 tails is pentanacci_number(25+5), the number of sequences with no runs of more than 5 tails is hexanacci_number(25+6). I assume that this holds true in general. My question: Is there some intuitive or easily proven reason why this is so?

Kårø[edit]

Sorry to butt in but I was intrigued by the discussion at the RefDesk so I thought I'd take a look. I wonder if it might be the island of Kårøya in the Røst municipality? It matches the description of 'on the coast' and the o/ø suffix would suggest an island. But what it has to do with Norwich, I don't know. - Cucumber Mike (talk) 15:14, 19 June 2015 (UTC)

Double A is a vowel that was pronounced with a long A (/aː/) in Old Norse, that has shifted to [ɔ] ("awe") in Norwegian. The letter "Å" was officially introduced in Norwegian spelling in 1917. In given names and surnames, the double a is still to some extent used. If "Kaaro" is a Norwegian name, the modern spelling would be "Kåro", which does not sound like a Norwegian name to me, and has no sensible hits on Google maps, Yr.no (a weather forecasting site which provides forecasts for the tiniest of places) or official maps [1]. Replacing the "o" with "ø" makes sense and sounds like a Norwegian name, but searching the same sites for "Kårø" gives no exact hits. Since Norwegian surnames often are toponyms, I've checked if "Kårø" or "Kaarø" is a surname at Statistics Norway (https://www.ssb.no/en/befolkning/statistikker/navn link). Result: Fewer than four people or no one has the name Kårø (this is the same result that you would get searching for someone named "Qwertyxyz"). There are 7 with Kaarø as their surname. In the telephone directory I find ten persons with "Kaarø" as their surname or middle name. Several of these live in Trondheim, which could be a clue to where "Kaarø" is located. A google search for "Kårø" gives a couple of hits on genealogy sites, which indicate that the "Kårø" family name originates in Hemne, about 100 kilometers from Trondheim. I've found one person (with another name) who is a farmer with his business address listed as "Kårø", with a Google map link to Vinjeøra, in Hemne. google map link.

Cucumber Mike's suggestion of replacing the "ø" with "øy" is a good one, since "ø" is Danish and Riksmål for island, which in Bokmål and Nynorsk is spelled and pronounced "øy". "Øya" is just the determinate form. Returning to official maps, a search for "Kårø" gives the following result, my translation of geographical features etc. in curly brackets:

Kårøyan, Turisthytte {tourist cabin} Hemne

Kårød, Bruk (gardsbruk) {farm} Nøtterøy

Kårøyan, Bruk (gardsbruk) {farm} Hemne

Kårøyosen, Sund i sjø {strait} Solund

Kårøydalen, Dal {valley} Hemne

Kårøya, Øy i sjø {island} Røst

Kårøyna, Holme i sjø {islet, see Holm (island)}, Austrheim

Kårøyna, Holme i sjø {islet} Austrheim

Kårøyna, Holme i sjø {islet} Solund

Kårøylia, Li {hillside} Rindal

Kårøyskjeret, Skjær i sjø {Skerry} Røst

The reasonable candidates are in my opinion Kårøyan map link (which could be identical to the "Kårø" farm I mentioned above, although the google map link is not quite right) and Cucumber Mikes suggestion Kårøya map link. There is also a Swedish island called Kårö map link. That said, I think it is a bit of a stretch to assume that Carrow road is named for any of these. The source says: If we agree to Norwich being a Scandinavian place-name, it is curious to notice that there is a place on the Norway Coast called Kaaro, which sounds like speculation to me. I would suggest changing the article text to The name "Carrow" originally refers to the former Carrow Abbey that once stood on the riverside, its name in turn having possible Norse origins [13], and omitting possibly related to Kaaro, Norway.

ROC curves[edit]

I would appreciate your feedback about whether my understanding of ROC-curves is correct. I also have a couple of questions at the end. As this post clearly shows, my mathematical capabilities are limited, so please be gentle, and cautious about introducing terminology that I might have trouble understanding. If necessary, please translate my statements into more conventional terminology.

In my understanding, a ROC curve is a plot of true positive rates (TPR, Y-axis) vs false positive rates (FPR, X-axis), when the cutoff (CO) between what is considered a positive and a negative observation is varied such that it covers all reasonable values. The tangent at a given point (CO), can be estimated as

ΔTPR(CO)/ΔFPR(CO),

hence TPR(CO)/FPR(CO) is the derived function of the ROC curve, and the ROC-curve the antiderivative of the function TPR(CO)/FPR(CO).

In the following, I'll use the ROC curve in a medical context, and let 'm' represent the observed value of a diagnostic test, for which we have a ROC curve availabe. Then

TPR(CO)/FPR(CO) = p(m = CO ± eps|Disease)/p(m = CO ± eps|No disease) = the likelihood ratio function.

Q1: Am I right in thinking that the probability ratio in the previous line should be the probablity of 'm' being close to CO (in my notation ± eps), and not greater than CO?

Q2: I've read a couple of places (such as here: Choi BC (1998) Am J Epidemiol 148:1127–32. PMID 9850136) that it is valid to draw lines between several points on the ROC curve, say corresponding to "negative", "weak positive" and "strong positive", and that the slope of each line is a valid estimate of the likelihood ratio for test results that fall within the corresponding interval. Sounds reasonable, but exactly why is that so?

Q3: I've read many places (including in our article) that the area under the curve corresponds to the probability of a randomly chosen diseased individual getting a higher test result than a randomly chosen individual without the disease. Again, this sounds reasonable, but exactly why is it so.

Thanks, NorwegianBlue^talk 14:18, 30 August 2015 (UTC)

Metaquestion: Why did my question receive no answers?[edit]

Reply from Ssscienccce[edit]

@NorwegianBlue:, late answer, but anyway...

I've taken a look at it (don't visit RD/MA often ), the question was a bit confusing:

The tangent at a given point (CO), can be estimated as ΔTPR(CO)/ΔFPR(CO),

hence TPR(CO)/FPR(CO) is the derived function of the ROC curve

Disclaimer: not familiar with the topic, just my interpretation, which could be way off..

Is it about the continuous probability functions? These are turned into binary tests by integrating them (page 2260), and the ROC curve plots those values. (See note 1)

The tangent at a given point equals the ratio of the density functions (those in fig 1 in the link), and is also (by definition) ΔTPR(CO)/ΔFPR(CO), but not TPR(CO)/FPR(CO). (See note 2)

Each point on the curve gives you the fraction of the sick population and the fraction of the healthy population that would test positive at that CO. So if (TPR=0.9, FPR=0.1) is on the curve, 90% of all sick and 10% of healthy would test positive.

Context would make it easier to understand, preferably a link to the specific material and formulas you're asking about, not many people with a medical background here and most maths techniques have numerous applications, so people won't know the conventions/notations, etc. Most sources I found were like this .

TPR(CO)/FPR(CO) = p(m = CO ± eps|Disease)/p(m = CO ± eps|No disease)

Q1: Am I right in thinking that the probability ratio in the previous line should be the probability of 'm' being close to CO (in my notation ± eps), and not greater than CO?

Haven't seen that formula in links I checked. Seems a casual way of saying that both become equal when eps goes to zero. Seems valid to me, no reason to limit me to one side of CO, it's not like this has influence on anything, or I'm missing context. (See note 3)

Q2: it is valid to draw lines between several points on the ROC curve, say corresponding to "negative", "weak positive" and "strong positive", and that the slope of each line is a valid estimate of the likelihood ratio for test results that fall within the corresponding interval.

It's the definition of ROC curve: Y-coordinate = TPR(CO) x-coordinate=FPR(CO), the slope of a line from origin (0,0) to a point is by definition y/x. For a line between two points on the curve you calculate the slope by subtracting the coordinates. For example: Two points TPR=0.5 FPR=0.3 and TPR=0.8 FPR=0.7, segment between: TPR=0.8-0.5=0.3 and FPR=0.7-0.3=0.4; between those points, the positive likelihood ratio is 0.3/0.4=0.75, so more False positives than true positives. (See note 4)

Q3: I've read many places (including in our article) that the area under the curve corresponds to the probability of a randomly chosen diseased individual getting a higher test result than a randomly chosen individual without the disease. Again, this sounds reasonable, but exactly why is it so?

The X and Y coordinates of a ROC are proportional to the number of healthy and infected people. so an increase of 0.10 in either direction represents 10 % of the related group. Movement at 45° represents the same % of people in both groups.

Suppose the curve starts by going up to 0.4, that means 40% of the infected (lets assume low values indicate infection) will test lower than all the rest. Then 0.2 to the right; 0.3 up; 0.7 right, 0.2 up, finally along the diagonal.

Calculate the odds of infected people scoring better (lower), than healthy: 40% score better than all 0.4*1; 30% score better than 80%; 0.3*0.8; 20% better than 10%: 0.2*0.1 and the last 10% have a chance of one in two to do better than the other 10%, so: 0.1*0.1*0.5. In total: 0.665.

The other group: 0.2 *0.6; 0.7*0.3 and 0.1*0.1*0.5: total: 0.335. You can work it out on paper, when you change the curve, the change in area will match the change in odds. Here, the area is 66.5%. (See note 5)

Oeps, I now see that the link you provided had not just the abstract, but the whole article... (9850136?) Ssscienccce (talk) 02:30, 21 September 2015 (UTC)

Thank you, Ssss, those were very helpful answers and links. I took the liberty of inserting yellow labels in your answer, in order to reference your replies more easily. Yes, I see that the question was confusing, reflecting my own confusion.

Note 1: I was thinking about a smoothened ROC curve, and see that that would imply needing continuous distributions.
Note 2: Duh! Confused, wrong thinking on my part here. Of course it isn't. I'm forgetting the Quotient rule, and being unclear about exactly what variable I'm differentiating with respect to.
Note 3: Q1 is very confused. I think I'd rather strike it than try to reformulate it.
Note 4: Thanks! I understand it now.
Note 5: Thanks! I'm still struggling a bit with this one, but it's getting late. I'll re-read your reply carefully in the morning, and I'm optimistic that I'll understand it then.

Your reference to Johnson 2005, PMID 15236429 was very helpful indeed! --NorwegianBlue^talk 20:38, 21 September 2015 (UTC)