User:Sjnicholls44/The Sieve of Nicholls

This is not a Wikipedia article: It is an individual user's work-in-progress page, and may be incomplete and/or unreliable. For guidance on developing this draft, see Wikipedia:So you made a userspace draft.

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL
Easy tools: Citation bot (help) | Advanced: Fix bare URLs
This page was last edited by Sjnicholls44 (talk | contribs) 10 years ago. (Update timer)

Finished writing a draft article? Are you ready to request an experienced editor review it for possible inclusion in Wikipedia? Submit your draft for review!

In mathematics, the Sieve of Eratosthenes is the basic ancient algorithm for finding all prime numbers up to any given limit. There are many variants referenced in the see also section on this page.

The basic sieve is relatively slow, and there are a number of flaws to the algorithm. Best explained by looking at the test results section on this page which cover the results of this investigation into a better algorithm. The test looks for all primes up to int.MaxValue (2,147,483,647), there are 105,097,564 to find, so 4.89%. In algorithm 12 the basic Sieve of Eratosthenes takes 01:39.279, reading from the array 2,147,483,645 time (once for each number) and writing to the array 6,999,004,570 (3.43 times for each non-prime). The algorithm considers all numbers as composites of primes and all integer co-factors, so it ends up writing to each composite when visited in sieving for each prime - i.e. 2x3 and 3x2. Which is not too efficient.

The Sieve of Atkin is a variant of the basic sieve that uses quadratic masks to find primes. Algorithm 11 implements this and shows it runs quicker than the basic sieve taking 01:26.583 (87.21%). As to measuring its efficiency, this is trickier. Initially it works the other way round only marking numbers it thinks are primes. However, it marks too many so needs to take a second pass with an Eratosthenes style sieve for multiples of the square of each prime to remove incorrect guesses. It writes 868,871,578 times. Given it is trying to mark primes this is an average of 8.26 writes per prime. To make it comparable, this is 0.43 writes for each non-prime. The write count is far better, but applying the masks involves lots of modulo arithmetic which is slow. Plus, it reads from the array 31.3% more than the basic sieve, hence it not being spectacularly quicker.

Besides performance, both these algorithms also from the following two flaws:

Memory requirements limit the algorithm to int.MaxValue - both algorithms needing to use an array of booleans to store the state of every number up to the limit. A boolean array for all numbers up to int.MaxValue requires about 255Mb. Computers can work with numbers up to long.MaxValue (9,223,372,036,854,775,807). Ideally an algorithm should be able to work up to this limit. These algorithms would need a 1,073,741,823Gb array to do so!
Single threaded algorithm - most machines have more than one CPU or Core. So being unable to use all of them seems somewhat of a waste.

Essentially, a way needs to be found to do the work iteratively, working on smaller parts of the overall goal, in parallel. A finite set of threads could then use fixed size blocks to explore up to a very large limit like long.MaxValue. This can be done with both these algorithms. However, the Atkin sieve is trickier and the inability to dramatically improve its performance with some other basic optimisations means there is not much point considering doing so.

The Sieve of Nicholls covered on this page, like the Atkin Sieve, is a variant of the Sieve of Eratosthenes that works:

In parallel.
Can run up to long.MaxValue.
Is 16.1x faster than the Eratosthenes Sieve and 11.6x faster than the Atkin Sieve.

Looking at the test results algorithm 1 takes only 00:06.166 to find all primes up to int.MaxValue using an algorithm with 4 threads, 6.21% of the time. It also only reads from the array 381,604,889 times and writes 418,936,899 times, so only 0.205 times per non-prime. So better than the Atkin Sieve.

The rest of this page now covers how this works.

Core set of Optimisations

There are two variants of the Sieve of Nicholls covered on this page.

Simpler single-threaded algorithm that is still bounded by int.MaxValue.
The full multi-threaded algorithm bounded by long.MaxValue.

Both algorithms use the following core set of optimisations, that we will cover first:

Only sieve primes that are smaller than the sqrt of the limit
Start sieving at the square of the current prime
Avoid multiples of the first N primes

If you look at the test results, algorithm 3 contains these 3 optimisations and takes only 00:16.435, only 16.6% of the time, so over a 6x speed increase. This has the same read and write profile as the parallel version of the algorithm, but is single threaded and bound by int.MaxValue.

Applying these same optimisation to the Atkin sieve in algorithm 10 only increases the speed to 01:12.028, which is far less of an improvement. This is because you can't easily use the steps array within the modulo calculations to avoid exploring numbers. As a result of not being able to get the basic performance up to anywhere near that of the Eratosthenes sieve, parallelising the Atkin sieve is not explored.

Looking at each improvement in detail.

Only sieve primes that are smaller than the sqrt of the limit

The basic Eratosthenes algorithm forms composites to sieve by considering all multiples of each prime, sieving all composites up to the limit. However, once the prime is larger than the sqrt(limit), the only co-factors the algorithm will consider must be smaller than the sqrt(limit) in order for the composite to stay smaller than the limit - i.e. if the limit is 100, when sieving 11 we won't go beyond 11x9=99. Looking at all the pairs:

 11x2  = 22, 2 is prime
 11x3  = 33, 3 is prime
 11x4  = 44, 4 is 2x2
 11x5  = 55, 5 is prime
 11x6  = 66, 6 is 2x3
 11x7  = 77, 7 is prime
 11x8  = 88, 8 is 2x4
 11x9  = 99, 9 is 3x3
 11x10 = 110, 10 is 2x5
 11x11 = 121, 11 is the current prime

The algorithm will stop at a co-factor of 10. However, is there any point in sieving the prime 11 at all? In short no, all these smaller co-factors are either smaller primes or composites of smaller primes. So the composites are all merely 11 times bigger than composites already sieved, so must have been sieved themselves too. Hence there being no need to bother.

Test 8 isolates this optimisation, it brings the time down to 01:07.203, a 32.3% improvement. The read profile stays the same, but it makes 19.1% fewer writes. So a reasonable improvement.

Start sieving at the square of the current prime

This optimisation is an extension of the previous. If we change the example so we are sieving up to a limit of 1000, then 11 is below the 31.6=sqrt(1000), so would need sieving. However, is part of the work pointless? Yes, it a waste to consider the same smaller co-factors that are formed from the smaller primes. The first co-factor that we definitely have to use to create a composite for sieving is the prime itself - in this case 11. The same principle applies to all primes. The solution is to start sieving at the square of each prime.

You might be thinking that the reality is that there are co-factors larger than p (11 in this example) that are also only made as composites of these smaller primes - i.e. 12, 14, 15, 16, etc. The question is, how to miss these out without missing ones you need to consider. The next section covers this.

Test 7 isolates this optimisation, it brings the time down to 01:03.084, a further 4.16% improvement. The read profile stays the same, but there is a further 1.5% reduction in writes to the array. So a small, but valuable improvement.

Avoid multiples of the first N primes

The trend is that considering fewer co-factors of each prime reduces the effort of the algorithm and makes it quicker. Is there anyway to improve this further? As indicated above there are still co-factors greater than p that are formed only by primes that are smaller than p - i.e. when sieving for 13 you will start at a co-factor of 13 and sieve:

 13x13
 13x14 2
 13x15 3, 5
 13x16 2
 13x17
 13x18 2, 3
 13x19
 13x20 2, 5
 13x21 3, 7
 13x22 2, 11
 13x23
 ...

However, the ones I've crossed out are entirely pointless. The co-factor will have been considered when sieving the smaller primes indicated. So 13 times these will also have been done by those same sieves. So is there some way of stepping through co-factors to avoid them? A simple optimisation would be to step by two, so only do (13, 15, 17, 19, ...) essentially +=2 to the co-factor.

Can we do better? How about multiples of say 2 and 3? Well in this instance you need to alternately step +=2 then +=4, from 5 - i.e. (5, 7, 11, 13, 19, 21, etc). For the first two at least there seems to be a repeating pattern of steps. Does this apply for more and more primes? What about (2,3,5)? Well in this instance the stepping pattern cycles after 8 steps starting from 7 and is (4,2,4,2,4,6,2,6). So what about larger seed sets?

Primes	Cycle Range	No. of Steps	Actual Steps
2,3	6	2	2,4
2,3,5	30	8	4,2,4,2,4,6,2,6
2,3,5,7	210	48	2,4,2,4,6,2,6,4,2,4,6,6,2,6,4,2,6,4,6,8,4,2,4,2,4,8,6,4,6,2,4,6,2,6,6,4,2,4,6,2,6,4,2,4,2,10,2,10
2,3,5,7,11	2310	480	... too long!
2,3,5,7,11,13	30030	5760	... too long!
2,3,5,7,11,13,17	510510	92160	... too long!
2,3,5,7,11,13,17,19	9699690	1658880	... too long!

FYI, these steps are from the next prime after the last one in the list - i.e. for 2,3,5 the steps work when you start from 7.

Having a cycle means it is not that hard to build an efficient algorithm around this. You can generate a steps array to drive the change in the co-factor at every iteration of the inner loop - i.e. let us consider using a (2,3,5) steps array. Starting at p=7 your co-factors to check would be (7x7, 7x11, 7x13, 7x17, 7x19, 7x23, 7x29, 7x31, 7x37, 7x41, ...), you will notice 41 is created by cycling back to the start in the steps array. You simply repeat this whilst the product is smaller than the limit.

As well as using the steps array to reduce work in this inner loop, you can also use it to reduce work in the outer loop. Use the skips array to step through the array itself looking for primes - i.e. you pre-load (2,3,5) as primes, then start at 7 and only consider, (11,13,17,19,23,29,31,37,41,etc). Which will speed things up even more. The other advantage is in working out what index into the steps array you should start with in the inner loop - i.e. when the outer loop gets to 29 the index into the steps array will be 6, implying the next step size to use is 2. Conveniently given you also want to start with a co-factor of 29, as in 29^2, you simply copy 6 as the starting index into the steps array for the inner loop - simple.

How do you create such an array? Use the following steps:

The cycle is related to permutations of all the seed primes. The length of this permutation is defined by the product of all the seed primes - i.e. for (2,3,5) the cycle will start at 7 and run to, (2*3*5) + 7 + 1 = 38.
Use an Eratosthenes sieve with the first two core optimisations to sieve only the seed primes up to this number - i.e. 38.
Start at the first non-seed prime (7 in this example) and record the size of the gaps between primes - i.e. 7 (4) 11 (2) 13 (4) 17 (2) 19 (4) 23 (6) 29 (2) 31 (6) 37.
Store these gaps in a fixed size array - i.e. use a List<int> to record them, then call .ToArray() on it at the end.

So how many seed primes to use?

Well as you can see the size of the steps array grows quite quickly. So the array needed to store the steps will start to become a consideration in terms of keeping it in the CPU cache quite rapidly. For example the range of numbers you would need to look over for the first 15 primes is 6.1e17 numbers. Which is not tractable.

The question is really best answered by considering whether the benefit is linear? The answer is no.

Skipping by 2 removes 1/2 the numbers, skipping by 3 removes 1/3, or 3/6 + 1/6 = 2/3 combined. In fact using a steps array for the first 5 primes will remove 79.2% of numbers from consideration. Adding the next 5 primes will only remove a further 5%, the next five after that only 1.9%, and so on.

At the same time the range that you need to scan to calculate the steps array is growing by the product of the prime. Here are some pros and cons:

Pros

The more primes in the seed the larger the step sizes so each steps array data page will cover more of the number range.
Stepping by large co-factor jumps leads to loading fewer data pages that contain the composites for marking.

Cons

As the cycle range is the product of all the primes, at some point the time taken to sieve the seed primes from that range will become a significant cost to the algorithm.
As soon as the steps array is larger than a single cache page, then different pages will need bringing into cache as the algorithm goes along. Although once is is more than one page most of the slow down is gained, as simply having to swap it is the cost, which one you swap it for matters less, and they will have the same number of steps on each of them.
Each prime you add removes few numbers from consideration, the overhead of the size of the steps array will start to make the benefit marginal.

So what does this mean in real terms? For the first 5 primes, 480 steps can be held in 1.8Kb. So should easily fit in a 4Kb page for my hardware architecture. So up till this point a single page is needed in the cache. Beyond this it will need to swap pages to work through the entire steps array, but you do remove lots of number from consideration.

Having tested various size (2,3,5,7,11,13,17,19) as a seed was the fastest. Beyond this it starts to slow down again quite dramatically. Mainly as a result of having to run a full sieve up to 223,092,870 to generate the steps. Which is not a trivial cost.

Test 3 isolates this optimisation, it brings the time down to 00:16.435, a further 46.99% improvement. This is more than 3.83x faster with just this optimisation. A real benefit.

Simpler single-threaded algorithm that is still bounded by int.MaxValue

In trying to create the most efficient single threaded algorithm there were two further optimisations that were considered. When it came down to it they didn't yield the benefit expected, but they merit discussion.

Check the sieve hasn't been marked before you write to it
Use the sieve to avoid co-factors

Check the sieve hasn't been marked before you write to it

Most computers find it harder to clear CPU caches of pages that have been written to, as they need flushing back to main memory. Those simply read from can just be discarded. Given the algorithms all overwrite to the same composite, avoiding the 2nd, 3rd, 4th, etc writes might increase the speed of the algorithm.

This can be done simply by reading it first - i.e. when sieving 3, the composite 12 will already have been marked when sieving 2 - so just check if the bit has been set and skip the write if it has. There is a trade-off here:

It will add an extra read operation for all the initial writes.
It may not stop the data page from needing flushing back. Where a data page contains multiple composites, it only requires a write for one of them for it to be marked as dirty making the read pointless - i.e. if we consider algorithm 3 the skips array is guaranteed to move the index into the array at least p*2 away from the current composite, possibly more depending on the size of the step. For a 4kb data page once the prime is larger than 1024x4x8/2=16384 this means these jumps will move you to another data page for each composite. However, for smaller primes more and more composites will appear on the same page. If any one of them writes then all the reads for that page will have been a waste of time. So what proportion of writes are for primes below 16384? Well for algorithm 3, this stops sieving primes at sqrt(int.MaxValue)=46341. There are 1900 primes below 16384, and only 2892 beyond that up to 46341. So only 39.6% of the primes. However, the 1900 below cause 95.5% of the writes. So it is entirely possible this optimisation will not work as well as expected.

In order to test the benefit gained in genuinely avoiding writes to data pages. Variants of each of the algorithms were created that introduced reads checks once the primes were larger than 16384 for algorithms that step at least 2 co-factors and 32768 for those that step through each co-factor.

The results are very clear for the basic Eratosthenes sieve in algorithm 12. Its performance increases by 2.7%. However, for our fastest single/multi-threaded algorithms 1 and 3, it is of debatable benefit. This is all to do with the ratio of reads-added/writes-saved. You will see for the basic sieve this is pretty much 1. This is because without the steps filter 70.8% of its writes are repeated writes to non-primes. So predominantly there are savings to be made. However, for algorithm 3 only 37.4% of writes are repeated writes to non-primes. So we are going to have far more reads that don't give us any benefit. This bears out in the ratios, algorithm 3 has a ratio of 2.15. So for every write saved 2.15 reads are made. This overhead overall slows the algorithm down.

The only odd results worth clarifying are those of algorithm 1 and 2. Given these are just parallel versions of 3 and 4 respectively you would expect the reads and writes to change by the same amounts in the read version. For the reads they do. However, for the writes they don't. This is because of the way primes are queued between threads in the algorithm. Small primes have to pass through lower batches first, whereas larger primes jump ahead of them because of starting at p^2 - i.e. for composite 269,780,429 which has prime factors (17, 967, 16411), 16411^2 is 269,320,921, so it is queued to the third block ahead of 17 which starts at 289 and 967 which starts at 935,089. They have to make there way through two blocks before they join block threes queue. Which means 16411 gets sieved first and the bit is not set at that point in time. This will happen for lots of primes hence the writes not being the same. It actually makes the benefit of this optimisation even worse for the parallel implementation.

The read version of the fastest algorithm does seem to consistently time slightly faster than the non-read version. That said the 95% ci is larger than the difference. So it is less believable. For the single threaded version in algorithm 3 it definitely is not of benefit. As such this optimisation is borderline and is not recommended.

Use the sieve to avoid co-factors

Is there an extension to the basic optimisation of avoiding multiples of the first N primes? After all if we generate a basic steps array of say (2,3,5), once we have sieved (7, 11, 13, etc) the sieve itself will know which co-factors have already been sieved. So the sieve itself can be used as a steps array - i.e. once we have sieved 7, when we are doing 11:

 11x11
 11x13
 11x17
 11x19
 11x23
 11x29
 11x31
 11x37
 11x41
 11x43
 11x47
 11x49 49 = 7 x 7 => 539 = 7 x 77
 11x53
 11x59
 11x61
 11x67
 11x71
 11x73
 11x77 77 = 7 x 11 => 847 = 7 x 121
 11x79
 11x83
 11x89
 11x91
 11x97
 11x101
 11x103
 11x107
 11x109
 11x113
 11x119 119 = 7 x 17 => 1309 = 7 x 187
 11x121
 11x127
 ...

All composites that have 7 as a prime factor will have been sieved by sieving 7. Any co-factor for 11 in this example that has 7 as a factor itself will have been marked in the sieve, so we could check the co-factor in the sieve, and simply skip it without checking the overall composite. The frequency may seem low for 7, but as you step more primes away from 5 there are more and more co-factor permutations from the primes between 5 and p that can be avoided.

The problem with this optimisation is that implemented naively you can miss composites. Reading and writing from the sieve at the same time causes problems with composites that have p as a factor more than once - i.e. let us consider what would happen when sieving for 7, we will start at 7 as a co-factor and sieve 49=7x7. Whilst still sieving 7 when we get to 49 as a co-factor we will skip it as it is marked as not prime which will mean we fail to sieve 343=7x(49=7x7). The solution to this problem is to sieve backwards. So we start sieving from the limit. This will mean we do not remove the co-factor until we have sieved all its multiples - i.e. in this example we will not remove 49 till we consider co-factor 7, so 49 will still be unmarked as a co-factor when we go past it and we will remove 343.

Computationally this changes the algorithm. As you are no longer just working with one place in the array. The place you are checking in the sieve is different to the place you are marking. In checking the co-factors you will begin by loading the page at limit/p, and start scanning it backwards using all unmarked numbers as co-factors. At the same time the algorithm will need to load pages near the limit working downwards to do the sieving.

What are the pros and cons?

Cons

We only have a finite CPU cache. Needing to load two pages for the algorithm to use does mean the CPU cache will need to load and clear more pages from the cache.
Ultimately you cannot reduce the number of pages that you will need to write to and flush back to main memory.

Pros

The pages it loads for checking co-factors will only be used in a read-only manner so will not require flushing back to main memory.
Every page loaded for co-factor checking will be used with the same density. The co-factors are stepped through using the steps array regardless of page loaded each will get a similar density of use. Regardless of the size of p this will remain constant - let's call the average number of reads s.
Once p is larger enough to mean the composites of adjacent co-factors appear on different data pages. Every read of the co-factor data page that avoids checking a composite will mean one fewer pages need to be loaded in to the cache. If all s reads stop the composites being checked, then one data page loaded for co-factor checking with stop the loading of s data pages for composite checking. Which is a big potential saving.

So the question is will the overhead of having to load these extra data pages be offset by these gains?

Reality of this optimisation is that it does not bear out. If you look at the results algorithm 5 which adds this optimisation to algorithm 4 takes 00:21.357, which is 13.6% slower. It does make 248,060,548 fewer writes, but to do so it needs to make 589,193,356 more reads. This trade off just doesn't help the performance of the algorithm.

The final nail in the coffin for this optimisation is the fact that sieving backwards means you can't parallelise the algorithm. It stops you from being able to discover subsequent primes earlier - i.e. you don't discover that 3 is the next prime until you have sieved 2 all the way from the limit down to 4. Whereas in the forwards implementation you find out after having sieved the first composite - i.e. 4.

As a result, this optimisation is not of benefit to the already fully optimised version, and it is definitely omitted.

The resulting single threaded algorithm

Is the Eratosthenese Sieve with the 3 core optimisations, the other two options do not actually improve the performance.

The full multi-threaded algorithm bounded by long.MaxValue

The only true improvement to the algorithm over the core optimisations is to parallelise the algorithm. This creates the following benefits:

Performance improvement - it is over 2.65x faster than the single threaded algorithm in the tests.
Dramatically reduces the memory overhead as you are using sub-blocks of the overall limit.
Allows the algorithm to be unbounded by both int.MaxValue and the memory limits of the machine.

Summary of the Algorithm

The concept of the algorithm is as follows.

The number range is broken up into blocks. When a prime has finished being sieved in one block three values are passed on to the next block for it:

prime: the prime number
next-composite: the next number to mark in the virtual array - i.e. in the next block you will mark next-composite - block-start.
step-array-index: this implicitly records the co-factor, so next-compsite = prime * steps[step-array-index]

These are passed around in triples between the threads, each receiving thread needs to process all these ahead of looking for any new primes within its own number range.

There is a special triple which gets passed last up to the next batch. It controls the behaviour of the outer loop that is looking for new primes. The first thread gets passed (1, 13, 0). Having no previous primes to deal with the first thread will simply use this to step up through numbers checking for primes. When it gets to the end of the block, it will pass it on to the next block. This will allow the next block to know the next number to check and where the previous block had got to in the steps array.

Algorithm in detail

With these concepts summarised the algorithm in detail is.

1) The organising thread

Decide on:
1. a batch size, say 100,000,000.
2. the number of threads to allow in parallel, say 4.
Create N batch objects with starting points separated by the batch size - i.e. 0, 100000000, 200000000, 300000000.
Create a thread pool of size N, queue these initial N batch objects to the thread pool.
Add control triple (13, 0, 1) to batch 0's queue.
Wait till the largest batch has run.

2) Each worker thread

Check the inbound queue for triples.
If the triple is not the control triple - i.e. prime > 1, then
1. Sieve the prime using the core optimisations up to the end of this blocks number range.
2. Once sieved the triple will now refer to a number outside this threads number range. Retrieve or create the batch object that needs to process this prime next and add the triple to its queue.
If the triple is the control triple, then
1. Step through the numbers in the block looking for primes up to the sqrt(limit).
2. When one is found, each p
  1. Create a triple (p*p,current-steps-array-index,p) to act for it. This works as the first co-factor is p, the step to use next for the co-factor conveniently corresponds to where the control triple is currently looking in the steps array.
  2. If the p*p is still within this blocks number range then apply all steps in 2 above as you did for prime passed to this thread.
3. Once the control triple has moved beyond the end of this blocks number range.
4. Pass the control triple on to the next thread.
5. Retrieve block object this+(4*batchsize) if this is still below the limit of the search and queue this to the thread pool, else queue no more blocks.
6. Centrally mark the next block as the lowest active block.
7. End the method, this will return this thread to the thread pool.

Example of the algorithm

To put some colour on this here is an example of the sort of thing that will happen using a block size of 200 with 4 threads and a step array of 480 steps for seed primes (2,3,5,7,11).

Blocks for 0-199, 200-399, 400-599 and 600-799 are created and queued to a thread pool of size 4. The control triple of:

    13,  0, 1

Is then passed to the block 0-199. With no previous primes to process it gets straight down to the job of looking for primes. It then immediately finds 13, and creates triple:

   169,  0, 13

This needs sieving in this first block, so it does so. The step size at index 0 is 4, so the next composite to sieve for 13 is 221. So the triple becomes:

   221,  1, 13

As such this will get queued for the second block dealing with 200-399, and the first blocks thread then gets on with finding more primes within its number range. The next two primes as the only ones that will be queued to the second block:

   289,  1, 17
   361,  2, 19

With the following being the only one to go to a block with an active thread:

   529,  3, 23

The remainder it finds will cause 36 more block objects to get created in a central block store with each of these triples getting queued to them for processing:

   841,  4, 29
   961,  5, 31
  1369,  6, 37
  1681,  7, 41
  1849,  8, 43
  2209,  9, 47
  2809, 10, 53
  3481, 11, 59
  3721, 12, 61
  4489, 13, 67
  5041, 14, 71
  5329, 15, 73
  6241, 16, 79
  6889, 17, 83
  7921, 18, 89
  9409, 19, 97
 10201, 20, 101
 10609, 21, 103
 11449, 22, 107
 11881, 23, 109
 12769, 24, 113
 16129, 25, 127
 17161, 26, 131
 18769, 27, 137
 19321, 28, 139
 22201, 29, 149
 22801, 30, 151
 24649, 31, 157
 26569, 32, 163
 27889, 33, 167
 29929, 35, 173 - note that we will have skipped index 34 in the steps array as it found us 169 which is not prime.
 32041, 36, 179
 32761, 37, 181
 36481, 38, 191
 37249, 39, 193
 38809, 40, 197
 39601, 41, 199

Once the first thread has passed all these on it will queue the control triple to the second block, it will have the value:

   221, 42, 1

It then retrieves block 800-999 from the block store, and queues this to the thread pool for processing. If the block was beyond the end of limit of the search it would not do so. This thread then marks 200 as the lowest active block and the threads method then terminates returning the thread to the thread pool.

The second thread will have had a queue that looks like:

   221,  1, 13
   289,  1, 17
   361,  2, 19
   221, 42, 1

It will sieve the first three and pass them on. The use the control triple at the end to find more primes, passing on to block 400-599 when it is done, and so on till we get to the limit.

Memory requirements for the algorithm

This algorithm has very different memory requirement to the full array implementations:

The working sieves - which is the block size times the thread pool size. In my experiments matching the thread pool size to the number of CPUs was the most efficient with a block size of about 100,000,000 seemed enough to get all the performance benefits. So we are talking 47.7Mb for the 4 active bit arrays.
Requirement for keeping track of the triples - which is tricky to estimate. If we consider primes up to int.MaxValue, there are 105,097,564 below this number. However, there are only 4792 primes below sqrt(int.MaxValue), 46340. Given there is no need to sieve primes above 46340, we only need to pass 4792 primes between threads. So how big is each triple? Well the current-composite and prime will need to be longs, but the steps-index with only 480 steps could be a short. So, 18 bytes in total each. Which means 85Kb to hold all of them. So not really a problem. The rest of the primes simply need to be written out to a file.

This is compared to the memory requirements for a full sieve, which for up to int.MaxValue requires a 255Mb bit array.

What about up to long.MaxValue?

For the full array, it would need to look for a specialist storage type. BitArray in c# only allows int as the initialisation type. So you can't even create an array up to size of long.MaxValue. If you did it would be a 1,073,741,823Gb bit array!

What about for the parallel sieve?

Well it can go up to long. The working sieves do not change in requirement unless we change the size of the thread pool. So we only need to look at the storage of triples. The question is how many primes are there below sqrt(long.MaxValue) = ~1.414 * int.MaxValue. Using the parallel algorithm to work this out there are 146,144,318. So running the algorithm up to long.MaxValue would require 2.44Gb to store all the triples. Which is easily doable with today's hardware. That said the increase in algorithm runtime seems to be linear. So I estimate it will take ~920 years for my top spec Intel i7 quad core CPU. Further, there will be at worst 4.89% of these numbers being prime. Which will require 3,284,298Tb of disk space to write out all the primes!

So, I can't actually test the algorithm up to this limit.

Test Results

Results from testing the various optimisations on an Macbook Pro, Intel Core i7-4850HQ @ 2.30 GHz.

The algorithms were tested finding primes all the way up to int.MaxValue. This does not include the IO overhead of writing them out, a counter is simply incremented for each prime found. Each algorithm was run 5 times and 95% confidence intervals calculated. They all time fairly consistently.

Max: 2147483647
Tests: 5		Normal				Read check				Diffs
Algo	Names	Time taken	95% ci	Reads	Writes	Time taken	95% ci	Reads	Writes	Time	Reads	Writes	Ratio r/w
01)	EratoSqrtSqrParallelSkip8	00:06.193	00:00.085	381604889	418936899	00:06.166	00:00.090	403377799	416022717	99.6%	105.7%	99.3%	-7.47
02)	EratoSqrtSqrParallelSkip5	00:06.650	00:00.048	446233563	589194032	00:06.671	00:00.057	472693105	584944688	100.3%	105.9%	99.3%	-6.23
03)	EratoSqrtSqrSkip8	00:16.435	00:00.244	381604889	418936899	00:17.228	00:00.109	403377799	408833003	104.8%	105.7%	97.6%	-2.15
04)	EratoSqrtSqrSkip5	00:18.808	00:00.051	446233563	589194032	00:19.145	00:00.131	472693105	574403504	101.8%	105.9%	97.5%	-1.79
05)	EratoSqrtSqrBkwdSkip5	00:21.357	00:00.100	1035426919	341133484	-	-	-	-	-	-	-	-
06)	FactorSkip5	00:24.490	00:00.402	446233563	589194032	00:29.225	00:00.113	1032005963	341210943	119.3%	231.3%	57.9%	-2.36
07)	EratoSqrtSqr	01:03.084	00:00.192	2147483645	5556559197	01:05.008	00:00.193	2167303218	5538575962	103.0%	100.9%	99.7%	-1.10
08)	EratoSqrt	01:07.203	00:00.695	2147483645	5661128206	01:06.067	00:00.245	2217822488	5592625701	98.3%	103.3%	98.8%	-1.03
09)	EratoSqrtSqrBkwd	01:07.979	00:00.293	7704042842	2042386081	-	-	-	-	-	-	-	-
10)	AtkinSkip5	01:12.028	00:00.640	1119397593	721398915	01:11.611	00:00.436	1119404619	721393879	99.4%	100.0%	100.0%	-1.40
11)	Atkin	01:26.583	00:00.539	2820647675	868871578	01:21.621	00:00.787	2820648955	868870935	94.3%	100.0%	100.0%	-1.99
12)	Eratosthenes	01:39.279	00:01.287	2147483645	6999004570	01:36.644	00:00.891	3555698852	5592625701	97.3%	165.6%	79.9%	-1.00
13)	Factor	01:59.103	00:00.824	2147483645	5556559197	02:20.420	00:00.592	7643018389	2046357940	117.9%	355.9%	36.8%	-1.57

References

External links

example.com