Jump to content

Range query (computer science): Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
→‎Examples: I added range majority queries as examples of range queries and discussed range tau-majority queries on 2D arrays. I will add more in the coming days.
→‎Range Majority Queries on Two-Dimensional Arrays: Added range majority queries on one-dimensional arrays. Will add more soon.
Line 74: Line 74:
If a linear algorithm to find the medians is used, the total cost of preprocessing for {{mvar|k}} range median queries is <math> n\log k</math>. The algorithm can also be modified to solve the [[online algorithm|online]] version of the problem.<ref name=ethpaper />
If a linear algorithm to find the medians is used, the total cost of preprocessing for {{mvar|k}} range median queries is <math> n\log k</math>. The algorithm can also be modified to solve the [[online algorithm|online]] version of the problem.<ref name=ethpaper />
===Majority===
===Majority===
Finding frequent elements in a given set of items is one of the most important tasks in data mining. Finding frequent elements might be a difficult task to achieve when most items have similar frequencies. Therefore, it might be more beneficial if some threshold of significance was used for detecting such items. One of the most famous algorithms for finding the majority of an array was proposed by Boyer and Moore <ref>{{Citation|last=Boyer|first=Robert S.|title=MJRTY—A Fast Majority Vote Algorithm|date=1991|url=http://dx.doi.org/10.1007/978-94-011-3488-0_5|work=Automated Reasoning Series|pages=105–117|place=Dordrecht|publisher=Springer Netherlands|access-date=2021-12-18|last2=Moore|first2=J. Strother}}</ref> which is also known as the [[Boyer–Moore majority vote algorithm]]. Boyer and Moore proposed an algorithm to find the majority element of a string (if it has one) in <math>O(n)</math> time and using <math>O(1)</math> space. In the context of Boyer and Moore’s work and generally speaking, a majority element in a set of items (for example string or an array) is one whose number of instances is more than half of the size of that set. Few years later, Misra and Gries <ref>{{Cite journal|last=Misra|first=J.|last2=Gries|first2=David|date=1982-11|title=Finding repeated elements|url=http://dx.doi.org/10.1016/0167-6423(82)90012-0|journal=Science of Computer Programming|volume=2|issue=2|pages=143–152|doi=10.1016/0167-6423(82)90012-0|issn=0167-6423}}</ref> proposed a more general version of Boyer and Moore's algorithm using <math>O(n log (1 / \tau))</math> comparisons to find all items in an array whose relative frequencies are greater than some threshold <math>0<\tau<1</math>. A range <math>\tau</math>-majority query is one that, given a subrange of a data structure (for example an array) of size <math>|R|</math>, returns the set of all distinct items that appear more than (or in some publications equal to) <math>\tau |R|</math> times in that given range. In different structures that support range <math>\tau</math>-majority queries, <math>\tau </math> can be either static (specified during preprocessing) or dynamic (specified at query time). Many of such approaches are based on the fact that, regardless of the size of the range, for a given <math>\tau</math> there could be at most <math>O(1/\tau)</math> distinct ''candidates'' with relative frequencies at least <math>\tau</math>. By verifying each of these candidates in constant time, <math>O(1/\tau)</math> query time is achieved.
Finding frequent elements in a given set of items is one of the most important tasks in data mining. Finding frequent elements might be a difficult task to achieve when most items have similar frequencies. Therefore, it might be more beneficial if some threshold of significance was used for detecting such items. One of the most famous algorithms for finding the majority of an array was proposed by Boyer and Moore <ref>{{Citation|last=Boyer|first=Robert S.|title=MJRTY—A Fast Majority Vote Algorithm|date=1991|url=http://dx.doi.org/10.1007/978-94-011-3488-0_5|work=Automated Reasoning Series|pages=105–117|place=Dordrecht|publisher=Springer Netherlands|access-date=2021-12-18|last2=Moore|first2=J. Strother}}</ref> which is also known as the [[Boyer–Moore majority vote algorithm]]. Boyer and Moore proposed an algorithm to find the majority element of a string (if it has one) in <math>O(n)</math> time and using <math>O(1)</math> space. In the context of Boyer and Moore’s work and generally speaking, a majority element in a set of items (for example string or an array) is one whose number of instances is more than half of the size of that set. Few years later, Misra and Gries <ref>{{Cite journal|last=Misra|first=J.|last2=Gries|first2=David|date=1982-11|title=Finding repeated elements|url=http://dx.doi.org/10.1016/0167-6423(82)90012-0|journal=Science of Computer Programming|volume=2|issue=2|pages=143–152|doi=10.1016/0167-6423(82)90012-0|issn=0167-6423}}</ref> proposed a more general version of Boyer and Moore's algorithm using <math>O(n log (1 / \tau))</math> comparisons to find all items in an array whose relative frequencies are greater than some threshold <math>0<\tau<1</math>. A range <math>\tau</math>-majority query is one that, given a subrange of a data structure (for example an array) of size <math>|R|</math>, returns the set of all distinct items that appear more than (or in some publications equal to) <math>\tau |R|</math> times in that given range. In different structures that support range <math>\tau</math>-majority queries, <math>\tau </math> can be either static (specified during preprocessing) or dynamic (specified at query time). Many of such approaches are based on the fact that, regardless of the size of the range, for a given <math>\tau</math> there could be at most <math>O(1/\tau)</math> distinct ''candidates'' with relative frequencies at least <math>\tau</math>. By verifying each of these candidates in constant time, <math>O(1/\tau)</math> query time is achieved. A range <math>\tau</math>-majority query is decomposable <ref name=":1">{{Cite book|last=Verfasser|first=Karpiński, Marek 1948-|url=http://worldcat.org/oclc/277046650|title=Searching for frequent colors in rectangles|oclc=277046650}}</ref> in the sense that a <math>\tau</math>-majority in a range <math>R</math> with partitions <math>R_1</math> and <math>R_2</math> must be a <math>\tau</math>-majority in either <math>R_1</math>or <math>R_2</math>. Due to this decomposability, some data structures answer <math>\tau</math>-majority queries on one-dimensional arrays by finding the [[Lowest common ancestor]] (LCA) of the endpoints of the query range in a [[Range tree]] and validating two sets of candidates (of size <math>O(1/\tau)</math>) from each endpoint to the lowest common ancestor in constant time resulting in <math>O(1/\tau)</math> query time.


===== Range Majority Queries on Two-Dimensional Arrays =====
==== Range Majority Queries on Two-Dimensional Arrays ====
Gagie et al. <ref>{{Citation|last=Gagie|first=Travis|title=Finding Frequent Elements in Compressed 2D Arrays and Strings|date=2011|url=http://dx.doi.org/10.1007/978-3-642-24583-1_29|work=String Processing and Information Retrieval|pages=295–300|place=Berlin, Heidelberg|publisher=Springer Berlin Heidelberg|isbn=978-3-642-24582-4|access-date=2021-12-18|last2=He|first2=Meng|last3=Munro|first3=J. Ian|last4=Nicholson|first4=Patrick K.}}</ref> proposed a data structure that supports range <math>\tau</math>-majority queries queries on an <math>m\times n</math> array <math>A</math>. For each query in this data structure a threshold <math>0<\tau<1</math> and a rectangular range are specified, and the set of all elements that have relative frequencies (inside that rectangular range) greater than or equal to <math>\tau</math> are returned as the output. This data structure supports dynamic thresholds (specified at query time) and a preprocessing threshold <math>\alpha</math> based on which it is constructed. During the preprocessing, a set of ''vertical'' and ''horizontal'' intervals are built on the <math>m \times n</math> array. Together, a vertical and a horizontal interval form a ''block.'' Each block is part of a ''superblock'' nine times bigger than itself (three times the size of the block's horizontal interval and three times the size of its vertical one). For each block a set of candidates (with <math>\frac{9}{\alpha}</math> elements at most) is stored which consists of elements that have relative frequencies at least <math>\alpha</math> (the preprocessing threshold as mentioned above) in its respective superblock. These elements are stored in non-increasing order according to their frequencies and it is easy to see that, any element that has a relative frequency at least <math>\alpha</math> in a block must appear its set of candidates. Each <math>\tau</math>-majority query is first answered by finding the ''query block,'' or the biggest block that is contained in the provided query rectangle in <math>O(1)</math> time. For the obtained query block, the first <math>\frac{9}{\tau}</math> candidates are returned (without being verified) in <math>O(1/\tau)</math> time, so this process might return some false positives. Many other data structures (as discussed below) have proposed methods for verifying each candidate in constant time and thus maintaining the <math>O(1/\tau)</math> query time while returning no false positives. The cases in which the query block is smaller than <math>1/\alpha</math> are handled by storing <math>log(1/\alpha)</math> different instances of this data structure of the following form:
Gagie et al. <ref>{{Citation|last=Gagie|first=Travis|title=Finding Frequent Elements in Compressed 2D Arrays and Strings|date=2011|url=http://dx.doi.org/10.1007/978-3-642-24583-1_29|work=String Processing and Information Retrieval|pages=295–300|place=Berlin, Heidelberg|publisher=Springer Berlin Heidelberg|isbn=978-3-642-24582-4|access-date=2021-12-18|last2=He|first2=Meng|last3=Munro|first3=J. Ian|last4=Nicholson|first4=Patrick K.}}</ref> proposed a data structure that supports range <math>\tau</math>-majority queries on an <math>m\times n</math> array <math>A</math>. For each query <math>\operatorname{Q}=(\operatorname{R}, \tau)</math> in this data structure a threshold <math>0<\tau<1</math> and a rectangular range <math>\operatorname{R}</math> are specified, and the set of all elements that have relative frequencies (inside that rectangular range) greater than or equal to <math>\tau</math> are returned as the output. This data structure supports dynamic thresholds (specified at query time) and a preprocessing threshold <math>\alpha</math> based on which it is constructed. During the preprocessing, a set of ''vertical'' and ''horizontal'' intervals are built on the <math>m \times n</math> array. Together, a vertical and a horizontal interval form a ''block.'' Each block is part of a ''superblock'' nine times bigger than itself (three times the size of the block's horizontal interval and three times the size of its vertical one). For each block a set of candidates (with <math>\frac{9}{\alpha}</math> elements at most) is stored which consists of elements that have relative frequencies at least <math>\frac{\alpha}{9}</math> (the preprocessing threshold as mentioned above) in its respective superblock. These elements are stored in non-increasing order according to their frequencies and it is easy to see that, any element that has a relative frequency at least <math>\alpha</math> in a block must appear its set of candidates. Each <math>\tau</math>-majority query is first answered by finding the ''query block,'' or the biggest block that is contained in the provided query rectangle in <math>O(1)</math> time. For the obtained query block, the first <math>\frac{9}{\tau}</math> candidates are returned (without being verified) in <math>O(1/\tau)</math> time, so this process might return some false positives. Many other data structures (as discussed below) have proposed methods for verifying each candidate in constant time and thus maintaining the <math>O(1/\tau)</math> query time while returning no false positives. The cases in which the query block is smaller than <math>1/\alpha</math> are handled by storing <math>log(1/\alpha)</math> different instances of this data structure of the following form:


<math>\beta=2^{-i}, \;\; i\in \{1,\dots,log(\frac{1}{\alpha})\}
<math>\beta=2^{-i}, \;\; i\in \{1,\dots,log(\frac{1}{\alpha})\}
Line 83: Line 83:


where <math>\beta</math> is the preprocessing threshold of the <math>i</math>-th instance. Thus, for query blocks smaller than <math>1/\alpha</math> the <math>\lceil\log (1 / \tau)\rceil</math>-th instance is queried. As mentioned above, this data structure has query time <math>O(1/\tau)</math> and requires <math>\mathcal{O}(m n(H+1) \log^2 (1 / \alpha))</math> bits of space by storing a Huffman-encoded copy of it (note the <math>log(1/\alpha)</math> factor and also see [[Huffman coding]]).
where <math>\beta</math> is the preprocessing threshold of the <math>i</math>-th instance. Thus, for query blocks smaller than <math>1/\alpha</math> the <math>\lceil\log (1 / \tau)\rceil</math>-th instance is queried. As mentioned above, this data structure has query time <math>O(1/\tau)</math> and requires <math>\mathcal{O}(m n(H+1) \log^2 (1 / \alpha))</math> bits of space by storing a Huffman-encoded copy of it (note the <math>log(1/\alpha)</math> factor and also see [[Huffman coding]]).

==== Range Majority Queries on One-Dimensional Arrays ====
Chan et al. <ref name=":0">{{Citation|last=Chan|first=Timothy M.|title=Linear-Space Data Structures for Range Minority Query in Arrays|date=2012|url=http://dx.doi.org/10.1007/978-3-642-31155-0_26|work=Algorithm Theory – SWAT 2012|pages=295–306|place=Berlin, Heidelberg|publisher=Springer Berlin Heidelberg|isbn=978-3-642-31154-3|access-date=2021-12-20|last2=Durocher|first2=Stephane|last3=Skala|first3=Matthew|last4=Wilkinson|first4=Bryan T.}}</ref> proposed a data structure that given a one-dimensional array<math>A</math>, a subrange <math>R</math> of <math>A</math> (specified at query time) and a threshold <math>\tau</math> (specified at query time), is able to return the list of all <math>\tau</math>-majorities in <math>O(1/\tau)</math> time requiring <math>O(nlogn)</math> words of space. To answer such queries, Chan et al. <ref name=":0" /> begin by noting that there exists a data structure capable of returning the ''top-k'' most frequent items in a range in <math>O(k)</math> time requiring <math>O(n)</math> words of space. For a one-dimensional array <math>A[0,..,n-1]</math>, let a one-sided top-k range query to be of form <math>A[0..i] \text { for } 0 \leq i \leq n-1</math>. For a maximal range of ranges <math>A[0..i] \text { through } A[0..j]</math> in which the frequency of a distinct element <math>e</math> in <math>A</math> remains unchanged (and equal to <math>f</math>), a horizontal line segment is constructed. The <math>x</math>-interval of this line segment corresponds to <math>[i,j]</math> and it has a <math>y</math>-value equal to <math>f</math>. Since adding each element to <math>A</math> changes the frequency of exactly one distinct element, the aforementioned process creates <math>O(n)</math> line segments. Moreover, for a vertical line <math>x=i</math> all horizonal line segments intersecting it are sorted according to their frequencies. Note that, each horizontal line segment with <math>x</math>-interval <math>[\ell,r]</math> corresponds to exactly one distinct element <math>e</math> in <math>A</math>, such that <math>A[\ell]=e</math>. A top-k query can then be answered by shooting a vertical ray <math>x=i</math> and reporting the first <math>k</math> horizontal line segments that intersect it (remember from above that these line line segments are already sorted according to their frequencies) in <math>O(k)</math> time.

Chan et al. <ref name=":0" /> first construct a [[range tree]] in which each branching node stores one copy of the data structure described above for one-sided range top-k queries and each leaf represents an element from <math>A</math>. The top-k data structure at each node is constructed based on the values existing in the subtrees of that node and is meant to answer one-sided range top-k queries. Please note that for a one-dimensional array <math>A</math>, a range tree can be constructed by dividing <math>A</math> into two halves and recursing on both halves; therefore, each node of the resulting range tree represents a range. It can also be seen that this range tree requires <math>O(nlogn)</math> words of space, because there are <math>O(logn)</math> levels and each level <math>\ell</math> has <math>2^{\ell}</math> nodes. Moreover, since at each level <math>\ell</math> of a range tree all nodes have a total of <math>n</math> elements of <math>A</math> at their subtrees and since there are <math>O(logn)</math> levels, the space complexity of this range tree is <math>O(nlogn)</math>.

Using this structure, a range <math>\tau</math>-majority query <math>A[i..j]</math> on <math>A[0..n-1]</math> with <math>0\leq i\leq j \leq n</math> is answered as follows. First, the [[lowest common ancestor]] (LCA) of leaf nodes <math>i</math> and <math>j</math> is found in constant time. Note that there exists a data structure requiring <math>O(n)</math> bits of space that is capable of answering the LCA queries in <math>O(1)</math> time <ref>{{Cite journal|last=Sadakane|first=Kunihiko|last2=Navarro|first2=Gonzalo|date=2010-01-17|title=Fully-Functional Succinct Trees|url=http://dx.doi.org/10.1137/1.9781611973075.13|journal=Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms|location=Philadelphia, PA|publisher=Society for Industrial and Applied Mathematics|doi=10.1137/1.9781611973075.13}}</ref>. Let <math>z</math> denote the LCA of <math>i </math> and <math>j</math>, using <math>z</math> and according to the decomposability of range <math>\tau</math>-majority queries (as described above and in <ref name=":1" />), the two-sided range query <math>A[i..j]</math> can be converted into two one-sided range top-k queries (from <math>z</math> to <math>i</math> and <math>j</math>). These two one-sided range top-k queries return the top-(<math>1/\tau</math>) most frequent elements in each of their respective ranges in <math>O(1/\tau)</math> time. These frequent elements make up the set of ''candidates'' for <math>\tau</math>-majorities in <math>A[i..j]</math> in which there are <math>O(1/\tau)</math> candidates some of which might be false positives. Each candidate is then assessed in constant time using a linear-space data structure (as described in Lemma 3 in <ref>{{Cite journal|last=Chan|first=Timothy M.|last2=Durocher|first2=Stephane|last3=Larsen|first3=Kasper Green|last4=Morrison|first4=Jason|last5=Wilkinson|first5=Bryan T.|date=2013-03-08|title=Linear-Space Data Structures for Range Mode Query in Arrays|url=http://dx.doi.org/10.1007/s00224-013-9455-2|journal=Theory of Computing Systems|volume=55|issue=4|pages=719–741|doi=10.1007/s00224-013-9455-2|issn=1432-4350}}</ref>) that is able to determine in <math>O(1)</math> time whether or not a given subrange of an array <math>A</math> contains at least <math>q</math> instances of a particular element <math>e</math>.







Revision as of 22:03, 20 December 2021

In data structures, a range query consists of preprocessing some input data into a data structure to efficiently answer any number of queries on any subset of the input. Particularly, there is a group of problems that have been extensively studied where the input is an array of unsorted numbers and a query consists of computing some function, such as the minimum, on a specific range of the array.

Definition

A range query on an array of n elements of some set S, denoted , takes two indices , a function f defined over arrays of elements of S and outputs .

For example, for and an array of numbers, the range query computes , for any . These queries may be answered in constant time and using extra space by calculating the sums of the first i elements of A and storing them into an auxiliary array B, such that contains the sum of the first i elements of A for every . Therefore, any query might be answered by doing .

This strategy may be extended for every group operator f where the notion of is well defined and easily computable.[1] Finally, this solution can be extended to two-dimensional arrays with a similar preprocessing.[2]

Examples

Semigroup operators

Constructing the corresponding Cartesian tree to solve a range minimum query.
Range minimum query reduced to the lowest common ancestor problem.

When the function of interest in a range query is a semigroup operator, the notion of is not always defined, so the strategy in the previous section does not work. Andrew Yao showed[3] that there exists an efficient solution for range queries that involve semigroup operators. He proved that for any constant c, a preprocessing of time and space allows to answer range queries on lists where f is a semigroup operator in time, where is a certain functional inverse of the Ackermann function.

There are some semigroup operators that admit slightly better solutions. For instance when . Assume then returns the index of the minimum element of . Then denotes the corresponding minimum range query. There are several data structures that allow to answer a range minimum query in time using a preprocessing of time and space . One such solution is based on the equivalence between this problem and the lowest common ancestor problem.

The Cartesian tree of an array has as root and as left and right subtrees the Cartesian tree of and the Cartesian tree of respectively. A range minimum query is the lowest common ancestor in of and . Because the lowest common ancestor can be solved in constant time using a preprocessing of time and space , range minimum query can as well. The solution when is analogous. Cartesian trees can be constructed in linear time.

Mode

The mode of an array A is the element that appears the most in A. For instance the mode of is 4. In case of ties any of the most frequent elements might be picked as mode. A range mode query consists in preprocessing such that we can find the mode in any range of . Several data structures have been devised to solve this problem, we summarize some of the results in the following table.[1]

Range Mode Queries
Space Query Time Restrictions

Recently Jørgensen et al. proved a lower bound on the cell-probe model of for any data structure that uses S cells.[4]

Median

This particular case is of special interest since finding the median has several applications.[5] On the other hand, the median problem, a special case of the selection problem, is solvable in O(n), using the median of medians algorithm.[6] However its generalization through range median queries is recent.[7] A range median query where A,i and j have the usual meanings returns the median element of . Equivalently, should return the element of of rank . Range median queries cannot be solved by following any of the previous methods discussed above including Yao's approach for semigroup operators.[8]

There have been studied two variants of this problem, the offline version, where all the k queries of interest are given in a batch, and a version where all the preprocessing is done up front. The offline version can be solved with time and space.

The following pseudocode of the quickselect algorithm shows how to find the element of rank r in an unsorted array of distinct elements, to find the range medians we set .[7]

rangeMedian(A, i, j, r) {
    if A.length() == 1
        return A[1]

    if A.low is undefined then
        m = median(A)
        A.low  = [e in A | e <= m]
        A.high = [e in A | e > m ]

    calculate t the number of elements of A[i, j] that belong to A.low

    if r <= t then
        return rangeMedian(A.low, i, j, r)
    else
        return rangeMedian(A.high, i, j, r-t)
}

Procedure rangeMedian partitions A, using A's median, into two arrays A.low and A.high, where the former contains the elements of A that are less than or equal to the median m and the latter the rest of the elements of A. If we know that the number of elements of that end up in A.low is t and this number is bigger than r then we should keep looking for the element of rank r in A.low; otherwise we should look for the element of rank in A.high. To find t, it is enough to find the maximum index such that is in A.low and the maximum index such that is in A.high. Then . The total cost for any query, without considering the partitioning part, is since at most recursion calls are done and only a constant number of operations are performed in each of them (to get the value of t fractional cascading should be used). If a linear algorithm to find the medians is used, the total cost of preprocessing for k range median queries is . The algorithm can also be modified to solve the online version of the problem.[7]

Majority

Finding frequent elements in a given set of items is one of the most important tasks in data mining. Finding frequent elements might be a difficult task to achieve when most items have similar frequencies. Therefore, it might be more beneficial if some threshold of significance was used for detecting such items. One of the most famous algorithms for finding the majority of an array was proposed by Boyer and Moore [9] which is also known as the Boyer–Moore majority vote algorithm. Boyer and Moore proposed an algorithm to find the majority element of a string (if it has one) in time and using space. In the context of Boyer and Moore’s work and generally speaking, a majority element in a set of items (for example string or an array) is one whose number of instances is more than half of the size of that set. Few years later, Misra and Gries [10] proposed a more general version of Boyer and Moore's algorithm using comparisons to find all items in an array whose relative frequencies are greater than some threshold . A range -majority query is one that, given a subrange of a data structure (for example an array) of size , returns the set of all distinct items that appear more than (or in some publications equal to) times in that given range. In different structures that support range -majority queries, can be either static (specified during preprocessing) or dynamic (specified at query time). Many of such approaches are based on the fact that, regardless of the size of the range, for a given there could be at most distinct candidates with relative frequencies at least . By verifying each of these candidates in constant time, query time is achieved. A range -majority query is decomposable [11] in the sense that a -majority in a range with partitions and must be a -majority in either or . Due to this decomposability, some data structures answer -majority queries on one-dimensional arrays by finding the Lowest common ancestor (LCA) of the endpoints of the query range in a Range tree and validating two sets of candidates (of size ) from each endpoint to the lowest common ancestor in constant time resulting in query time.

Range Majority Queries on Two-Dimensional Arrays

Gagie et al. [12] proposed a data structure that supports range -majority queries on an array . For each query in this data structure a threshold and a rectangular range are specified, and the set of all elements that have relative frequencies (inside that rectangular range) greater than or equal to are returned as the output. This data structure supports dynamic thresholds (specified at query time) and a preprocessing threshold based on which it is constructed. During the preprocessing, a set of vertical and horizontal intervals are built on the array. Together, a vertical and a horizontal interval form a block. Each block is part of a superblock nine times bigger than itself (three times the size of the block's horizontal interval and three times the size of its vertical one). For each block a set of candidates (with elements at most) is stored which consists of elements that have relative frequencies at least (the preprocessing threshold as mentioned above) in its respective superblock. These elements are stored in non-increasing order according to their frequencies and it is easy to see that, any element that has a relative frequency at least in a block must appear its set of candidates. Each -majority query is first answered by finding the query block, or the biggest block that is contained in the provided query rectangle in time. For the obtained query block, the first candidates are returned (without being verified) in time, so this process might return some false positives. Many other data structures (as discussed below) have proposed methods for verifying each candidate in constant time and thus maintaining the query time while returning no false positives. The cases in which the query block is smaller than are handled by storing different instances of this data structure of the following form:

where is the preprocessing threshold of the -th instance. Thus, for query blocks smaller than the -th instance is queried. As mentioned above, this data structure has query time and requires bits of space by storing a Huffman-encoded copy of it (note the factor and also see Huffman coding).

Range Majority Queries on One-Dimensional Arrays

Chan et al. [13] proposed a data structure that given a one-dimensional array, a subrange of (specified at query time) and a threshold (specified at query time), is able to return the list of all -majorities in time requiring words of space. To answer such queries, Chan et al. [13] begin by noting that there exists a data structure capable of returning the top-k most frequent items in a range in time requiring words of space. For a one-dimensional array , let a one-sided top-k range query to be of form . For a maximal range of ranges in which the frequency of a distinct element in remains unchanged (and equal to ), a horizontal line segment is constructed. The -interval of this line segment corresponds to and it has a -value equal to . Since adding each element to changes the frequency of exactly one distinct element, the aforementioned process creates line segments. Moreover, for a vertical line all horizonal line segments intersecting it are sorted according to their frequencies. Note that, each horizontal line segment with -interval corresponds to exactly one distinct element in , such that . A top-k query can then be answered by shooting a vertical ray and reporting the first horizontal line segments that intersect it (remember from above that these line line segments are already sorted according to their frequencies) in time.

Chan et al. [13] first construct a range tree in which each branching node stores one copy of the data structure described above for one-sided range top-k queries and each leaf represents an element from . The top-k data structure at each node is constructed based on the values existing in the subtrees of that node and is meant to answer one-sided range top-k queries. Please note that for a one-dimensional array , a range tree can be constructed by dividing into two halves and recursing on both halves; therefore, each node of the resulting range tree represents a range. It can also be seen that this range tree requires words of space, because there are levels and each level has nodes. Moreover, since at each level of a range tree all nodes have a total of elements of at their subtrees and since there are levels, the space complexity of this range tree is .

Using this structure, a range -majority query on with is answered as follows. First, the lowest common ancestor (LCA) of leaf nodes and is found in constant time. Note that there exists a data structure requiring bits of space that is capable of answering the LCA queries in time [14]. Let denote the LCA of and , using and according to the decomposability of range -majority queries (as described above and in [11]), the two-sided range query can be converted into two one-sided range top-k queries (from to and ). These two one-sided range top-k queries return the top-() most frequent elements in each of their respective ranges in time. These frequent elements make up the set of candidates for -majorities in in which there are candidates some of which might be false positives. Each candidate is then assessed in constant time using a linear-space data structure (as described in Lemma 3 in [15]) that is able to determine in time whether or not a given subrange of an array contains at least instances of a particular element .




Related problems

All the problems described above have been studied for higher dimensions as well as their dynamic versions. On the other hand, range queries might be extended to other data structures like trees,[8] such as the level ancestor problem. A similar family of problems are orthogonal range queries, also known as counting queries.

See also

References

  1. ^ a b Krizanc, Danny; Morin, Pat; Smid, Michiel H. M. (2003). "Range Mode and Range Median Queries on Lists and Trees". ISAAC: 517–526. arXiv:cs/0307034.
  2. ^ Meng, He; Munro, J. Ian; Nicholson, Patrick K. (2011). "Dynamic Range Selection in Linear Space". ISAAC: 160–169.
  3. ^ Yao, A. C (1982). "Space-Time Tradeoff for Answering Range Queries". E 14th Annual ACM Symposium on the Theory of Computing: 128–136.
  4. ^ Greve, M; J{\o}rgensen, A.; Larsen, K.; Truelsen, J. (2010). "Cell probe lower bounds and approximations for range mode". Automata, Languages and Programming: 605–616.
  5. ^ Har-Peled, Sariel; Muthukrishnan, S. (2008). "Range Medians". ESA: 503–514.
  6. ^ Blum, M.; Floyd, R. W.; Pratt, V. R.; Rivest, R. L.; Tarjan, R. E. (August 1973). "Time bounds for selection" (PDF). Journal of Computer and System Sciences. 7 (4): 448–461. doi:10.1016/S0022-0000(73)80033-9.
  7. ^ a b c Beat, Gfeller; Sanders, Peter (2009). "Towards Optimal Range Medians". Icalp (1): 475–486.
  8. ^ a b Bose, P; Kranakis, E.; Morin, P.; Tang, Y. (2005). "Approximate range mode and range median queries". In Proceedings of the 22nd Symposium on Theoretical Aspects of Computer Science (STACS 2005), Volume 3404 of Lecture Notes in ComputerScience: 377–388.
  9. ^ Boyer, Robert S.; Moore, J. Strother (1991), "MJRTY—A Fast Majority Vote Algorithm", Automated Reasoning Series, Dordrecht: Springer Netherlands, pp. 105–117, retrieved 2021-12-18
  10. ^ Misra, J.; Gries, David (1982-11). "Finding repeated elements". Science of Computer Programming. 2 (2): 143–152. doi:10.1016/0167-6423(82)90012-0. ISSN 0167-6423. {{cite journal}}: Check date values in: |date= (help)
  11. ^ a b Verfasser, Karpiński, Marek 1948-. Searching for frequent colors in rectangles. OCLC 277046650. {{cite book}}: |last= has generic name (help)CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
  12. ^ Gagie, Travis; He, Meng; Munro, J. Ian; Nicholson, Patrick K. (2011), "Finding Frequent Elements in Compressed 2D Arrays and Strings", String Processing and Information Retrieval, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 295–300, ISBN 978-3-642-24582-4, retrieved 2021-12-18
  13. ^ a b c Chan, Timothy M.; Durocher, Stephane; Skala, Matthew; Wilkinson, Bryan T. (2012), "Linear-Space Data Structures for Range Minority Query in Arrays", Algorithm Theory – SWAT 2012, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 295–306, ISBN 978-3-642-31154-3, retrieved 2021-12-20
  14. ^ Sadakane, Kunihiko; Navarro, Gonzalo (2010-01-17). "Fully-Functional Succinct Trees". Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia, PA: Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611973075.13.
  15. ^ Chan, Timothy M.; Durocher, Stephane; Larsen, Kasper Green; Morrison, Jason; Wilkinson, Bryan T. (2013-03-08). "Linear-Space Data Structures for Range Mode Query in Arrays". Theory of Computing Systems. 55 (4): 719–741. doi:10.1007/s00224-013-9455-2. ISSN 1432-4350.

External links