User:Garfieldnate/AM alg draft
For now, just copy the original Perl documentation
package Algorithm::AM::algorithm; use strict; use warnings;
- ABSTRACT: How the Analogical Modeling algorithm works
- VERSION
1;
=head1 OVERVIEW
I<Exemplar-based modeling> works as follows: there is a set of data items, each assigned an outcome, and there is a test item. The test item is compared with the data items; the result of this comparison tells what the possible outcomes of the test item are, along with their likelihoods.
Exemplar-based modeling is often contrasted with I<rule-based modeling>. Note that in rule-based modeling, there can be only one possible outcome, unless the model is fudged by introducing probability. (Some types of exemplar-based modeling also give only one possible outcome.)
I<Analogical Modeling> (AM) is one way to do the comparison and determine the outcome. Some of its salient features are as follows:
=over 4
=item *
Exemplars that seem less similar to the test item than those that seem more similar can still have a magnified effect if there are many of them. This is known as the I<gang effect>.
=item *
AM accounts for I<leakage>.
For instance, it is possible for someone to accidentally say "snew" instead of "snowed", in analogy with "know/knew", "grow/grew", "throw/threw", "blow/blew", etc. (I've never done this myself, though I know someone who has.) In rule-based modeling, this could never occur; in AM, this is predicted to occur, though with very low frequency.
=back
=head1 DESCRIPTION
First, the user must create a set of data items, with their outcomes, and some test items. All of these items are represented by I<feature vectors>. These feature vectors are I<not> created by the AM algorithm; they could be generated by hand or script, it matters not to AM. All the feature vectors must be of the same length, call it I<n>.
=head2 The supracontextual lattice
AM requires the construction of a I<supracontextual lattice>. It is merely a complete distributive lattice of sets called I<supracontexts>, each one labeled with an integer in the range 0 to S<2^I<n> - 1>. If I<a> and I<b> are labels of two supracontexts, then I<a> & I<b> = I<b> (that's bitwise AND) iff the supracontext labeled by I<a> is a superset of the supracontext labeled by I<b>.
The supracontextual lattice starts out with every element being the empty set. The AM algorithm adds I<subcontexts> to them one at a time.
=head2 Subcontexts
Subcontexts are also sets, and they are also labeled with an integer in the range 0 to S<2^I<n> - 1>. The elements of the subcontexts are data items.
=over 4
=item Example
Suppose that the test item has feature vector
('S', 'O', '0', 'S', 'R', '0', 'T', 'A')
and a data item has the feature vector
('P', 'Y', 'V', 'S', 'R', '0', 'T', 'a').
Compare the corresponding features, using a 1 for different and 0 for same. Then the binary number
0b11100001
is the label of the subcontext to which the data item belongs.
=back
=head2 Filling the supracontextual lattice
The elements of the supracontexts are the subcontexts. If a subcontext has label I<a>, then it will be an element of any supracontext whose label I<b> satisfies I<b> & I<a> = I<a>.
Thus, the subcontext labeled by 0b11100001 will belong to 16 different supracontexts. The labels of these supracontexts are found by replacing the 0s by 1s in all possible ways; the original 1s are left untouched.
=head2 Homogeneity and Heterogeneity
A supracontext is I<heterogeneous> if the following two conditions hold:
=over 4
=item 1
The supracontext contains two or more subcontexts.
=item 2
The data items contained in these subcontexts do not share a common outcome.
=back
Otherwise, the supracontext is I<homogeneous>. Only homogeneous supracontexts determine the possible outcomes.
=head2 The Analogical Set
The I<analogical set> is a list of all data items that appear in the subcontexts in the homogeneous supracontexts. With each data item is associated a number which is one of the following:
=over 4
=item Number of I<occurrences>
This is merely the number of supracontexts containing the subcontext containing the data item.
=item Number of I<pointers>
Assign to each supracontext a number representing the total number of data items in the subcontexts it contains. The number of pointers of a particular data item is the sum of the numbers assigned to the supracontexts containing the subcontext containing the data item. (Number of occurrences can be thought of as assigning 1 to each supracontext.)
=back
Using pointers gives rise to I<gang effects>: the ability of many data items less similar to the test item but appearing in the same subcontext to have more influence on the outcome than a few data items more similar to the test item.
With the analogical set in hand, one can then describe the likelihood of the various outcomes actually occurring by looking at the numbers assigned to its data items.
package Algorithm::AM::lattice; use strict; use warnings;
- ABSTRACT: How to store and manipulate large lattices in C
- VERSION
1;
=head1 DESCRIPTION
The Analogical Modeling (AM) algorithm requires constructing and traversing large completely distributive lattices, also known as Boolean algebras. This document tells how we do it in F<Parallel.xs>.
A lot of what appears below could be used generally to store lattices; that which applies specifically to C<AM::Parallel> is so marked.
=head1 BOOLEAN ALGEBRAS AND LATTICES
If I<n> is a positive integer, then the set of positive integers that can be expressed in I<n> or fewer bits, along with the operations & (bitwise AND) and | (bitwise OR), is an example of a I<Boolean algebra>. It is also an example of a I<lattice> which is I<completely distributive>. Although all the lattices used in AM are in fact Boolean algebras, it is customary to refer to them merely as lattices.
Any lattice can have a I<partial order> imposed upon it; this is done by defining I<a> E<lt>= I<b> whenever I<a> & I<b> = I<b> (or equivalently, I<a> | I<b> = I<a>). This partial order is symmetric, transitive, and antireflexive (if I<a> E<lt>= I<b> and I<b> E<lt>= I<a>, then I<a> = I<b>). It's called a I<partial> order because it is often the case that neither I<a> E<lt>= I<b> nor I<b> E<lt>= I<a>.
A common way to draw lattices on paper is by putting elements that are greater than other elements higher up on the paper, using line segments to indicate the partial order. Here's an example:
000 /|\ / | \ / | \ / | \ 001 010 100 | \ / \ / | | \/ \/ | | /\ /\ | | / \ / \ | 011 101 110 \ | / \ | / \ | / \|/ 111
If you can get from one element to another by only going down along the line segments, then the first element is greater than the second.
=head2 Notes for AM
=over 4
=item *
The elements of the lattice created by AM are sets. The partial order is defined as follows: I<A> E<lt>= I<B> if I<B> is a subset of I<A>. If you draw the lattice, the smaller sets are at the top. This lattice is known as the I<supracontextual lattice>; its elements are called I<supracontexts>.
=item *
The value of I<n>, and thus the size of the lattice, is determined by the length of the I<feature vector> of the test item (see F<AM.pod> for more explanation). There is a set corresponding to each I<n>-bit positive integer; furthermore, if set I<A> corresponds to integer I<a> and set I<B> corresponds to integer I<b>, then I<A> is a superset of (or "below") I<B> if I<a> & I<b> = I<b>.
=item *
Many of the elements of the supracontextual lattice are equal as sets; i.e., they have precisely the same members. Thus, for those of you who know a lot of math, it is important not to confuse the supracontextual lattice with the Boolean algebra generated by the power set of a set. The supracontextual lattice I<is> a Boolean algebra of sets; where these sets come from is explained in F<AM.pod>.
To store the supracontextual lattice, it is enough to create an array C<lattice[]> of length 2^I<n>, where C<lattice[>I<a>C<]> contains a pointer to a structure containing information about the elements of the set corresponding to I<a>.
Of course, the size of C<lattice[]> grows exponentially with I<n>; to overcome that, see the section on L<lattices as products of smaller lattices|"LATTICES AS PRODUCTS OF SMALLER LATTICES">.
=item *
The supracontextual lattice is built up by adding elements to these sets one at a time. When a new element is added to a set, it is a simple thing to C<memcpy> (actually, we use Perl's safe equivalent, C<Copy>) the original set to a new location, append the new element, and change the pointer. We only have to do this once; F<Parallel.xs> keeps track of the creation of new sets, so sometimes all that is necessary is the changing of a pointer.
=back
=head1 TRAVERSING A LATTICE
During the course of the AM algorithm, it is necessary to visit all the supracontexts that lie "below" a given supracontext. For example, given that a supracontext is labeled
1001011
AM requires an iterator that produces
1001111 1011011 1011111 1101011 1101111 1111011 1111111
though the order in which these seven are produced is immaterial.
F<Parallel.xs> does this by using a I<Gray code>. This is a method by which only one bit flips (either from 0 to 1 or from 1 to 0) at each step. Deciding which bit to flip is done as follows:
=over 4
=item 1
List the "gaps"; for
1001011
the gaps are
0100000 = gaps[0] 0010000 = gaps[1] 0000100 = gaps[2]
Each gap has exactly one 1 bit which lines up with a 0 in the original number.
=item 2
If there are I<g> gaps, list the I<g>-bit integers in reverse order: in this case, 111, 110, ..., 001, 000.
=item 3
Take each of these numbers in succession. Determine where the rightmost 1 is; its position determines which bit to flip:
1001011 1101011 = 1001011 ^ 0100000 (rightmost 1 of 111 is bit 0, use gap[0]) 1111011 = 1101011 ^ 0010000 (rightmost 1 of 110 is bit 1, use gap[1]) 1011011 = 1111011 ^ 0100000 (rightmost 1 of 101 is bit 0, use gap[0]) 1011111 = 1011011 ^ 0000100 (rightmost 1 of 100 is bit 2, use gap[2]) 1111111 = 1011111 ^ 0100000 (rightmost 1 of 011 is bit 0, use gap[0]) 1101111 = 1111111 ^ 0010000 (rightmost 1 of 010 is bit 1, use gap[1]) 1001111 = 1101111 ^ 0100000 (rightmost 1 of 001 is bit 0, use gap[0])
=back
(As I write this, I see that finding the bit to flip is the same problem of deciding which disk to move in the Towers of Hanoi problem.)
=head1 LATTICES AS PRODUCTS OF SMALLER LATTICES
Consider the following lattice: the first number is the binary label, and the other numbers represent the elements of the set with that label:
label elements
0000
0001 3 0010 0100 6 1000
0011 3 0101 3 6 0110 6 1001 1 3 1010 4 1100 5 6
0111 2 3 6 1011 1 3 4 7 1101 1 3 5 6 1110 4 5 6
1111 1 2 3 4 5 6 7
(Verify that this is a lattice.)
This lattice can be stored as two smaller lattices:
label elements
00 3 01 2 3 6 10 1 3 4 7 11 1 2 3 4 5 6 7
label elements
00 5 6 01 1 3 5 6 10 4 5 6 11 1 2 3 4 5 6 7
The set labeled by 1001 in the large lattice is precisely the intersection of the sets labeled by 10 and 01 respectively in the smaller lattices: {1, 3} is the intersection of {1, 3, 4, 7} and {1, 3, 5, 6}.
C<AM::Parallel> breaks the supracontextual lattice up into 4 smaller lattices, resulting in a great savings of memory at the expense of finding a lot of intersections. But since the elements of the sets are listed as increasing sequences of positive integers, finding the intersection is actually quite straightforward.
To initialize, set I<i> to point to the largest element of the first set and I<j> to point to the largest element of the second set.
=over 4
=item 1
Move I<i> to the left as long as it points to an integer larger than that pointed to by I<j>.
=item 2
If I<i> points to an integer less than the integer pointed to by I<j>, swap I<i> and I<j> so they point into the opposite sets; go to step 1.
=item 3
If I<i> and I<j> point to equal values, put this equal value into the intersection and move both I<i> and I<j> once to the left.
=item 4
If I<i> can't be moved, the algorithm ends.
=back