# User:Prasenjitmukherjee

## Prasen Mukherjee

I live in Bangalore and work for Yahoo! MultiMedia search. I am interested in data mining, mathematics, physics, religion and lots of other random topics.

## What do I do

• Clustering in Mobile Ad Network Space
• Extended Fielder Method ( Based on SVD decomposition )
• Stacked RBMs ( aka Restricted Boltzmann Machine )

## Earlier Work

### Architecting Mapquest/Mobile Platform Solutions

• Mapquest
• Single Line Geo Search
• Extracting contextual location based on short term browser behavior
• Mood based route recommendation
• WebTileCache
• Mobile AIM Platform
• Wireless Village/IMPS

### Spearheading R&D initiatives

• Leading R&D initiative by collaborating with premier research inistitutes
• Leading open source Machine Learning initiatives (http://lucene.apache.org/mahout)

### Ad monetization (Mobile/Social-Networking)

• Interact/Influence different business units to aid revenue-generation and monetization
• Apply technology to gauge trends and user-behaviour
• Spot market opportunities

## Thoughts on Personalized Recommended Systems

### Recommendation System ( common for Mobile/Desktop)

• Recommend (news) articles
• Recommend upcoming events
• Personalized browsing - Idea here is to learn the immediate intent of user dynamically ( in the current session as he broswes more and more pages ) from his past history, and recommend articles based on that.

#### Temporal profiling of user log data

• Analyze distribution of personal topics ( probably extracted via PLSI,LDA etc.) over time
• Use the above distribution data to group temporally similar users
• Reference papers :
• Andrew McCallum's Publications
• A Continuous-Time Model of Topic Co-occurrence Trends. Xuerui Wang, Wei Li, and Andrew McCallum. AAAI Workshop on Event Detection, 2006. Capture the time distributions not only of a topics, but also of their co-occurrences. For example, notice that while NLP and ML have both been around for a long time, but their co-occurrence has been rising recently. The model is effectively a combination of the Pachinko Allocation Model (PAM) and Topics-Over-Time (TOT).
• Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. Xuerui Wang and Andrew McCallum. Conference on Knowledge Discovery and Data Mining (KDD) 2006. A new LDA-style topic model that models trends over time. The meaning of a topic remains fixed and reliable, but its prevalence over time is captured, and topics may thus focus in on co-occurrence patterns that are time-sensitive. Unlike other work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps. Improvements in topic saliency and the ability to predict time given words.

### Personalized Summarization

• Can be applied to mobile devices
• Perform document summarization ( snippet-generation ) based on user profile.
• Describe a user based on his past behavior. This requires NLP techniques.

### Use Session data to aid unsupervised clustering

• Each session could be thought as a representative of a user-intent ( aka topic ) as generally ( there are always exceptions) we do related things in a session. Like searching for a travel destination, followed by searching for hotel rates, followed by searching for air-tickets. If we make this assumption, this greatly reduces the problem of high dimensions.
• Now instead of clustering based on document/word tuple, we could create a representative vector for each user session ( call it a session-vector in term-space ) and apply clustering logic on session/word tuple. We can use LSH to see the distribution of session-vectors across different hash-buckets, which can be used to estimate the aprior number of clusters for latent variable algorithms like PLSI/LDA etc.

## Understanding the essence of probability distributions

### Understanding Maximum Likelihood

Its the basis of probability theory, we use it everyday without realizing it. How do we compute probability : #-possible-outcomes/#-total-outcomes. This intuition/axiom/theorem/rule is based on max likelihood. Lets say we have a Bernoulli's event with n-trials and k-positive outcomes. We say probability of positive outcome is k/n. This comes from max-likelihood. Lets see it how.

• Probability of data/likelihood ( L ) = p^k * (1-p)^(n-k)
• Which value of p will maximize L. That is given by dL/dp = 0.
• Solving dL/dp=0 will give same results as solving d(logL)/dp=0.
• Solving d(logL)/dp=0 ==> p = k/n Q.E.D.

### Bernoulli type distributions

• Relation between Bernoulli/Binomial/Beta
• Binomial is a special case of Bernoulli.
• When n(#-of-trials) = 1, the binomial distribution is a Bernoulli distribution.
• Beta Distribution is continuous extension of Binomial Distribution
• Beta Distribution is also the conjugate prior of Binomial and Bernoulli Distribution in Bayesian statistics.

### Other important relations between distributions

• The binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while the product np remains fixed.
• The Gamma function generalizes the factorial function for non-integer and complex values of n. If n is a positive integer, then
$\Gamma(n) = (n-1)!\,$

showing the connection to the factorial function.

${n \choose k} = \frac1{(n+1) \Beta(n-k+1, k+1)}$

## Why chose Dirichlet

Lets have topic = z, doc = d, word = w.

If we assume P(d|z) is Multinomial i.e.

$P(d|z) = {n! \over z_1!\cdots z_k!}p_1^{z_1}\cdots p_k^{z_k}, {where } \sum_{i=1}^k z_i=n$

What the above equation means is that in a document there are n (doc-length) slots to be filled by one of the z topics, ranging from $z_1$ to $z_k$. P(d|z) is effectively the document distribution and is often referred as just P(d). This is different from p(d) which is the probability of d.

Assuming P(d|z) is a multinomial distribution what are our choices for P(z|d)

From Bayesian statistics :
P ( d | z ) * P ( z ) = P ( z | d ) * P ( d )
P ( z | d ) = P ( d | z ) * P ( z ) / P ( d )
= Multinomial () * P ( z ) / Constant(i.e. Evidence )

Since Dirichlet is a conjugate prior of Multinomial, it is a good choice for P ( z | d ). And since P ( z | d ) is dirichlet, so would be P ( z ) ( from the definition of Conjugate Prior).

## CondtionalRandomField (CRF)

• Comparison with HMM
• In HMM observations ( emission entities ) are independent
• Unlike CR which is based on discriminative model, HMM is based on a generative model, which means its based on joint probability distribution p(s,o) and requires a model of p(o)
• Comparison with MEMM
• MEMM is a directed graph, means that hidden-states are dependent ONLY on past history
• Like CRF, MEMM is also based on discriminative model, which means its based on conditional probability distribution p(s|s',o) and doesn't require a model of p(o) as observation (o) is already given.

### Hidden Markov Models

Essentially based on hidden variable model which can transition from 1 variable(state) to another with some probability. These states can be interpreted as some known set states which cannot be directly observed, like mood,affinity etc. This approach can be used to model supervised clustering, where clusters are known set of states and features being represented as transmission probabilities (also known as confusion matrix). In analogy with term-doc text mining parlance documents ( or sentences or browsing behaviors ) can be generated by sampling over terms in sequence. The hidden states could be abstract,intro,content,conclusion for documents ( subject,predicate,verb for sentences and start,browse,end for browsing users ).

• An HMM is defined by λ = (M,N,A,B,π), where M=num_observations, N=num_hidden_states, A = NXN self-similarity matrix(state transition probabilities), B = MXN confusion matrix ( state->observable emission probabilities), π = NX1 array of initial hidden state probabilities.
• Basic problems/solutions
• Training ==> Given a set of sequence observations O[i]s estimate HMM parameters by maximum likelihood ( i.e. maximize P(O|λ) ) == Baum Welch Algorithm
• Inferencing ==> Given an observable-sequence find out the most likely state-sequence == Viterbi Algo

## LocalitySensitiveHashing (LSH)

CAVEAT :: These are all my personal observations. Appreciate constructive comments.

Interesting technique to do the following

• Speeds up Clustering
• It avoids pairwise distance/similarity computation
• Near Duplicate Detection
• Tries to solve "Curse of Dimensionality" problem
• As query is independent of corpus size
• More applicable to multimedia indexing
• Unlike text indexing, they follow QueryByExample hence LSH is directly applicable.
• Less applicable to text indexing
• As text-queries are very short ( 1-2 words), hence there will be very few hits as very few documents will have 1-2 words.

### Basic LSH Algorithm

• Use functions of the form
• $g(p) = $
• Preprocessing:
• Select $g_1...g_L$
• For all $p \in P$, hash p to buckets $g_1(p) \cdots g_L(p)$
• Query
• Retrieve the points from buckets $g_1(p),g_2(p) \cdots$ until
• Total points <= 2L or all
• Return top k-points by doing similarity computation among 2L points

### LSH hash functions

Basic definition can be found in Locality_sensitive_hashing

#### Hamming Hash Function

This is the first LSH hash funchion mentioned in Indyk-Motwani'98's original paper on LSH. Seems to be more applicable to compute Image Similarity ( Color Histogram match etc )

#### Minwise Permutation

Initial intent was to do similarity computation using some probabilistic technique. Later used as an efficient LSH. Set based approach. All the other approaches are vector/space based. Since its set based unlike vector-space approach, it doesn't take care of the relative distribution of term-frequencies in a document. More applicable to do Clustering, Duplicate Detection etc. in text corpus.

• Article link : Minwise Independent Permutations Broder et al.
• Metric : Jaccard Coefficient
• $h(d) = min_{w \in d} \{ \pi(w) \}$
• $Pr[h(v)=h(u)] = J(v,u)$, where J = Jaccard Coefficient
• Good for text/document similarity

#### p-Stable Distribution

Uses properties of stable distribution family. Seems to be the most talked about these days. Wide applications including text-retrieval, (sub)image-retrieval, clustering(text,media), duplicate-detection(text,media). Andoni-Indyk-06 extended the projection space from 1 dimension to t-dimension and introduced leech-lattice LSH based on Voronoi diagrammes and leech-lattice-space.

• Metric : $l_p(v_1, v_2)$ Basically it honours $l_p^d$ norm
• $h_{\mathbf{a},b} (\boldsymbol{\upsilon}) = \left \lfloor \frac{\mathbf{a}\cdot \mathbf{v}+b}{r} \right \rfloor$
• $Pr[h(v)=h(u)] = \int_{0}^{r} \frac{1}{c}f_p(t/c) (1- t/r)\, dt$, where
• $c = \left \Vert v-u\right \|_p$, and
• $f_p(t)$ denote the prob density of the absolute value of the p-stable distribution

### Semantic Hashing

This is based on stacked RBMs with layered pre training ( of RBM layers ) followed by backpropagation based training. This seem to work on input data vectors which are quasi binary and can be projected on a hamming space ( means coefficient-values close to 0 or 1 ). Thats one of the problem in modeling a problem with continuous probability distributions into RBM. Hence clustering based on containment or just presence of a click ( instead of click-count or click-probability ) can be modeled using RBMs.

## My SVD Observations

• The key to working with SVD of any rectangular matrix A is to consider AAT and ATA.
• The columns of U, that is t by t, are eigenvectors of AAT,
• The columns of V, that is d by d, are eigenvectors of ATA.
• The singular values on the diagonal of S, that is t by d, are the positive square roots of the nonzero eigenvalues of both AAT and ATA.
• Computation of EigenValues/Vectors
• First transform the given matrix into an upper-triangular form ( via HousehOlder reflections/Gram-Schmidt Normalization etc. )
• Then use iterative QR algorithm.

## Random Indexing

Random Indexing is a vector space technique that provides an efficient and scalable approximation to distributional similarity problems.

 Instead of collecting word-use statistics into a 54,000 x 37,600
(M x N) words-by-documents matrix F, collect them into a
54,000 x 4,000 (M x N') matrix G.  How?
0.  Set G = 0.
1.  For each document d, make an N'-dimensional (e.g., 4,000-D)
ternary INDEX VECTOR i_d by placing a small number k of
-1s and +1s (e.g., ten of each; k = 10) at random among
the 4,000; the rest will be 0s:
i_d = 0000000+000...000+000000-000...000,
that's 4,000 "trits," 3,980 of them 0s.
2.  Scan through document d.  Every time the word w appears,
add the index vector i_d to row w of matrix G.
NOTE: The same procedure will produce the matrix F if the 37,600
index vectors i_d are 37,600-bit and have only one 1, in the
d-th position (i.e., unary or LOCAL representation):
i_d = 00000000000...000+0000000000...000000000000000000000000000,
that's 37,600 bits, 37,599 of them 0s.


## EC2 Quirks

### General Tricks

• Logging into ec2 instance/cluster from a diff machine
• Run ec2-describe-instances and use the public DNS ( aka ec2-x-x-x-x.compute-1.amazonaws.com ) just after the <cluster>-master line.
• ssh -i ~/.ec2_keys/id_rsa-ec2-keypair root@ec2-x-x-x-x.compute-1.amazonaws.com
• scp get
• scp -i ~/.ec2_keys/id_rsa-ec2-keypair root@ec2-x-x-x-x.compute-1.amazonaws.com:/root/log.txt .
• Can use 'cluster ssh'
• Write a script that uses soap to get a list of all running instances and you can use cluster SSH to connect to and issue commands to all machines simultaneously

### Use Oracle data mining

Summarized from [2]

• Use Elasticfox and the following ami to boot instance : “ami-117b9778”
• Enable Local Port-Forwarding
• Get ec2-host's local ip address ( using ifconfig ), then run the following ssh-command
• ssh -4 -g -L <local-port-12345>:<local-ip-X.X.X.X>:1521 -i <ec2-keypair-file> root@<ec2-XX-XX-XX-XX.compute-1.amazonaws.com>
• To Use SQLPLUS
• ssh -i <ec2-keypair-file> root@<ec2-XX-XX-XX-XX.compute-1.amazonaws.com>
• su - oracle
• sqlplus dmuser/dmuser
• To Use OracleDataMiner GUI
• Download Oracle's odminer-GUI tool for Data Mining
• <unzip-dir>/odminer/bin/odminer
• Use following odminer settings :
• user:dmuser/pwd:dmuser
• host:localhost/port:<local-port-12345>/sid:odmdb

## Regression Analysis

Basic idea is based on OLS ( ordinary least squares fit ) of straight line. Extended to more than 1 dependent variables ( regressors ). Very applicable to inventory estimation/forecasting problem for any ad-network mobile networks specifically, since the inventories explode in mobile space because of device*carrier*geo aspects. For desktops device*carrier doesn't matter and geo almost remains the same for a particular inventory cell.

### Linear Regression

• Simple_linear_regression is the uni-variate form
• Linear_regression is a multi-variate extension of the above
• Problems with multiple (linear) regression
• Not local ==> two completely independent (orthogonal) regressors affecting each other

### Scaling the Multiple (Linear) Regression Problem

• Exploit hierarchy in regressing dimensions
• Dance between being too macro and too micro
• Too micro ==> unstable regression
• Exhibit random behavior ( central limit theorem )
• Too macro ==> bad estimation
• Violates locality ( as well collapse the dimensions )
• Use LSH
• Control the dance via parameterization of hash functions

• Can mix contextual with behavioral, based on currently-watching-pgm and past history with appropriate weights given to each of the algorithms.
• Can use user's EPG settings to pre-fetch ads for contextual ads, as we know what is the pgm user is gonna watch.
• Use CRF models as compared to Markovian models as we know the future behavior here.

### Client based (STB) recommendation

• Needs to cluster different user-groups in the household
• TV is shared, hence superimposition of behaviors. ( kids, family, teens, parental, senior citizen, young-adult etc. )
• Identify current user-group ( or a mixture of groups )
• Based on temporal, contextual current data
• Based on current EPG settings ( user1,user2 etc. ), channels watching, frequency of activity, time-of-day etc.
• Different user-groups has different cut-off bar for showing recommendations/advertisements :
• For Kids  : low bar ==> ( low precision, high recall ). Can show it to anyone. may not be effective for some.
• For Adults : high bar ==> dangerous to show adult advertisements to kids ==>( Extremely High precision, high recall )
• For finance/home-loan ==> moderate bar ( as everyone is moderately interested in finance etc. Low returns when showing it to kids though )

#### Bidirectional HMM

• Can model as bi-directional HMM
• STB sees the ads sent as observations
• Cable Network sees yes/no from STB as observations

## OpenCV building guide

• Go to OpenCV Installation Guide
• Install Cmake 2.8+
• Building instructions : ( basically ./bootstrap; make ; make install )
• Add /opt/cmake/bin/cmake in your $PATH ( thast where cmake is installed by default in linux ). In mac • Install pkg-config • in linux : "yum install pkgconfig" • in Mac • Download source code • Instructions are in INSTALL fiel which comes with the src code ( steps : ./configure; make; make check; make install ) • Configure OpenCV by running cmake • cd opencvdir; mkdir release; cd release; • In redhat/linux : cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D BUILD_NEW_PYTHON_SUPPORT=ON .. • In Mac : cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D BUILD_NEW_PYTHON_SUPPORT=ON -D BUILD_JAVA_SUPPORT=FALSE .. • sudo make install • For Python support • I was able to compile successfully for python in Mac ( redhat has some issues I am still debugging ) • In mac ensure that you build with -D BUILD_NEW_PYTHON_SUPPORT=ON option while running cmake. • after running "make install" , export PYTHONPATH=opencvdir/release/lib:opencvdir/samples/python:$PYTHONPATH ( there should be a cv2.so file under <opencvdir>/release/lib )
• now run python; import cv2;

• Installations on python
• To summarize, a personal private(.ssh/id_rsa)/public(.ssh/id_rsa.pub) key pair is generated using the ssh-keygen command. The public key is then copied onto a remote systems' .ssh/authorized_keys file. And you can now SSH to the remote systems's account without the use of a password. SSH uses client's .ssh/id_rsa as the priavte key and sends auth request to server which then checks its authorized_keys entries. If present it allows pwdless login
• On Client machine :
• ssh-keygen -t rsa
• scp .ssh/id_rsa.pub prasen@remotemachine:/homes/prasen/.ssh
• ssh prasen@remotemachine 'cat /homes/prasen/.ssh/id_rsa.pub >> /homes/prasen/.ssh/authorized_keys'
• rm .ssh/id_rsa.pub ( Optional )

### GNUPLOT Tips

• Use Gnuplot in Mac
• Install Macports ( macport installs packages in /opt/local/bin )
• sudo port install gnuplot ( gnuplot installed in /opt/local/bin/gnuplot )
• The default aquaterm doesnt work in Mac 10.6
• xterm;gnuplot;set term xterm; plot 'dat' u 1:2 smooth cumulative; ( to see it on xterm )
• set term png;set output 'output.png'; plot 'dat' u 1:2 smooth cumulative; ( this will generate png file )
• Additional gnuplot tips
• set datafile separator '\t';
• Ranked Data plotting
• Useful Links :

### Solr Tips

• Solr Multicore setup ( with sharding )