Talk:Probabilistic latent semantic analysis

	Linguistics portal This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.LinguisticsWikipedia:WikiProject LinguisticsTemplate:WikiProject LinguisticsLinguistics articles
???	This article has not yet received a rating on the project's importance scale.
	This article is supported by Applied Linguistics Task Force.

Statistics Low‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
Low	This article has been rated as Low-importance on the importance scale.

The way it is explained in original article and repeated in all other articles that I saw do not allow engineers to reproduce the method in software. I rewrote an explanation in formal notations of linear algebra and added 100 lines program that demonstrates the method. It is already published on-line but I wish to know if anyone reads this topic and really need complete clarity. Let me know if anyone is interested and I will add the link. — Preceding unsigned comment added by 65.207.170.141 (talk) 20:12, 28 September 2011 (UTC)[reply]

Corrected a few inconsistencies/confusions/inaccuracies:

afaik the acronym PLSA is more common that the lower-case variant pLSA -- need to be consistent anyway.
Fisher kernels allow PLSA to be used in a discriminative setting, not as a generative model.
Whoever wrote the part about "severe overfitting problems" should provide a reference for that.

I stumbled upon a paper stating these overfitting problems and added the reference. —Preceding unsigned comment added by Keretapi (talk • contribs) 14:37, 17 September 2007 (UTC)[reply]

In "Evolutions...", _discriminative_ was obviously wrong -- I think what was meant is _generative_ -- that's one way to present LDA.
Added a bullet on the extension to higher-order data

Sunny house 20:00, 22 August 2007 (UTC)[reply]

Excellent. Rama 08:38, 23 August 2007 (UTC)[reply]

Errr -- whoever added the graph: it's nice and everything but could you try to use the same notation as in the article? Sunny house (talk) 19:44, 11 March 2008 (UTC)[reply]

No, in every paper i have read, the latent variable is always denoted as 'z'. So the text should be changed instead.--137.250.39.133 (talk) 09:21, 18 April 2008 (UTC)[reply]

Dear 137.250.39.133: first the goal here is not necessarily to reproduce what you read in other papers, but to provide a self-contained explanation of PLSA. Whether the latent variable is denoted c or z is inconsequential as long as it is clear that it is a latent variable. However, the main issue with the graph is that it is confusing w.r.t. the document variable 'd', which is denoted by the theta in the graph. I doubt every paper you read uses this notation -- of the papers cited here, Hofmann, Vinokourov et al. and Gaussier et al. cerrtainly do not. Finally there is a captioning problem: the words are not the only observables, the document index is observed too (by definition). Sunny house (talk) 13:18, 5 July 2008 (UTC)[reply]

Actually Hofmann, in its original paper "Probabilistic Latent Semantic Analysis" uses "d" for the document variable, "z" for the topic and "w" for the observed word. However, this is by no means important, and several other works approach both the plate notation as the formulas using diverse letters for the variables. It is in fact more common to see "z" as the topic, but this should not be taken as a rule. For clarity, both the text and the image should have the same letters. Also, Sunny House, when you say that there is a captioning problem because of the documents being observed, I'm not sure if you mean that the document node should be shaded. If you do, it should not be shaded. It is not the observed document itself, but rather a distribution over the topics. The observed documents are underlying in the observed words node.