Talk:Document-term matrix

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Linguistics / Applied Linguistics  (Rated Stub-class)
WikiProject icon This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Stub-Class article Stub  This article has been rated as Stub-Class on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.
Taskforce icon
This article is supported by Applied Linguistics Task Force.
Note icon
This article has been automatically rated by a bot or other tool as Stub-Class because it uses a stub template. Please ensure the assessment is correct before removing the |auto= parameter.

I need some help here:

Do you think I focused to much one vectors?

We definitely need more applications. Kh251

I don't agree with the last changes. Performing eigenvalue decomposition reduce the size of the matrix, thus improves speed, but decreases accuracy. I know I might be wrong, but I'd like to understand... KH251 09:32, 21 July 2005 (UTC)

Not necessarily: what you say is one valid interpretation of the reduction, but the reduction can also be interpreted as creating a "better" matrix, since the operation tends to "soften" the representation and reduce possible noise.
Also, it's not always true that this makes it easier on the computational side; for instance, LSA is rather heavier than just just leaving the thing alone (I have a reference for that somewhere, I am just rather busy at the moment...). Hope it helps ! Cheers ! Rama 12:14, 21 July 2005 (UTC)
Yes but LSA is computed once, the important part is having real time answers to queries. Once the matrix is smaller, this will be faster, won't it ? KH251 12:37, 21 July 2005 (UTC)
LSA produces a very serious computation burden on a search engine. Right now, if you type a word at a search engine, it looks the word up in a trie and finds documents that contain that word in O(1) time (independent of the number of documents in the collection). If you had a search engine that looked up documents in the LSA latent space, it would have to perform high-dimensional nearest neighbor search. LSA is typically used with 100+ dimensions, so none of the computational geometry speed-ups for nearest neighbor search apply. Therefore, the search would be O(N), where N is the number of documents in the collection. For Google, that would be 8,000,000,000. As you can see, this is disastrous for searching the web. -- hike395 06:14, July 22, 2005 (UTC)
Oh ! That's how ! Thank you very much for the explanation. You made my day. KH251 09:02, 22 July 2005 (UTC)

Since we seem to be several people to have a taste for the thing, would anyone fancy creating a "NLP project" on Wikipedia ? Rama 12:18, 22 July 2005 (UTC)

Intro Improvement Request[edit]

I encountered this term for the first time just a few minutes ago. I read the intro, but I still don't have a clear idea of what a document-term matrix is, other than it is a mathematical matrix and that it is related to a body of text. Danielx (talk) 01:42, 2 November 2009 (UTC)