Talk:tf–idf

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Low

This article has been rated as Low-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Untitled/undated discussion[edit]

Usually, the term frequency is just the count of a term in a document (NOT divided by the total number of terms in the document), which is confusing because it isn't really a frequency.

I strongly agree, in all the technical papers I've been reading for my Internet services class at U.Washington, TF is the count, and so TF*IDF is biased (usually has higher values) for longer documents therefore needing to be normalized.

Naming of the statistic (and this article)[edit]

Why title of the article is in lower case? Why not "TF-IDF"? --ajvol 15:29, 25 November 2006 (UTC)[reply]

I believe the short story of this is that tf-idf is a well known function in the literature and that is how it is referred. I know that in some cases it is used to help differentiate it from the uppercase variations that are sometimes used to refer to other equations. Josh Froelich 03:19, 11 December 2006 (UTC)[reply]

In other papers I see the sign of multiplication TF*IDF, not minus. See, e.g. S. Robertson, Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 60, 503-520, 2004.
What do you think about renaming the article? --AKA MBG (talk) 14:18, 7 March 2008 (UTC)[reply]

By far, the most common representation in the literature is lowercase with an asterisk or similar multiplication symbol. I agree that this article should be renamed if the Wikimedia allows * in article names. 13:20, 26 June 2009 (UTC)

I have seen hyphen used more often than asterisks, but this probably varies between subfields in the literature. I believe more variants should be included in the opening sentences. Currently it might appear like "tf-idf" and "TFIDF" are the only variants in use, which is far from the case. Thomas Tvileren (talk) 10:01, 29 April 2020 (UTC)[reply]

Use of En Dash[edit]

Why is TF-IDF spelled with an en dash in this article? zachaysan (parley) 21:29, 27 September 2020 (UTC)[reply]

Example[edit]

You could extract the most relevant terms from a version of a page of Wikipedia, perhaps this very one, as an example. --84.20.17.84 15:20, 15 March 2007 (UTC)[reply]

This is a good idea. I'll do it tomorrow. —Preceding unsigned comment added by 67.170.95.139 (talk) 07:10, 9 December 2008 (UTC)[reply]

I'd warn against using a Wikipedia article, since they change over time, which impedes reproducibility; it's better to choose a static document, such as a public domain text. If it explicitly references Wikipedia, it also runs afoul of Wikipedia:Avoid self-reference. Dcoetzee 09:29, 9 December 2008 (UTC)[reply]

Text Data Clustering[edit]

I think we can also use tf-idf in text data clustering. I would like to know any Java source code on unstructured text data clustering based on tf-idf? —Preceding unsigned comment added by 125.53.215.245 (talk) 03:09, 12 September 2007 (UTC)[reply]

Normalized frequencies[edit]

The frequency of the terms isn't usually normalized by dividing it for the total length of each document. Instead, normalization is done by dividing for the frequency of the most used term in the document (as outlined in http://www.miislita.com/term-vector/term-vector-4.html). —Preceding unsigned comment added by 151.53.133.126 (talk) 18:59, 29 February 2008 (UTC)[reply]

I've removed the normalization from the definition of tf following several discussions on mailing lists about tf implementations. The (unsourced!) variant previously described has sown a lot of confusion on the 'net. Qwertyus (talk) 12:46, 29 June 2011 (UTC)[reply]

Also, with the information given in the example it is not possible to calculate the TF of the term "cow". It is wrong to say that TF(cow) is the frequency of the term cow (3) divided by the number of terms (100). —Preceding unsigned comment added by 201.53.230.107 (talk) 02:34, 13 October 2009 (UTC)[reply]

I've added a banner to the section requesting expert help. The name "term frequency" probably isn't clear to outsiders, whether it should be a raw count or a normalised count. If someone from the text-retrieval community could simply clarify whether it is "normal" to normalise the value or not, that would improve this page! --mcld (talk) 11:14, 3 February 2012 (UTC)[reply]

I'm probably not the "expert" you're looking for, but either is a measure of term frequency. Normalization accounts, up to a point, for how term frequency tends to favor long documents, but pace comments above, a very simple measure of tf is still tf. I don't think there needs to be too much stress over this. Universaladdress (talk) 06:43, 15 March 2012 (UTC)[reply]

I have edited the section to provide one particular formula, but have tried to emphasize that the given formula is not necessarily the definitive version. Hope this was helpful. 72.195.132.12 (talk) 05:03, 17 August 2012 (UTC)[reply]

Logarithms[edit]

Can someone please specify the logarithm bases correctly? Is that binary or base 10 log? —Preceding unsigned comment added by Godji (talk • contribs) 12:03, 20 March 2008 (UTC)[reply]

it doesn't matter as long as they are all the same in your calculations 24.222.83.249 (talk) 23:42, 1 June 2008 (UTC)[reply]

Remember that

\log _{a}(x)=\log _{b}(x)/\log _{b}(a)

. This means that converting between two logarithm bases is just multiplication by a constant. 72.195.132.12 (talk) 04:03, 17 August 2012 (UTC)[reply]

Idf definition[edit]

It maybe the issue of the Information Retrieval community as whole, but the definition of IDF is an intellectual insult for anybody with a reasonable natural sciences background. Saying that IDF (Inverse Document Frequency) = log ( 1 / document frequency ) should be prohibited! Maybe the place to fix this is wikipedia, since we can't fix IR textbooks and papers... —Preceding unsigned comment added by 194.58.241.106 (talk) 19:50, 14 June 2008 (UTC)[reply]

That would be original research, which is not permitted. We should describe IDF as it is defined and used in the field of IR. If someone in the natural sciences has published something about why this is a poor choice for a formula, it could probably be given a couple sentences somewhere. Dcoetzee 09:41, 9 December 2008 (UTC)[reply]

So what if someone is insulted. The job of the wikipedia is to convey information...if you feel insulted about something, go ask your mother for a hug. —Preceding unsigned comment added by 206.169.37.100 (talk) 20:16, 13 May 2009 (UTC)[reply]

If there is a significant difference in the way the term is used across disciplines, a disambiguation page may be in order. That information may not need to be included in this article, however. Universaladdress (talk) 06:14, 12 August 2012 (UTC)[reply]

Notation very confusing[edit]

The index notation here is difficult to understand quickly because you use i for word and j for document. It would be much easier to read and grok if you used w, w', w" for words, and d, d', d" for documents. —Preceding unsigned comment added by 206.169.37.100 (talk) 20:12, 13 May 2009 (UTC)[reply]

Why is there a $\times$ sign instead of $\cdot$ ? It is not a cross-product here:

\mathrm {(tf{\mbox{-}}idf)_{i,j}} =\mathrm {tf_{i,j}} \times \mathrm {idf_{i}}

—Preceding unsigned comment added by 137.250.39.119 (talk) 12:13, 17 June 2009 (UTC)[reply]

Enjoy!! —Preceding unsigned comment added by 207.46.55.31 (talk) 08:54, 3 May 2010 (UTC)[reply]

Example is incorrect[edit]

In the example the term frequency should the absolute count of term in the document, and shouldn't be divided by all term counts in the dictionary

Detailed discussion about this question above on the talk page. Expert help has been sought to clarify the article. Universaladdress (talk) 15:46, 29 April 2012 (UTC)[reply]

corrupted text at Web mining[edit]

Concerns the topic of this article. Appears after "When the length of the words in a document goes to". I'm deleting it; hopefully s.o. here can fix. — kwami (talk) 23:13, 17 January 2014 (UTC)[reply]

Why are you reporting this here, instead of at Talk:Web mining? QVVERTYVS (hm?) 23:24, 17 January 2014 (UTC)[reply]

Double Normalization K[edit]

What does the "K" stand for in "Double Normalization K"? — Preceding unsigned comment added by 195.159.43.226 (talk) 15:06, 15 September 2015 (UTC)[reply]

Looks like it's just a nameless constant that somebody decided to call

K

. I.e., it doesn't stand for anything. QVVERTYVS (hm?) 15:14, 15 September 2015 (UTC)[reply]

Recommended tf–idf weighting schemes[edit]

This table is confusing. Scheme 3 is identical for document and query term, which does not make sense. --zeno (talk) 12:42, 13 December 2018 (UTC)[reply]

I agree with above comment for multiple points. First, column headers make references to unexplained terms: what is "document term weight", is it equivalent to the "idf weight" presented in the previous table (same thing apply for next column). Second, weights presented in columns do not reflect the ones presented in variation tables, making them difficult to understand which component (tf or idf) they relates to. Then, weighting scheme column brings few information, just numbering the schemes. Do these scheme actually have a name ? Are they referenced in some published article (if yes, maybe a citation should be added). Finally, the layout does not follow the one of the two other variants table (table title, left aligned), making it unclear if it is part of the tf*idf "standard" or some variations of it. AlexisBRENON (talk) 10:14, 21 November 2022 (UTC)[reply]

Typo in Link with Information Theory[edit]

In the middle Formula appears a W. Probably this should be a T?!--Tobias.rostock (talk) 12:47, 26 November 2019 (UTC)[reply]

Past tense in introduction[edit]

@DancingPhilosopher: Why did you change the tense in the lede section to past tense in this edit? I can't believe that tf-idf is not used any more... --Wrongfilter (talk) 10:36, 9 February 2024 (UTC)[reply]