Talk:Stemming

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Linguistics / Applied Linguistics   
WikiProject icon This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of Linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
 ???  This article has not yet received a rating on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.
Taskforce icon
This article is supported by the Applied Linguistics Task Force.
 

This should not redirect. This should be the article. User:Jfroelich

Agree, this is not the same process. Lemmatization produces grammatically correct word forms, but stemming just removes a series of suffixes/prefixes and/or endings. The result is not necessarily a word at all. Stemming is done through a grammar-unaware algorithm. As a result, totally different words can have the same stemmed version. It would be totally insane to merge two articles.User:Itman —Preceding undated comment added 17:35, 21 October 2010 (UTC).

The section on matching algorithms is unreadable.. —Preceding unsigned comment added by 129.170.66.210 (talk) 20:24, 7 September 2007 (UTC)

Agree, the section on matching algorithms is unreadable... I'd edit it, but it isn't clear what the point of the section is.
I've cleaned it up. However, I'm not an expert in that field. I did my best to figure out what was really meant in that paragraph and rewrite it using good English, style, explanations and examples, but it would be nice if someone who really knows something about matching algorithms reviewed this section. 89.138.151.18 (talk) 10:34, 27 May 2008 (UTC)
Not very happy with whoever wrote this section, it is redundant with a large amount of the text already written, and it is written poorly. (original author Josh Froelich)

A few other points, if someone has the time to edit:

1) There appears to be no mention of 'suffix substitution', this is common in most stemmers that also do suffix stripping, for example in English substitute -ies for -y (as in lady and ladies); the method can be extended to substitute irregular verbs such as ran for run etc.

I added this today if you want to review (Josh Froelich)

2) The list of languages is a bit pointless, many commercial products have stemming for dozens of languages e.g. Verity, dtSearch.

Agree but it was there in the 1 paragraph article that used to exist so I left it in there

3) Google as a commercial example is rather poor I think, many other commercial companies have been using stemming techniques for decades. Ray3055 20:27, 27 September 2007 (UTC)

Agree but it was there in the original article so I left it in there. There is a bit of a Google fanaticism unfortunately and there is some truth to the fact that many normal non-technical people approach this concept via this Google vehicle of understanding.

"On the other hand, stemmers for true isolating languages such as Vietnamese can be even simpler than those for English." Removed this, Vietnamese has no verb inflection, and no noun declension - hence stemming does not apply. The link to the Vietnamese wikipedia article is confusing it uses the term "..is an analytic (or isolating) language" implying the terms mean the same - but see the two separate articles.Ray3055 12:03, 14 November 2007 (UTC)

I changed the number "one" to "two" under "Hybrid approaches". This is my first edit in Wikipedia ever. I welcome any comments that may help me be a better editor. Lon of Oakdale (talk) 18:05, 14 April 2008 (UTC)

This article as it stands is absolute crap. Large parts are waffling opinion and each section could be summarised in one or two sentences without losing any content, if indeed the content is any use (see the lack of citations, personal opinions, etc.) Judgements about algorithm efficiency are a bit arbitrary (e.g. algorithmic approach faster than lookup?!?) The assertion that using stochastic algorithms makes it harder to distribute software is just arrant nonsense.

Please can someone with a linguistics and computer science background clean this up? —Preceding unsigned comment added by 88.109.5.58 (talk) 23:56, 21 June 2010 (UTC)

need more information

"immense amounts of storage" is vague and it really depends on when the statement was made. 64kb was an immense amount of storage at one point. — Preceding unsigned comment added by Zachary Whitley (talkcontribs) 00:56, 2 October 2011 (UTC)