# Talk:Kernel trick

WikiProject Robotics (Rated Start-class, Mid-importance)
Kernel trick is within the scope of WikiProject Robotics, which aims to build a comprehensive and detailed guide to Robotics on Wikipedia. If you would like to participate, you can choose to edit this article, or visit the project page (Talk), where you can join the project and see a list of open tasks.
Start  This article has been rated as Start-Class on the project's quality scale.
Mid  This article has been rated as Mid-importance on the project's importance scale.
WikiProject Mathematics (Rated Start-class, Low-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 Start Class
 Low Importance
Field: Applied mathematics

## Merge with Kernel Method

This article is nice, and has some good mathematical background that the Kernel methods article is missing ... but the two articles seem to be discussing the same concept. I think we should merge them. What do others think? We can make a redirect so both names still work for links and searches. (User talk:riedl July 3 2009). —Preceding undated comment added 21:12, 3 July 2009 (UTC).

I am not sure if this should happen; but we need to make sure that either the article keeps the title "kernel trick" or resolves that link correctly as that is a very common name in the literature. Doctorambient (talk) 18:44, 26 March 2013 (UTC)

Many (most?) people are introduced to the "Kernel Trick" because it is used by SVM, which is popular. I think having a dedicated topic for Kernel Trick is useful and should be MOTIVATED BY CLEAR EXAMPLE. The Mathematical principles should be in the Kernel Models page. my 2 cents. — Preceding unsigned comment added by 76.21.11.140 (talk) 01:55, 5 June 2013 (UTC)

I like the tone of the article. It offers a direct statement of what the "Kernel Trick" is in a way that is easy to grasp. A more formal treatment may retain the actual content while losing the reader in the details. This one does not do that. — Preceding unsigned comment added by 108.31.9.224 (talk) 02:48, 28 November 2013 (UTC)

I think the link to Kernel_statistics is really misleading, since its about Kernels used in non-parametric density estimation (window estimators). This is something completely different! —Preceding unsigned comment added by 82.130.82.137 (talk) 17:05, 10 February 2009 (UTC)

## PDF linked in external resources

I am going to delete the document linked as part of the external resources. My reasons are : 1)the document lacks any scholarly format (there are even question marks as to the uncertainty of certain details - for example, in the second paragraph, the author is unsure of who is responsible for a contribution). 2) The language used in the document is too informal. 3) The author introduces many concepts without a clear background or explanation that is necessary to understand them.

I, on the whole believe that, someone who does link to this document and gives it a read will be left with no clear idea of how the kernel trick works.

If you think otherwise, please undo my edit and write a short note on my talk page as to why this document should continue to be linked.

Grokmenow 11:22, 18 October 2007 (UTC)

## Mercer's condition

I find this statement of Mercer's theorem very vague. Specifically, what is the space that the kernel operates in, and what is the higher-dimensional space? I would think it would be easy for anyone who knows this theorem to state the answers. Michael Hardy 22:29, 24 Aug 2003 (UTC)

After the last round of edits, the article is now factually incorrect. It is not that the kernel is decomposed into eigenfunctions. It is that the kernel is decomposed into a single dot product $K(x,y) = \phi(x)\phi(y)$. The new feature space $F$ is defined by the map $\phi: x -> F$. $\phi$ may be a non-linear function of x and $F$ may have a much higher (or infinite) dimension than x.

The whole point is that linear algorithms can operate in $\phi$ space without actually computing any values in $F$, since $F$ may be very high (or even infinite dimension).

Please see Chapter 4 of Chris Burges' tutorial [1] for a more complete explanation, especially Section 4.1 .

In any event, the article is now incorrect and must be changed. I can edit the article, but I'll give you a first chance to correct it. -- hike395 04:05, 26 Aug 2003 (UTC)

P.S. Notice in Burges' tutorial that Support Vector Machine is capitalized. -- hike395

Went ahead and fixed article -- hike395 04:26, 27 Aug 2003 (UTC)

I edited the article on Mercer's theorem. As stated originally, it was not correct... not every positive definite function is a product of functions. What is true is that a positive definite function as an integral operator is the product of two non0negatiev operators.

This should be cleared up. Good chance for collaboration here between mathematicians and AI people.CSTAR 19:55, 11 May 2004 (UTC)

Well I think the issue about what the kernel trick actually is should be able to be resolved --- semi-definite functions K(x,y) are of the form <phi(x), phi(y)> for some function phi with values in some Hilbert space H. Could we agree to do this? To reiterate the point made on other page, the fact at this level of generality has nothing to do with Mercer's theoremCSTAR 19:32, 13 May 2004 (UTC)

I went back and checked Burges' tutorial on SVMs ([2]). Burges is very well-respected and very exacting. His tutorial says that it is Mercer's condition that guarantees that a semi-definite kernel is a dot product in a Hilbert space, not Mercer's theorem. He cites Courant and Hilbert, 1953.
I think things were mixed up between the two terms: condition and theorem. So, let's fix this article to say Mercer's condition, not Mercer's theorem, and then everything will be OK. Right? -- hike395 05:16, 14 May 2004 (UTC)

One more comment: As stated in the main article of the [Kernel trick]], Mercer's condition needs a measure μ on the space X in order for inequality stated there to make sense. A simpler condition, which does not involve integrals (and in some cases is equivalent to the integral form of Mercer's condition) is

$\sum_{i,j} K(x_i,x_j) c_i c_j \geq 0$

for any finite sequence of 'x1,..,xn of X and sequence c1,..,cnof real numbers.

Do you really need the integral form of Mercer's condition? Also I think it would be a lot clearer to mathematicians if you said:

• φ is a function with values in H

• φ is a function with image in H
I agree. Also changed image to range, because the word linked to the article function range. -- hike395 12:56, 14 May 2004 (UTC)

In my opinion the whole statement of Mercer's condition here could be replaced with a simpler and more germane statement of what it means for a kernel function to be symmetric positive semidefinite:

$K(x_i,x_i) \ge 0$ for all nonzero $x_i \in X$
$K(x_i,x_j) = K(x_j,x_i)$ for all$x_i,x_j \in X$

Kenahoo 18:53, 15 September 2006 (UTC)

Seconded. I'm confused by the current statement of Mercer's condition. As it stands the condition requires the inequality to hold for "any finite sequence of x... and sequence c... of real numbers". Do the cs depend on the xs? Unless I'm missing the point (which is quite possible), can't I always choose some sequence of cs given some sequence of xs with odd-length that will make the sum negative by simply setting the last ci = -ci? -- MDReid 01:10, 5 January 2007 (UTC)

## Intro typo?

This sounds wrong:

by mapping the original observations into a higher-dimensional non-linear space so that linear classification in the new space is equivalent to non-linear classification in the original space.

Should it say this:

by mapping the original observations into a higher-dimensional linear space so that linear classification in the new space is equivalent to non-linear classification in the original space.

or is it correct as is?

— BenFrantzDale

Yes, I believe you are correct. Alternatively we could just remove the word non-linear altogether:

by mapping the original observations into a higher-dimensional space so that linear classification in the new space is equivalent to non-linear classification in the original space.

Kenahoo 18:41, 15 September 2006 (UTC)

## Origin of "kernel trick" possibly from Japanese?

This usage sounds so awkward and meaningless in English that I wonder if it actually came from a Japanese mathematician trying to deal with the Japanese word コツ (kotsu). That's actually a rather broad term, but could have been used in the sense of "knack" or "technique", and this is a word coinage method that could be used in Japanese. I feel that an English speaker (following English word creation rules) would naturally try to coin a term based on some meaningful adjective. From the article, I'd have guessed "kernel linearizing technique" or perhaps an acronym. Failing that, a linkage to the name of the creator...

Shanen (talk) 04:08, 25 December 2008 (UTC)

"kernel trick" is often how it is referred to in the literature, it's pretty standard usage. I can't find the original author online, but over 40,000 papers found on google scholar have used it and it has been used since around 1995. Dreamer.redeemer (talk) 07:05, 11 October 2009 (UTC)

About 4,300, actually: [3] -- Borb (talk) 00:53, 5 December 2009 (UTC)

I love that a 'citation needed' tag has been put on a statement that basically means 'There is no citable evidence as to where the term kernel trick came from'! Who would the cited person cite? —Preceding unsigned comment added by 128.40.255.154 (talk) 14:02, 7 December 2009 (UTC)

## MDS

I believe there should be a mention of Multi dimension scaling in this article because of the "kernel trick" is basically a form of MDS. Perhaps a mention of the similarity should be mentioned in the introduction of the article and a link to the MDS page should be added at the end of the article. —Preceding unsigned comment added by 156.56.92.123 (talk) 19:08, 24 February 2008 (UTC)

## Examples of Kernel functions?

Shouldn't examples of kernel functions be added to the article? The article says how they are defined to be inner products in other spaces but does not cite any example - not even the linear Kernel example. A quick search on google reveals this page which seems to have a lot of functions: Kernel Functions for Machine Learning Applications though as the author says, he can't guarantee the correctness of all functions - but most of them seems to be linked to their original sources. Perhaps this could be added as an external link?

I just added the article polynomial kernel. Feel free to add other kernels, either here or in a separate article. Qwertyus (talk) 15:33, 12 November 2012 (UTC)

## Examples, etc

I second the call for one or more examples, ideally illustrating unseparable problems becoming separable. Perhaps a simple example plus a second one where the computation of $\varphi()$ is complex and avoidable by the kernel trick. This would improve both clarity and motivation. The article as it currently stands is quite abstract and not quickly understandable. RVS (talk) 23:37, 8 December 2010 (UTC)

200.148.50.81 (talk) 02:31, 10 September 2010 (UTC)

## language flow

Some more transition sentences are needed and possibly more sections. Details from other articles like positive semi-definite, linear algebra kernel, and Mercer's theorem are mixed in without due lead-in.

It seems like this was pasted in bits from a more thorough paper or something. Crasshopper (talk) 21:27, 28 January 2011 (UTC)

## K remains undefined.

Te definition in the second paragraph is incomplete and confusing. I tried to fix it, but may have rendered it incorrect, as I have no reference. This article needs a definition. Everything else is superfluous. What is a kernel function? How is it different from a kernel in linear algebra? This article doesn't tell you. Fail. Fyedernoggersnodden (talk) 06:31, 19 March 2011 (UTC)