Talk:Support vector machine

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Robotics (Rated B-class, Low-importance)
WikiProject icon Support vector machine is within the scope of WikiProject Robotics, which aims to build a comprehensive and detailed guide to Robotics on Wikipedia. If you would like to participate, you can choose to edit this article, or visit the project page (Talk), where you can join the project and see a list of open tasks.
B-Class article B  This article has been rated as B-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
WikiProject Statistics (Rated B-class, Mid-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

B-Class article B  This article has been rated as B-Class on the quality scale.
 Mid  This article has been rated as Mid-importance on the importance scale.
WikiProject Computer science (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.

Remove the remark on Lagrange multipliers temptation[edit]

In the "Primal form" section, the article remarks that "One could be tempted to express the previous problem by means of non-negative Lagrange multipliers alpha as min_{w,b,alpha} ...". This is absolute nonsense: why would one be tempted to do this? The standard way (well-known in convex optimization) of constructing the dual is as min_{w,b} max_{alpha>0}, as correctly written later. Therefore, this remark should be deleted. -- (talk) 10:54, 12 May 2010 (UTC)

I second that point. What would lead someone to incorrectly encode the constraint as a hard penalty that way? The subsequent form shown is the correct one. There is no "temptation" to use any other unless you don't understand what you're doing. Elvenhack (talk) 18:00, 19 October 2011 (UTC)

See Also[edit]

The "see also" section is I think to empty. I think people may want to see other information on margin classifiers/modern inference techniques. I think, at the very least, Supervised learning,Machine learning,Linear classifier, and Boosting should be included. I didn't want to modify the page myself though as I want to make sure there wasn't some previous consensus against this. --IlyaV (talk) 22:34, 6 April 2009 (UTC)

dual formulation, constraint on kernel-functions[edit]

I think it should be mentioned that the kernel functions have to be positive-semidefinite. I would like to add something like Not every function k(\mathbf{x}_i, \mathbf{x}_j) is usable as a kernel function. If k is not positive-(semi)-definite, the Lagrangian function \tilde{L}(\mathbf{\alpha}) is not necessarily bounded below, so that the maximazation is not well defined. below the list of possible kernels, but I'm not sure if this argumentation is correct. It is basically taken from "Pattern recognition and Machine Learning" from Christopher M. Bishop ch. 7.1 p. 329. I have trouble with the "bounded below". The original says: "The kernel formulation also makes clear the role of the constraint that the kernel function be positive definite, because this ensures the Langrangian function \tilde{L}(\mathbf{\alpha}) is bounded below, giving rise to a well-defined optimization problem". \tilde{L}(\mathbf{\alpha}) is to be maximized, so it should be bounded from ABOVE. Since k(x_i,x_j) can become arbitrarily large, so can the Sum_{i,j}, so that \tilde{L}(\mathbf{\alpha}) is not bounded from below. I haven't found this in the Errata, so I'm not sure if I have made some mistake or if there really is a mistake in the book. NikolasE (talk) 22:07, 16 January 2010 (UTC)

Discriminative Models[edit]

Why are SVMs known as discriminative models? They dont model the posterior probability. —Preceding unsigned comment added by (talk) 12:20, 20 October 2008 (UTC)

How else would you interprete the distance from the hyperplane (given by the output of the svm - you can use a sigmoid function to obtain values between 0 and 1) than as a posterior probability? 10:36, 20 November 2008 (UTC) —Preceding unsigned comment added by (talk)

Relation to Kernel Machines[edit]

Can someone explain the relation of SVM to KM?--Adoniscik (talk) 06:18, 2 January 2008 (UTC)


The second reference (van der Walt&Barnard) is a bad paper. This is a conference paper that was not peer reviewed. Maybe remove the reference (or replace it with a decent reference). —Preceding unsigned comment added by (talk) 22:59, 6 November 2007 (UTC)

Hey - I dont think that kernel PCA should redirect here? Sort of like redirecting "dogs" to "Animals" It probably needs its own article.-- 09:07, 26 July 2006 (UTC)

Indeed, kernel PCA should not redirect here - KPCA and SVM have only in common the kernel trick, but one is a feature extractor, the other is a classifier. KPCA is in fact closer to spectral clustering than it is to the SVM. I'll try to write something about KPCA some time soon.

Hello! I have a comment about the first sentence, which says "A SVM is a <blank>", and <blank> has been either "statistical classification model" and "supervised learning method" lately. I'm in favor of the former because it situates SVM in a very large group of related methods from both conventional statistics and machine learning. "Supervised learning" is deficient on two counts -- s.l. also includes regression as well as classification, and it suggests only a link to machine learning and not conventional statistics. So I'd like to hear what other people have to say. Happy editing, Wile E. Heresiarch 02:35, 29 Apr 2004 (UTC)

Yes, indeed, classification is more specific than supervised learning. However, there are both classification and regression forms of a support vector machine. The latter is often called Support Vector Regression (SVR). I added a discussion of SVR to this article, although it is difficult for laypeople to understand. Anyway, I think that supervised learning is a more accurate description. The article supervised learning is much less stubby than classification, too.
To me, machine learning is statistics, so I don't have a preference for "supervised learning" over "classification" on that basis. -- hike395 04:46, 29 Apr 2004 (UTC)

Hmm. Not quite the way I'd put it. However, I haven't got my references on me at the moment, so I can't come up with a precise description. --

Needs to flesh out mathematics more[edit]

As someone who is taking Calc III(and has studied a bit of linear algebra), I feel very lost in some parts of the article. For example, what does w (dot) x + b mean in terms of the parameters? How do lagrange multipliers play a role in this(I know what they are, and how they are used in terms of Calc, but NOT in terms of SVMs).

What's REALLY unclear for me, though, is how to actually FIND the hyperplane separating the points. Perhaps I haven't done enough reading on machine learning, but I'd personally be very greatful for anyone who could flesh this out enough so that someone with a good foundation of calc and a minimal foundation of linear algebra could understand the concept of how to solve it better. Nothing I've seen covers this subject well enough that I am able to comprehend it conceptually. This should be fleshed out to the point where it's easily possible to write pseudocode from the ideas presented. —Preceding unsigned comment added by (talk) 03:08, 26 November 2007 (UTC)

To be fair I don't think the article is unclear about how to find the hyperplane. It clearly states that this can be done by solving the Quadratic Programming (QP) problem, and even sets up the functional and constraints to be used in solving the QP. Unfortunately, this well-defined problem is sufficiently complicated that it does not belong in this article in the form of pseudocode. Pupdike (talk) 17:23, 21 March 2009 (UTC)

Non-linear? and origin of machine term?[edit]

Do SVMs have to be non-linear? I thought they could be either linear or non-linear. -- Oliver PEREIRA 13:18 Jan 26, 2003 (UTC)

No. See if my new edit makes you happier. By the way, does anyone know why they're called "machines"? Nobody seems to call older machine learning techniques (e.g. the perceptron, feed-forward networks, etc.) "machines". --Ryguasu 00:13 Apr 2, 2003 (UTC)

According to Vapnik (The Nature Of Statistical Learning Theory, p. 133) they are non-linear: The Support Vector (SV) machine implements the following idea: it maps the input vectors x into a high-dimensional feature space Z through some nonlinear mapping, chosen a priori. In this space, an Optimal separating hyperplane is constructed. To be strict a so-called linear SVM is an optimal hyperplane (it has support vectors, but is not a support vector machine), although many authors ignore this. --knl 15:57, 28 Aug 2004 (UTC)
The mapping of the feature vectors into the feature space (possibly one of a higher dimension) only makes sense if it is a nonlinear one. Would it be linear, no additional information could be gained by this step, because the feature vectors would still have the same relations, only modified by some scalar factor. So chosing a linear mapping would make no sense and only add to the computation expense. Hence, any but the most basic support vector machines, which are in fact called linear support vector machines, are nonlinear at a very fundamental level. —The preceding unsigned comment was added by (talk) 01:58, 12 March 2007 (UTC).
I learned that the reasoning behind the name support vector machine is the following: The support vectors, that is, the samples along the hyperplanes, are used to generate/solve for the maximum margin hyperplane between the two (i.e. the positive example support vector and the negative example support vector) support vectors. This maximum margin hyperplane (characterized by w and b) is the machine which returns 1 or -1 (or 0 if the test point/vector is right on the SVM) when given positive and negative points/vectors. So, machine in this case means "something that returns 1 (true) or -1 (false)" for the purposes of classification. Not sure how to find a reference for this though. --Rajah 16:27, 20 February 2007 (UTC)
Another useful interpretation of the name is the analogy in Burges' paper "A Tutorial on Support Vector Machines for Pattern Recognition". If you treat the decision surface in feature space as a rigid sheet, and suppose each support vector exerts a force alpha (the value of the dual variable associated with it) perpendicularly against it (say by a rod from the decision sheet to the support vector (in feature space)) then at optimality the forces and torques associated all cancel out. So basically the margin is "supported" by these forces exerted by the "support vectors" and subsequently lies at a position of mechanical stability. This is a nice justification for the name support vector and also, being rather mechanistic, makes machine seem less of a stretch. Svm slave 12:25, 24 February 2007 (UTC)

Fast and lightweight? As compared to what? (Sitting in front of machine that's spent 3 days on a linearly separable dataset C4.5 takes a few minutes to chomp through). User:Iwnbap

Don't forget Sequential Minimal Optimization (SMO) [1]

As someone who doesn't know SVM very well yet, I say this article could really use a picture. The article is technically correct, but not enlightening to a newbie. Someone drew me a picture with two classes (+ and -), some separating planes (i.e., a 2-d example) and explained that many cases not near the planes are often thrown away. I thought it helped a lot. If I knew SVM well enough, I would upload and label the picture. Maybe I will if I learn more. dfrankow 03:54, 3 March 2006 (UTC)

Hope the picture I added makes it more clear AnAj 19:23, 15 June 2006 (UTC)
I can not see the picture in the article. Is the link wrong? —The preceding unsigned comment was added by (talkcontribs) .
Apparently there was something wrong with the thumbnail cache. I purged it and now it shows correctly on all thumbnail sizes at least for me. AnAj 09:04, 13 August 2006 (UTC)

How to add a Loss-Matrix to SVM[edit]

Maybe anybody can point out how to incoperate a Loss matrix into the svm framework. I've been looking for this information on various places and I think this would add some great value to the article.

I'm not quite sure in what sense you refer to the loss-matrix. Loss-matrices are usually used in bayesian learning as far as I know, a field I wouldn't put the SVM in. The basic SVM is defined for linear separable data, so there is no miss-classification on the training-data. Here are some suggestions how to "prefer" on class over the other, but I'm not sure whether these are correct: Of course you could move the separating hyperplane somewhat closer toward the less preferred class by adjusting the threshold-value b in the classificator, this should be pretty simple. The orientation of the hyperplane should stay the same anyway. I suppose you should receive the same results by using different thresholds c and d in the inequality-constraints ensuring the correct classification of the training data. The values ratio of c and d then describes the margin ratio on either side of the hyperplane. This way you could incorporate the preference into the soft-margin SVM as well. The slick-variables have to be greater for the preferred class to have a missclassification and as the sum of the slick-variables is part of the minimization-problem the "preferred" class should have less slick and the orientation of the hyperplane might indeed change. Anybody willing to check if this is true? -- 14:04, 28 October 2007 (UTC)

Need more practically USEFUL info.[edit]

Article is too academically minded. There should a paragraph on real-world applications, like SVMs and logistic regression method are being used nowadays in filtering e-mail spam messages. How and why? Thanks in advance! —The preceding unsigned comment was added by (talk) 15:54, 18 January 2007 (UTC).

Hmm I think would be nice to have some real world examples where SVM beats every competetor (should be easy to find). Maybe a good idea would be to create some images with the different cases that arise on real world data (e.g. linear seperable, linear seperable with soft margin, seperable with rbf kernel and seperable with rbf kernel and soft margin...) to clarify why there are different extentions on the original linear formulation.--Cyc 12:25, 5 February 2007 (UTC)

Not going to excise comments from a talk page and I agree that examples are useful. But the comment above is totally wrong.... The law of conservation of generalisation means that for every case where SVMs are better at classification than another method there is a corresponding case where they are worse. It's a fundamental principle of machine learning (however counter-intuitive it seems). Outside the world of the theoretical, there are many applications where SVM's are not the best approach (and I use SVMs in industry). It all depends on the data and the application. I'd agree it's worth including the RBF kernel as in my experience it's probably used more frequently in the real world than linear SVM's as data is rarely linearly separable in n-dimensions. A brief explanation and a link to Kernel methods is probably enough. Maybe I'll do it myself if I get time.... 09:13, 28 February 2007 (UTC)

Totally wrong? Didn't claim anything that contradicts the No-Free-Lunch theorem. Cyc 23:58, 4 April 2007 (UTC)

This should tie into least-squares[edit]

SVMs and SVRs are a simple application of least squares which has been around a lot longer than either of these two. People from many disciplines have never heard of machine learning or SVM/Rs but they have heard of least squares. Even the non-linear adjustment is standard for least-squares polynomial fitting. The standard least squares is just SVR and SVM does a similar trick while trying to fit a hyperplane BETWEEN the data rather than fit a line ON the data.

I'm not sure this is strictly true (though admittedly while SVMs are my specialty I know relatively little about advanced least-squares algorithms). I agree that least-squares SVMs (LS-SVRs and LS-SVMs as per Suykens' book) implement a regularised least-squares technique, but standard SVRs (and SVM classifiers) implement a cost function which is linear, not quadratic (specifically, SVRs usually use Vapnik's epsilon-insensitive cost). Actually, it might be good to cover least-squares SVMs and maybe mention (briefly, the details might be a bit excessive for now) very general extensions like Smola, Scholkopf and Muller's paper "General Cost Functions for Support Vector Regression". Svm slave 12:34, 24 February 2007 (UTC)
The loss functions employed by least-square and SVM are different. There is no performance guarantee (generalization error bound) for the traditional least-square, unlike the VC-bound for the SVM. Note that recent papers of Bouquet and co. provide also bounds for regularized least-square. Further, the kernel trick employed by SVM is one step beyond that of the traditional polynomial (non-linear) fitting using least-square.
The dimensionality of the polynomial fitting problem grows exponentially with respect to the degree of polynomial.
In the case of kernel trick, the dimensionality of the problem is bounded by the number of training examples (i.e. dimension of kernel matrix) Jung dalglish (talk) 03:24, 16 February 2010 (UTC)
Why don't you try to write something in the main article in order to clarify these points? --Pot (talk) 21:31, 17 February 2010 (UTC)

Secure Virtual Machine (SVM) Technology?[edit]

Is there a wiki page for this article, I typed in SVM and got here. I also typed in all the other obvious things and think that this term does not have a wiki article! TonyMartin 19:36, 06 March 2007 (UTC)

As far as I know, there is no such article on the english-language wikipedia. There is however a short article on the german-language wikipedia. Feel free to add Secure Virtual Machine to the list of requested article for computer science, computing, and Internet. (Or even better, start that article yourself if you feel knowledgable about this topic). — Tobias Bergemann 08:37, 7 March 2007 (UTC)

Addendum: Your signature appears to be broken, as the wiki link leads to a user page for a non-existing user "TonyMartin" instead of User:Tonymartin234567. — Tobias Bergemann 08:40, 7 March 2007 (UTC)

Superscript text The claim "they simultaneously minimize the empirical classification error and maximize the geometric margin" is impercise to say the least. More like: maximize margin hoping that error is minimized.

minor cleaning up[edit]

I did some cleaning up to the page. 1. In the introduction I added the word 'separating' to the sentence "The hyperplane is the hyperplane that maximises the distance between the two parallel hyperplanes." which I thought needed some clarification. If anyone disagrees with that change, feel free to find a better way of saying that. 2. I removed the "Generalization" section which seemed unneeded. 3. I replaced a sentence in "Motiviation" with a simpler sentence including information that had been in the "Generalization" section. 4. In "Formalization" I removed the distinction between statistics and computer science notation for the number of dimensions (p and n, respectively). Since n is used elsewhere as the number of points in the data sample, it seems better to refer to a p-dimensional input space. Hope I wasn't stepping on anyone's toes by making those changes! —The preceding unsigned comment was added by Digfarenough (talkcontribs) 17:46, 8 April 2007 (UTC).

That's a fast bot! Clicked "save page", realized I forgot to sign, came right back, and the bot had already added that. Impressive :) digfarenough (talk) 17:47, 8 April 2007 (UTC)

Suggestions for clarity[edit]

I think the article loses its clarity, particularly for someone who isn't already well-versed on SVMs, when you get to the section "Non-linear classification". Up to that point, I felt it was pretty good, and in fact, even the first paragraph of the section is fine. But I can't imagine anyone who isn't already highly knowledgeable about SVMs getting anything out of that second paragraph of that section.

First, there seems to have been a sudden notation change without explanation. Above a training vector is denoted \mathbf{x_i}, but then in the kernel examples suddenly we have \mathbf{x} and \mathbf{x'}. It looks like a vector transpose, which doesn't make any sense, but adds to the confusion. Why not: k(\mathbf{x_i},\mathbf{x_j})? That would be more consistent with the earlier sections.

Next, the relationship between a kernel and a feature space hasn't been introduced. The earlier article leads the reader to focus on the primal form, so it needs to be explicitly explained that the dot product \mathbf{x_i \cdot x_j} in the dual form is being replaced by the kernel function, k(\mathbf{x_i},\mathbf{x_j}), and that the foundation for solving these isn't destroyed by this non-linear substitution.

Since it is such a key conceptual part of SVMs, I think the decomposition of the kernel function on a higher dimensional space needs to be explained here. The part about the infinite-dimensional Hilbert space is non-interpretable until that this key concept is understood. Also the idea that the solution can be found without even knowing what \Phi(\mathbf{x_i}) is, or what its dimensionality is, is a really hard thing to grasp and a key concept that is not mentioned here. Transforming data through a pre-defined \Phi(\mathbf{x_i}) before applying the algorithm is not particularly interesting, but the idea that we're doing this with the kernel function in the dual space, possibly on an implied super-high-dimensional primary space, without a huge performance hit, seems like the defining characteristic of an SVM, but is not at all clear from this article.

Next, the article has never told the reader whether the selection of non-linear mapping (kernel function) has part of the algorithm, or selected before the algorithm. In addition, one is left wondering whether parameters of the kernel function (such as "d" in the polynomial kernels) are selected in advance or optimized by the algorithm. (The form and the parameters of the kernel are always fixed in advance, right?).

Anyway, I hope these suggestions are useful for cleaning this up. I'm a bit reluctant to make relevant changes to the article until they've been aired here. Thanks.

I do like the absence of convergence theory (e.g., VC dimensions) and details on techniques for solving these. I don't think those belong in the wikipedia introduction to SVMs, so that is a good attribute of the article as it exists.

These seem like spot-on observations to me. I encourage you to make the relevant changes. (talk) 01:25, 25 February 2008 (UTC)
Agreed, anything to make this article more understandable would be most appreciated. I found a lack of descriptions for the symbols really unhelpful, and this was compounded by the change in symbol used for each concept. -- (talk) 11:37, 20 September 2010 (UTC)

What does the "Support" in "Support Vector" mean?[edit]

I know what a support vector is, but I don't know why we call it "support vector"? Does the vector "support" something? The hyperplane? Then, what does it "support"? Does anyone know this? Welcome to discuss this with

The physical interpretation is explained in Burges tutorial 3.6 page 15

They are called this because they "hold up" the separating plane

AIMA/2E, pg.751

--Adoniscik (talk) 06:43, 2 January 2008 (UTC)

Thanks all for your help!I have found the answer right int Burges tutorials.

History of SVM begins in the 1960s with Tom Cover[edit]

Tom Cover developed the theory of support vectors in his 1964 thesis, which was published as "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition," IEEE Trans. Elec. Comp, EC-14, 326-334. " This paper has many deep insights on the role of dimension. Spmeyn 12:23, 20 October 2007 (UTC)


- add explanation why ||w||² is used (done)
- remove scaling as it is a common technique in machine learning (done)
- link primal to dual. Emphasize role of dot product!
--Cyc (talk) 15:17, 27 January 2008 (UTC)

They are really good suggestions! —Preceding unsigned comment added by Visame (talkcontribs) 04:05, 30 January 2008 (UTC)

delete errorneous image[edit]

The image showing 3 Hyperplanes suggests that L2 is the optimal one. This is clearly wrong. It should therefor be replaced by a corrected version.
I suggest following improvements:
- svg as image format
- smaller circles. filled with white and black
- one line L1 that doesn't seperate the datapoints. one that does but with a small margin and the maximum margin one.
What do you guys think? --Cyc (talk) 09:13, 8 February 2008 (UTC)

Ok did that. Unfortunately I din't manage to get my inkscape svg file to be displayed properly. —Preceding unsigned comment added by Cyc (talkcontribs) 18:06, 16 February 2008 (UTC)

Verctor w in second diagram[edit]

I think the vector w in the second diagram is misleading. It is merely a vector that is orthogonal to the separating hyperplane, and not a vector that starts from the w.x - b = 0 hyperplane and ends at the w.x - b = 1 hyperplane. Otherwise, readers will be left wondering why the margin of separation is 2 / ||w|| instead of 2 ||w||. --unkx80 (talk) 15:59, 8 March 2008 (UTC)

Yes, you are right. Will fix this.Cyc (talk) 10:05, 11 March 2008 (UTC) Fixed. Thanks pointing this out. --Cyc (talk) 12:51, 11 March 2008 (UTC)

Mistake in second diagram[edit]

I think the distance between the hyperplane and the origin should be b/||w|| instead of b.

No, the diagram is correct. —Preceding unsigned comment added by (talk) 12:22, 26 August 2009 (UTC)

In the text it says "The parameter b/||w|| determines the offset of the hyperplane from the origin along the normal vector w." And in the picture it just shows b. I don't know which is right, but aren't they contradicting?

Of course, only b/||w|| is correct. I wonder why the diagram hasn't been updated yet. I would suggest to retain "b", but use it denote the VERCITAL distance from the origin. —Preceding unsigned comment added by (talk) 15:31, 3 February 2011 (UTC)

Ranking SVM[edit]

I'd like to add a section on the extension of SVMs to ranking problems. I'm not sure where I ought to put it, however. Thoughts? Martian bob (talk) 01:24, 17 June 2008 (UTC)

Why does SVM appear under categories Neural Networks and Ensemble Learning???[edit]

I suggest to remove these categories, these are wrong classifications —Preceding unsigned comment added by (talk) 12:57, 17 June 2008 (UTC)

Yes check.svg Done I am not an expert in this field, but the above observation seems reasonable to me. If anyone knows better, please put back the categories and please explain why. --Pot (talk) 15:54, 25 November 2009 (UTC)

Audience Level[edit]

This article isn't a candidate for the Simple English wikipedia. I think though even granted math content and a certain inherent required knowledgeability, a better presentation, accessible to any math or engineering grad or senior undergrad student should be that maximum required level. When I merged the tags, this part was undated. (talk) 21:46, 11 July 2008 (UTC)

Noob-Friendly description?[edit]

Why do so many articles describe things in the most complex manner there is? Would it really be that much of a loss of "eliteness" if there was one or two paragraphs in the beginning of the article describing what this is/what it does/what it can be used for? —Preceding unsigned comment added by (talk) 15:28, 6 August 2008 (UTC)

Yeah - It would be nice if someone who understands this writes a few paragraphs as to what it is. I'd like to get an idea at least what this article is about. Marcperkel (talk) 16:41, 5 March 2009 (UTC)
are you familiar with machine learning\supervised learning? if you are not, then indeed you are not expected to understand this article (that's OK BTW :) ), and the first sentence can refer you to a more general article. if you are familiar to those concepts please give more details about what is unclear. —Preceding unsigned comment added by Amit man (talkcontribs) 18:33, 5 March 2009 (UTC)
OK, I can buy that understanding the article requires a level of familiarity with machine learning and/or supervised learning. But it would be really nice to know that requirement from reading the lead paragraph. Even just a prefix like "In mathematics," would help. Heavy Joke (talk) 03:43, 17 November 2009 (UTC)

Trust me, it's explained very well and in very simple language here, relative to my university lecture. —Preceding unsigned comment added by (talk) 15:53, 30 August 2009 (UTC)

what it SUPPORTS?[edit]

hello everyone,, i am new to SVM and i am trying hard to understand what it actually SUPPORTS —Preceding unsigned comment added by Kattikantha rao (talkcontribs) 09:42, 2 October 2009 (UTC)

The SVM classifier finds the hyperplane which separates the data. This hyperplane has a margin on both sides, thus it lies in the middle of the margin. The vectors, which are most close to the hyperplane lie exactly on the margin. They are called support vectors, since you can now define the hyperplane only by means of these "edge" vectors. Now clear? --Peni (talk) 16:17, 18 October 2009 (UTC)
#What does the "Support" in "Support Vector" mean? :) --Peni (talk) 16:19, 18 October 2009 (UTC)

Oracle also uses SVM for Document Classification[edit]

Hello: Oracle has a SVM machine to clasiffy documents, apart for Data Mining. See 6.4.2 SVM-Based Supervised Classification at [2] —Preceding unsigned comment added by (talk) 11:08, 1 March 2010 (UTC)

Too many software links[edit]

See Wikipedia:External links - the list of dozens of software implementations of SVM was far too much, totally inappropriate for an encyclopedia. Maybe a couple of brief educational (pseudo)code presentations could be considered. I've removed the software links. I'm being bold --mcld (talk) 14:03, 13 March 2010 (UTC)

I'd rather you didn't do that sort of thing - when I'm looking at something like this on Wikipedia, a list of implementing software is very much part of what I'm looking for. If there are dozens of implementations then the top four or five would be appropriate. —Preceding unsigned comment added by (talk) 11:56, 12 November 2010 (UTC)
I'm in sympathy with removing them. The page should be more a definition and explication of the concept rather than a market survey. Ezrakilty (talk) 18:46, 12 November 2010 (UTC)
Maybe the method has become so widespread now that it is easy to find implementations. In the external links section we have some implementations, that should be enough for a start. --Pot (talk) 22:36, 14 November 2010 (UTC)

Transductive suport vector machines[edit]

I think the article he is refering to is —Preceding unsigned comment added by (talk) 14:48, 14 December 2010 (UTC)

Error in Formalization section[edit]

In the formalization section it says

Minimize (in {\mathbf{w},b})

\|\mathbf{w}\| \,

subject to (for any i = 1, \dots, n)

c_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1. \,

But wouldn't that make w a zero vector every time the dataset is linearly separable? -- (talk) 00:27, 16 October 2010 (UTC)

No. w would be zero only when all c_i's are equal, i.e. all points have the same class. When points are linearly separable, length of w is const * margin between classes. -- X7q (talk) 10:44, 16 October 2010 (UTC)

Element Spaces and Nonlinear Case[edit]

I'm a bit too foreign (and shy) to edit the article itself, but two details made it a bit complicated for me to find the how-to in it:

1. It would help to state clearly, which elements are in which space, especially b \in \R.

2. For the linear case there is that beautiful statement that the hyperplane is defined just by the linearcombination of the input data, w = \sum \alpha_i y_i x_i, and you have to search \alpha_i. In the nonlinear case this clear and short solution is a bit hidden in the middle of a longer sentence, and in the end that is the thing you need for computation. ;-)

So \textstyle \mathbf{w}\cdot\varphi(\mathbf{x}) = \sum_i \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) could be much more outstanding.

Thanks! — Preceding unsigned comment added by (talk) 16:28, 8 April 2012 (UTC)

Hyperplane deffinition in linear SVM[edit]

I read in the article:

These hyperplanes can be described by the equations

\mathbf{w}\cdot\mathbf{x} - b=1\,


\mathbf{w}\cdot\mathbf{x} - b=-1.\,

I wonder why the specific values '-1' and '1' are used here. Wouldn't a more general distance term be clearer? I feel this formulation is a specific case, and also misleads the reader to find a relationship to the class labels '-1' and '1', which have nothing to do with the hyperplanes that this paragraph refers to.

Or am I missing something here?

Elferdo (talk) 15:42, 13 June 2012 (UTC)

machine vs machines[edit]

Should the title be Support vector machines instead of Support vector machine? and make Support vector machine redirect to Support vector machines instead? — Preceding unsigned comment added by (talk) 09:51, 13 November 2012 (UTC)


The paragraph on support vector regression seems copy-pasted from here: (see section 3.2.2). I'm not sure if that's ok or if it should be rewritten. — Preceding unsigned comment added by Ankid (talkcontribs) 19:54, 27 February 2013 (UTC)

Does the solution lie at a saddle point?[edit]

The "Primal form" section says that "we look for a saddle point". Is this correct? What is thefunction that this is a saddle point of? Is it a saddle point of f(w,b)=\max_{\alpha}\frac{1}{2}||w||^2+\sum_i\alpha_i(y_i(w_i x_i-b)-1)? It doesn't seem clear that this is even differentiable. The objective function (the norm squared of (w,b)) is positive semi-definite, so I don't think it has any saddle points.Vinzklorthos (talk) 01:37, 7 February 2014 (UTC)