গাউসিয়ান কার্নেলের সাথে একটি এসভিএমের fi nite ডাইমেনশনাল বৈশিষ্ট্যযুক্ত স্থানটি রয়েছে এর পিছনে স্বজ্ঞাততাটি কী?
গাউসিয়ান কার্নেলের সাথে একটি এসভিএমের fi nite ডাইমেনশনাল বৈশিষ্ট্যযুক্ত স্থানটি রয়েছে এর পিছনে স্বজ্ঞাততাটি কী?
উত্তর:
এই উত্তরটি নিম্নলিখিত ব্যাখ্যা করে:
কার্নেলের লোকাল প্রোপার্টিগুলির কারণে গাউসিয়ান কার্নেলের সাথে নিখুঁত বিচ্ছিন্নতা সর্বদা সম্ভব (বিভিন্ন শ্রেণীর কাছ থেকে কোনও দুটি বিন্দু হুবহু একই রকম থাকে না), যা স্বচ্ছন্দভাবে নমনীয় সিদ্ধান্তের সীমানায় নিয়ে যায়। পর্যাপ্ত পরিমাণ ছোট কার্নেল ব্যান্ডউইথের জন্য, সিদ্ধান্তের সীমানাটি দেখতে যেমন আপনি যখনই পয়েন্টগুলির চারপাশে ছোট্ট বৃত্ত আঁকেন তখনই যখনই ইতিবাচক এবং নেতিবাচক উদাহরণগুলি পৃথক করার প্রয়োজন হয়:
(ক্রেডিট: অ্যান্ড্রু এনগের অনলাইন মেশিন লার্নিং কোর্স )
সুতরাং, কেন এটি গাণিতিক দৃষ্টিকোণ থেকে ঘটে?
মান সেটআপ বিবেচনা করুন: আপনি যদি একটি গসিয়ান কার্নেল আছে ও প্রশিক্ষণ ডেটা ( এক্স ( 1 ) , Y ( 1 ) ) , ( এক্স ( 2 ) , y ( 2 ) ) , ... , ( x ( n ) যেখানে y ( i ) মানগুলি ± 1 । আমরা একটি শ্রেণিবদ্ধ ফাংশন শিখতে চাই
এখন কিভাবে আমরা কখনও ওজন ধার্য হবে ? আমাদের কি অসীম মাত্রিক স্থান এবং একটি চতুষ্কোণ প্রোগ্রামিং অ্যালগরিদম দরকার? না, কারণ আমি কেবল দেখাতে চাই যে আমি পয়েন্টগুলি পুরোপুরি আলাদা করতে পারি। তাই আমি করতে σ একটি বিলিয়ন বার ক্ষুদ্রতম বিচ্ছেদ চেয়ে ছোট | | x ( i ) - x ( জে ) | | যে কোনও দুটি প্রশিক্ষণের উদাহরণের মধ্যে, এবং আমি কেবল ডাব্লু i = 1 সেট করি । সব প্রশিক্ষণ পয়েন্ট বিলিয়ন sigmas পৃথক্ যতটা কার্নেল সংশ্লিষ্ট হয়, এবং প্রতিটি বিন্দুতে সম্পূর্ণরূপে চিহ্ন নিয়ন্ত্রণ করে এই অর্থ Yএর আশেপাশে সাধারণত, আমাদের আছে
যেখানে কিছু ইচ্ছামত ছোট মান। আমরা জানি ε ছোট কারণ এক্স ( ট ) অন্য কোন বিন্দু থেকে এক বিলিয়ন sigmas দূরে, তাই সবার জন্য আমি ≠ k আমরা আছে
যেহেতু এত ছোট হয়, Y ( এক্স ( ট ) ) স্পষ্টভাবে হিসাবে একই চিহ্ন রয়েছে Y ( ট ) , এবং ক্লাসিফায়ার প্রশিক্ষণ ডেটার উপর নিখুঁত সঠিকতা অর্জন করা হয়ে।
এটিকে "একটি অসীম মাত্রিক বৈশিষ্ট্য স্থানে নিখুঁত রৈখিক বিভাজন" হিসাবে ব্যাখ্যা করা যায় এমন তথ্য কার্নেল ট্রিক থেকে আসে, যা আপনাকে কার্নেলের একটি অভ্যন্তরীণ পণ্য হিসাবে ব্যাখ্যা করতে দেয় (সম্ভাব্য অসীম-মাত্রিক) বৈশিষ্ট্য স্থান:
যেখানে বৈশিষ্ট্য মহাকাশ ডেটা স্থান থেকে ম্যাপিং হয়। অবিলম্বে অনুসরণ করে যে Y ( এক্স ) বৈশিষ্ট্য মহাকাশে একটি রৈখিক ফাংশন হিসাবে ফাংশন:
যেখানে লিনিয়ার ফাংশন বৈশিষ্ট্য স্পেস ভেক্টর ভি হিসাবে সংজ্ঞায়িত করা হয়
এই ফাংশনটি মধ্যে রৈখিক হয় কারণ এটা শুধু নির্দিষ্ট ভেক্টর দিয়ে ভেতরের পণ্য একটি রৈখিক সমন্বয়। বৈশিষ্ট্য স্থান, সিদ্ধান্ত সীমানা Y ( এক্স ) = 0 ঠিক হয় এল ( বনাম ) = 0, the level set of a linear function. This is the very definition of a hyperplane in the feature space.
Note: In this section, the notation refers to an arbitrary set of points and not the training data. This is pure math; the training data does not figure into this section at all!
Kernel methods never actually "find" or "compute" the feature space or the mapping explicitly. Kernel learning methods such as SVM do not need them to work; they only need the kernel function .
That said, it is possible to write down a formula for . The feature space that maps to is kind of abstract (and potentially infinite-dimensional), but essentially, the mapping is just using the kernel to do some simple feature engineering. In terms of the final result, the model you end up learning, using kernels is no different from the traditional feature engineering popularly applied in linear regression and GLM modeling, like taking the log of a positive predictor variable before feeding it into a regression formula. The math is mostly just there to help make sure the kernel plays well with the SVM algorithm, which has its vaunted advantages of sparsity and scaling well to large datasets.
If you're still interested, here's how it works. Essentially we take the identity we want to hold, , and construct a space and inner product such that it holds by definition. To do this, we define an abstract vector space where each vector is a function from the space the data lives in, , to the real numbers . A vector in is a function formed from a finite linear combination of kernel slices:
The inner product on the space is not the ordinary dot product, but an abstract inner product based on the kernel:
With the feature space defined in this way, is a mapping , taking each point to the "kernel slice" at that point:
You can prove that is an inner product space when is a positive definite kernel. See this paper for details. (Kudos to f coppens for pointing this out!)
This answer gives a nice linear algebra explanation, but here's a geometric perspective, with both intuition and proof.
For any fixed point , we have a kernel slice function . The graph of is just a Gaussian bump centered at . Now, if the feature space were only finite dimensional, that would mean we could take a finite set of bumps at a fixed set of points and form any Gaussian bump anywhere else. But clearly there's no way we can do this; you can't make a new bump out of old bumps, because the new bump could be really far away from the old ones. So, no matter how many feature vectors (bumps) we have, we can always add new bumps, and in the feature space these are new independent vectors. So the feature space can't be finite dimensional; it has to be infinite.
We use induction. Suppose you have an arbitrary set of points such that the vectors are linearly independent in the feature space. Now find a point distinct from these points, in fact a billion sigmas away from all of them. We claim that is linearly independent from the first feature vectors .
Proof by contradiction. Suppose to the contrary that
Now take the inner product on both sides with an arbitrary . By the identity , we obtain
Here is a free variable, so this equation is an identity stating that two functions are the same. In particular, it says that a Gaussian centered at can be represented as a linear combination of Gaussians at other points . It is obvious geometrically that one cannot create a Gaussian bump centered at one point from a finite combination of Gaussian bumps centered at other points, especially when all those other Gaussian bumps are a billion sigmas away. So our assumption of linear dependence has led to a contradiction, as we set out to show.
The kernel matrix of the Gaussian kernel has always full rank for distinct . This means that each time you add a new example, the rank increases by . The easiest way to see this if you set very small. Then the kernel matrix is almost diagonal.
The fact that the rank always increases by one means that all projections in feature space are linearly independent (not orthogonal, but independent). Therefore, each example adds a new dimension to the span of the projections . Since you can add uncountably infinitely many examples, the feature space must have infinite dimension. Interestingly, all projections of the input space into the feature space lie on a sphere, since . Nevertheless, the geometry of the sphere is flat. You can read more on that in
Burges, C. J. C. (1999). Geometry and Invariance in Kernel Based Methods. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in Kernel Methods Support Vector Learning (pp. 89–116). MIT Press.
For the background and the notations I refer to the answer How to calculate decision boundary from support vectors?.
So the features in the 'original' space are the vectors , the binary outcome and the Lagrange multipliers are .
It is known that the Kernel can be written as ('' represents the inner product.) Where is an (implicit and unknown) transformation to a new feature space.
I will try to give some 'intuitive' explanation of what this looks like, so this answer is no formal proof, it just wants to give some feeling of how I think that this works. Do not hesitate to correct me if I am wrong. The basis for my explanation is section 2.2.1 of this pdf
I have to 'transform' my feature space (so my ) into some 'new' feature space in which the linear separation will be solved.
For each observation , I define functions , so I have a function for each element of my training sample. These functions span a vector space. The vector space spanned by the , note it . ( is the size of the training sample).
I will try to argue that this vector space is the vector space in which linear separation will be possible. By definition of the span, each vector in the vector space can be written as as a linear combination of the , i.e.: , where are real numbers. So, in fact,
Note that are the coordinates of vector in the vector space .
is the size of the training sample and therefore the dimension of the vector space can go up to , depending on whether the are linear independent. As (see supra, we defined in this way), this means that the dimension of depends on the kernel used and can go up to the size of the training sample.
If the kernel is 'complex enough' then the will all be independent and then the dimension of will be , the size of the training sample.
The transformation, that maps my original feature space to is defined as
.
This map maps my original feature space onto a vector space that can have a dimension that goes up to the size of my training sample. So maps each observation in my training sample into a vector space where the vectors are functions. The vector from my training sample is 'mapped' to a vector in , namely the vector with coordinates all equal to zero, except the -th coordinate is 1.
Obviously, this transformation (a) depends on the kernel, (b) depends on the values in the training sample and (c) can, depending on my kernel, have a dimension that goes up to the size of my training sample and (d) the vectors of look like , where are real numbers.
Looking at the function in How to calculate decision boundary from support vectors? it can be seen that . The decision boundary found by the SVM is .
In other words, is a linear combination of the and is a linear separating hyperplane in the -space : it is a particular choice of the namely !
The are known from our observations, the are the Lagrange multipliers that the SVM has found. In other words SVM find, through the use of a kernel and by solving a quadratic programming problem, a linear separation in the -spave.
This is my intuitive understanding of how the 'kernel trick' allows one to 'implicitly' transform the original feature space into a new feature space , with a different dimension. This dimension depends on the kernel you use and for the RBF kernel this dimension can go up to the size of the training sample. As training samples may have any size this could go up to 'infinite'. Obviously, in very high dimensional spaces the risk of overfitting will increase.
So kernels are a technique that allows SVM to transform your feature space , see also What makes the Gaussian kernel so magical for PCA, and also in general?
Unfortunately, fcop's explanation is quite incorrect. First of all he says "It is known that the Kernel can be written as... where ... is an (implicit and unknown) transformation to a new feature space." It's NOT unknown. This is in fact the space the features are mapped to and this is the space that could be infinite dimensional like in the RBF case. All the kernel does is take the inner product of that transformed feature vector with a transformed feature vector of a training example and applies some function to the result. Thus it implicitly represents this higher dimensional feature vector. Think of writing (x+y)^2 instead of x^2+2xy+y^2 for example. Now think what infinite series is represented implicitly by the exponential function... there you have your infinite feature space. This has absolutely nothing to do with the fact that your training set could be infinitely large.
The right way to think about SVMs is that you map your features to a possibly infinite dimensional feature space which happens to be implicitly representable in yet another finite dimensional "Kernel" feature space whose dimension could be as large as the training set size.