LDA降维与K近邻

MATLAB验证LDA降维前后k近邻准确率

一、概念

1.最近邻法：对于未知样本x，比较x与N个已知类别的样本之间的欧式距离，并决策x与距离它最近的样本同类。

2.K近邻法：取未知样本x的k个近邻，看这k个近邻中多数属于哪一类，就把x归为哪一类。K取奇数，为了是避免k1=k2的情况。

3.欧式距离公式：
二维的公式 d = sqrt ( (x1 - x2) ^2 + (y1 - y2) ^2)
三维的公式 d = sqrt ( (x1 - x2) ^2 + (y1 - y2) ^2 + (z1 - z2) ^2)
推广到n维空间，欧氏距离的公式 d = qrt( ∑ (xi1 - xi2) ^2 ) i=1,2..n
xi1表示第一个点的第i维坐标,xi2表示第二个点的第i维坐标.

二、分析

1、算法分析

若要判别x属于哪一类，关键要求得与x最近的k个样本（当k=1时，即是最近邻法），然后判别这k个样本的多数属于哪一类。采用欧式距离公式，对 UCI 数据集进行处理。

在k近邻中先在sonar和iris数据中进行交叉验证，划分出训练样本与测试样本，然后分别计算出测试样本到每个训练样本的欧氏距离，选出与测试样本最近的k个训练样本并分类，与原始数据比较得出准确率。

在进行降维操作的实验中，先将数据分类，分别计算各类的类均值向量和类内离散度，然后算总均值向量、总类内和类间离散度和Fisher投影方向。取特征向量的前k维组成投影矩阵，将样本在投影矩阵上降维至 k维，然后再进行k近邻运算。

2、数据分析

Sonar：共有208个样本，特征数为60，样本分为97和111两类

Iris：共有150个数据集，每个数据有4个属性，分为3类，每类数据数为50

Usps：共有7291个训练样本和2007个测试样本，每个样本特征数为256

三、核心代码

sonar

k近邻

clear all;
load('sonar.mat');

%交叉验证，分为8组，7组作为训练样本，1组作为测试样本
for k=1:2:19
c=zeros(8,1);
wrong=zeros(8,1);
indices = crossvalind('Kfold',208,8);
for i = 1:8
    test = (indices == i);
    train = ~test;
    traindata=sonar(train,:);
    testdata=sonar(test,:);
    %计算测试样本到每个测试样本的欧氏距离，并将距离排序
for r=1:26
    x(r,1:60)=testdata(r,1:60);
    distance=zeros(182,1);
    for j=1:182
        distance(j,1)=sqrt((testdata(r,1:60)-traindata(j,1:60))*(testdata(r,1:60)-traindata(j,1:60)));
    end
    [B,index]=sort(distance);
    %选出与测试样本最近的k个训练样本，将其归于最多的那一类
    count=zeros(2,1);
    for j=1:k
        if traindata(index(j),61)==1
            count(1,1)=count(1,1)+1;
        else
            count(2,1)=count(2,1)+1;
        end
    end
    [C,max]=sort(count);
    x(r,61)=max(2);
    %与原始类别标签比较，计算错误率
    if x(r,61)~=testdata(r,61)
        wrong(i,1)=wrong(i,1)+1;
    end
end
end
c=1-wrong/26
correct=mean(c)
end
correct(1:2:19,1)

先降维再k近邻

clear all
load ('sonar.mat')

%降维
%把全部样本按照类别标签分成两类，并计算各类的类均值向量和类内离散度
n1=0;
for j=1:208
    if sonar(j,61)==1
        n1=n1+1;
        simple1(n1,:)=sonar(j,1:60);
    end
end
m1=mean(simple1(1:n1,:));
s1=zeros(60,60);
for j=1:n1
    s1=s1+(simple1(j,:)-m1)'*(simple1(j,:)-m1);
end
n2=0;
for j=1:208
    if sonar(j,61)==2
        n2=n2+1;
        simple2(n2,:)=sonar(j,1:60);
    end
end
m2=mean(simple2(1:n2,:));
s2=zeros(60,60);
for j=1:n2
    s2=s2+(simple2(j,:)-m2)'*(simple2(j,:)-m2);
end
%计算总均值向量、总类内和类间离散度、Fisher投影方向
m=mean(sonar(:,1:60));
sw=s1+s2;
sb=n1*(m1-m)'*(m1-m)+n2*(m2-m)'*(m2-m);
[eigvec,eigvalue]=eig(sb,sw);
%取特征向量的前10维为投影方向
w=eigvec(:,1:10);
%将全部样本在该投影方向上降维
data=[(w*sonar(:,1:60)')',sonar(:,61)];


%k近邻分类
%k=input('please input the value of k\n   k=');
%交叉验证，分为8组，7组作为训练样本，1组作为测试样本
correct=zeros(20,1);
for k=1:2:19
c=zeros(8,1);
wrong=zeros(8,1);
indices = crossvalind('Kfold',208,8);
for i = 1:8
    test = (indices == i);
    train = ~test;
    traindata=sonar(train,:);
    testdata=sonar(test,:);
    %计算测试样本到每个测试样本的欧氏距离，并将距离排序
for r=1:26
    x(r,1:60)=testdata(r,1:60);
    distance=zeros(182,1);
    for j=1:182
        distance(j,1)=sqrt((testdata(r,1:60)-traindata(j,1:60))*(testdata(r,1:60)-traindata(j,1:60)));
    end
    [B,index]=sort(distance);
    %选出与测试样本最近的k个训练样本，将其归于最多的那一类
    count=zeros(2,1);
    for j=1:k
        if traindata(index(j),61)==1
            count(1,1)=count(1,1)+1;
        else
            count(2,1)=count(2,1)+1;
        end
    end
    [C,max]=sort(count);
    x(r,61)=max(2);
    %与原始类别标签比较，计算错误率
    if x(r,61)~=testdata(r,61)
        wrong(i,1)=wrong(i,1)+1;
        %error(wrong,:)=iris(test(r),:);
        %error(wrong,:)
        %for j=1:k
         %   k_nearest(j,:)=iris(train(index(j)),:);
        %end
        %k_nearest
        %L(wrong,:)
    end
end
end
c=1-wrong/26;
correct(k,1)=mean(c);
end
correct(1:2:19,1)

Iris

clear all
load ('iris.txt')
%交叉验证，90%作为训练样本，10%作为测试样本
correct=zeros(10,1);
for times=1:10
    c=zeros(10,1);
indices = crossvalind('Kfold',150,10);
wrong=zeros(10,1);
for i = 1:10
           test = (indices == i);
           train = ~test;
           traindata=iris(train,:);
           testdata=iris(test,:);
%计算测试样本到每个测试样本的欧氏距离，并将距离排序
for r=1:15
    x(r,1:4)=testdata(r,1:4);
    distance=zeros(135,1);
    for j=1:135
        distance(j,1)=sqrt((testdata(r,1:4)-traindata(j,1:4))*(testdata(r,1:4)-traindata(j,1:4)));
    end
    [B,index]=sort(distance);
    %选出与测试样本最近的训练样本，将其归于那一类
    x(r,5)=traindata(index(1),5);
    %与原始类别标签比较，计算错误率
    if x(r,5)~=testdata(r,5)
        wrong(i,1)=wrong(i,1)+1;
        %error(wrong,:)=iris(test(r),:);
        %error(wrong,:)
        %for j=1:k
         %   k_nearest(j,:)=iris(train(index(j)),:);
        %end
        %k_nearest
        %L(wrong,:)
    end
end
end
c=1-wrong/15;
correct(times,1)=mean(c);
end
correct

k近邻

clear all
load ('iris.txt')

%交叉验证，90%作为训练样本，10%作为测试样本
for k=1:2:19
c=zeros(10,1);
wrong=zeros(10,1);
indices = crossvalind('Kfold',150,10);
for i = 1:10
           test = (indices == i);
           train = ~test;
           traindata=iris(train,:);
           testdata=iris(test,:);
%计算测试样本到每个测试样本的欧氏距离，并将距离排序
for r=1:15
    x(r,1:4)=testdata(r,1:4);
    distance=zeros(135,1);
    for j=1:135
        distance(j,1)=sqrt((testdata(r,1:4)-traindata(j,1:4))*(testdata(r,1:4)-traindata(j,1:4)));
    end
    [B,index]=sort(distance);
    %选出与测试样本最近的k个训练样本，将其归于最多的那一类
    count=zeros(3,1);
    for j=1:k
        if traindata(index(j),5)==1
            count(1,1)=count(1,1)+1;
        else
            if traindata(index(j),5)==2
                count(2,1)=count(2,1)+1;
            else
                count(3,1)=count(3,1)+1;
            end
        end
    end
    [C,max]=sort(count);
    x(r,5)=max(3);
    %与原始类别标签比较，计算错误率
    if x(r,5)~=testdata(r,5)
        wrong(i,1)=wrong(i,1)+1;
        %error(wrong,:)=iris(test(r),:);
        %error(wrong,:)
        %for j=1:k
         %   k_nearest(j,:)=iris(train(index(j)),:);
        %end
        %k_nearest
        %L(wrong,:)
    end
end
end
c=1-wrong/15
correct=mean(c)
end
correct(1:2:19,1)

先降维再k近邻

clear all
load ('iris.txt')

%降维
%把全部样本按照类别标签分成三类，并计算各类的类均值向量和类内离散度
simple=1;
n1=0;
for j=1:150
    if iris(j,5)==simple
        n1=n1+1;
        simple1(n1,:)=iris(j,1:4);
    end
end
m1=mean(simple1);
s1=zeros(4,4);
for j=1:n1
    s1=s1+(simple1(j,:)-m1)'*(simple1(j,:)-m1);
end
simple=2;
n2=0;
for j=1:150
    if iris(j,5)==simple
        n2=n2+1;
        simple2(n2,:)=iris(j,1:4);
    end
end
m2=mean(simple2);
s2=zeros(4,4);
for j=1:n2
    s2=s2+(simple2(j,:)-m2)'*(simple2(j,:)-m2);
end
simple=3;
n3=0;
for j=1:150
    if iris(j,5)==simple
        n3=n3+1;
        simple3(n3,:)=iris(j,1:4);
    end
end
m3=mean(simple3);
s3=zeros(4,4);
for j=1:n3
    s3=s3+(simple3(j,:)-m3)'*(simple3(j,:)-m3);
end
%计算总均值向量、总类内和类间离散度、Fisher投影方向
m=mean(iris(:,1:4));
sw=s1+s2+s3;
sb=n1*(m1-m)'*(m1-m)+n2*(m2-m)'*(m2-m)+n3*(m3-m)'*(m3-m);
[eigvec,eigvalue]=eig(sw^(-1)*sb);
%取特征向量的前两维为投影方向
w=eigvec(:,1:2);
%将全部样本在该投影方向上降维
data=[(w'*iris(:,1:4)')',iris(:,5)];


%k近邻分类
%k=input('please input the value of k\n    k=');
%交叉验证，90%作为训练样本，10%作为测试样本
correct=zeros(20,1);
for k=1:2:19
    c=zeros(10,1);
    wrong=zeros(10,1);
    indices = crossvalind('Kfold',150,10);
for i = 1:10
           test = (indices == i);
           train = ~test;
           traindata=data(train,:);
           testdata=data(test,:);
%计算测试样本到每个测试样本的欧氏距离，并将距离排序
for r=1:15
    x(r,1:2)=testdata(r,1:2);
    distance=zeros(135,1);
    for j=1:135
        distance(j,1)=sqrt((testdata(r,1:2)-traindata(j,1:2))*(testdata(r,1:2)-traindata(j,1:2))');
    end
    [B,index]=sort(distance);
    %选出与测试样本最近的k个训练样本，将其归于最多的那一类
    count=zeros(3,1);
    for j=1:k
        if traindata(index(j),3)==1
            count(1,1)=count(1,1)+1;
        else
            if traindata(index(j),3)==2
                count(2,1)=count(2,1)+1;
            else
                count(3,1)=count(3,1)+1;
            end
        end
    end
    [C,max]=sort(count);
    x(r,3)=max(3);
    %与原始类别标签比较，计算错误率
    if x(r,3)~=testdata(r,3)
        wrong(i,1)=wrong(i,1)+1;
        %error(wrong,:)=iris(test(r),:);
        %error(wrong,:)
        %for j=1:k
         %   k_nearest(j,:)=iris(train(index(j)),:);
        %end
        %k_nearest
        %L(wrong,:)
    end
end
end
c=1-wrong/15;
correct(k,1)=mean(c);
end
correct(1:2:19,1)

usps

clear all;
traindata=importdata('usps.txt');
testdata=importdata('usps_test.txt');
%k近邻法对测试样本进行分类

correct=zeros(145,1);
for k=25:20:145
wrong=0;
for i=1:2007
    x=zeros(1,257);
    x(1,2:257)=testdata(i,2:257);
    %计算测试样本到每个训练样本的欧氏距离，并将距离排序
    distance=zeros(7291,1);
    for j=1:7291
        distance(j,1)=sqrt((x(1,2:257)-traindata(j,2:257))*(x(1,2:257)-traindata(j,2:257))');
    end
    [B,index]=sort(distance);
    %选出与测试样本最近的k个训练样本，将其归于最多的那一类
    count=zeros(10,1);
    for j=1:k
        switch traindata(index(j),1)
            case 0
                count(1,1)=count(1,1)+1;
            case 1
                count(2,1)=count(2,1)+1;
            case 2
                count(3,1)=count(3,1)+1;
            case 3
                count(4,1)=count(4,1)+1;
            case 4
                count(5,1)=count(5,1)+1;
            case 5
                count(6,1)=count(6,1)+1;
            case 6
                count(7,1)=count(7,1)+1;
            case 7
                count(8,1)=count(8,1)+1;
            case 8
                count(9,1)=count(9,1)+1;
            case 9
                count(10,1)=count(10,1)+1;
        end
    end
    [C,max]=sort(count);
    x(1,1)=max(10)-1;
     %与原始类别标签比较，计算错误率
    if x(1,1)~=testdata(i,1)
        wrong=wrong+1;
    end
end
correct(k,1)=1-wrong/2007
end
correct(25:20:145,1)

四、结果数据

soner

K取值	降维前准确率	降维后准确率
1	0.8221	0.8077
3	0.8365	0.8029
5	0.7788	0.8125
7	0.7500	0.7356
9	0.7115	0.7309
11	0.6779	0.6683
13	0.6442	0.6635
15	0.6394	0.6683
17	0.6490	0.6394
19	0.6587	0.6683

Iris

K取值	降维前准确率	降维后准确率
1	0.9600	0.9667
3	0.9600	0.9600
5	0.9533	0.9600
7	0.9667	0.9667
9	0.9733	0.9800
11	0.9733	0.9800
13	0.9733	0.9800
15	0.9800	0.9800
17	0.9667	0.9800
19	0.9800	0.9733

usps

K取值	准确率
1	0.9502
3	0.9482
5	0.9502
7	0.9467
9	0.9442
11	0.9402
13	0.9372
15	0.9352
17	0.9332
19	0.9317