LDA降维与K近邻

MATLAB验证LDA降维前后k近邻准确率


一、概念

1.最近邻法:对于未知样本x,比较x与N个已知类别的样本之间的欧式距离,并决策x与距离它最近的样本同类。

2.K近邻法:取未知样本x的k个近邻,看这k个近邻中多数属于哪一类,就把x归为哪一类。K取奇数,为了是避免k1=k2的情况。

3.欧式距离公式:
二维的公式 d = sqrt ( (x1 - x2) ^2 + (y1 - y2) ^2)
三维的公式 d = sqrt ( (x1 - x2) ^2 + (y1 - y2) ^2 + (z1 - z2) ^2)
推广到n维空间,欧氏距离的公式 d = qrt( ∑ (xi1 - xi2) ^2 ) i=1,2..n
xi1表示第一个点的第i维坐标,xi2表示第二个点的第i维坐标.

二、分析

1、算法分析

若要判别x属于哪一类,关键要求得与x最近的k个样本(当k=1时,即是最近邻法),然后判别这k个样本的多数属于哪一类。采用欧式距离公式,对 UCI 数据集进行处理。

在k近邻中先在sonar和iris数据中进行交叉验证,划分出训练样本与测试样本,然后分别计算出测试样本到每个训练样本的欧氏距离,选出与测试样本最近的k个训练样本并分类,与原始数据比较得出准确率。

在进行降维操作的实验中,先将数据分类,分别计算各类的类均值向量和类内离散度,然后算总均值向量、总类内和类间离散度和Fisher投影方向。取特征向量的前k维组成投影矩阵,将样本在投影矩阵上降维至 k维,然后再进行k近邻运算。

2、数据分析

Sonar:共有208个样本,特征数为60,样本分为97和111两类

Iris:共有150个数据集,每个数据有4个属性,分为3类,每类数据数为50

Usps:共有7291个训练样本和2007个测试样本,每个样本特征数为256

三、核心代码

sonar

最近邻

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
clear all;
load('sonar.mat');
%交叉验证,分为8组,7组作为训练样本,1组作为测试样本
correct=zeros(10,1);
for times=1:10
c=zeros(8,1);
indices = crossvalind('Kfold',208,8);
wrong=zeros(8,1);
for i = 1:8
test = (indices == i);
train = ~test;
traindata=sonar(train,:);
testdata=sonar(test,:);
%计算测试样本到每个测试样本的欧氏距离,并将距离排序
for r=1:26
x(r,1:60)=testdata(r,1:60);
distance=zeros(182,1);
for j=1:182
distance(j,1)=sqrt((testdata(r,1:60)-traindata(j,1:60))*(testdata(r,1:60)-traindata(j,1:60)));
end
[B,index]=sort(distance);
%选出与测试样本最近的训练样本,将其归于那一类
x(r,61)=traindata(index(1),61);
%与原始类别标签比较,计算错误率
if x(r,61)~=testdata(r,61)
wrong(i,1)=wrong(i,1)+1;
end
end
end
c=1-wrong/26;
correct(times,1)=mean(c);
end
correct

k近邻

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
clear all;
load('sonar.mat');

%交叉验证,分为8组,7组作为训练样本,1组作为测试样本
for k=1:2:19
c=zeros(8,1);
wrong=zeros(8,1);
indices = crossvalind('Kfold',208,8);
for i = 1:8
test = (indices == i);
train = ~test;
traindata=sonar(train,:);
testdata=sonar(test,:);
%计算测试样本到每个测试样本的欧氏距离,并将距离排序
for r=1:26
x(r,1:60)=testdata(r,1:60);
distance=zeros(182,1);
for j=1:182
distance(j,1)=sqrt((testdata(r,1:60)-traindata(j,1:60))*(testdata(r,1:60)-traindata(j,1:60)));
end
[B,index]=sort(distance);
%选出与测试样本最近的k个训练样本,将其归于最多的那一类
count=zeros(2,1);
for j=1:k
if traindata(index(j),61)==1
count(1,1)=count(1,1)+1;
else
count(2,1)=count(2,1)+1;
end
end
[C,max]=sort(count);
x(r,61)=max(2);
%与原始类别标签比较,计算错误率
if x(r,61)~=testdata(r,61)
wrong(i,1)=wrong(i,1)+1;
end
end
end
c=1-wrong/26
correct=mean(c)
end
correct(1:2:19,1)

先降维再k近邻

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
clear all
load ('sonar.mat')

%降维
%把全部样本按照类别标签分成两类,并计算各类的类均值向量和类内离散度
n1=0;
for j=1:208
if sonar(j,61)==1
n1=n1+1;
simple1(n1,:)=sonar(j,1:60);
end
end
m1=mean(simple1(1:n1,:));
s1=zeros(60,60);
for j=1:n1
s1=s1+(simple1(j,:)-m1)'*(simple1(j,:)-m1);
end
n2=0;
for j=1:208
if sonar(j,61)==2
n2=n2+1;
simple2(n2,:)=sonar(j,1:60);
end
end
m2=mean(simple2(1:n2,:));
s2=zeros(60,60);
for j=1:n2
s2=s2+(simple2(j,:)-m2)'*(simple2(j,:)-m2);
end
%计算总均值向量、总类内和类间离散度、Fisher投影方向
m=mean(sonar(:,1:60));
sw=s1+s2;
sb=n1*(m1-m)'*(m1-m)+n2*(m2-m)'*(m2-m);
[eigvec,eigvalue]=eig(sb,sw);
%取特征向量的前10维为投影方向
w=eigvec(:,1:10);
%将全部样本在该投影方向上降维
data=[(w*sonar(:,1:60)')',sonar(:,61)];


%k近邻分类
%k=input('please input the value of k\n k=');
%交叉验证,分为8组,7组作为训练样本,1组作为测试样本
correct=zeros(20,1);
for k=1:2:19
c=zeros(8,1);
wrong=zeros(8,1);
indices = crossvalind('Kfold',208,8);
for i = 1:8
test = (indices == i);
train = ~test;
traindata=sonar(train,:);
testdata=sonar(test,:);
%计算测试样本到每个测试样本的欧氏距离,并将距离排序
for r=1:26
x(r,1:60)=testdata(r,1:60);
distance=zeros(182,1);
for j=1:182
distance(j,1)=sqrt((testdata(r,1:60)-traindata(j,1:60))*(testdata(r,1:60)-traindata(j,1:60)));
end
[B,index]=sort(distance);
%选出与测试样本最近的k个训练样本,将其归于最多的那一类
count=zeros(2,1);
for j=1:k
if traindata(index(j),61)==1
count(1,1)=count(1,1)+1;
else
count(2,1)=count(2,1)+1;
end
end
[C,max]=sort(count);
x(r,61)=max(2);
%与原始类别标签比较,计算错误率
if x(r,61)~=testdata(r,61)
wrong(i,1)=wrong(i,1)+1;
%error(wrong,:)=iris(test(r),:);
%error(wrong,:)
%for j=1:k
% k_nearest(j,:)=iris(train(index(j)),:);
%end
%k_nearest
%L(wrong,:)
end
end
end
c=1-wrong/26;
correct(k,1)=mean(c);
end
correct(1:2:19,1)

Iris

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
clear all
load ('iris.txt')
%交叉验证,90%作为训练样本,10%作为测试样本
correct=zeros(10,1);
for times=1:10
c=zeros(10,1);
indices = crossvalind('Kfold',150,10);
wrong=zeros(10,1);
for i = 1:10
test = (indices == i);
train = ~test;
traindata=iris(train,:);
testdata=iris(test,:);
%计算测试样本到每个测试样本的欧氏距离,并将距离排序
for r=1:15
x(r,1:4)=testdata(r,1:4);
distance=zeros(135,1);
for j=1:135
distance(j,1)=sqrt((testdata(r,1:4)-traindata(j,1:4))*(testdata(r,1:4)-traindata(j,1:4)));
end
[B,index]=sort(distance);
%选出与测试样本最近的训练样本,将其归于那一类
x(r,5)=traindata(index(1),5);
%与原始类别标签比较,计算错误率
if x(r,5)~=testdata(r,5)
wrong(i,1)=wrong(i,1)+1;
%error(wrong,:)=iris(test(r),:);
%error(wrong,:)
%for j=1:k
% k_nearest(j,:)=iris(train(index(j)),:);
%end
%k_nearest
%L(wrong,:)
end
end
end
c=1-wrong/15;
correct(times,1)=mean(c);
end
correct

k近邻

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
clear all
load ('iris.txt')

%交叉验证,90%作为训练样本,10%作为测试样本
for k=1:2:19
c=zeros(10,1);
wrong=zeros(10,1);
indices = crossvalind('Kfold',150,10);
for i = 1:10
test = (indices == i);
train = ~test;
traindata=iris(train,:);
testdata=iris(test,:);
%计算测试样本到每个测试样本的欧氏距离,并将距离排序
for r=1:15
x(r,1:4)=testdata(r,1:4);
distance=zeros(135,1);
for j=1:135
distance(j,1)=sqrt((testdata(r,1:4)-traindata(j,1:4))*(testdata(r,1:4)-traindata(j,1:4)));
end
[B,index]=sort(distance);
%选出与测试样本最近的k个训练样本,将其归于最多的那一类
count=zeros(3,1);
for j=1:k
if traindata(index(j),5)==1
count(1,1)=count(1,1)+1;
else
if traindata(index(j),5)==2
count(2,1)=count(2,1)+1;
else
count(3,1)=count(3,1)+1;
end
end
end
[C,max]=sort(count);
x(r,5)=max(3);
%与原始类别标签比较,计算错误率
if x(r,5)~=testdata(r,5)
wrong(i,1)=wrong(i,1)+1;
%error(wrong,:)=iris(test(r),:);
%error(wrong,:)
%for j=1:k
% k_nearest(j,:)=iris(train(index(j)),:);
%end
%k_nearest
%L(wrong,:)
end
end
end
c=1-wrong/15
correct=mean(c)
end
correct(1:2:19,1)

先降维再k近邻

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
clear all
load ('iris.txt')

%降维
%把全部样本按照类别标签分成三类,并计算各类的类均值向量和类内离散度
simple=1;
n1=0;
for j=1:150
if iris(j,5)==simple
n1=n1+1;
simple1(n1,:)=iris(j,1:4);
end
end
m1=mean(simple1);
s1=zeros(4,4);
for j=1:n1
s1=s1+(simple1(j,:)-m1)'*(simple1(j,:)-m1);
end
simple=2;
n2=0;
for j=1:150
if iris(j,5)==simple
n2=n2+1;
simple2(n2,:)=iris(j,1:4);
end
end
m2=mean(simple2);
s2=zeros(4,4);
for j=1:n2
s2=s2+(simple2(j,:)-m2)'*(simple2(j,:)-m2);
end
simple=3;
n3=0;
for j=1:150
if iris(j,5)==simple
n3=n3+1;
simple3(n3,:)=iris(j,1:4);
end
end
m3=mean(simple3);
s3=zeros(4,4);
for j=1:n3
s3=s3+(simple3(j,:)-m3)'*(simple3(j,:)-m3);
end
%计算总均值向量、总类内和类间离散度、Fisher投影方向
m=mean(iris(:,1:4));
sw=s1+s2+s3;
sb=n1*(m1-m)'*(m1-m)+n2*(m2-m)'*(m2-m)+n3*(m3-m)'*(m3-m);
[eigvec,eigvalue]=eig(sw^(-1)*sb);
%取特征向量的前两维为投影方向
w=eigvec(:,1:2);
%将全部样本在该投影方向上降维
data=[(w'*iris(:,1:4)')',iris(:,5)];


%k近邻分类
%k=input('please input the value of k\n k=');
%交叉验证,90%作为训练样本,10%作为测试样本
correct=zeros(20,1);
for k=1:2:19
c=zeros(10,1);
wrong=zeros(10,1);
indices = crossvalind('Kfold',150,10);
for i = 1:10
test = (indices == i);
train = ~test;
traindata=data(train,:);
testdata=data(test,:);
%计算测试样本到每个测试样本的欧氏距离,并将距离排序
for r=1:15
x(r,1:2)=testdata(r,1:2);
distance=zeros(135,1);
for j=1:135
distance(j,1)=sqrt((testdata(r,1:2)-traindata(j,1:2))*(testdata(r,1:2)-traindata(j,1:2))');
end
[B,index]=sort(distance);
%选出与测试样本最近的k个训练样本,将其归于最多的那一类
count=zeros(3,1);
for j=1:k
if traindata(index(j),3)==1
count(1,1)=count(1,1)+1;
else
if traindata(index(j),3)==2
count(2,1)=count(2,1)+1;
else
count(3,1)=count(3,1)+1;
end
end
end
[C,max]=sort(count);
x(r,3)=max(3);
%与原始类别标签比较,计算错误率
if x(r,3)~=testdata(r,3)
wrong(i,1)=wrong(i,1)+1;
%error(wrong,:)=iris(test(r),:);
%error(wrong,:)
%for j=1:k
% k_nearest(j,:)=iris(train(index(j)),:);
%end
%k_nearest
%L(wrong,:)
end
end
end
c=1-wrong/15;
correct(k,1)=mean(c);
end
correct(1:2:19,1)

usps

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
clear all;
traindata=importdata('usps.txt');
testdata=importdata('usps_test.txt');
%k近邻法对测试样本进行分类

correct=zeros(145,1);
for k=25:20:145
wrong=0;
for i=1:2007
x=zeros(1,257);
x(1,2:257)=testdata(i,2:257);
%计算测试样本到每个训练样本的欧氏距离,并将距离排序
distance=zeros(7291,1);
for j=1:7291
distance(j,1)=sqrt((x(1,2:257)-traindata(j,2:257))*(x(1,2:257)-traindata(j,2:257))');
end
[B,index]=sort(distance);
%选出与测试样本最近的k个训练样本,将其归于最多的那一类
count=zeros(10,1);
for j=1:k
switch traindata(index(j),1)
case 0
count(1,1)=count(1,1)+1;
case 1
count(2,1)=count(2,1)+1;
case 2
count(3,1)=count(3,1)+1;
case 3
count(4,1)=count(4,1)+1;
case 4
count(5,1)=count(5,1)+1;
case 5
count(6,1)=count(6,1)+1;
case 6
count(7,1)=count(7,1)+1;
case 7
count(8,1)=count(8,1)+1;
case 8
count(9,1)=count(9,1)+1;
case 9
count(10,1)=count(10,1)+1;
end
end
[C,max]=sort(count);
x(1,1)=max(10)-1;
%与原始类别标签比较,计算错误率
if x(1,1)~=testdata(i,1)
wrong=wrong+1;
end
end
correct(k,1)=1-wrong/2007
end
correct(25:20:145,1)

四、结果数据

soner

K取值 降维前准确率 降维后准确率
1 0.8221 0.8077
3 0.8365 0.8029
5 0.7788 0.8125
7 0.7500 0.7356
9 0.7115 0.7309
11 0.6779 0.6683
13 0.6442 0.6635
15 0.6394 0.6683
17 0.6490 0.6394
19 0.6587 0.6683

lda1

Iris

K取值 降维前准确率 降维后准确率
1 0.9600 0.9667
3 0.9600 0.9600
5 0.9533 0.9600
7 0.9667 0.9667
9 0.9733 0.9800
11 0.9733 0.9800
13 0.9733 0.9800
15 0.9800 0.9800
17 0.9667 0.9800
19 0.9800 0.9733

lda2

usps

K取值 准确率
1 0.9502
3 0.9482
5 0.9502
7 0.9467
9 0.9442
11 0.9402
13 0.9372
15 0.9352
17 0.9332
19 0.9317

lda3