Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2015-12-05 #17

Open
yorkie opened this issue Dec 5, 2015 · 0 comments
Open

2015-12-05 #17

yorkie opened this issue Dec 5, 2015 · 0 comments
Labels

Comments

@yorkie
Copy link
Owner

yorkie commented Dec 5, 2015

VC Dimension:

  • 关键点在于超平面(hyperplane),即对于二维空间中,给定的数据集合,没有数据点都有正向(Positive)和负向(Negative)之分,然后我们需要用一条直线(hyperplane)把数据按正/负划分开,所以当数据集呈现为一个矩形,并且正负正好位于对角线位置时,是不可能用一条直线把数据分类的,不过对于只有三个点的数据集时,我们总是有办法用一条直线进行切割。
  • 上一条是例子,所以对于VC维来说,有两个变量,一个是数据集S,另一个是散列函数H,上例中的H是一条直线(二维空间的超平面),对于四个点,当然也可以用一个瘦椭圆来作为H值,此时VC(S, H) = 4。

在知道了如何计算VC维之后,我开始学习这个数值是用来做什么的,于是我参考了这个Quora答案

This is where the VC dimension comes in - it enables you to conduct your search in a principled way. For a family of surfaces - or to be precise, a family of functions - the VC dimension gives you a number on which you can peg its capability to separate labels.

The general idea is that the VC dimension points you to a reasonable family of functions to inspect. You pick a specific member within this family based on the exact data-set at hand.

然后按照我的理解是:在进行一些预测、分类时,VC维可以有效地帮助你筛选出哪一部分的数据是可以被有效分类的,但作者也指出:

Risk <= (Empirical Risk) + (VC dimension)

这里的 Empirical Risk 还不是特别明白,不过大致了解下来呢,就是一个通过努力可以降低的参数,从而降低错误率。因此这里就存在一个博奕,即:

  • 较大的VC维虽然可以让我们使用更多的数据进行筛选,不过也会增加其错误率
  • 较小的VC维虽然可以让错误率保持很低,但是经常会遇到数据不在范围,只得经由人类干涉

参考文献:

@yorkie yorkie added the Diary label Dec 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant