第四篇關於第四堂Python課—Programming with Python for Data Science 的筆記。
這是這一門課最後的筆記,
主要是紀錄Principal Component Analysis (PCA)的使用以及優缺點,
Isomap我就沒有記筆記了。
再之後的部分都是在講Data Modeling,
所以就跳過不上了。
之後會紀錄一些科學計算和訊號處理(拉普拉斯轉換之類)的內容。
# -*- coding: utf-8 -*-
"""
Created on Fri Oct 20 11:15:29 2017
@author: ShihHsing Chen
This is a note for the class
"Programming with Python for Data Science".
Caution: This course uses no Python 3.6 but Python 2.7. So I will update the
content for you if I know there is a difference.
"""
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/e9KQTn"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
PRINCIPAL COMPONENT ANALYSIS
Unsupervised learning aims to discover some type of hidden structure within
your data. Without a label or correct answer to test against, there is no
metric for evaluating unsupervised learning algorithms. Principal Component
Analysis (PCA), a transformation that attempts to convert your possibly
correlated features into a set of linearly uncorrelated ones, is the first
unsupervised learning algorithm you'll study.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/e9KQTn"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
PRINCIPAL COMPONENT ANALYSIS(cont'd)
PCA is one of the most popular techniques for dimensionality reduction, and
we recommend you always start with it when you have a complex dataset. It
models a linear subspace of your data by capturing its greatest variability.
Stated differently, it accesses your dataset's covariance structure directly
using matrix calculations and eigenvectors to compute the best unique features
that describe your samples.
An iterative approach to this would first find the center of your data, based
off its numeric features. Next, it would search for the direction that has
the most variance or widest spread of values. That direction is the principal
component vector, so it is then added to a list. By searching for more
directions of maximal variance that are orthogonal to all previously computed
vectors, more principal component can then be added to the list. This set of
vectors form a new feature space that you can represent your samples with.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/gyCDJV"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
PRINCIPAL COMPONENT ANALYSIS(cont'd)
By transforming your samples into the feature space created by discarding
under-prioritized features, a lower dimensional representation of your data,
also known as shadow or projection is formed. In the shadow, some information
has been lost—it has fewer features after all. You can actually visualize how
much information has been lost by taking each sample and moving it to the
nearest spot on the projection feature space. In the following 2D dataset,
the orange line represents the principal component direction, and the gray
line represents the second principal component. The one that's going to get
dropped.
By dropping the gray component above, the goal is to project the 2D points
onto 1D space. Move the original 2D samples to their closest spot on the line.
Once you've projected all samples to their closest spot on the major
principal component, a shadow, or lower dimensional representation has been
formed.
The summed distances traveled by all moved samples is equal to the total
information lost by the projection. An an ideal situation, this lost
information should be dominated by highly redundant features and random noise.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#The forth video and explanation in module four literally do nothing to help
#us. Its example has an undefined parameter named "df." If you know what it
#is, feel free to share with me. The only thing matters is the following web
#site:
"""http://scikit-learn.org/stable/"""
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/oG5eBt"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
PRINCIPAL COMPONENT ANALYSIS' WEAKNESS
To use PCA effectively, you should be aware of its weaknesses. The first is
that it is sensitive to the scaling of your features. PCA maximizes
variability based off of variance (the average squared differences of your
samples from the mean), and then projects your original data on these
directions of maximal variance. If your has a feature with a large variance
and others with small variances, PCA will load on the larger variance
feature. Being a linear transformation, it will rotate and reorient your
feature space so it diverts as much of the variance of the larger-variance
feature (and some of the other features' variances), into the first few
principal components.
When it does this, a feature might go from having little impact on your
transformation to totally dominating the first principal component, or
vice versa. Standardizing your variables ahead of time is the way to free
your PCA results of such scaling issues. The cases where you should not use
standardization are when you know your feature variables, and thus their
importance, need to respect the specific scaling you've set up. In such a
case, PCA would be used on your raw data.
PCA is extremely fast, but for very large datasets it might take a while to
train. All of the matrix computations required for inversion, eigenvalue
decomposition, etc. boggle it down. Since most real-world datasets are
typically very large and will have a level of noise in them, razor-sharp,
machine-precision matrix operations aren't always necessary. If you're
willing to sacrifice a bit of accuracy for computational efficiency,
SciKit-Learn offers a sister algorithm called RandomizedPCA that applies
some approximation techniques to speed up large-scale matrix computation.
You can use this for your larger datasets.
NOTE: The sklearn.decomposition.RandomizedPCA() method has since been
deprecated. If you would like to perform a randomized singular variable
decomposition as the means of estimating PCA, update the svd_solver parameter
to equal the string 'randomized'. Whether or not the randomized approximation
actually runs faster than regular PCA depends on a number of factors,
including but not limited to: if there was any interference from other
running sub-processes, the size of your dataset, the data types used in your
dataset, and if the logic introduced to run the randomized selection can
complete faster than just running through the dataset regularly. Due to this,
for small to medium sized datasets, it might occasionally be faster to run
regular PCA.
The last issue to keep in mind is that PCA is a linear transformation only!
In graphical terms, it can rotate and translate the feature space of your
samples, but will not skew them. PCA will only, therefore, be able to capture
the underlying linear shapes and variance within your data and cannot discern
any complex, nonlinear intricacies. For such cases, you will have to make use
different dimensionality reduction algorithms, such as Isomap.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#The lecturer also talked aobut Isomap's function and result, but I skipped it
#because it seems not much helpful to me by now.
Comments
Post a Comment