[Python] Notebook-10 Programming with Python For Data Science

[Python] Notebook-10 Programming with Python For Data Science—Part.4

第四篇關於第四堂Python課—Programming with Python for Data Science 的筆記。

這是這一門課最後的筆記，
主要是紀錄Principal Component Analysis (PCA)的使用以及優缺點，
Isomap我就沒有記筆記了。
再之後的部分都是在講Data Modeling，
所以就跳過不上了。
之後會紀錄一些科學計算和訊號處理（拉普拉斯轉換之類）的內容。


# -*- coding: utf-8 -*-
"""
Created on Fri Oct 20 11:15:29 2017
@author: ShihHsing Chen
This is a note for the class 
"Programming with Python for Data Science".

Caution: This course uses no Python 3.6 but Python 2.7. So I will update the
content for you if I know there is a difference.
"""

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/e9KQTn"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
                        PRINCIPAL COMPONENT ANALYSIS

Unsupervised learning aims to discover some type of hidden structure within 
your data. Without a label or correct answer to test against, there is no 
metric for evaluating unsupervised learning algorithms. Principal Component 
Analysis (PCA), a transformation that attempts to convert your possibly 
correlated features into a set of linearly uncorrelated ones, is the first 
unsupervised learning algorithm you'll study.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/e9KQTn"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
                     PRINCIPAL COMPONENT ANALYSIS(cont'd)

PCA is one of the most popular techniques for dimensionality reduction, and 
we recommend you always start with it when you have a complex dataset. It 
models a linear subspace of your data by capturing its greatest variability. 
Stated differently, it accesses your dataset's covariance structure directly 
using matrix calculations and eigenvectors to compute the best unique features
that describe your samples.

An iterative approach to this would first find the center of your data, based
off its numeric features. Next, it would search for the direction that has 
the most variance or widest spread of values. That direction is the principal
component vector, so it is then added to a list. By searching for more 
directions of maximal variance that are orthogonal to all previously computed
vectors, more principal component can then be added to the list. This set of 
vectors form a new feature space that you can represent your samples with.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/gyCDJV"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
                     PRINCIPAL COMPONENT ANALYSIS(cont'd)

By transforming your samples into the feature space created by discarding 
under-prioritized features, a lower dimensional representation of your data, 
also known as shadow or projection is formed. In the shadow, some information 
has been lost—it has fewer features after all. You can actually visualize how 
much information has been lost by taking each sample and moving it to the 
nearest spot on the projection feature space. In the following 2D dataset, 
the orange line represents the principal component direction, and the gray 
line represents the second principal component. The one that's going to get 
dropped.

By dropping the gray component above, the goal is to project the 2D points 
onto 1D space. Move the original 2D samples to their closest spot on the line.

Once you've projected all samples to their closest spot on the major 
principal component, a shadow, or lower dimensional representation has been 
formed.

The summed distances traveled by all moved samples is equal to the total 
information lost by the projection. An an ideal situation, this lost 
information should be dominated by highly redundant features and random noise.    
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

#The forth video and explanation in module four literally do nothing to help 
#us. Its example has an undefined parameter named "df." If you know what it 
#is, feel free to share with me. The only thing matters is the following web 
#site:
"""http://scikit-learn.org/stable/"""

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/oG5eBt"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
                  PRINCIPAL COMPONENT ANALYSIS' WEAKNESS

To use PCA effectively, you should be aware of its weaknesses. The first is 
that it is sensitive to the scaling of your features. PCA maximizes 
variability based off of variance (the average squared differences of your 
samples from the mean), and then projects your original data on these 
directions of maximal variance. If your has a feature with a large variance 
and others with small variances, PCA will load on the larger variance 
feature. Being a linear transformation, it will rotate and reorient your 
feature space so it diverts as much of the variance of the larger-variance 
feature (and some of the other features' variances), into the first few 
principal components.

When it does this, a feature might go from having little impact on your 
transformation to totally dominating the first principal component, or 
vice versa. Standardizing your variables ahead of time is the way to free 
your PCA results of such scaling issues. The cases where you should not use 
standardization are when you know your feature variables, and thus their 
importance, need to respect the specific scaling you've set up. In such a 
case, PCA would be used on your raw data.

PCA is extremely fast, but for very large datasets it might take a while to 
train. All of the matrix computations required for inversion, eigenvalue 
decomposition, etc. boggle it down. Since most real-world datasets are 
typically very large and will have a level of noise in them, razor-sharp, 
machine-precision matrix operations aren't always necessary. If you're 
willing to sacrifice a bit of accuracy for computational efficiency, 
SciKit-Learn offers a sister algorithm called RandomizedPCA that applies 
some approximation techniques to speed up large-scale matrix computation. 
You can use this for your larger datasets.

NOTE: The sklearn.decomposition.RandomizedPCA() method has since been 
deprecated. If you would like to perform a randomized singular variable 
decomposition as the means of estimating PCA, update the svd_solver parameter 
to equal the string 'randomized'. Whether or not the randomized approximation 
actually runs faster than regular PCA depends on a number of factors, 
including but not limited to: if there was any interference from other 
running sub-processes, the size of your dataset, the data types used in your 
dataset, and if the logic introduced to run the randomized selection can 
complete faster than just running through the dataset regularly. Due to this, 
for small to medium sized datasets, it might occasionally be faster to run 
regular PCA.

The last issue to keep in mind is that PCA is a linear transformation only! 
In graphical terms, it can rotate and translate the feature space of your 
samples, but will not skew them. PCA will only, therefore, be able to capture 
the underlying linear shapes and variance within your data and cannot discern 
any complex, nonlinear intricacies. For such cases, you will have to make use 
different dimensionality reduction algorithms, such as Isomap.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

#The lecturer also talked aobut Isomap's function and result, but I skipped it
#because it seems not much helpful to me by now.

[Python] Notebook-10 Programming with Python For Data Science—Part.4

Labels

Comments

Post a Comment

Popular posts from this blog

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學（一）找合作的律師事務所

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學（零）名詞解釋 Petitioner vs Applicant

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學（二）閱讀官方申請條件—1