[Python] Notebook-9 Programming with Python For Data Science

[Python] Notebook-9 Programming with Python For Data Science—Part.3

第三篇關於第四堂Python課—Programming with Python for Data Science 的筆記。

這次解決了一些讓人頭痛的問題，
包括之前發的Notebook-8還有下面要講的subplot問題。

另外，
也建議大家去stackoverflow註冊帳號，
就可以瀏覽過去的問題或解答。
如果找不到答案，
也可以提出新問題。
我的subplot問題就是這樣被解決的。

內容有讓繪圖結果變漂亮的ggplot、histogram的細部調整和對齊技巧、如何在histogram加上密度分布、讓繪圖結果超美麗的seaborn、2D scatter plot and its subplots、3D scatter plot、parallel coordinates、Andrew's curves、.imshow()。
最後還有補充教材可以讓你Dive deeper。
# -*- coding: utf-8 -*-
"""
Created on Thu Oct 12 18:24:30 2017
@author: ShihHsing Chen
This is a note for the class 
"Programming with Python for Data Science".

Caution: This course uses no Python 3.6 but Python 2.7. So I will update the
content for you if I know there is a difference.
"""

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/bGkWfx"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
                             HISTOGRAMS REVISIT

Histograms are one of the The Seven Basic Tools of Quality, graphical 
techniques which have been identified as being most helpful for 
troubleshooting issues. Histograms help you understand the distribution of a 
feature in your dataset. They accomplish this by simultaneously answering the 
questions where in your feature's domain your records are located at, and how
many records exist there. Coincidentally, these two questions are also 
answered by the .unique() and .value_counts() methods discussed in the feature
wrangling section, but in a graphical way. Be sure to take note of this in the
exploring section of your course map!

Recall from the Features Primer section that there are two types of features:
continuous and categorical. Histograms are only really meaningful with 
categorical data. If you have a continuous feature, it must first be binned 
or discretized by transforming the continuous feature into a categorical one 
by grouping similar values together. To accomplish this, the entire range 
values is divided into a series of intervals that are usually consecutive, 
equal in length, and non-overlapping. These intervals will become the 
categories. Then, a count of how many values fall into each interval serves 
as the categorical bin count.

If your interest lies in probabilities per bin rather than frequency counts, 
set the named parameter normed=True, which will normalize your results as 
percentages. MatPlotLib's online API documentation exposes many other features
and optional parameters that can be used with histograms, such as cumulative 
and histtype. 
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

#Read Wes McKinney's "Programming with Python for Data Science" for more 
#complete information. (Ch.9 Plotting and Visualization in 2nd Edition)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl 

mpl.style.use("ggplot") #Look Pretty
#If the above line throws an error, use plt.style.use('ggplot') instead

student_dataset = pd.read_csv("students.data", index_col=0)

fig0 = plt.figure(0) #We need to number the figures in case of overlapping.
fig0.subplots_adjust(hspace=0.5, wspace=0.5) #Adjust height-spacing to 
                                             #de-overlap titles and ticks

ax1 = fig0.add_subplot(2, 2, 1)         
my_series1 = student_dataset["G1"]  
my_series1.plot.hist(color = "blue", histtype = "bar", bins = 30, ax = ax1)
                                  #ax = ax1 was the big problem that I met.
                                  #If you skip it, your figure will split
                                  #into pieces when you make subplots of 
                                  #scatter plots.

ax2 = fig0.add_subplot(2, 2, 2)
my_series2 = student_dataset["G2"]
my_series2.plot.hist(color = "green", histtype = "step", bins = 20, ax = ax2)

ax3 = fig0.add_subplot(2, 2, 3)
my_series3 = student_dataset["G3"]
my_series3.plot.hist(alpha=0.5, histtype = "stepfilled", bins = 10, ax = ax3)

ax4 = fig0.add_subplot(2, 2, 4)
my_series1.plot.hist(alpha=0.5, color = "blue", ax = ax4)
my_series2.plot.hist(alpha=0.5, color = "green", ax = ax4)
my_series3.plot.hist(alpha=0.5, color = "red", ax = ax4)

#At this moment, you may find that the alignment of bars in our histograms
#act against our intuition. What now? Try bar chart? Please do not do that.
#We have 649 samples in this students.data file, which will jam your x-tick
#into a disaster. I tried, take it from me. So what can we do? I almost pull 
#an all-nighter and get you this.
"""https://goo.gl/89ohoh"""

fig1 = plt.figure(1) 
fig1.subplots_adjust(hspace=0.5, wspace=0.5) #Adjust height-spacing to 
                                             #de-overlap titles and ticks

ax5 = fig1.add_subplot(2, 2, 1)
my_series4 = student_dataset["freetime"]
print(my_series4.value_counts(), "\n") #You can compare it with the figure.

ms4max = my_series4.max() #Find out the maximum in the Series.
ms4min = my_series4.min() #Find out the minimum in the Series.
ms4bins = range(ms4min, ms4max+2, 1)

                                         #c = color, ec = edgecolor
my_series4.plot.hist(bins = ms4bins, align = "left", ec = "w", ax = ax5)
plt.xticks([1, 2, 3, 4, 5]) #Set the x-ticks we want.
plt.xlim(0, 6) #Limit the range of x-axis.
plt.xlabel("Hours", fontsize = 12)
plt.title("Histogram bar alignment—Method 1", fontsize = 12)

ax6 = fig1.add_subplot(2, 2, 2)
ms4bins = np.arange(ms4max+2)-0.5
my_series4.plot.hist(color = "purple", bins = ms4bins, rwidth = 0.5, ax = ax6)

plt.xticks([1, 2, 3, 4, 5]) #Set the x-ticks we want.
plt.xlim(0, 6) #Limit the range of x-axis.
plt.xlabel("Hours", fontsize = 12)
plt.title("Histogram bar alignment—Method 2", fontsize = 12)

ax7 = fig1.add_subplot(2, 2, 3)
my_series5 = student_dataset["absences"]
print(my_series5.value_counts(), "\n") #You can compare it with the chart.

ms5max = my_series5.max() #Find out the maximum in the Series.
ms5min = my_series5.min() #Find out the minimum in the Series.
ms5bins = np.arange(ms5max+2)-0.5 #range() can't take decimal steps.

my_series5.plot.hist(edgecolor = "white", bins = ms5bins, ax = ax7)
plt.xticks(range(ms5min, ms5max+1, 1)) #Set the x-ticks we want.
plt.xlabel("Days", fontsize = 12)

ax8 = fig1.add_subplot(2, 2, 4)
my_series5.plot.hist(ec = "white", bins = ms5bins, normed = True, ax = ax8)
my_series5.plot.density() #A little bonus for your hard work.

plt.xlim(ms5min - 5, ms5max + 2) #Primarily to limit left hand side of x-axis.
plt.xticks(range(ms5min, ms5max+1, 1)) #Set the x-ticks we want.
plt.xlabel("Days", fontsize = 12)
plt.ylabel("Percentage", fontsize = 12)

#This is another bonus for you. With this, you can plot a histogram and 
#density distribution together with one line of code. If you are going to be
#a data scientist, then seaborn could help you a lot.
import seaborn as sns
fig2 = plt.figure(2) 
my_series5 = student_dataset["absences"]
sns.distplot(my_series5) 

#DataFrame histogram cannot be placed in subplots with Series histogram, 
#though I do not know why. 
my_dataframe = student_dataset[['G3', 'G2', 'G1']] 
my_dataframe.plot.hist(alpha=0.5, histtype='barstacked', edgecolor = "white")

#2D Scatter Plot: Two methods to create sub plots of scatter plots 
#We use the same data set as we plot the histogram in pandas. But you should 
#also know how to plot scatter plots with matplotlib directly.
"""https://goo.gl/vwcBVU"""
#Method 1
fig4 = plt.figure(4) 
ax9 = fig4.add_subplot(2, 2, 1)
student_dataset.plot.scatter(x = "freetime", y = "G1", ax = ax9)
                               #ax = ax9 is the key kwrg**. Histograms work
                               #fine without it, but not scatter plots.           

ax10 = fig4.add_subplot(2, 2, 2)
student_dataset.plot.scatter(x = "freetime", y = "G2", ax = ax10)

ax11 = fig4.add_subplot(2, 2, 3)
student_dataset.plot.scatter(x = "freetime", y = "G3", ax = ax11)

#Method 2
fig, axes = plt.subplots(2, 2, figsize=(8, 5.5), sharex=False, sharey=False)
x = student_dataset["freetime"].values
for i in range(3):
    axes[i//2, i%2].scatter(x, student_dataset.iloc[:, i + 25].values)
fig5 = plt.figure(5)

#3D Scatter Plot
from mpl_toolkits.mplot3d import Axes3D #Without it, you get an error. 
fig6 = plt.figure(6)                    #But you get an "unused" message 
                                        #if you use spyder as well. And I do
                                        #not know why. That is weird.

ax12 = fig6.add_subplot(111, projection="3d")
ax12.set_xlabel("Final Grade")
ax12.set_ylabel("First Grade")
ax12.set_zlabel("Daily Alcohol")

ax12.scatter(student_dataset["G1"], student_dataset["G3"], \
           student_dataset["Dalc"], c='r', marker='.')
                                #When the interpreter see "\", they go to 
                                #next line and ignore it.

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/fdJJC4"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
                             PARALLEL COORDINATE

Each graphed observation is plotted as a polyline, a series of connected 
line segments.

Parallel coordinates are a useful charting technique you'll want to add the 
exploring section of your course map. They are a higher dimensionality 
visualization technique because they allow you to easily view observations 
with more than three dimensions simply by tacking on additional parallel 
coordinates. However at some point, it becomes hard to comprehend the chart 
anymore due to the sheer number of axes and also potentially due to the 
number of observations. If you data has more than 10 features, parallel 
coordinates might not do it for you.

Parallel coordinates are useful because polylines belonging to similar 
records tend to cluster together. To graph them with Pandas and MatPlotLib, 
you have to specify a feature to group by (it can be non-numeric). This 
results in each distinct value of that feature being assigned a unique color 
when charted. Here's an example of parallel coordinates using SciKit-Learn's 
Iris dataset.

Pandas' parallel coordinates interface is extremely easy to use, but use it 
with care. It only supports a single scale for all your axes. If you have 
some features that are on a small scale and others on a large scale, you'll 
have to deal with a compressed plot. For now, your only three options are to:

<1>Normalize your features before charting them
<2>Change the scale to a log scale
<3>Or create separate, multiple parallel coordinate charts. Each one only 
plotting features with similar domains scales plotted
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
from sklearn.datasets import load_iris
from pandas.tools.plotting import parallel_coordinates

#Load up SKLearn's Iris Dataset into a Pandas Dataframe
data = load_iris()
dfiris = pd.DataFrame(data.data, columns = data.feature_names) 

dfiris["target_names"] = [data.target_names[i] for i in data.target]

#Parallel Coordinates Start Here:
fig7 = plt.figure(7)
parallel_coordinates(dfiris, "target_names")
plt.show()
#To be honest, I am not quite sure about the usage of this parallel 
#coordinates. But we shall keep a record in case of future application.

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/PJMqfD"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
                               ANDREW'S CURVE

An Andrews plot, also known as Andrews curve, helps you visualize higher 
dimensionality, multivariate data by plotting each of your dataset's 
observations as a curve. The feature values of the observation act as the 
coefficients of the curve, so observations with similar characteristics tend 
to group closer to each other. Due to this, Andrews curves have some use in 
outlier detection.

Just as with Parallel Coordinates, every plotted feature must be numeric 
since the curve equation is essentially the product of the observation's 
features vector (transposed) and the vector: (1/sqrt(2), sin(t), cos(t), 
sin(2t), cos(2t), sin(3t), cos(3t), ...) to create a Fourier series.

The Pandas implementation requires you once again specify a GroupBy feature, 
which is then used to color code the curves as well as produce as chart 
legend.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
from pandas.tools.plotting import andrews_curves
#Andrews Curves Start Here:
fig8 = plt.figure(8)
andrews_curves(dfiris, "target_names")
plt.show()
#To be honest again, I am not quite sure about the usage of this Andrew's 
#curves, neither. But we can shall a record in case of future application.

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/Pei3Bf"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
                                  imshow

One last higher dimensionality, visualization-technique you should know how 
to use is MatPlotLib's .imshow() method. This command generates an image 
based off of the normalized values stored in a matrix, or rectangular array 
of float64s. The properties of the generated image will depend on the 
dimensions and contents of the array passed in:

<1>An [X, Y] shaped array will result in a grayscale image being generated
<2>A [X, Y, 3] shaped array results in a full-color image: 1 channel for red, 
1 for green, and 1 for blue
<3>A [X, Y, 4] shaped array results in a full-color image as before with an 
extra channel for alpha

Besides being a straightforward way to display .PNG and other images, the 
.imshow() method has quite a few other use cases. When you use the .corr() 
method on your dataset, Pandas calculates a correlation matrix for you that 
measures how close to being linear the relationship between any two features 
in your dataset are. Correlation values may range from -1 to 1, where 1 would 
mean the two features are perfectly positively correlated and have identical 
slopes for all values. -1 would mean they are perfectly negatively correlated,
and have a negative slope for one another, again being linear. Values closer 
to 0 mean there is little to zero linear relationship between the two 
variables at all (e.g., pizza sales and plant growth), and so the further 
away from 0 the value is, the stronger the relationship between the features.

.imshow() can help you any time you have a square matrix you want to 
visualize. Other matrices you might want to visualize include the covariance 
matrix, the confusion matrix, and in the future once you learn how to use 
certain machine learning algorithms that generate clusters which live in your 
feature-space, you'll also be able to use .imshow() to peek into the brain 
of your algorithms as they run, so long as your features represent a 
rectangular image!
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
df = pd.DataFrame(np.random.randn(1000, 4), columns=["a", "b", "c", "d"])
df.corr()

fig9 = plt.figure(9)
plt.imshow(df.corr(), cmap=plt.cm.Blues, interpolation = "nearest")
plt.colorbar()

tick_marks = [i for i in range(len(df.columns))]
plt.xticks(tick_marks, df.columns, rotation = "vertical")
plt.yticks(tick_marks, df.columns)

plt.show()

#Reading materials for you to dive deeper, including interesting radar charts.
"""https://goo.gl/T4Sga1"""
[Python] Notebook-9 Programming with Python For Data Science—Part.3

Labels

Comments

Post a Comment

Popular posts from this blog

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學（一）找合作的律師事務所

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學（零）名詞解釋 Petitioner vs Applicant

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學（二）閱讀官方申請條件—1