第三篇關於第四堂Python課—Programming with Python for Data Science 的筆記。
這次解決了一些讓人頭痛的問題,
包括之前發的Notebook-8還有下面要講的subplot問題。
另外,
也建議大家去stackoverflow註冊帳號,
就可以瀏覽過去的問題或解答。
如果找不到答案,
也可以提出新問題。
我的subplot問題就是這樣被解決的。
內容有讓繪圖結果變漂亮的ggplot、histogram的細部調整和對齊技巧、如何在histogram加上密度分布、讓繪圖結果超美麗的seaborn、2D scatter plot and its subplots、3D scatter plot、parallel coordinates、Andrew's curves、.imshow()。
最後還有補充教材可以讓你Dive deeper。
# -*- coding: utf-8 -*-
"""
Created on Thu Oct 12 18:24:30 2017
@author: ShihHsing Chen
This is a note for the class
"Programming with Python for Data Science".
Caution: This course uses no Python 3.6 but Python 2.7. So I will update the
content for you if I know there is a difference.
"""
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/bGkWfx"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
HISTOGRAMS REVISIT
Histograms are one of the The Seven Basic Tools of Quality, graphical
techniques which have been identified as being most helpful for
troubleshooting issues. Histograms help you understand the distribution of a
feature in your dataset. They accomplish this by simultaneously answering the
questions where in your feature's domain your records are located at, and how
many records exist there. Coincidentally, these two questions are also
answered by the .unique() and .value_counts() methods discussed in the feature
wrangling section, but in a graphical way. Be sure to take note of this in the
exploring section of your course map!
Recall from the Features Primer section that there are two types of features:
continuous and categorical. Histograms are only really meaningful with
categorical data. If you have a continuous feature, it must first be binned
or discretized by transforming the continuous feature into a categorical one
by grouping similar values together. To accomplish this, the entire range
values is divided into a series of intervals that are usually consecutive,
equal in length, and non-overlapping. These intervals will become the
categories. Then, a count of how many values fall into each interval serves
as the categorical bin count.
If your interest lies in probabilities per bin rather than frequency counts,
set the named parameter normed=True, which will normalize your results as
percentages. MatPlotLib's online API documentation exposes many other features
and optional parameters that can be used with histograms, such as cumulative
and histtype.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#Read Wes McKinney's "Programming with Python for Data Science" for more
#complete information. (Ch.9 Plotting and Visualization in 2nd Edition)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use("ggplot") #Look Pretty
#If the above line throws an error, use plt.style.use('ggplot') instead
student_dataset = pd.read_csv("students.data", index_col=0)
fig0 = plt.figure(0) #We need to number the figures in case of overlapping.
fig0.subplots_adjust(hspace=0.5, wspace=0.5) #Adjust height-spacing to
#de-overlap titles and ticks
ax1 = fig0.add_subplot(2, 2, 1)
my_series1 = student_dataset["G1"]
my_series1.plot.hist(color = "blue", histtype = "bar", bins = 30, ax = ax1)
#ax = ax1 was the big problem that I met.
#If you skip it, your figure will split
#into pieces when you make subplots of
#scatter plots.
ax2 = fig0.add_subplot(2, 2, 2)
my_series2 = student_dataset["G2"]
my_series2.plot.hist(color = "green", histtype = "step", bins = 20, ax = ax2)
ax3 = fig0.add_subplot(2, 2, 3)
my_series3 = student_dataset["G3"]
my_series3.plot.hist(alpha=0.5, histtype = "stepfilled", bins = 10, ax = ax3)
ax4 = fig0.add_subplot(2, 2, 4)
my_series1.plot.hist(alpha=0.5, color = "blue", ax = ax4)
my_series2.plot.hist(alpha=0.5, color = "green", ax = ax4)
my_series3.plot.hist(alpha=0.5, color = "red", ax = ax4)
#At this moment, you may find that the alignment of bars in our histograms
#act against our intuition. What now? Try bar chart? Please do not do that.
#We have 649 samples in this students.data file, which will jam your x-tick
#into a disaster. I tried, take it from me. So what can we do? I almost pull
#an all-nighter and get you this.
"""https://goo.gl/89ohoh"""
fig1 = plt.figure(1)
fig1.subplots_adjust(hspace=0.5, wspace=0.5) #Adjust height-spacing to
#de-overlap titles and ticks
ax5 = fig1.add_subplot(2, 2, 1)
my_series4 = student_dataset["freetime"]
print(my_series4.value_counts(), "\n") #You can compare it with the figure.
ms4max = my_series4.max() #Find out the maximum in the Series.
ms4min = my_series4.min() #Find out the minimum in the Series.
ms4bins = range(ms4min, ms4max+2, 1)
#c = color, ec = edgecolor
my_series4.plot.hist(bins = ms4bins, align = "left", ec = "w", ax = ax5)
plt.xticks([1, 2, 3, 4, 5]) #Set the x-ticks we want.
plt.xlim(0, 6) #Limit the range of x-axis.
plt.xlabel("Hours", fontsize = 12)
plt.title("Histogram bar alignment—Method 1", fontsize = 12)
ax6 = fig1.add_subplot(2, 2, 2)
ms4bins = np.arange(ms4max+2)-0.5
my_series4.plot.hist(color = "purple", bins = ms4bins, rwidth = 0.5, ax = ax6)
plt.xticks([1, 2, 3, 4, 5]) #Set the x-ticks we want.
plt.xlim(0, 6) #Limit the range of x-axis.
plt.xlabel("Hours", fontsize = 12)
plt.title("Histogram bar alignment—Method 2", fontsize = 12)
ax7 = fig1.add_subplot(2, 2, 3)
my_series5 = student_dataset["absences"]
print(my_series5.value_counts(), "\n") #You can compare it with the chart.
ms5max = my_series5.max() #Find out the maximum in the Series.
ms5min = my_series5.min() #Find out the minimum in the Series.
ms5bins = np.arange(ms5max+2)-0.5 #range() can't take decimal steps.
my_series5.plot.hist(edgecolor = "white", bins = ms5bins, ax = ax7)
plt.xticks(range(ms5min, ms5max+1, 1)) #Set the x-ticks we want.
plt.xlabel("Days", fontsize = 12)
ax8 = fig1.add_subplot(2, 2, 4)
my_series5.plot.hist(ec = "white", bins = ms5bins, normed = True, ax = ax8)
my_series5.plot.density() #A little bonus for your hard work.
plt.xlim(ms5min - 5, ms5max + 2) #Primarily to limit left hand side of x-axis.
plt.xticks(range(ms5min, ms5max+1, 1)) #Set the x-ticks we want.
plt.xlabel("Days", fontsize = 12)
plt.ylabel("Percentage", fontsize = 12)
#This is another bonus for you. With this, you can plot a histogram and
#density distribution together with one line of code. If you are going to be
#a data scientist, then seaborn could help you a lot.
import seaborn as sns
fig2 = plt.figure(2)
my_series5 = student_dataset["absences"]
sns.distplot(my_series5)
#DataFrame histogram cannot be placed in subplots with Series histogram,
#though I do not know why.
my_dataframe = student_dataset[['G3', 'G2', 'G1']]
my_dataframe.plot.hist(alpha=0.5, histtype='barstacked', edgecolor = "white")
#2D Scatter Plot: Two methods to create sub plots of scatter plots
#We use the same data set as we plot the histogram in pandas. But you should
#also know how to plot scatter plots with matplotlib directly.
"""https://goo.gl/vwcBVU"""
#Method 1
fig4 = plt.figure(4)
ax9 = fig4.add_subplot(2, 2, 1)
student_dataset.plot.scatter(x = "freetime", y = "G1", ax = ax9)
#ax = ax9 is the key kwrg**. Histograms work
#fine without it, but not scatter plots.
ax10 = fig4.add_subplot(2, 2, 2)
student_dataset.plot.scatter(x = "freetime", y = "G2", ax = ax10)
ax11 = fig4.add_subplot(2, 2, 3)
student_dataset.plot.scatter(x = "freetime", y = "G3", ax = ax11)
#Method 2
fig, axes = plt.subplots(2, 2, figsize=(8, 5.5), sharex=False, sharey=False)
x = student_dataset["freetime"].values
for i in range(3):
axes[i//2, i%2].scatter(x, student_dataset.iloc[:, i + 25].values)
fig5 = plt.figure(5)
#3D Scatter Plot
from mpl_toolkits.mplot3d import Axes3D #Without it, you get an error.
fig6 = plt.figure(6) #But you get an "unused" message
#if you use spyder as well. And I do
#not know why. That is weird.
ax12 = fig6.add_subplot(111, projection="3d")
ax12.set_xlabel("Final Grade")
ax12.set_ylabel("First Grade")
ax12.set_zlabel("Daily Alcohol")
ax12.scatter(student_dataset["G1"], student_dataset["G3"], \
student_dataset["Dalc"], c='r', marker='.')
#When the interpreter see "\", they go to
#next line and ignore it.
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/fdJJC4"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
PARALLEL COORDINATE
Each graphed observation is plotted as a polyline, a series of connected
line segments.
Parallel coordinates are a useful charting technique you'll want to add the
exploring section of your course map. They are a higher dimensionality
visualization technique because they allow you to easily view observations
with more than three dimensions simply by tacking on additional parallel
coordinates. However at some point, it becomes hard to comprehend the chart
anymore due to the sheer number of axes and also potentially due to the
number of observations. If you data has more than 10 features, parallel
coordinates might not do it for you.
Parallel coordinates are useful because polylines belonging to similar
records tend to cluster together. To graph them with Pandas and MatPlotLib,
you have to specify a feature to group by (it can be non-numeric). This
results in each distinct value of that feature being assigned a unique color
when charted. Here's an example of parallel coordinates using SciKit-Learn's
Iris dataset.
Pandas' parallel coordinates interface is extremely easy to use, but use it
with care. It only supports a single scale for all your axes. If you have
some features that are on a small scale and others on a large scale, you'll
have to deal with a compressed plot. For now, your only three options are to:
<1>Normalize your features before charting them
<2>Change the scale to a log scale
<3>Or create separate, multiple parallel coordinate charts. Each one only
plotting features with similar domains scales plotted
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
from sklearn.datasets import load_iris
from pandas.tools.plotting import parallel_coordinates
#Load up SKLearn's Iris Dataset into a Pandas Dataframe
data = load_iris()
dfiris = pd.DataFrame(data.data, columns = data.feature_names)
dfiris["target_names"] = [data.target_names[i] for i in data.target]
#Parallel Coordinates Start Here:
fig7 = plt.figure(7)
parallel_coordinates(dfiris, "target_names")
plt.show()
#To be honest, I am not quite sure about the usage of this parallel
#coordinates. But we shall keep a record in case of future application.
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/PJMqfD"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
ANDREW'S CURVE
An Andrews plot, also known as Andrews curve, helps you visualize higher
dimensionality, multivariate data by plotting each of your dataset's
observations as a curve. The feature values of the observation act as the
coefficients of the curve, so observations with similar characteristics tend
to group closer to each other. Due to this, Andrews curves have some use in
outlier detection.
Just as with Parallel Coordinates, every plotted feature must be numeric
since the curve equation is essentially the product of the observation's
features vector (transposed) and the vector: (1/sqrt(2), sin(t), cos(t),
sin(2t), cos(2t), sin(3t), cos(3t), ...) to create a Fourier series.
The Pandas implementation requires you once again specify a GroupBy feature,
which is then used to color code the curves as well as produce as chart
legend.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
from pandas.tools.plotting import andrews_curves
#Andrews Curves Start Here:
fig8 = plt.figure(8)
andrews_curves(dfiris, "target_names")
plt.show()
#To be honest again, I am not quite sure about the usage of this Andrew's
#curves, neither. But we can shall a record in case of future application.
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/Pei3Bf"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
imshow
One last higher dimensionality, visualization-technique you should know how
to use is MatPlotLib's .imshow() method. This command generates an image
based off of the normalized values stored in a matrix, or rectangular array
of float64s. The properties of the generated image will depend on the
dimensions and contents of the array passed in:
<1>An [X, Y] shaped array will result in a grayscale image being generated
<2>A [X, Y, 3] shaped array results in a full-color image: 1 channel for red,
1 for green, and 1 for blue
<3>A [X, Y, 4] shaped array results in a full-color image as before with an
extra channel for alpha
Besides being a straightforward way to display .PNG and other images, the
.imshow() method has quite a few other use cases. When you use the .corr()
method on your dataset, Pandas calculates a correlation matrix for you that
measures how close to being linear the relationship between any two features
in your dataset are. Correlation values may range from -1 to 1, where 1 would
mean the two features are perfectly positively correlated and have identical
slopes for all values. -1 would mean they are perfectly negatively correlated,
and have a negative slope for one another, again being linear. Values closer
to 0 mean there is little to zero linear relationship between the two
variables at all (e.g., pizza sales and plant growth), and so the further
away from 0 the value is, the stronger the relationship between the features.
.imshow() can help you any time you have a square matrix you want to
visualize. Other matrices you might want to visualize include the covariance
matrix, the confusion matrix, and in the future once you learn how to use
certain machine learning algorithms that generate clusters which live in your
feature-space, you'll also be able to use .imshow() to peek into the brain
of your algorithms as they run, so long as your features represent a
rectangular image!
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
df = pd.DataFrame(np.random.randn(1000, 4), columns=["a", "b", "c", "d"])
df.corr()
fig9 = plt.figure(9)
plt.imshow(df.corr(), cmap=plt.cm.Blues, interpolation = "nearest")
plt.colorbar()
tick_marks = [i for i in range(len(df.columns))]
plt.xticks(tick_marks, df.columns, rotation = "vertical")
plt.yticks(tick_marks, df.columns)
plt.show()
#Reading materials for you to dive deeper, including interesting radar charts.
"""https://goo.gl/T4Sga1"""
3>2>1>3>2>1>
Comments
Post a Comment