Skip to main content

[Python] Notebook-8 怎樣對pandas的DataFrame畫出Bars對準x軸數值的直方圖? How to plot histograms with pandas and align bars in your histograms with the ticks?


這一篇不是典型的上課筆記,
是我昨晚在嘗試畫直方圖屢屢遇到障礙時的解法紀錄。

This is not a typical class note. It is a lesson learned that I figured out (or found) as I was floundering in the pool of histograms' alignments. 

我在處理某DataFrame裡的一個Column時,
發現該Column有649個Keys(你可以想做Index),
若是要用pandas.plot.bar()來畫,
那X軸可是會擠爆的。
Bar Chart只適用Keys(Index)數量不大的時候,
數量一大就看不清楚你的X軸放甚麼東西。
看看下面這張範例就知道了。

I was dealing with a column of a DataFrame last night, and I found out that there are 649 keys (think of them as indexes). If you try to plot a bar chart for this column with pandas.plot.bar(), your x-axis is doomed to be jammed. You have to know that bar charts are easier to read only when the keys are not huge. Once your x-axis is jammed, who knows what the heck it is. Take the chart below for instance, then you know what I mean.



下面用有12個Keys的DataFrame來示範如何畫Bar Chart。

Let me use a DataFrame with 12 keys to show you how to plot a bar chart.


# -*- coding: utf-8 -*-
"""
Created on Fri Oct 13 02:26:28 2017

@author: ShihHsing Chen
"""
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.style.use("ggplot") # Look Pretty

age = {"Age" : [11, 10, 12, 11, 10, 13, 12, 12, 11, 12, 12, 9]}
simple = pd.DataFrame(age)
simple.plot.bar(alpha = 0.5, color = "green", edgecolor = "white")



當然,
你也可以用your_data_name.value_counts()來做計算,
然後再自己做一個Series或是DataFrame來套用Bar Chart的式子,
但這樣顯得多此一舉,
不是嗎?

Sure, you can count the occurrences of each value with your_data_name.value_counts(), and build a new Series or DataFrame working fine with commands of bar charts. Yet, it is unnecessary, right?  

所以只好回歸Histograms,
但是Histograms有麻煩的Bins
切的數目和大小會影響畫出來的直方圖。
我這裡用students.data的G1來畫,
你就會發現Bins數目不過增加一點,
出現的極值就會有所改變。

Hence, let us revisit histograms. For starters, you have to know what bins are and how annoying they are. Numbers and sizes of bins could directly affect the information delivered by histograms, even jeopardize. I load in the G1 column from students.data plot four histograms with different numbers of bins. You may notice that the peak value would change even the number of bins increases only by 2 or 4.



# -*- coding: utf-8 -*-
"""
Created on Fri Oct 13 02:26:28 2017

@author: ShihHsing Chen
"""
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.style.use("ggplot") # Look Pretty

student_dataset = pd.read_csv("students.data", index_col=0)
my_series1 = student_dataset["G1"]
fig = plt.figure() #Use it to create subplots.

ax1 = fig.add_subplot(2, 2, 1)
my_series1.plot.hist(alpha=0.5, color = "blue", bins = 16, ax = ax1)
plt.title("bins = 16", fontsize = 12)

ax2 = fig.add_subplot(2, 2, 2)
my_series1.plot.hist(alpha=0.5, color = "green", bins = 14, ax = ax2)
plt.title("bins = 14", fontsize = 12)

ax3 = fig.add_subplot(2, 2, 3)
my_series1.plot.hist(alpha=0.5, color = "red", bins = 12, ax = ax3)
plt.title("bins = 12", fontsize = 12)

ax4 = fig.add_subplot(2, 2, 4)
my_series1.plot.hist(alpha=0.5, color = "orange", bins = 10, ax = ax4)
plt.title("bins = 10", fontsize = 12)




好,
接下來就是解決問題的時候了。
回到一開始有12Keys的DataFrame思考一下,
如果要用直方圖來畫,
共有五種值(9, 10, 11, 12, 13),
那是不是要有六個Bins' edges才能完整包圍起來?
來看看圖例。


Alright, time to solve the quandary. Let's consider the DataFrame with 12 keys again. If you want to plot it in a histogram, you need 6 bins' edges to put the samples into 5 different "bins" (9, 10, 11, 12, 13). Let's take a look at the exemplification below.




那我們就來試試吧!
不要給Bins的數目,
改成用range()來給Bins的邊界條件。


How about give it a shoot? We feed in the condition of bins' edges rather than the number of bins.



# -*- coding: utf-8 -*-
"""
Created on Fri Oct 13 02:26:28 2017

@author: ShihHsing Chen
"""
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.style.use("ggplot") # Look Pretty

age = {"Age" : [11, 10, 12, 11, 10, 13, 12, 12, 11, 12, 12, 9]}
simple = pd.DataFrame(age)
simple.plot.hist(alpha = 0.5, color = "green", ec = "white", bins = range(6))




結果出來了!
非常好!
完全空白!
為甚麼呢?
因為range(6)會產生(0, 1, 2, 3, 4, 5),
但是我們的資料卻是從9開始,
所以勢必要如下圖擴大range()的範圍。


Great! We got the result! Oh no, wait. Why is it a complete blank? Ha ha. Because range(6) returns a array like this: (0, 1, 2, 3, 4, 5) and our data starts from 9, it makes sense to receive a blank chart. As a result, we must expand the range of range().







這樣了解了吧!
我們再來一次!


You got it, right? Let's hit again.



# -*- coding: utf-8 -*-
"""
Created on Fri Oct 13 02:26:28 2017

@author: ShihHsing Chen
"""
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.style.use("ggplot") # Look Pretty

age = {"Age" : [11, 10, 12, 11, 10, 13, 12, 12, 11, 12, 12, 9]}
agedf = pd.DataFrame(age)
simple = agedf["Age"]
agemax = simple.max() #Find out the maximum in the Series.
agemin = simple.min() #Find out the minimum in the Series.
agebins = range(agemin, agemax+2, 1)
simple.plot.hist(ec = "w", align= "left", bins = agebins, rwidth = 0.5)





你看看,
是不是跟希望的一樣啦!
只要利用.max(), .min(), range()或numpy.arange()就可以順利完成任務囉!
要記得range(a, b, c)的a代表起始點,
b代表終止點,
c代表步伐。
所以上述這個range(a, b, 1)產生的數列是a, a+1, a+2, ..., b-1。

See? It looks exactly the way we expect. The commands we need to complete our mission are .max(), .min(), range()or numpy.arange(). Besides, remember a is start point, b is stop point, and c is step in range(a, b, c). Thus, the array range(a, b, 1) generates is a, a+1, a+2, ..., b-1.
 


最後加碼一張649個Keys的DataFrame Histogram。

Here is a bonus. It is a DataFrame Histogram with 649 keys.



# -*- coding: utf-8 -*-
"""
Created on Fri Oct 13 02:26:28 2017

@author: ShihHsing Chen
"""
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.style.use("ggplot") # Look Pretty

student_dataset = pd.read_csv("students.data", index_col=0)
my_series = student_dataset["freetime"]
print(my_series.value_counts())
msmax = my_series.max() #Find out the maximum in the Series.
msmin = my_series.min() #Find out the minimum in the Series.
msbins = np.arange(msmax+2)-0.5 #range() can't take decimal steps.
my_series.plot.hist(color = "purple", bins = msbins, ec = "w", rwidth = 0.6)
plt.xticks([1, 2, 3, 4, 5]) #Set the x-ticks we want.
plt.xlim(0, 6) #Limit the range of x-axis.
plt.xlabel("Hours", fontsize = 12)
plt.title("Histogram bar alignment with X-axis", fontsize = 12)






參考資料:https://goo.gl/89ohoh


Comments

Popular posts from this blog

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學(一)找合作的律師事務所

Image by  David Peterson  from  Pixabay

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學(零)名詞解釋 Petitioner vs Applicant

Image by  David Peterson  from  Pixabay

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學(七)組裝 NIW I-140 申請文件包

Image by  David Peterson  from  Pixabay