[Python] Notebook-8 怎樣對pandas的DataFrame畫出Bars對準x軸數值的直方圖? How to plot histograms with pandas and align bars in your histograms with the ticks?
This is not a typical class note. It is a lesson learned that I figured out (or found) as I was floundering in the pool of histograms' alignments.
Bar Chart只適用Keys(Index)數量不大的時候,
I was dealing with a column of a DataFrame last night, and I found out that there are 649 keys (think of them as indexes). If you try to plot a bar chart for this column with pandas.plot.bar(), your x-axis is doomed to be jammed. You have to know that bar charts are easier to read only when the keys are not huge. Once your x-axis is jammed, who knows what the heck it is. Take the chart below for instance, then you know what I mean.
下面用有12個Keys的DataFrame來示範如何畫Bar Chart。
Let me use a DataFrame with 12 keys to show you how to plot a bar chart.
# -*- coding: utf-8 -*-
Created on Fri Oct 13 02:26:28 2017
@author: ShihHsing Chen
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use("ggplot") # Look Pretty
age = {"Age" : [11, 10, 12, 11, 10, 13, 12, 12, 11, 12, 12, 9]}
simple = pd.DataFrame(age)
simple.plot.bar(alpha = 0.5, color = "green", edgecolor = "white")
然後再自己做一個Series或是DataFrame來套用Bar Chart的式子,
Sure, you can count the occurrences of each value with your_data_name.value_counts(), and build a new Series or DataFrame working fine with commands of bar charts. Yet, it is unnecessary, right?
Hence, let us revisit histograms. For starters, you have to know what bins are and how annoying they are. Numbers and sizes of bins could directly affect the information delivered by histograms, even jeopardize. I load in the G1 column from students.data plot four histograms with different numbers of bins. You may notice that the peak value would change even the number of bins increases only by 2 or 4.
# -*- coding: utf-8 -*-
Created on Fri Oct 13 02:26:28 2017
@author: ShihHsing Chen
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use("ggplot") # Look Pretty
student_dataset = pd.read_csv("students.data", index_col=0)
my_series1 = student_dataset["G1"]
fig = plt.figure() #Use it to create subplots.
ax1 = fig.add_subplot(2, 2, 1)
my_series1.plot.hist(alpha=0.5, color = "blue", bins = 16, ax = ax1)
plt.title("bins = 16", fontsize = 12)
ax2 = fig.add_subplot(2, 2, 2)
my_series1.plot.hist(alpha=0.5, color = "green", bins = 14, ax = ax2)
plt.title("bins = 14", fontsize = 12)
ax3 = fig.add_subplot(2, 2, 3)
my_series1.plot.hist(alpha=0.5, color = "red", bins = 12, ax = ax3)
plt.title("bins = 12", fontsize = 12)
ax4 = fig.add_subplot(2, 2, 4)
my_series1.plot.hist(alpha=0.5, color = "orange", bins = 10, ax = ax4)
plt.title("bins = 10", fontsize = 12)
共有五種值(9, 10, 11, 12, 13),
那是不是要有六個Bins' edges才能完整包圍起來?
Alright, time to solve the quandary. Let's consider the DataFrame with 12 keys again. If you want to plot it in a histogram, you need 6 bins' edges to put the samples into 5 different "bins" (9, 10, 11, 12, 13). Let's take a look at the exemplification below.
How about give it a shoot? We feed in the condition of bins' edges rather than the number of bins.
# -*- coding: utf-8 -*-
Created on Fri Oct 13 02:26:28 2017
@author: ShihHsing Chen
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use("ggplot") # Look Pretty
age = {"Age" : [11, 10, 12, 11, 10, 13, 12, 12, 11, 12, 12, 9]}
simple = pd.DataFrame(age)
simple.plot.hist(alpha = 0.5, color = "green", ec = "white", bins = range(6))
因為range(6)會產生(0, 1, 2, 3, 4, 5),
Great! We got the result! Oh no, wait. Why is it a complete blank? Ha ha. Because range(6) returns a array like this: (0, 1, 2, 3, 4, 5) and our data starts from 9, it makes sense to receive a blank chart. As a result, we must expand the range of range().
You got it, right? Let's hit again.
# -*- coding: utf-8 -*-
Created on Fri Oct 13 02:26:28 2017
@author: ShihHsing Chen
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use("ggplot") # Look Pretty
age = {"Age" : [11, 10, 12, 11, 10, 13, 12, 12, 11, 12, 12, 9]}
agedf = pd.DataFrame(age)
simple = agedf["Age"]
agemax = simple.max() #Find out the maximum in the Series.
agemin = simple.min() #Find out the minimum in the Series.
agebins = range(agemin, agemax+2, 1)
simple.plot.hist(ec = "w", align= "left", bins = agebins, rwidth = 0.5)
只要利用.max(), .min(), range()或numpy.arange()就可以順利完成任務囉!
要記得range(a, b, c)的a代表起始點,
所以上述這個range(a, b, 1)產生的數列是a, a+1, a+2, ..., b-1。
See? It looks exactly the way we expect. The commands we need to complete our mission are .max(), .min(), range()or numpy.arange(). Besides, remember a is start point, b is stop point, and c is step in range(a, b, c). Thus, the array range(a, b, 1) generates is a, a+1, a+2, ..., b-1.
最後加碼一張649個Keys的DataFrame Histogram。
Here is a bonus. It is a DataFrame Histogram with 649 keys.
# -*- coding: utf-8 -*-
Created on Fri Oct 13 02:26:28 2017
@author: ShihHsing Chen
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use("ggplot") # Look Pretty
student_dataset = pd.read_csv("students.data", index_col=0)
my_series = student_dataset["freetime"]
msmax = my_series.max() #Find out the maximum in the Series.
msmin = my_series.min() #Find out the minimum in the Series.
msbins = np.arange(msmax+2)-0.5 #range() can't take decimal steps.
my_series.plot.hist(color = "purple", bins = msbins, ec = "w", rwidth = 0.6)
plt.xticks([1, 2, 3, 4, 5]) #Set the x-ticks we want.
plt.xlim(0, 6) #Limit the range of x-axis.
plt.xlabel("Hours", fontsize = 12)
plt.title("Histogram bar alignment with X-axis", fontsize = 12)
Reference: https://goo.gl/89ohoh
Post a Comment