修完關於Data Science的第一門課囉!
但我看後面還有兩、三門(汗…),
之後還要學Linux(Ubuntu),
嗯!
看來要好好加油。
以下筆記整理自—Introduction to Python for Data Science。
內容有numpy.ndarray、numpy.zeros、numpy.round、.shape、numpy.random.normal(a, b, c)、numpy.column_stack((e, f, ...))、numpy.mean、numpy.median、numpy.std。
以及matplotlib.pyplot、.plot()、.scatter()、.xticks()、.fill_between()、.show()、random、.sample()、.hist()、.xlabel()、.ylabel()、.title()、pandas、.read_csv()、.loc。
畫圖畫到這裡,
如果你跟我一樣都是使用Spyder來操作應該會發現圖像都是顯示在IPython Console裡。
要更改這個局面請按照以下步驟:
Enable interactive 3d plots in Spyder by going to Tools > Preferences > IPython Console > Graphics. From this page, set Backend to Automatic.
而這個步驟來自我參與的第四門Python課程Programming with Python for Data Science,
做到這一步的話,
看起來就更像Matlab了。
# -*- coding: utf-8 -*-
"""
Created on Wed Sep 20 22:06:18 2017
@author: ShihHsing Chen
This is a note for the class
"Introduction to Python for Data Science".
"""
#We already learn something about string and list, but we
#cannot operate a series of calculations to the elements in a list.
#Hence, we need the help of numpy. We can import it as np, which is
#an alias of numpy.
import numpy as np
height = [1.73, 1.71, 1.89, 1.80]
weight = [65.4, 70, 81.5, 90.4]
height_array = np.array(height)
weight_array = np.array(weight)
BMI_float = weight_array / height_array **2
print("Original Float Array: ", BMI_float)
print("Its shape is", BMI_float.shape, "\n") #We need to be extremely careful
#about the shape of each matrix.
#Or else, you get an error.
BMI_rounded = np.zeros(shape = (4, )) #Make a null array of the original shape.
print("Null Array: ", BMI_rounded)
print("Its shape is", BMI_rounded.shape, "\n")
count = 0
for element in BMI_float:
BMI_rounded[count] = round(BMI_float[count], 2) #We use round() to get rid
count += 1 #of unwanted digits.
print("Rounded Array", BMI_rounded)
# Or you can use print("Rounded Array", np.round(BMI_float, 2)) to replace the
#function from line 30 to line 34.
print("Its shape is", BMI_rounded.shape, "\n")
#If you put different types of elements in a list and call the
#np.array on the list, all the elements in the list will be casted
#into the same data type. Let's look an example.
mixed = [1.93, "Kobe", True]
print(type(mixed), mixed)
print(type(mixed[0]), type(mixed[1]), type(mixed[2]), "\n")
mixed_array = np.array(mixed)
print(type(mixed_array), mixed_array)
print(type(mixed_array[0]), type(mixed_array[1]), type(mixed_array[2]), "\n")
#Take a look at the data type of mixed_array that we just print.
#It is numpy.ndarray, right? numpy means that it's a data type defined in
#numpy. ndarray means n-dimension array, and n goes all the way up to 7.
#Let's create a 2D array, or a so-called matrix in linear algebra textbooks.
#Be aware of the difference between results of d and e.
array = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
#Using .shape, we know (row, column) of the array.
print("The shape of this matrix is", array.shape)
print(array, "\n")
a = array[0, 1] #We go to the first row and get the second item.
print("The item a is", a, "\n")
b = array[1][2] #We get the second row and get the third item.
print("The item b is", b, "\n")
c = array[0][:] #We get the first row and get all the columns.
print("The item c is", c, "\n") #Although the result is the same as the next
#print() the process is different.
d = array[:][0] #We go through all the rows and get the first row.
print("The item d is", d, "\n") #Hence, the [:] is useless, being redundant.
#d = array[0] will do the same job for you.
e = array[:, 0] #We go through all the rows and get the first column.
print("The item e is", e, "\n")
#To generate some data sets for processing, we use the following statements.
#city_XXX = np.round(np.random.normal(a, b, c), d)
# a = distribution mean
# b = distribution standard deviation
# c = number of samples
# d = number of digits you want to keep
#More about .random.normal(a, b, c):
"""https://goo.gl/jegvSK"""
city_height = np.round(np.random.normal(1.75, 0.2, 5000), 2)
city_weight = np.round(np.random.normal(73.24, 10.2, 5000), 2)
city_BMI = np.round(city_weight / city_height **2, 2)
#To stack two 1D arrays as one 2D array, we use the following statements.
#city = np.column_stack((e, f, ...))
# (e, f, ...) = tuple, also tup, sequence of 1-D or 2-D arrays
#Citations from "Learning Python, forth edition" by Mark Lutz.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
The tuple object (pronounced “toople” or “tuhple,” depending on who you ask)
is roughly like a list that cannot be changed—tuples are sequences, like
lists,but they are immutable, like strings. Syntactically, they are coded in
parentheses instead of square brackets, and they support arbitrary types,
arbitrary nesting, and the usual sequence operations.
The primary distinction for tuples is that they cannot be changed once
created. That is, they are immutable sequences.
Like lists and dictionaries, tuples support mixed types and nesting, but they
don’t grow and shrink because they are immutable.
Why Tuples?
So, why have a type that is like a list, but supports fewer operations?
Frankly, tuples are not generally used as often as lists in practice, but
their immutability is the whole point. If you pass a collection of objects
around your program as a list, it can be changed anywhere; if you use a tuple,
it cannot. That is, tuples provide a sort of integrity constraint that is
convenient in programs larger than those we’ll write here.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#More about .column_stack((e, f, ...)):
"""https://goo.gl/uZ1BKu"""
city = np.column_stack((city_height, city_weight, city_BMI))
#How to get the mean, standard deviation, and median of a data set?
#Use .mean(), .std() and .median()
#There are also functions like .sum() and .sort()
print("Heights")
print("Calculated mean:", round(np.mean(city[:, 0]), 3))
print("Calculated median:", np.median(city[:, 0]))
print("Calculated standard deviation:", round(np.std(city[:, 0]), 3))
print("Expected mean: 1.75")
print("Expected median: 1.75")
print("Expected standard deviation: 0.2 \n")
print("Weights")
print("Calculated mean:", round(np.mean(city[:, 1]), 3))
print("Calculated median:", np.median(city[:, 1]))
print("Calculated standard deviation:", round(np.std(city[:, 1]), 3))
print("Expected mean: 73.24")
print("Expected median: 73.24")
print("Expected standard deviation: 10.2 \n")
print("BMIs")
print("Calculated mean:", round(np.mean(city[:, 2]), 3))
print("Calculated median:", np.median(city[:, 2]))
print("Calculated standard deviation:", round(np.std(city[:, 2]), 3), "\n")
#Data Visualization with .plot(), .scatter(), and .show()
#The main reason I learn Python is to replace the ploting functions of Matlab.
#It feels really good to be here.
import matplotlib.pyplot as plt
year = [2013, 2014, 2015, 2016, 2017]
bodyweight = [97.6, 85.6, 87.5, 91.4, 89.8]
plt.figure(0) #Since we change the graphics setting to "Automatic", we need
#this command to separate different figures.
plt.plot(year, bodyweight)
plt.xticks([2013, 2014, 2015, 2016, 2017],
["2013Y", "2014Y", "2015Y", "2016Y", "2017Y"])
plt.fill_between(year, bodyweight, 85, color = "blue")
plt.show() #You need this call to display the result.
#We can get ideas about the distribution of events with .hist()
#We import random to randomly sample 20 integers between 0 and 99 for 50000
#times, then we get a "big" data set to plot a histogram.
"""
hist(x, bins=None, range=None, normed=False, weights=None, cumulative=False,
bottom=None, histtype='bar', align='mid', orientation='vertical',
rwidth=None, log=False, color=None, label=None, stacked=False, hold=None,
data=None, **kwargs)
"""
import random
testing = random.sample(range(100), 20)
tenthousands = 50000
while tenthousands > 0:
testing += random.sample(range(100), 20)
tenthousands -= 1
else:
pass
plt.figure(1)
plt.hist(testing, bins = 100, normed = 0, histtype='bar', color = "green",
edgecolor = "black")
plt.xlabel("Integers", fontsize = 16)
plt.ylabel("Occurances", fontsize = 16)
plt.title("Occurances of Randomly Picked Integers", fontsize = 16)
plt.show()
#Pandas come! Run for your life!
#Wait! It is not the real mammals, no worries. I don't think there is a high
#chance of being attacked by pandas.
#Alright, here we use the NBA stats sample from
"""https://github.com/AddisonGauss/NbaData2015-2016"""
#Big thanks to the contributors.
#Oh, remember to download it to your computer and rename the file. Or else,
#you get an error.
import pandas as pd
stats_brics = pd.read_csv("NBAstats.csv", index_col = 0)
print(stats_brics, "\n")
print(stats_brics["Age"], "\n")
print(stats_brics.loc[37], "\n")
print(stats_brics.loc[37]["AVP"], "\n")
#For your information, you can directly do math operations to the columns in
#the brics, but I skip this part and leave it for you to discover.
#And we will talk more about the usage of pandas in next note.
Comments
Post a Comment