第四門關於Python的課—Programming with Python for Data Science ,
這門課有一些之前沒有見過的資料型態,
還有一些課程設計上的缺失,
所以會讓學習者比較辛苦一點。
另外,
這門課程使用的是Python 2.7而不是Python 3.6,
所以筆記會比較多一點。
不過不用擔心,
我參考了另外兩本教科書來做補充說明,
如果仔細看完筆記,
應該沒有甚麼觀念需要再補了。
內容有sample and feature、pandas data structures、loading and writing to .xlsx, .csv, .sql、dict data structure、how to construct dataframe manually、slicing and dicing。
# -*- coding: utf-8 -*-
"""
Created on Fri Oct 6 18:04:10 2017
@author: ShihHsing Chen
This is a note for the class
"Programming with Python for Data Science".
Caution: This course uses no Python 3.6 but Python 2.7. So I will update the
content for you if I know there is a difference.
"""
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/qc732f"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
SAMPLE AND FEATURE
To be usable by SciKit-Learn, the machine learning library for Python you
will be using in this course, your data needs to be organized into matrix of
samples and features.
A sample is any phenomenon you can describe with quantitative traits.
A feature is a quantitative trait that describes your sample. And there are
two types of features. One is continuous feature, and the other is
categorical feature, which can be break down into Ordinal (Ordered) and
Nominal (Unordered).
Continuous features: distance, time, cost, etc.
Ordinal features: happy, neutral, sad, etc.
Nominal features: color, TV shows, etc.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#Another citation from the course webpage. You can go to the original webpage
#for more details.
"""https://goo.gl/PfeaLV"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
INTRO TO PANDAS DATA STRUCTURE: SERIES AND DATAFRAME
There are two data structures in Pandas you need to know how to work with.
The first is the Series object, a one-dimensional labeled array that
represents a single column in your dataset.
Having all elements share the same units and data type may give you the
ability to apply series-wide operations. Because of this, Pandas Series must
be homogeneous. They're capable of storing any Python data type (integers,
strings, floating point numbers, objects, etc.), but all the elements in a
Series must be of the same data type.
The second structure you need to work with is a collection of series called a
DataFrame. To manipulate a dataset, you first need to load it into a
DataFrame. Different people prefer alternative methods of storing their data,
so Pandas tries to make loading data easy no matter how it's stored. Here are
some methods for loading data:
xls_dataframe = pd.read_excel('data.xlsx', 'Sheet1', na_values=['NA', '?'])
csv_dataframe = pd.read_csv('data.csv', sep=',')
table_dataframe= pd.read_html('http://page.com/with/table.html')[0]
Writing an existing DataFrame to disk is just as straightforward as reading
from one:
my_dataframe.to_sql('table', engine)
my_dataframe.to_excel('dataset.xlsx')
my_dataframe.to_csv('dataset.csv')
A note of caution: While generally we would say axis is another word for
dimension or feature, Pandas uses the word differently. In Pandas, axes
refers to the two-dimensional, matrix-like shape of your dataframe.
In this context, if you see or hear the term "axis", assume the speaker is
talking about the layout of your dataframe as opposed to the dimensionality
of your features.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#The original course material kinda skips the Series. The lecturer only briefs
#a little and moves forward to DataFrame. As a result, I borrow some content
#from Wes McKinney's "Python for Data Analysis—Second Edition" to complement
#the note.
import pandas as pd
test_Series1 = pd.Series([4, 7, 5, -3])
print(type(test_Series1))
print(test_Series1, "\n")
#As the printed result you see, there are a sequence of numbers and an
#associated array of indexes or indices. This array of indexes is default
#values, and you can define your own set of indexes.
test_Series2 = pd.Series([4, 7, 5, -3], index = ["Ag", "Ca", "Au", "Li"])
print(type(test_Series2))
print(test_Series2, "\n")
#Before we dive deeper, we need to meet the other data type first. I know it
#is kind of annoying, but what else we can do. We are rookies. = =
#Alright. The data type is Dictionaries, also known as dict. I borrow some
#content as well from Mark Lutz's "Learning Python—Forth Edition".
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Python dictionaries are something completely different (Monty Python reference
intended)—they are not sequences at all, but are instead known as mappings.
Mappings are also collections of other objects, but they store objects by key
instead of by relative position. In fact, mappings don’t maintain any reliable
left-to-right order; they simply map keys to associated values. Dictionaries,
the only mapping type in Python’s core objects set, are also mutable: they may
be changed in-place and can grow and shrink on demand, like lists.
When written as literals, dictionaries are coded in curly braces and consist
of a series of “key: value” pairs. Dictionaries are useful anytime we need to
associate a set of values with keys—to describe the properties of something,
for instance.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
test_dict1 = {"Subject" : "I", "Verb" : "have", "Quant" : 1, "Food": "Apple"}
print(test_dict1["Subject"], test_dict1["Verb"], test_dict1["Quant"],
test_dict1["Food"], "\n")
#Nested Dictionary.
test_dict2 = {"Name": {"First": "John", "Last": "Doe"},
"job": ["Engineering", "Manager"],
"age": 40.5}
print(test_dict2, "\n")
#Another citation from Wes McKinney's "Python for Data Analysis—Second Edition"
#Here we learn how to create a DataFrame manually.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
There are many ways to construct a DataFrame, though one of the most
common is from a dict of equal-length lists or NumPy arrays.
e.g. df = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c"])
The resulting DataFrame will have its index assigned automatically as with
Series, and the columns are placed in sorted order:
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada"],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data) #data is a dict, and we transform it into DataFrame.
print(frame, "\n")
#Do you remember "NBAstats.csv" we used last time? If you print the data type
#of it, you will get Yeah, we use the
#the DataFrame structure even before we learn its existence. So, let's import
#the .csv file again to review the function calls. Besides, we have some
#detail about .read_csv to read.
"""https://goo.gl/ZyxezR"""
#And also some detail about .read_html to read.
"""https://goo.gl/8cF98b"""
stats_brics = pd.read_csv("NBAstats.csv", index_col = 0)
print("first print \n")
print(stats_brics.head(5), "\n")
print("second print \n")
print(stats_brics.tail(5), "\n")
print("third print \n")
print(stats_brics.describe(), "\n")
print("forth print \n")
print(stats_brics.columns, "\n")
print("fifth print \n")
print(stats_brics.index, "\n")
print("sixth print \n")
print(stats_brics.values, "\n")
print("seventh print", "\n")
print(stats_brics.dtypes, "\n")
#If you are really interested in it, or you need it, dive deeper with the
#API reference here.
"""https://goo.gl/NjmPxr"""
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/EM6ka8"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
SLICIN' AND DICIN'
A DataFrame is essentially one or more series which have been 'stitched'
together into a new data type. Pandas exposes many equivalent methods for
slicing out those underlying series. You can slice by location, the way you
would normally index into a regular Python list. You can slice by label, the
way you would normally index into a Python dictionary. And like NumPy arrays,
you can also index by boolean masks
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
print("01")
print(stats_brics.AST, "\n")
print("02")
print(stats_brics["AST"], "\n")
print("03")
testslice1 = stats_brics[["AST"]] #You get a DataFrame.
print(testslice1)
print(type(testslice1), "\n")
print("04")
print(stats_brics.loc[:, "AST"], "\n")
print("05")
testslice2 = stats_brics.loc[:, ["AST"]] #You get a DataFrame.
print(testslice2)
print(type(testslice2), "\n")
print("06")
print(stats_brics.iloc[:, 0], "\n")
print("07")
testslice3 = stats_brics.iloc[:, [0]] #You get a DataFrame.
print(testslice3)
print(type(testslice3), "\n")
print("08")
print(stats_brics.ix[:, 0], "\n")
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/EM6ka8"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
SLICIN' AND DICIN' (continued)
Why does Pandas have so many different data access methods? The answer is
because there are slight differences between them. The first difference you'll
notice from the list above is that in some of the commands, you specify the
name of the column or series you want to slice: recency. By using the column
name in the code, it's very easy to discern what is being pulled, and you
don't have to worry about the order of the columns. Doing this lookup of first
matching the column name before slicing the column index is marginally slower
than directly accessing the column by index.
Once you're ready to move to a production environment, Pandas documentation
recommends you use either .loc[], .iloc[], or .ix[] data access methods, which
are more optimized. The .loc[] method selects by column label, .iloc[] selects
by column index, and .ix[] can be used whenever you want to use a hybrid
approach of either.
Another difference you'll notice is that some of the methods take in a list
of parameters, e.g.: df[['recency']], df.loc[:, ['recency']], and
df.iloc[:, [0]]. By passing in a list of parameters, you can select more than
one column to slice. Please be aware that if you use this syntax, even if you
only specify a single column, the data type that you'll get back is a
dataframe as opposed to a series. This will be useful for you to know once you
start machine learning, so be sure to take down that note.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#Row Slicing (row indexing) in four different commands.
#But I think it is better to keep a consistency in your code. Mixing different
#types of commands could be misleading.
print("09")
print(stats_brics[0:2], "\n")
print("10")
print(stats_brics.iloc[0:2, :], "\n")
print("11")
print(stats_brics.ix[30:32, :], "\n")
print("12")
print(stats_brics.loc[30:32, :], "\n")
#The course talks about DICIN' a little bit. If you are interested, click
#the following link.
"""https://goo.gl/Y5KR2E"""
Comments
Post a Comment