第四門關於Python的課—Programming with Python for Data Science ,
這門課比起之前的三門課,
內容相對多且雜,
但又常常只是開個引頭就沒下文,
所以常常要找補充資料或是引用網路資源。
為了避免行數太多,
我盡量在250行左右就做拆分。
(跟有在寫程式的人比起來,幾百行是小兒科,我懂。)
內容有convert data into quantitative attributes, including ordinal and nominal features、usage of .astype() and .cat.codes、new dtype category、manipulating textual, graphical, and audio features。
最後還有補充教材可以讓你Dive deeper。
# -*- coding: utf-8 -*-
"""
Created on Wed Oct 11 23:40:30 2017
@author: ShihHsing Chen
This is a note for the class
"Programming with Python for Data Science".
Caution: This course uses no Python 3.6 but Python 2.7. So I will update the
content for you if I know there is a difference.
"""
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/dccNFB"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
REPRESENT DATA AS QUANTITATIVE ATTRIBUTES
Your features need be represented as quantitative (preferably numeric)
attributes of the thing you're sampling. They can be real world values, such
as the readings from a sensor, and other discernible, physical properties.
Alternatively, your features can also be calculated derivatives, such as the
presence of certain edges and curves in an image, or lack thereof.
If your data comes to you in a nicely formed, numeric, tabular format, then
that's one less thing for you to worry about. But there is no guarantee that
will be the case, and you will often encounter data in textual or other
unstructured forms.
If you have a categorical feature, the way to represent it in your dataset
depends on if it's ordinal or nominal. For ordinal features, map the order as
increasing integers in a single numeric feature. Any entries not found in your
designated categories list will be mapped to -1.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#Citation from the content of the following webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/wbmXAd"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
NEW DATA TYPE (DTYPE): CATEGORY—ORDINAL
pd.Categorical(values, categories=None, ordered=False, fastpath=False)
Represents a categorical variable in classic R / S-plus fashion
Categoricals can only take on only a limited, and usually fixed, number of
possible values (categories). In contrast to statistical categorical
variables, a Categorical might have an order, but numerical operations
(additions, divisions, ...) are not possible.
All values of the Categorical are either in categories or np.nan. Assigning
values outside of categories will raise a ValueError. Order is defined by
the order of the categories, not lexical order of the values.
categories : Index-like (unique), optional
The unique categories for this categorical. If not given, the categories are
assumed to be the unique values of values.
ordered : boolean, (default False)
Whether or not this categorical is treated as a ordered categorical. If not
given, the resulting categorical will not be ordered. Since it is ordinal, it
is True here.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
import pandas as pd
#We create a list named ordinal.
ordinal = ["Very Unhappy",
"Unhappy",
"Neutral",
"Happy",
"Very Happy"]
#We create a DataFrame named df1_1 and df1_2 from a dict.
#We use .astype to cast the keys of df1 into dtype category, and we use
#.cat.codes to capture the category codes.
#If you see nan, NaN, or NAN, it represents any missing category.
df1_1 = pd.DataFrame({"satisfaction":["Mad",
"Happy",
"Unhappy",
"Neutral"]})
print("This is DataFrame1_1 before .astype.")
print(df1_1, "\n")
df1_1["satisfaction"] = df1_1["satisfaction"].astype("category", ordered=True,
categories = ordinal)
print("This is DataFrame1_1 after .astype with an ordinal order.")
print("You will see NaN, a missing category, at the output below.")
print(df1_1, "\n")
df1_1["satisfaction"] = df1_1["satisfaction"].cat.codes
print("This is DataFrame1_1 after .astype.cat.codes with ordinal order.")
print("Try to match the list ordinal:", ordinal)
print("What do you find here? The keys, right?")
print("ordinal[3] = Happy, ordinal[1] = Unhappy, ordinal[2] = Neutral")
print(df1_1, "\n")
print("/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/* \n")
df1_2 = pd.DataFrame({"satisfaction":["Mad",
"Happy",
"Unhappy",
"Neutral"]})
print("This is DataFrame1_2 before .astype.")
print("Same as DataFrame1_1 before .astype.")
print(df1_2, "\n")
df1_2["satisfaction"] = df1_2["satisfaction"].astype("category", ordered=False)
print("This is DataFrame1_2 after .astype withlout ordinal order.")
print("We do not have any category, so no more NaN here.")
print(df1_2, "\n")
df1_2["satisfaction"] = df1_2["satisfaction"].cat.codes
print("This is DataFrame1_2 after .astype.cat.codes withlout ordinal order.")
print("It is arranged in lexical order, which may not always be desirable.")
print("And this is exactly the same as the ramification of df2 below.")
print("Hence, choose .astype.cat.codes and orders for ordinal features.")
print(df1_2, "\n")
print("/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/* \n")
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/dccNFB"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
NEW DATA TYPE (DTYPE): CATEGORY—NOMINAL—WITHOUT ORDER
On the other hand, if your feature is nominal (and thus there is no obvious
numeric ordering), then you have two options. The first is you can encoded it
similar as you did above. This would be a fast-and-dirty approach. While
you're just getting accustomed to your dataset and taking it for its first run
through your data analysis pipeline, this method might be best.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
df2 = pd.DataFrame({"vertebrates":["Bird",
"Bird",
"Mammal",
"Fish",
"Amphibian",
"Reptile",
"Mammal",
]})
print("This is DataFrame2 before .astype.")
print(df2, "\n")
# Method 1) The lecturer calls it fast and dirty approach.
df2["vertebrates"] = df2["vertebrates"].astype("category")
print("This is DataFrame2 after .astype.")
print(df2, "\n")
df2["vertebrates"] = df2["vertebrates"].cat.codes
print("This is DataFrame2 after .astype.cat.codes.")
print(df2, "\n")
print("/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/* \n")
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/dccNFB"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
NEW DATA TYPE (DTYPE): CATEGORY—NOMINAL—WITH ORDER
Notice how this time, ordered=True was not passed in, nor was a specific
ordering listed. Because of this, Pandas encodes your nominal entries in
alphabetical order. This approach is fine for getting your feet wet, but the
issue it has is that it still introduces an ordering to a categorical list of
items that inherently has none. This may or may not cause problems for you in
the future. If you aren't getting the results you hoped for, or even if you
are getting the results you desired but would like to further increase the
result accuracy, then a more precise encoding approach would be to separate
the distinct values out into individual boolean features.
These newly created features are called boolean features because the only
values they can contain are either 0 for non-inclusion, or 1 for inclusion.
Pandas .get_dummies() method allows you to completely replace a single,
nominal feature with multiple boolean indicator (dummy) features. This method
is quite powerful and has many configurable options, including the ability to
return a SparseDataFrame, and other prefixing options. Its benefit is that
no erroneous ordering is introduced into your dataset.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
df3 = pd.DataFrame({"vertebrates":["Bird",
"Bird",
"Mammal",
"Fish",
"Amphibian",
"Reptile",
"Mammal",
]})
print("This is DataFrame3 before .astype.")
print(df3, "\n")
# Method 2)
df3 = pd.get_dummies(df3, columns=["vertebrates"])
print("This is DataFrame3 after .get_dummies.")
print(df3, "\n")
print("/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/* \n")
#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/RhbrSP"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
PURE TEXTUAL FEATURES
If you are trying to "featurize" a body of text such as a webpage, a tweet,
a passage from a newspaper, an entire book, or a PDF document, creating a
corpus of words and counting their frequency is an extremely powerful
encoding tool. This is also known as the Bag-of-Words model, implemented
with the CountVectorizer() method in SciKit-Learn. Even though the grammar
of your sentences and their word-order are complete discarded, this model
has accomplished some pretty amazing things, such as correctly identifying
J.K. Rowling's writing from a blind line up of authors.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
from sklearn.feature_extraction.text import CountVectorizer
corpus = open("D:/Introduction to Python/lyrics.txt", "r") #Prepare yours.
#Or else, an error.
corpus_content = corpus.readline().strip() #Remember .read(), .readline, and
count1 = 0 #.readlines?
while corpus_content:
print(count1, corpus_content)
corpus_content = corpus.readline().strip()
count1 += 1
else:
print() #If you put pass here, it is a placeholder.
corpus.seek(0) #Move the indicator to the start of the file.
bow = CountVectorizer()
X = bow.fit_transform(corpus) # Sparse Matrix
print(bow.get_feature_names())
print(X.toarray()) #Do not try this if your sparse matrix is huge. It will
#cost a lot of your computer memory.
#The course also talks about graphical and audio features. However, that is
#not my interest now. So I just leave it here for your or my future reference.
"""https://goo.gl/WLhk6i"""
#The course talks about different approaches to deal with your missing data
#(Wrangling). But that is not our concern now. Skip it as well.
"""https://goo.gl/W2CFPw"""
#Further reading to help you dive deeper.
"""https://goo.gl/Q3UAbh"""
Comments
Post a Comment