[Python] Notebook-7 Programming with Python For Data Science

[Python] Notebook-7 Programming with Python For Data Science—Part.2

第四門關於Python的課—Programming with Python for Data Science ，
這門課比起之前的三門課，
內容相對多且雜，
但又常常只是開個引頭就沒下文，
所以常常要找補充資料或是引用網路資源。

為了避免行數太多，
我盡量在250行左右就做拆分。
（跟有在寫程式的人比起來，幾百行是小兒科，我懂。）

內容有convert data into quantitative attributes, including ordinal and nominal features、usage of .astype() and .cat.codes、new dtype category、manipulating textual, graphical, and audio features。
最後還有補充教材可以讓你Dive deeper。
# -*- coding: utf-8 -*-
"""
Created on Wed Oct 11 23:40:30 2017
@author: ShihHsing Chen
This is a note for the class 
"Programming with Python for Data Science".

Caution: This course uses no Python 3.6 but Python 2.7. So I will update the
content for you if I know there is a difference.
"""

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/dccNFB"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
               REPRESENT DATA AS QUANTITATIVE ATTRIBUTES

Your features need be represented as quantitative (preferably numeric) 
attributes of the thing you're sampling. They can be real world values, such 
as the readings from a sensor, and other discernible, physical properties. 
Alternatively, your features can also be calculated derivatives, such as the 
presence of certain edges and curves in an image, or lack thereof.

If your data comes to you in a nicely formed, numeric, tabular format, then 
that's one less thing for you to worry about. But there is no guarantee that 
will be the case, and you will often encounter data in textual or other 
unstructured forms. 

If you have a categorical feature, the way to represent it in your dataset 
depends on if it's ordinal or nominal. For ordinal features, map the order as 
increasing integers in a single numeric feature. Any entries not found in your
designated categories list will be mapped to -1.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

#Citation from the content of the following webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/wbmXAd"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
                  NEW DATA TYPE (DTYPE): CATEGORY—ORDINAL

pd.Categorical(values, categories=None, ordered=False, fastpath=False)

Represents a categorical variable in classic R / S-plus fashion

Categoricals can only take on only a limited, and usually fixed, number of 
possible values (categories). In contrast to statistical categorical 
variables, a Categorical might have an order, but numerical operations 
(additions, divisions, ...) are not possible.

All values of the Categorical are either in categories or np.nan. Assigning 
values outside of categories will raise a ValueError. Order is defined by 
the order of the categories, not lexical order of the values.

categories : Index-like (unique), optional
The unique categories for this categorical. If not given, the categories are 
assumed to be the unique values of values.

ordered : boolean, (default False)
Whether or not this categorical is treated as a ordered categorical. If not 
given, the resulting categorical will not be ordered. Since it is ordinal, it
is True here.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
import pandas as pd
#We create a list named ordinal.
ordinal = ["Very Unhappy", 
           "Unhappy", 
           "Neutral", 
           "Happy", 
           "Very Happy"]

#We create a DataFrame named df1_1 and df1_2 from a dict.
#We use .astype to cast the keys of df1 into dtype category, and we use
#.cat.codes to capture the category codes.
#If you see nan, NaN, or NAN, it represents any missing category.
df1_1 = pd.DataFrame({"satisfaction":["Mad", 
                                      "Happy", 
                                      "Unhappy", 
                                      "Neutral"]})

print("This is DataFrame1_1 before .astype.")
print(df1_1, "\n")

df1_1["satisfaction"] = df1_1["satisfaction"].astype("category", ordered=True,
  categories = ordinal)
print("This is DataFrame1_1 after .astype with an ordinal order.")
print("You will see NaN, a missing category, at the output below.")
print(df1_1, "\n")

df1_1["satisfaction"] = df1_1["satisfaction"].cat.codes
print("This is DataFrame1_1 after .astype.cat.codes with ordinal order.")
print("Try to match the list ordinal:", ordinal)
print("What do you find here? The keys, right?") 
print("ordinal[3] = Happy, ordinal[1] = Unhappy, ordinal[2] = Neutral")
print(df1_1, "\n")

print("/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/* \n")

df1_2 = pd.DataFrame({"satisfaction":["Mad", 
                                      "Happy", 
                                      "Unhappy", 
                                      "Neutral"]})

print("This is DataFrame1_2 before .astype.") 
print("Same as DataFrame1_1 before .astype.")
print(df1_2, "\n")

df1_2["satisfaction"] = df1_2["satisfaction"].astype("category", ordered=False)
print("This is DataFrame1_2 after .astype withlout ordinal order.")
print("We do not have any category, so no more NaN here.")
print(df1_2, "\n")

df1_2["satisfaction"] = df1_2["satisfaction"].cat.codes
print("This is DataFrame1_2 after .astype.cat.codes withlout ordinal order.")
print("It is arranged in lexical order, which may not always be desirable.")
print("And this is exactly the same as the ramification of df2 below.")
print("Hence, choose .astype.cat.codes and orders for ordinal features.")
print(df1_2, "\n")

print("/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/* \n")

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/dccNFB"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
            NEW DATA TYPE (DTYPE): CATEGORY—NOMINAL—WITHOUT ORDER

On the other hand, if your feature is nominal (and thus there is no obvious 
numeric ordering), then you have two options. The first is you can encoded it 
similar as you did above. This would be a fast-and-dirty approach. While 
you're just getting accustomed to your dataset and taking it for its first run
through your data analysis pipeline, this method might be best.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
df2 = pd.DataFrame({"vertebrates":["Bird",
                                   "Bird",
                                   "Mammal",
                                   "Fish",
                                   "Amphibian",
                                   "Reptile",
                                   "Mammal",
                                   ]})

print("This is DataFrame2 before .astype.")
print(df2, "\n")

# Method 1) The lecturer calls it fast and dirty approach.
df2["vertebrates"] = df2["vertebrates"].astype("category")

print("This is DataFrame2 after .astype.")
print(df2, "\n")

df2["vertebrates"] = df2["vertebrates"].cat.codes
print("This is DataFrame2 after .astype.cat.codes.")
print(df2, "\n")

print("/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/* \n")

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/dccNFB"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
             NEW DATA TYPE (DTYPE): CATEGORY—NOMINAL—WITH ORDER

Notice how this time, ordered=True was not passed in, nor was a specific 
ordering listed. Because of this, Pandas encodes your nominal entries in 
alphabetical order. This approach is fine for getting your feet wet, but the 
issue it has is that it still introduces an ordering to a categorical list of 
items that inherently has none. This may or may not cause problems for you in 
the future. If you aren't getting the results you hoped for, or even if you 
are getting the results you desired but would like to further increase the 
result accuracy, then a more precise encoding approach would be to separate 
the distinct values out into individual boolean features.

These newly created features are called boolean features because the only 
values they can contain are either 0 for non-inclusion, or 1 for inclusion. 
Pandas .get_dummies() method allows you to completely replace a single, 
nominal feature with multiple boolean indicator (dummy) features. This method
is quite powerful and has many configurable options, including the ability to 
return a SparseDataFrame, and other prefixing options. Its benefit is that 
no erroneous ordering is introduced into your dataset.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
df3 = pd.DataFrame({"vertebrates":["Bird",
                                   "Bird",
                                   "Mammal",
                                   "Fish",
                                   "Amphibian",
                                   "Reptile",
                                   "Mammal",
                                   ]})

print("This is DataFrame3 before .astype.")
print(df3, "\n")

# Method 2)
df3 = pd.get_dummies(df3, columns=["vertebrates"])

print("This is DataFrame3 after .get_dummies.")
print(df3, "\n")

print("/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/* \n")

#Citation from the content of this course webpage. You can go to the original
#webpage for more details.
"""https://goo.gl/RhbrSP"""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
                           PURE TEXTUAL FEATURES

If you are trying to "featurize" a body of text such as a webpage, a tweet, 
a passage from a newspaper, an entire book, or a PDF document, creating a 
corpus of words and counting their frequency is an extremely powerful 
encoding tool. This is also known as the Bag-of-Words model, implemented 
with the CountVectorizer() method in SciKit-Learn. Even though the grammar 
of your sentences and their word-order are complete discarded, this model 
has accomplished some pretty amazing things, such as correctly identifying 
J.K. Rowling's writing from a blind line up of authors.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
from sklearn.feature_extraction.text import CountVectorizer

corpus = open("D:/Introduction to Python/lyrics.txt", "r") #Prepare yours.
                                                           #Or else, an error.
                                                           
corpus_content = corpus.readline().strip() #Remember .read(), .readline, and
count1 = 0                                 #.readlines?

while corpus_content:
    print(count1, corpus_content)
    corpus_content = corpus.readline().strip()
    count1 += 1
else:
    print() #If you put pass here, it is a placeholder.

corpus.seek(0) #Move the indicator to the start of the file.

bow = CountVectorizer()
X = bow.fit_transform(corpus) # Sparse Matrix

print(bow.get_feature_names())
print(X.toarray()) #Do not try this if your sparse matrix is huge. It will
                   #cost a lot of your computer memory.

#The course also talks about graphical and audio features. However, that is 
#not my interest now. So I just leave it here for your or my future reference.
"""https://goo.gl/WLhk6i"""

#The course talks about different approaches to deal with your missing data 
#(Wrangling). But that is not our concern now. Skip it as well.
"""https://goo.gl/W2CFPw"""
#Further reading to help you dive deeper.
"""https://goo.gl/Q3UAbh"""
[Python] Notebook-7 Programming with Python For Data Science—Part.2

Labels

Comments

Post a Comment

Popular posts from this blog

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學（一）找合作的律師事務所

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學（零）名詞解釋 Petitioner vs Applicant

[申辦綠卡] EB-2 NIW 國家利益豁免綠卡申請教學（二）閱讀官方申請條件—1