03 Dec 2021 14637字 49分

Pandas

數據準備工作

黨您獲取到數據時，往往數據是冗雜且紛亂的。因此，我們需要對數據進行一系列的準備工作，如：加載，組裝，刪除等…………。（當然，一般官方庫的數據均是處理完備可以直接進行使用的

數據合并

數據合并則使用函數 pd.merge(DataFrame_1， DataFrame_2)，在使用時需注意，需保持兩個 DataFrame 對象的長度一致。

整個合并過程類似於 “交” 操作，即保留 columns的值相同的項，儅 columns 中有多項 Index 相同時，則(很大可能)會篩除掉更多的數據。具體請看下方實例。

frame1 = pd.DataFrame({"id":["A", "B", "C", "D"],
                       "values":[100, 200, 300, 500]})
print(frame1)

frame2 = pd.DataFrame({"id":["A", "B","E", "C"],
                        "values":[500, 200, 300, 500]})
print()
print(pd.merge(frame1, frame2))
-------------------------------------
  id  values
0  A     100
1  B     200
2  C     300
3  D     500

  id  values
0  B     200

frame1 = pd.DataFrame({"id":["A", "B", "C", "D"],
                       "values":[100, 200, 300, 500]})
print(frame1)

frame2 = pd.DataFrame({"id":["A", "B","E", "C"],
                       "objects":["a", "c", "e", "f"]})
print()
print(pd.merge(frame1, frame2))
-----------------------------------------
  id  values
0  A     100
1  B     200
2  C     300
3  D     500

  id  values objects
0  A     100       a
1  B     200       c
2  C     300       f

可以看到，儅進行合并時，若2個 DataFrame 對象的 Columns 相同，則會篩除掉更多的數據（因要保證數據更高的相同性），若有額外的 Columns_value 出現，則會附加在後邊（如 objects 所示）。

另外，也可以設置參數 on = ““，來指定函數進行合并所選的列。

frame1 = pd.DataFrame({"id":["A", "B", "C", "D"],
                       "values":[100, 200, 300, 500],
                       "objects":["da", "da", "qwe", "qwe"]},
                        )
frame2 = pd.DataFrame({"id":["A", "B","E", "C"],
                       "objects":["a", "c", "e", "f"]})
print(pd.merge(frame1, frame2))
print()
print(pd.merge(frame1, frame2, on="id"))
------------------------------------------------
Empty DataFrame
Columns: [id, values, objects]
Index: []

  id  values objects_x objects_y
0  A     100        da         a
1  B     200        da         c
2  C     300       qwe         f

現在您清楚了吧。(不是

那麽，黨兩個 DataFrame 對象沒有相同的 columns 值該怎麽辦呢？若在此時用上述的方法進行合并，那麽最後結果均是空列表。如下所示：

frame1 = pd.DataFrame({"id":["A", "B", "C", "D"],
                       "values":[100, 200, 300, 500],
                       "objects":["da", "da", "qwe", "qwe"]},
                        )
frame2 = pd.DataFrame({"sid":["A", "B","E", "C"],
                       "objects":["a", "c", "e", "f"]})
print(pd.merge(frame1, frame2))

-------------------------------------
Empty DataFrame
Columns: [id, values, objects, sid]
Index: []

這時，我們就可以使用 DataFrame() 中的參數 left_on 和 right_on，來指定合并時需要進行比較的 Columns 的值。當然，既可以使用 key 值進行指定，也可以使用索引值（left_index)。

print(pd.merge(frame1, frame2,left_on="id", right_on="sid"))
-----------------------------
 id  values objects_x sid objects_y
0  A     100        da   A         a
1  B     200        da   B         c
2  C     300       qwe   C         f

在連接時，也可以對 how = “” 參數進行調整以改變鏈接的方式。how = “left” 進行左連接；how = “right” 進行右鏈接， how = “outer”進行外連接。此處的三種鏈接方式與數據庫中的定義相同，便不再進行闡釋。

若要根據索引進行合并，可使用 pd.merge(left_index =True, right_index = True) 或使用 DataFrame.join() 來進行合并。（即橫向匹配

frame1 = pd.DataFrame({"id":["A", "B", "C", "D"],
                       "values":[100, 200, 300, 500],
                       "objects":["da", "da", "qwe", "qwe"]},
                        )
frame2 = pd.DataFrame({"sid":["A", "B","E", "C", "Q"],
                       "objects":["a", "c", "e", "f", "haha"]})
print(pd.merge(frame1, frame2))
# print()
# print(pd.merge(frame1, frame2,left_on="id", right_on="sid"))

print(pd.merge(frame1, frame2, left_index=True, right_index=True))
--------------------------------------
Empty DataFrame
Columns: [id, values, objects, sid]
Index: []

  id  values objects_x sid objects_y
0  A     100        da   A         a
1  B     200        da   B         c
2  C     300       qwe   E         e
3  D     500       qwe   C         f

然而若使用 frame1.join(frame2)，則會報錯。因爲此時 frame1 和 frame 2的 columns 具有重複的值。

frame1 = pd.DataFrame({"id":["A", "B", "C", "D"],
                       "values":[100, 200, 300, 500],
                       "objects":["da", "da", "qwe", "qwe"]},
                        )
frame2 = pd.DataFrame({"sid":["A", "B","E", "C", "G"],
                       "objects1":["a", "c", "e", "f", "haha"]})
print()
print(frame1.join(frame2))
--------------------------------------
  id  values objects sid objects1
0  A     100      da   A        a
1  B     200      da   B        c
2  C     300     qwe   E        e
3  D     500     qwe   C        f

數據拼接

需要用到的函數：

pd.concat()，包括參數：[序列]， keys，axis。

在進行數據準備時，數據（或稱爲數組更容易理解）的拼接也是常做的工作。

不難想象，拼接既有竪向拼接，也有橫向拼接。

簡單的拼接（注意這個方括號哦~）

frame1 = pd.DataFrame(np.arange(9).reshape(3, 3),
                      columns=["A", "B", "C"],
                      index=["a", "b", "c"])
frame2 = pd.DataFrame(np.arange(9).reshape(3, 3),
                      columns=["D", "E", "F"],
                      index=["d", "e", "f"])

print()
print(pd.concat([frame1, frame2]))
-----------------------------------------
     A    B    C    D    E    F
a  0.0  1.0  2.0  NaN  NaN  NaN
b  3.0  4.0  5.0  NaN  NaN  NaN
c  6.0  7.0  8.0  NaN  NaN  NaN
d  NaN  NaN  NaN  0.0  1.0  2.0
e  NaN  NaN  NaN  3.0  4.0  5.0

通過 key 參數，來辨別拼接而成的各個部分

frame1 = pd.DataFrame(np.arange(9).reshape(3, 3),
                      columns=["A", "B", "C"],
                      index=["a", "b", "c"])
frame2 = pd.DataFrame(np.arange(9).reshape(3, 3),
                      columns=["D", "E", "F"],
                      index=["d", "e", "f"])
print(pd.concat([frame1, frame2], keys = ["First", "Second"]))
-----------------------------------------
            A    B    C    D    E    F
First  a  0.0  1.0  2.0  NaN  NaN  NaN
       b  3.0  4.0  5.0  NaN  NaN  NaN
       c  6.0  7.0  8.0  NaN  NaN  NaN
Second d  NaN  NaN  NaN  0.0  1.0  2.0
       e  NaN  NaN  NaN  3.0  4.0  5.0
       f  NaN  NaN  NaN  6.0  7.0  8.0

根據 axis 參數，來調整拼接的方向，其默認， axis = 0即為橫向拼接；黨 axis = 1時，則竪著并列。

frame1 = pd.DataFrame(np.arange(9).reshape(3, 3),
                      columns=["A", "B", "C"],
                      index=["a", "b", "c"])
frame2 = pd.DataFrame(np.arange(9).reshape(3, 3),
                      columns=["D", "E", "F"],
                      index=["d", "e", "f"])
print(pd.concat([frame1, frame2], keys = ["First", "Second"], axis=1))
----------------------------------
  First           Second          
      A    B    C      D    E    F
a   0.0  1.0  2.0    NaN  NaN  NaN
b   3.0  4.0  5.0    NaN  NaN  NaN
c   6.0  7.0  8.0    NaN  NaN  NaN
d   NaN  NaN  NaN    0.0  1.0  2.0
e   NaN  NaN  NaN    3.0  4.0  5.0
f   NaN  NaN  NaN    6.0  7.0  8.0

數據組合

書上寫到，儅兩個數據集的索引完全或部分相同時，可以通過“組合” 來組合數據。（不是很能理解）不過還是來講講函數的具體使用方法。

使用DataFrame.combine_first( ) 函數即可達到進行組合的目的。

frame1 = pd.DataFrame(np.arange(9).reshape(3, 3),
                      columns=["A", "B", "C"],
                      index=["a", "b", "c"])

frame2 = pd.DataFrame(np.arange(16).reshape(4, 4),
                      columns=["C","D", "E", "F"],
                      index=["c","d", "e", "f"])

print(frame1.combine_first(frame2))
----------------------------------------
     A    B     C     D     E     F
a  0.0  1.0   2.0   NaN   NaN   NaN
b  3.0  4.0   5.0   NaN   NaN   NaN
c  6.0  7.0   8.0   1.0   2.0   3.0
d  NaN  NaN   4.0   5.0   6.0   7.0
e  NaN  NaN   8.0   9.0  10.0  11.0
f  NaN  NaN  12.0  13.0  14.0  15.0

最後達到的效果，類似於拼接 + 填充（即自己沒有的可以從另外一個 DataFrame 中拿出來)。

軸向旋轉

黨你想把列轉化為行時，可以使用 DataFrame.stack()函數。演示如下：

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(9).reshape(3, 3),
                      columns=["A", "B", "C"],
                      index=["a", "b", "c"])

print(frame1)
print()
print(frame1.stack())
------------------------------
   A  B  C
a  0  1  2
b  3  4  5
c  6  7  8

a  A    0
   B    1
   C    2
b  A    3
   B    4
   C    5
c  A    6
   B    7
   C    8
dtype: int32

使用函數 DataFrame.unstack()可以把行轉換為列（即 stack() 的逆過程)。

frame1 = pd.DataFrame(np.arange(9).reshape(3, 3),
                      columns=["A", "B", "C"],
                      index=["a", "b", "c"])

print(frame1)
print()
print(frame1.unstack())
------------------------------
   A  B  C
a  0  1  2
b  3  4  5
c  6  7  8

A  a    0
   b    3
   c    6
B  a    1
   b    4
   c    7
C  a    2
   b    5
   c    8
dtype: int32

運用 DataFrame.pivot(index=， columns =， values = ) 既可以提取原對象columns屬性下可能多次出現的值，然後以達到將 “瘦”的數據變成”胖”數據的目的。（可能解釋的不是很清楚，具體請看下方的演示。

data = {'date': ['2018-08-01', '2018-08-02', '2018-08-03', '2018-08-01', '2018-08-03', '2018-08-03',
                 '2018-08-01', '2018-08-02'],
        'variable': ['A','A','A','B','B','C','C','C'],
        'value': [3.0 ,4.0 ,6.0 ,2.0 ,8.0 ,4.0 ,10.0 ,1.0 ]}

frame2 = pd.DataFrame(data, columns=["date", "variable", "value"])
print(frame2)
print()
print(frame2.pivot(index="date", columns="variable",values="value"))
--------------------------------------------------
         date variable  value
0  2018-08-01        A    3.0
1  2018-08-02        A    4.0
2  2018-08-03        A    6.0
3  2018-08-01        B    2.0
4  2018-08-03        B    8.0
5  2018-08-03        C    4.0
6  2018-08-01        C   10.0
7  2018-08-02        C    1.0

variable      A    B     C
date                      
2018-08-01  3.0  2.0  10.0
2018-08-02  4.0  NaN   1.0
2018-08-03  6.0  8.0   4.0

Process finished with exit code 0

刪除

刪除可分爲刪除行或刪除列，同第四章所講的一樣，可以通過 del 和 DataFrame.drop() 來達到刪除的目的。接下來將會在下文一一闡述。

使用 del 時要注意，只能對 columns 進行刪除，而不能對 index 進行刪除。

df = pd.DataFrame(np.arange(9).reshape(3, 3),
                  columns=["A", "B", "C"],
                  index=["a", "b", "c"])
print(df)
del df["A"]
print()
print(df)
-----------------------------
   A  B  C
a  0  1  2
b  3  4  5
c  6  7  8

   B  C
a  1  2
b  4  5
c  7  8

通過 DataFrame.drop() 即可刪除你想刪除的行（即索引:index)

df = pd.DataFrame(np.arange(9).reshape(3, 3),
                  columns=["A", "B", "C"],
                  index=["a", "b", "c"])
print(df)
df = df.drop("a")
print()
print(df)
-----------------------------
   A  B  C
a  0  1  2
b  3  4  5
c  6  7  8

   A  B  C
b  3  4  5
c  6  7  8                 

數據轉換

刪除重複的元素

當你的數據中出現重複的數據后，你可能會考慮將其刪除掉，也可能會不。（不是

通過 DataFrame.duplicated() 可以來檢測重複出現的數據。

ps: 對於這種進行判斷的函數，大部分（不確定）可用於X[X.xxxx()] 條件判斷中用於生成對應的單個元素。

df = pd.DataFrame({"color":["red", "red", "blue", "green", "yellow"],
                   "values":[1, 1, 3, 5, 5]})

print(df)
print()
print(df.duplicated())
print()
print(df[df.duplicated()])
--------------------------------
    color  values
0     red       1
1     red       1
2    blue       3
3   green       5
4  yellow       5

0    False
1     True
2    False
3    False
4    False
dtype: bool

  color  values
1   red       1

映射關係

對於映射關係的理解，可以聯想 PYTHON 中的字典。在字典中， key 和 value 一一對應，這就是一種映射關係。

替換元素

通過映射，可以實現將元素替換的效果。使用 DataFrame.replace() 即可達到。其中的參數為 {old: new}這種。

df = pd.DataFrame({"objects":["pen", "pencil", "bigpen"],
                   "values":[100, 300, 500]})
print(df)
df = df.replace({"pen":"bigbigpen","bigpen":"littlepen"})
print(df)
-------------------------
  objects  values
0     pen     100
1  pencil     300
2  bigpen     500

     objects  values
0  bigbigpen     100
1     pencil     300
2  littlepen     500

添加元素

通過映射關係，我們還可以往數據裏邊添加一列。需要使用到 DataFrame.map()。

df = pd.DataFrame({"objects":["pen", "pencil", "bigpen"],
                   "values":[100, 300, 500]})

price = {100: 123, 300: 456, 500: 789}
df["price"] = df["values"].map(price)
print(df)
-----------------------------------------
  objects  values  price
0     pen     100    123
1  pencil     300    456
2  bigpen     500    789

通過前一列的數據和映射關係，即可將 price 成功添加到 DataFrame 中去。

軸的重命名

當你對你的軸的名稱不滿意時，你也可以對其進行更改。大致方式與👆文處添加元素一致。只不過此處調用的并非 .map() 而是 .rename()。

df = pd.DataFrame(np.arange(25).reshape(5, 5),
                  columns=["A", "B", "C", "D", "E",],
                  index=["a", "b", "c", "d", "e"])
print(df)

reindex_ = {"A": "AAAAA",
           "B": "BBBBB",
           "C": "CCCCC",
           "D": "DDDDD",
           "E": "EEEEE"}

df = df.rename(columns=reindex_)
print()
print(df)
-----------------------------
    A   B   C   D   E
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24

   AAAAA  BBBBB  CCCCC  DDDDD  EEEEE
a      0      1      2      3      4
b      5      6      7      8      9
c     10     11     12     13     14
d     15     16     17     18     19
e     20     21     22     23     24

此處需注意，若直接傳入字典參數，則其默認修改的是 index ，若要對行（表頭）進行修改，需要顯性的寫出 columns = xxxx。

離散化和面元劃分

由於我們和算法對數據的不同需求，我們可能會需要離散化的數據，也可能會需要連續的數據，因此對數據進行離散化也是一個非常重要的過程。

簡單來説，將連續的數據拆分開的過程就叫離散化，所拆分的一組一組的就是“面元。”(離散化和面元劃分其實就是一類東西啦

進行劃分前，可以先通過列表定義面元，然後再通過 pd.cut()函數對數據進行劃分。

data = np.random.randint(0, 100, 50)
bins = [0, 25, 50, 75, 100]
print(data)
print()
print(pd.cut(data, bins))
----------------------------------------
[70 10 99 92  9 48 68 66 38 12  5 62 89 56 59 88  4 93 87  7  3 22 78 75
 30 62 46 60 51 64 29 52  8 40 90  7  8 44 49 47 29 49 28 79  7 43 12 33
 29 30]

[(50, 75], (0, 25], (75, 100], (75, 100], (0, 25], ..., (25, 50], (0, 25], (25, 50], (25, 50], (25, 50]]
Length: 50
Categories (4, interval[int64]): [(0, 25] < (25, 50] < (50, 75] < (75, 100]]

若在調用cut() 函數時不指定面元的各界限，而是僅僅傳入一個整數作爲參數， cut()會按照指定的數字將元素按照範圍劃分為相應的幾部分。

注意：這裏是按照數字將元素劃爲等分的區域，并不代表著每個區域中的元素個數都是相同的。

data = np.random.randint(0, 100, 50)
bins = [0, 25, 50, 75, 100]
print(pd.cut(data, 5))
---------------------------------
[(-0.097, 19.4], (19.4, 38.8], (77.6, 97.0], (58.2, 77.6], (58.2, 77.6], ..., (-0.097, 19.4], (38.8, 58.2], (-0.097, 19.4], (-0.097, 19.4], (77.6, 97.0]]
Length: 50
Categories (5, interval[float64]): [(-0.097, 19.4] < (19.4, 38.8] < (38.8, 58.2] < (58.2, 77.6] <
                                    (77.6, 97.0]]

次數計算

通過 pd.value_counts() 函數可以計算各個面元出現的次數。

ata = np.random.randint(0, 100, 50)
bins = [0, 25, 50, 75, 100]
cut_result = pd.cut(data,bins)
print(pd.value_counts(cut_result))
-----------------------------------------
(75, 100]    18
(50, 75]     12
(25, 50]     11
(0, 25]       9
dtype: int64

通過 cut_result(這是個變量名) .codes 可以得出各個元素屬於哪個面元。（其中面元從0開始至len(arr) - 1，也就是面元劃分的各個下標）

data = np.random.randint(0, 100, 50)
bins = [0, 25, 50, 75, 100]
print(cut_result.codes)
-------------------------------------
[3 2 0 1 1 2 1 0 1 1 2 3 1 0 0 3 3 0 2 3 3 3 1 0 1 1 2 1 0 2 1 0 0 2 1 1 3
 0 3 1 2 0 3 2 3 0 2 0 0 2]

名稱指定

通過一個X元的列表（其中X 應與面元的個數保持一致，才能達到名稱指定（1對1）的作用），來指定面元的名稱。其中需主要的是，在調用 cut 函數時需顯性寫出參數 labels =

data = np.random.randint(0, 100, 50)
bins = [0, 25, 50, 75, 100]
bin_names = ["unlikely", "less likely", "likely", "high likely"]
print(pd.cut(data, bins, labels=bin_names))
-------------------------------------------------
['unlikely', 'likely', 'less likely', 'less likely', 'likely', ..., 'high likely', 'less likely', 'unlikely', 'high likely', 'high likely']
Length: 50
Categories (4, object): ['unlikely' < 'less likely' < 'likely' < 'high likely']

排序

改變行排列或列排列

通過 DataFrame.take()即可以實現對行排位或列排位的改變。不難想到，此處同樣可以通過對參數 axis 的控制來實現對行或是對列的操作。

df = pd.DataFrame(np.arange(25).reshape(5, 5))
new_order = np.random.permutation(5)
print(new_order)
print()
print(df.take(new_order))
-------------------------------------------
[2 3 1 4 0]

    0   1   2   3   4
2  10  11  12  13  14
3  15  16  17  18  19
1   5   6   7   8   9
4  20  21  22  23  24
0   0   1   2   3   4

此處，補充一下random.permutation()。這個函數用於創建一個包含0 至 x （不包括x)有 x 個整數的數組。這 X 個數均會被包括，但其出現的順序則是隨機的。

字符串處理

可能會用到的函數：

.split()

.join()

.index()

.find()

.replace()

正則表達式

（沒想到還是逃不過正則）

首先講一下什麽是正則表達式。

正则表达式使用单个字符串来描述、匹配一系列匹配某个句法规则的字符串。在很多文本編輯器里，正则表达式通常被用来检索、替换那些匹配某个模式的文本。

(不過老師跳過了，那我們也略。

數據聚合

所謂數據聚合，即是將數組中的多個數據進行會聚，最後生成一個單一的數值。

Groupby

在pandas中，數據的聚合通過 Groupby() 函數，生成一個 Groupby對象，具體的信息均可通過 Groupby 獲取。

frame = pd.DataFrame({"color": ["red", "blue", "green"],
                      "object": ["pen", "ball", "pencil"],
                      "value":[5, 10, 15],
                      "value2":[20, 25,50]},)

print(frame["value"])
group = frame["value"].groupby(frame["color"])
print(group.groups)
print()
print(group.mean())
-----------------------------------
0     5
1    10
2    15
Name: value, dtype: int64
{'blue': [1], 'green': [2], 'red': [0]}

color
blue     10
green    15
red       5
Name: value, dtype: int64

顯而易見，這個 DataFrame 首先應該是由數值和字符串一起構成的。

在使用 groupby 函數時，首先應指定你想要聚類在哪個方向上（或者稱爲哪個屬性上），將其傳遞在參數中；然後用被聚類的屬性來調用這個 groupby() 函數，就如上述代碼中的：frame[“value”].groupby(frame[“color”])，可理解為 A is grouped by (B)。

若想查看 group 對象，需使用屬性 groups 而非直接 print()；另外，.mean()，.max() 這些方法肯定也是會具有的。

等級分組

當然，可以根據一列來進行聚類，就可以根據多累來進行聚類。只需在傳入參數時多寫幾個進去就可以了。（簡單吧

另外，被聚類的數據同時也可以不止一個屬性哦~

ps:所以這種方法爲什麽叫等級分組呢？

frame = pd.DataFrame({"color": ["red", "blue", "green"],
                      "object": ["pen", "ball", "pencil"],
                      "value":[5, 10, 15],
                      "value2":[20, 25,50]},)

group = frame[["value", "value2"]].groupby(frame["color"])
print(group.groups)
print()
print(group.mean())
print("==========================")
print(frame.groupby(frame["object"]).mean())
--------------------------------
{'blue': [1], 'green': [2], 'red': [0]}

       value  value2
color               
blue      10      25
green     15      50
red        5      20
==========================
        value  value2
object               
ball       10      25
pen         5      20
pencil     15      50

由此可以看出，當單獨使用 frame.groupby() 而不指定被聚類屬性時，其會自動選擇 DataFrame 對象中的為數值的屬性，而非由字符串組成的屬性。

組迭代

.groupby() 函數返回的是一個可迭代對象，其組成類似於 dict 中的 (key, value) 鍵值對。

import pandas as pd
import numpy as np

frame = pd.DataFrame({"color": ["red", "blue", "green"],
                      "object": ["pen", "ball", "pencil"],
                      "value":[5, 10, 15],
                      "value2":[20, 25,50]},)

for name, group in frame.groupby(frame["color"]):
    print(name)
    print(group)
----------------------------------
blue
  color object  value  value2
1  blue   ball     10      25
green
   color  object  value  value2
2  green  pencil     15      50
red
  color object  value  value2
0   red    pen      5      20

鏈式轉換

鏈式轉換看起來可能會很酷，但實際上并不推薦大家使用（鄙人私自認爲）。

故不再進行介紹。（不是

分組函數

通過調用函數 quantile() 計算分位數即可達到分組的效果。有些同學可能就疑惑了，分位數是個什麽東西？下面簡單介紹一下分位數。

分位數是一個數值，它代表了在整個序列（或列表）中，前百分之多少的數會小於該數。用人話來説，就是將數組進行排序，然後在百分比多少的位置對數據進行劃分。

frame = pd.DataFrame({"color": ["red", "blue", "green"],
                      "object": ["pen", "ball", "pencil"],
                      "value":[5, 10, 15],
                      "value2":[20, 25,50]},)

print(frame.quantile(0.6))
---------------------------------
value     11.0
value2    30.0
Name: 0.6, dtype: float64

一般來説，分組函數可以配合 groupby() 一起使用。將數組進整體行分組后再進行内部劃分。

自定義聚類函數

通過 .agg() 可以實現自定義聚類函數的功能哦。

frame = pd.DataFrame(np.arange(9).reshape(3, 3),
                        columns = ["red", "blue", "green"],
                      index =  ["pen", "ball", "pencil"])

group = frame.groupby(frame.index)

print(group.mean())
print()
print(group.agg(mydef))
print()
print(frame.agg(mydef))
----------------------------------------------
       red  blue  green
ball      3     4      5
pen       0     1      2
pencil    6     7      8

        red  blue  green
ball      0     0      0
pen       0     0      0
pencil    0     0      0

red      6
blue     6
green    6
dtype: int64

高級數據聚合

（很難不懷疑爲什麽教材把這個内容放在章節末）

需要使用到的函數：

.transform()

.apply()

示例如下：

add_prefix() 用於修改index，groupby()用於在 color(本例中）進行聚類。

.transform(func) 其中func 是用於轉換的函數，既可以傳入自定義函數，也可以傳入庫中所包括的函數。

frame = pd.DataFrame({"color": ["white", "red", "green", "red", "green"],
                      "price1":[2, 5, 7, 9, 10],
                      "price2":[11, 33, 55, 77, 99]})

print(frame.groupby(frame["color"]).transform(np.sum).add_prefix("tot_"))
---------------------------------
   tot_price1  tot_price2
0           2          11
1          14         110
2          17         154
3          14         110
4          17         154

apply() 函數適用於更加通用的 Groupby() 操作。

frame = pd.DataFrame({"color": ["white", "black", "white", "black", "white"],
                      "status":["up", "up", "down", "down", "down"],
                      "value1":[12, 13, 15, 16, 17],
                      "value2": [21, 25, 26, 27, 29]})

print(frame.groupby(["color", "status"]).apply(lambda x:x.max()))
-----------------------------------------------
			  color status  value1  value2
color status                              
black down    black   down      16      27
      up      black     up      13      25
white down    white   down      17      29
      up      white     up      12      21

注：如果沒看懂，其實也沒有關係。

Pandas 数据处理