¸üÐÂʱ¼ä:2017Äê12ÔÂ26ÈÕ15ʱ20·Ö À´Ô´:ÀÖÓã²¥¿Í ä¯ÀÀ´ÎÊý:
Python ÕýѸËÙ³ÉΪÊý¾Ý¿ÆÑ§¼ÒÃǸüΪÖÓ°®µÄ±à³ÌÓïÑÔ¡£ÐγɸÃÏÖ×´µÄÀíÓɷdz£³ä·Ö£ºPython ÌṩÁËÒ»ÖÖ¸²¸Ç·¶Î§¸üΪ¹ãÀ«µÄ±à³ÌÓïÑÔÉú̬ϵͳ£¬ÒÔ¼°¾ßÓÐÒ»¶¨¼ÆËãÉî¶ÈÇÒÐÔÄÜÁ¼ºÃµÄ¿ÆÑ§¼ÆËã¿â¡£
ÔÚ Python ×Ô´øµÄ¿ÆÑ§¼ÆËã¿âÖУ¬Pandas Ä£¿éÊÇ×îÊÊÓÚÊý¾Ý¿ÆÑ§Ïà¹Ø²Ù×÷µÄ¹¤¾ß¡£±¾ÎÄ×ÅÖØ½éÉÜÁË Python ÖÐÊý¾Ý´¦ÀíµÄ 5ÖÖ·½·¨¡£
Ê×Ïȵ¼ÈëÏà¹ØÄ£¿é²¢¼ÓÔØÊý¾Ý¼¯µ½ Python »·¾³ÖУº
import pandas as pd
import numpy as np
data = pd.read_csv("***.csv", index_col="Loan_ID")
1. Apply º¯Êý
Apply º¯ÊýÊÇ´¦ÀíÊý¾ÝºÍ½¨Á¢Ð±äÁ¿µÄ³£Óú¯ÊýÖ®Ò»¡£ÔÚÏòÊý¾Ý¿òµÄÿһÐлòÿһÁд«µÝÖ¸¶¨º¯Êýºó£¬Apply º¯Êý»á·µ»ØÏàÓ¦µÄÖµ¡£Õâ¸öÓÉ Apply ´«ÈëµÄº¯Êý¿ÉÒÔÊÇϵͳĬÈϵĻòÕßÓû§×Ô
def num_missing(x):
return sum(x.isnull())
#Applying per column:
print "Missing values per column:"
print data.apply(num_missing, axis=0)
2.Ìȱʧֵ
fillna() º¯Êý¿ÉÒ»´ÎÐÔÍê³ÉÌî²¹¹¦ÄÜ¡£Ëü¿ÉÒÔÀûÓÃËùÔÚÁеľùÖµ/ÖÚÊý/ÖÐλÊýÀ´Ìæ»»¸ÃÁеÄȱʧÊý¾Ý¡£ÏÂÃæÀûÓÓGender”¡¢“Married”¡¢ºÍ“Self_Employed”ÁÐÖи÷×ÔµÄÖÚÊýÖµÌî²¹¶ÔÓ¦ÁеÄȱʧÊý¾Ý¡£
from scipy.stats import mode
mode(data['Gender'])
3. Êý¾Ý͸ÊÓ±í
Pandas ¿É½¨Á¢ MS Excel ÀàÐ͵ÄÊý¾Ý͸ÊÓ±í¡£ÀýÈçÔÚÏÂÎĵĴúÂë¶ÎÀ¹Ø¼üÁГLoanAmount” ´æÔÚȱʧֵ¡£ÎÒÃÇ¿ÉÒÔ¸ù¾Ý“Gender”£¬“Married”ºÍ“Self_Employed”·Ö×éºóµÄƽ¾ù½ð¶îÀ´Ìæ»»¡£ “LoanAmount”µÄ¸÷×é¾ùÖµ¿ÉÓÉÈçÏ·½·¨È·¶¨
4. ¸´ºÏË÷Òý
Èç¹ûÄú×¢Òâ¹Û²ì#3¼ÆËãµÄÊä³öÄÚÈÝ£¬»á·¢ÏÖËüÓÐÒ»¸öÆæ¹ÖµÄÐÔÖÊ¡£¼´Ã¿¸öË÷Òý¾ùÓÉÈý¸öÊýÖµµÄ×éºÏ¹¹³É£¬³ÆÎª¸´ºÏË÷Òý¡£ËüÓÐÖúÓÚÔËËã²Ù×÷µÄ¿ìËÙ½øÐС£
´Ó#3µÄÀý×Ó¼ÌÐø¿ªÊ¼£¬ÒÑ֪ÿ¸ö·Ö×éÊý¾ÝÖµµ«»¹Î´½øÐÐÊý¾ÝÌî²¹¡£¾ßÌåµÄÌî²¹·½Ê½¿É½áºÏ´Ëǰѧµ½µÄ¶à¸ö¼¼ÇÉÀ´Íê³É¡£
for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():
ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])
data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]
#Now check the #missing values again to confirm:
print data.apply(num_missing, axis=0)
5. Crosstab º¯Êý
¸Ãº¯ÊýÓÃÓÚ»ñÈ¡Êý¾ÝµÄ³õʼӡÏó(Ö±¹ÛÊÓͼ)£¬´Ó¶øÑé֤һЩ»ù±¾¼ÙÉè¡£ÀýÈçÔÚ±¾ÀýÖУ¬“Credit_History”±»ÈÏΪ»áÏÔÖøÓ°Ïì´û¿î״̬¡£Õâ¸ö¼ÙÉè¿ÉÒÔͨ¹ýÈçÏ´úÂëÉú³ÉµÄ½»²æ±í½øÐÐÑéÖ¤£º
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)
±±¾©Ð£Çø