【原】我用數(shù)據(jù)告訴你，家長學(xué)歷對孩子成績的影響有多大？

等你在雨中tbv9 2020-08-06

展開全文

知識不僅能改變自己的命運(yùn)，還能改變后代的命運(yùn)。

看了今天的數(shù)據(jù)分析，我相信你會更加清晰的明確這點(diǎn)。

所以說，國家在貧困地區(qū)優(yōu)先發(fā)展教育的政策是有非常有戰(zhàn)略意義的。教育脫貧，才能實(shí)現(xiàn)一個(gè)家族、一個(gè)地區(qū)的真正脫貧。

這份數(shù)據(jù)比較簡單，它來源于美國一份學(xué)生成績單，包括性別、家長學(xué)歷、是否備考、午餐計(jì)劃、分?jǐn)?shù)等，意在通過這些記錄這些因素對學(xué)生成績的影響。

今天的數(shù)據(jù)分析的工具依然是Python。

數(shù)據(jù)準(zhǔn)備

1、數(shù)據(jù)概覽

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom pyecharts import Bar,WordCloud,Pie,Linedf=pd.read_csv(r'C:\Users\Administrator\Desktop\StudentsPerformance.csv')df.head()

這個(gè)是數(shù)據(jù)的字段，前面我已經(jīng)說過了，很簡單。

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom pyecharts import Bar,WordCloud,Pie,Linedf=pd.read_csv(r'C:\Users\Administrator\Desktop\StudentsPerformance.csv')df.info()

數(shù)據(jù)類型也不多，就兩個(gè)，從圖上可以看到，這份數(shù)據(jù)很干凈，不需要進(jìn)行數(shù)據(jù)清洗。那我們就直接開始分析吧。

數(shù)據(jù)分析

1、各科學(xué)生成績分布箱式圖

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom pyecharts import Bar,WordCloud,Pie,Linedf=pd.read_csv(r'C:\Users\Administrator\Desktop\StudentsPerformance.csv')

plt.rcParams['font.sans-serif']=['SimHei']y1=df['math score']y2=df['reading score']y3=df['writing score']x=[y1,y2,y3]plt.figure(figsize=(10,5))labels=['math score','reading score','writing score']plt.boxplot(x,labels=labels,vert=True)plt.title('各科目成績分布箱式圖',loc='left',size=15)

plt.xlabel('科目',size=15)plt.ylabel('分?jǐn)?shù)',size=15)plt.xticks(size=15)plt.yticks(size=15)#plt.yticks([]) 可以去掉y軸plt.grid(False)sns.despine(left=False )#去掉上面和右邊邊框plt.show();

在數(shù)據(jù)分析中，箱式圖是用的比較多的，特別是看分布的時(shí)候，非常實(shí)用，中間的矩形就是上四分位和下四分位，代表大部分?jǐn)?shù)據(jù)集中在這里，下面超過箱式圖的代表是異常值，也就是特別低的數(shù)據(jù)。

從圖片看出，各科成績的集中分?jǐn)?shù)段都是差不多的，都在60到80之間，但是數(shù)學(xué)成績的異常值更多，說明數(shù)學(xué)是最難的。

2、學(xué)生整體成績分組情況

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom pyecharts import Bar,WordCloud,Pie,Line%matplotlib inline%config InlineBackend.figure_format = 'svg'df=pd.read_csv(r'C:\Users\Administrator\Desktop\StudentsPerformance.csv')y1=df['math score']y2=df['reading score']y3=df['writing score']df['總分']= y1 + y2 + y3

def GetGrade(總分): if ( 總分 >=270 ): return '優(yōu)秀' if ( 總分 >=240): return '良好' if ( 總分 >=180): return '及格' else: return '不及格'

df['等級'] = df.apply(lambda x :GetGrade(x['總分']), axis=1)

plt.rcParams['font.sans-serif']=['SimHei']plt.figure(figsize=(10,5))sns.countplot(x="等級",data=df, order=['優(yōu)秀','良好','及格','不及格'],palette="muted")plt.title('學(xué)生成績分組情況分析',loc='left',size=15)plt.xlabel('成績分組情況',size=15)plt.ylabel('人數(shù)',size=15)plt.grid(False)sns.despine(left=False )#去掉上面和右邊邊框plt.show()

這份圖是學(xué)生三科匯總分?jǐn)?shù)的分層情況，我把總分在270分以上的，歸為優(yōu)秀；240以上的，歸為良好；180以上的歸為及格；其余則為不及格。分組的代碼已經(jīng)在上面的代碼塊中寫出來了。

但是上面這兩個(gè)圖還只是學(xué)生的成績情況，并不能看出成績的影響因素。

下面開始，我將特意從家長學(xué)歷的角度分析對孩子成績的影響。

3、父母學(xué)歷對子女成績影響1--考試分?jǐn)?shù)

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom pyecharts import Bar,WordCloud,Pie,Line%matplotlib inline%config InlineBackend.figure_format = 'svg'

df=pd.read_csv(r'C:\Users\Administrator\Desktop\StudentsPerformance.csv')y1=df['math score']y2=df['reading score']y3=df['writing score']df['總分']= y1 + y2 + y3

plt.figure(figsize=(10,5))sns.violinplot(x="parental level of education",y="總分",data=df,palette="Set3");

plt.title('父母學(xué)歷對子女成績影響1--考試分?jǐn)?shù)',loc='left',size=15)plt.xlabel('父母學(xué)歷',size=15)plt.ylabel('分?jǐn)?shù)',size=15)plt.xticks(size=12)plt.yticks(size=12)#plt.yticks([]) 可以去掉y軸plt.grid(False)sns.despine(left=False )#去掉上面和右邊邊框plt.show();

這個(gè)圖是小提琴圖，類似于箱式圖，看法也差不多，尾巴越長，說明低分越多。

從圖可以看出，高中未畢業(yè)（some high school）的家長子女低分最多；其次是高中家長（high school）；然后是大學(xué)未畢業(yè)家長（some college）；然后是大專家長（associate's degree）；接下來是大學(xué)生家長（bachelor's degree）；低分最少的家長是碩士家長（master's degree）。

4、父母學(xué)歷對子女成績影響2--合格人數(shù)

#算總分y1=df['math score']y2=df['reading score']y3=df['writing score']df['總分']= y1 + y2 + y3

#計(jì)算合格或者不合格情況passmark =180df['考試合格'] = np.where(df['總分']<passmark, 'N', 'Y')df['考試合格'].value_counts()

plt.figure(figsize=(12,6))p= sns.countplot(x='parental level of education', data = df, hue='考試合格', palette='bright')_ = plt.setp(p.get_xticklabels(), rotation=0)

plt.title('父母學(xué)歷對子女成績影響2--合格人數(shù)',loc='left',size=15)plt.xlabel('父母學(xué)歷',size=15)plt.ylabel('考試合格或不合格人數(shù)',size=15)plt.xticks(size=12)plt.yticks(size=12)#plt.yticks([]) 可以去掉y軸plt.grid(False)sns.despine(left=False )#去掉上面和右邊邊框plt.show();

這個(gè)圖知道做出來后，我才發(fā)現(xiàn)其實(shí)不太容易看出父母學(xué)歷對子女成績的影響，但既然做出來了，也就先放到這，當(dāng)做練手了。

由圖可知，無論國內(nèi)還是國外，其實(shí)高學(xué)歷人群還是少數(shù)，你看美國，家長是碩士學(xué)位的也是最少的。

5、父母學(xué)歷對子女成績影響3--及格率

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom pyecharts import Bar,WordCloud,Pie,Line%matplotlib inline%config InlineBackend.figure_format = 'svg'

df=pd.read_csv(r'C:\Users\Administrator\Desktop\StudentsPerformance.csv')y1=df['math score']y2=df['reading score']y3=df['writing score']df['總分']= y1 + y2 + y3

passmark =180df['考試合格'] = np.where(df['總分']<passmark, 'N', 'Y')df1=df.pivot_table('考試合格',index='parental level of education',aggfunc='count').reset_index()df2=df[df['考試合格']=='Y']df2=df2.pivot_table('考試合格',index='parental level of education',aggfunc='count').reset_index()df3=pd.merge(df1,df2,on='parental level of education')df3['合格率']=df3['考試合格_y']/df3['考試合格_x']

plt.rcParams['font.sans-serif']=['SimHei']x=df3['parental level of education']y=round(df3['合格率'],2)

plt.figure(figsize=(12,6))plt.bar(x,y,width=0.5,align='center')plt.title('父母學(xué)歷對子女成績影響3--合格率',loc='left',size=15)

for a,b in zip(x,y): plt.text(a,b,b,ha='center',va='bottom',fontsize=12)#顯示額度標(biāo)簽

plt.xlabel('父母學(xué)歷',size=15)plt.ylabel('合格率',size=15)plt.xticks(x,size=12)plt.yticks(size=15)#plt.yticks([]) 可以去掉y軸plt.grid(False)sns.despine(left=False )#去掉上面和右邊邊框plt.show()