Learn to use pandas for data analysis in 10minutes

3 min readNov 17, 2022

first loading data

import pandas as pd
df = pd.read_csv("./data/cardio_base.csv")
df.head()

1.How tall are the tallest $1 %$ of people

new_df= df.sort_values(by=["height"],ascending=False)
new_df["height"][:int(df.shape[0]*0.01)]6486     250
21628    207
41901    200
8897     198
30127    198
        ... 
21182    184
59359    184
37162    184
69375    184
2862     184
Name: height, Length: 700, dtype: int64

answer is They are taller than $184 \mathrm{~cm}$

2.Which two features have the highest spearman rank correlation?

new_df = df.corr(method="spearman")
new_df

answeer is Blood pressure levels (ap_hi, ap_low)

3.What percentage of people are more than 2 standard deviations far from the average height?

height_mean = df["height"].values.mean()
height_std = df["height"].values.std()
df["height_a"]=df["height"]-height_mean
df["height_a"]=df["height_a"].apply(abs)
num_df = df[df["height_a"]>2*height_std].shape[0]
num_df/df.shape[0]0.033357142857142856

answer is 3%

4. What percentage of the population over 50 years old consume alcohol?

Also use the cardio_alco.csv and merge the datasets on ID. Ignore those persons, where we have no alcohol consumption information!

df_alco = pd.read_csv("./data/cardio_alco.csv",sep=";")
df_alco.head()

dff = pd.merge(df,df_alco,on="id")
dff.head()

dff[dff["age"]>365*50]["alco"].value_counts()0    37540
1     1957
Name: alco, dtype: int641957/(1957+37540)0.04954806694179305

5.Which of the following statements is true with $95 %$ confidence?

df[df["smoke"]==0][["weight","ap_hi","cholesterol"]].describe()

df[df["smoke"]==1][["weight","ap_hi","cholesterol"]].describe()

answer:Smokers have higher cholesterol level than non smokers

6.When did the difference in the total number of confirmed cases between Italy and Germany become more than 10000 ?

Second Dataset, Covid19 cases This dataset contains daily covid19 cases for all countries in the world. Each row represents a calendar day. The rows also contain some simple information about the countries, like population, percentage of the population over 65 , GDP and hospital beds per thousand inhabitants. Please use this dataset to answer the following questions.

import warnings
warnings.filterwarnings("ignore")
df=pd.read_csv("./data/covid_data.csv")
df.head()

def sum_diedai(li):
    new_li=[]
    for i in range(len(li)):
        new_li.append(sum(li[:i+1]))
    return new_lidf_ita = df[df["location"]=="Italy"]
df_ger = df[df["location"]=="Germany"]df_ita["ita_sum"]=sum_diedai(df_ita["new_cases"].values.tolist())
df_ita["ger_sum"]=sum_diedai(df_ger["new_cases"].values.tolist())df_ita["jian"]=df_ita["ita_sum"]-df_ita["ger_sum"]df_ita["jian"]=df_ita["jian"].apply(abs)
df_ita[df_ita["jian"]>10000].head()

answer is 2020–03–12

li=[1,2,3,4]
sum_diedai(li)[1, 3, 6, 10]li[:4][1, 2, 3, 4]# li-[1,1,1]