Learn to use pandas for data analysis in 10minutes

Piaoya
3 min readNov 17, 2022
  • first loading data
import pandas as pd
df = pd.read_csv("./data/cardio_base.csv")
df.head()
png

1.How tall are the tallest $1 %$ of people

new_df= df.sort_values(by=["height"],ascending=False)
new_df["height"][:int(df.shape[0]*0.01)]
6486 250
21628 207
41901 200
8897 198
30127 198
...
21182 184
59359 184
37162 184
69375 184
2862 184
Name: height, Length: 700, dtype: int64

answer is *They are taller than $184 \mathrm{~cm}$ *

2.Which two features have the highest spearman rank correlation?

new_df = df.corr(method="spearman")
new_df
png

answeer is Blood pressure levels (ap_hi, ap_low)

3.What percentage of people are more than 2 standard deviations far from the average height?

height_mean = df["height"].values.mean()
height_std = df["height"].values.std()
df["height_a"]=df["height"]-height_mean
df["height_a"]=df["height_a"].apply(abs)
num_df = df[df["height_a"]>2*height_std].shape[0]
num_df/df.shape[0]
0.033357142857142856

answer is 3%

4. What percentage of the population over 50 years old consume alcohol?

Also use the cardio_alco.csv and merge the datasets on ID. Ignore those persons, where we have no alcohol consumption information!

df_alco = pd.read_csv("./data/cardio_alco.csv",sep=";")
df_alco.head()

dff = pd.merge(df,df_alco,on="id")
dff.head()
png
dff[dff["age"]>365*50]["alco"].value_counts()0    37540
1 1957
Name: alco, dtype: int64
1957/(1957+37540)0.04954806694179305

5.Which of the following statements is true with $95 %$ confidence?

df[df["smoke"]==0][["weight","ap_hi","cholesterol"]].describe()
png
df[df["smoke"]==1][["weight","ap_hi","cholesterol"]].describe()
png

answer:Smokers have higher cholesterol level than non smokers

6.When did the difference in the total number of confirmed cases between Italy and Germany become more than 10000 ?

Second Dataset, Covid19 cases This dataset contains daily covid19 cases for all countries in the world. Each row represents a calendar day. The rows also contain some simple information about the countries, like population, percentage of the population over 65 , GDP and hospital beds per thousand inhabitants. Please use this dataset to answer the following questions.

import warnings
warnings.filterwarnings("ignore")
df=pd.read_csv("./data/covid_data.csv")
df.head()
png
def sum_diedai(li):
new_li=[]
for i in range(len(li)):
new_li.append(sum(li[:i+1]))
return new_li
df_ita = df[df["location"]=="Italy"]
df_ger = df[df["location"]=="Germany"]
df_ita["ita_sum"]=sum_diedai(df_ita["new_cases"].values.tolist())
df_ita["ger_sum"]=sum_diedai(df_ger["new_cases"].values.tolist())
df_ita["jian"]=df_ita["ita_sum"]-df_ita["ger_sum"]df_ita["jian"]=df_ita["jian"].apply(abs)
df_ita[df_ita["jian"]>10000].head()
png

answer is 2020–03–12

li=[1,2,3,4]
sum_diedai(li)
[1, 3, 6, 10]li[:4][1, 2, 3, 4]# li-[1,1,1]

--

--

Piaoya

Data scientist. I write about data science, machine learning and analytics.