SCIE 301: Research Design and Statistical Analysis
Winter 2021, Assignment 1
This is an open-book assignment, both calculators and statistical software such as R can be used.
If you are using software to assist, please include the code, output and your own conclusions
based on your interpretations of the output.
For numerical answers, please round up the results to keep three decimal digits when needed.
The \data-bmi.csv” le contains data on individual characteristics including gender, age, height and
weight on a group of individuals. The variables and the corresponding scales or measurements in the
dataset are as follows
ID: the identication number for an individual;
gender: either of the two sexes (male and female) of an individual
age: age of an individual in years
height: height of an individual in centimeters
weight: weight of an individual in pounds
Please answer the following questions based on this dataset.
1.(2 points) Please import the csv le into R and create a data frame. Please state the dimension
of the data in terms of the number of rows (individuals) and number of columns (variables).
2.(4 points) Please calculate the Body Mass Index (BMI) in the unit of kg=m
for all subjects,
report the ve number summary along with the variance and standard deviation.
3.(5 points) Please report on the frequency and percentage of missing data of all variables including
newly created variable BMI, obtain a subset that contains individuals with complete information
on all variables and state the size of this subset.
4.The following questions are based on the subset of complete cases obtained in previous question.
(a)(6 points) Please classify the individuals into groups indicating the weight statuses using
the following criterion
underweight (BMI is less than 18.50)
normal weight (BMI is between 18.50 and 25.00, including 18.50 and excluding 25.00)
overweight (BMI is between 25.00 and 30.00, including 25.00 and excluding 30.00)
obesity, excluding extreme obesity (BMI is between 30.00 and 40.00, including 30.00
and excluding 40.00)
extreme obesity (BMI is greater or equal to 40.00).
Please also summarize the distribution of weight status using numbers or tables.
(b)(8 points) Please display summary information of the weight status among all subjects
including frequencies and relative frequencies (in percentage) using two types of graphs
(bar plot and pie chart).
Please be considerate to your audience and make your graphs as informative and concise
as possible by using legend, labeling axes, displaying numbers and so on. Please sort the
layout based on the frequency or relative frequency in either ascending or descending order
to enhance readability.
(c)(4 points) Please create a histogram of age in this subset and thoroughly elaborate your
ndings about the distribution of the age based on the histogram created (for example,
you should comment on the shape such as centers (three central tendencies), skewness,
existence of outliers, spread and so on).
(d)(12 points) Please create side-by-side boxplots for the heights of the individuals in dierent
gender groups and comment on whether you are able to judge whether the two gender
groups dier in terms of the distribution of the height. Please be sure to provide foundation
of your comments. Use the 1:5IQR rule (remember to show your steps) to identify the
individuals whose heights are outliers in each gender group.