Important Information:
Read all the instructions carefully before you begin!
You
Important Information:
Read all the instructions carefully before you begin!
You will need to save the (.ipynb) file as a searchable PDF (NOT as a picture), and submit it as the primary resource. Pictures or snapshots of your work will NOT be accepted.
The generated CSV file and .ipynb file must be submitted in a zip-folder as the secondary source.
You may use Jupyter Notebook or Colab as per your convenience.
Non-compliance with the above instructions will result in a 0 grade on the relevant portions of the assignment. Your instructor will grade your assignment based on what you submitted. Failure to submit the assignment or submitting an assignment intended for another class will result in a 0 grade, and resubmission will not be allowed. Make sure that you submit your original work. Suspected cases of plagiarism will be treated as potential academic misconduct and will be reported to the College Academic Integrity Committee for a formal investigation. As part of this procedure, your instructor may require you to meet with them for an oral exam on the assignment.
Important First Steps:
You can use either Anaconda or Colab to work on the Jupyter notebook that you will submit as your final project on Forum:
Start by downloading this Jupyter Notebook to your local machine.
Open a tab in your browser and type https://colab.research.google.com/.
This will open a small window. Choose the last option Show notebooks in Drive on the upper menu, “Upload”. Then choose the Jupyter notebook you have saved in step 1.
You can start working on your assignment by answering the questions in the corresponding cells.
A sample code is provided for tasks 3 to 6. Remember these are only sample codes, and you will need to make minor revisions to the codes to be able to complete the tasks.
If you have any questions, please reach out to your instructors and the CIS tutors.
Background
Imagine that you have graduated CIS and now work as a consultant.
You are hired by a health and fitness company.
They have collected detailed data from 507 physically active participants. This data includes information about the participant’s body measurements as well as personal attributes such as age, weight, height, and gender.
The company wants you to analyze this data in ways that can help them design personalized fitness evaluations and training regimens for their users.
Note: The entire dataset (and descriptions of each of the variables) can be found [here] (https://vincentarelbundock.github.io/Rdatasets/doc/openintro/bdims.html).
In Assignment 1 you will take a random sample of 100 participants from the 507 individuals who were studied, and analyze the data for these 100 individuals.
Task 1.
As mentioned above, you will select a random sample of 100 individuals from the company’s data set.
You will then conduct analyses on this random sample.
Look at the code below. To select a random sample from the data, you should replace Name with your own name in the code.
After you have done so run the code. The code will generate a CSV file with a random sample of 100 participants. It will also be labeled with your name.
REMEMBER: you need to add this CSV file to a zip file along with your .ipynb. file when submitting your assignment.
Task 2.
Now that you have your data set you are ready to start analyzing it!
The first step is to explore your dataset.
Look at the variables that make up the data set.
Once you’ve done so, imagine you are writing a report for the fitness company that hired you.
Start with a brief introduction to the research question you are exploring, then the dataset you are analyzing (e.g., what is the sample you are analyzing? What are the variables?)
Assume that your audience is the company’s leadership. They will be with what you are reporting.
Task 3.
Run the code to randomly select 4 variables from your dataset.
It will then print the names of the four variables that were randomly selected.
REMEMBER: Check the full name of each of your variables, you can find it here. https://vincentarelbundock.github.io/Rdatasets/doc/openintro/bdims.html
Your task is to do the following:
You should create a histogram and generate descriptive statistics for each of the four variables that were randomly selected above. You can use the code below to help you do so.
For each variable you need to describe the following: shape,** center**,** spread**, and the presence of any outliers.
Task 4.
Now that you have described and plotted data, let’s explore if the data differ for male and female participants.
Generate grouped box plots for each of the 4 variables in Task 3.
Your boxplot should compare the distributions for males and females in your dataset.
Afterwards, you should describe what you observe in each case.
Make sure you mention the five-number summaries for both genders.
Task 5
Part A
Select TWO variables from Task 3. Treat these as an independent variable.
Now create a scatterplot for each variable.
In each case, the plot should visualize the relationship between the variable and weight (dependent variable).
Describe each scatterplot in terms of the form,** strength**, and direction of the relationship between the variables.
Part B
Examine if the relationship explored in each scatterplot varies by gender.
Hint: You will need to create scatterplots separately for each gender to answer this question.
Task 6.
PART A
Finally, for each of the variables you focused on in Task 5:
Fit a simple linear regression model that predicts a participant’s Weight based on the variable you selected.
Make sure you generate, interpret, and use the residual plot, the standard error, and the R^2 to assess the fit of each linear model.
If the model is a good fit, interpret the slope and the y-intercept.
PART B
If you found that the relationship between weight and the variable you selected differed for males and females in Task 5 (Part B) then:
Run the regression model for each gender separately and interpret your findings accordingly.
Assignment Information
Length:
N/A
Weight:
18%
Learning Outcomes Added
CompProgramDesign: Generate working programs in a computer language that can solve computational problems; find and fix bugs that appear in them.
Variables: Identify and classify the relevant variables of a system, problem, or model.
DescriptiveStats: Calculate and interpret descriptive statistics appropriately.
Correlation: Apply and interpret measures of correlation; distinguish correlation and causation.
Visualizations: Interpret, analyze, and create data visualizations.