Attached is pdf with questions and excel datasheet to answer question #7. R Stud
Attached is pdf with questions and excel datasheet to answer question #7. R Studio should only be used to solve #7.
Attached is pdf with questions and excel datasheet to answer question #7. R Stud
Attached is pdf with questions and excel datasheet to answer question #7. R Studio should only be used to solve #7.
The 2 short- answer questions need to be completed by using the software Rstudio
The 2 short- answer questions need to be completed by using the software Rstudios, and have detailed explanations and steps.
Requirments: Only submit your R code, with your answers embedded as comments in the code. R Markdown files also accepted
Data: Credit.csv
We are using a dataset of information from 310 credit card hold
Data: Credit.csv
We are using a dataset of information from 310 credit card holders..
You will also use a second dataset of new cardholders to predict their credit card balances at the end of the assignment. That dataset is called credit_card_prediction.csv
Background: For this assignment, you work at a credit card company and you would like to predict new cardholders’ credit card balances based on a number of factors. This dataset only contains information on cardholders who maintain a balance at some point during a month (that is, their balances are not zero). The credit card company does have customers who do not have a credit card balance (because they are not using their cards), but this analysis is only examining active card users. Your business questions are: “What variables effectively contribute to predicting active cardholders’ credit card balances?” and “What credit card balance might a new active cardholder hold depending on certain variables?”
Variables: The variables in this dataset include:
Income: Annual income, in dollars
Limit: Credit limit for credit card, in dollars
Rating: A credit rating calculated by the credit card company. (Not the same as a typical
credit score)
Age: Age in years
Education: Number of years of education
Student: Whether or not the cardholder is a student (No = 0, Yes = 1)
Gender: The gender of the cardholder (Male = 0, Female = 1)
Married: Whether or not the cardholder is married (No = 0, Yes = 1)
Balance: The amount of each cardholder’s balance, in dollars
Assignment Steps:
Carry out the steps below to complete the assignment, then answer the questions in the Module 3 Assignment Quiz on Brightspace. The quiz questions are included here, with their numbers, if you prefer to answer them as you are doing the assignment and enter them in the Brightspace quiz all at once (multiple choice questions are labeled “MC”).
Generate summary statistics for the variables in the Credit.csv dataset.
Quiz question #1: How many cardholders in the full dataset are students?
Partition the dataset into a training set and a validation set (following the method used in the lecture code car_regression_ex.R)
**IMPORTANT #1: Because this dataset is smaller than the one used in the video example, divide the dataset 50-50 rather than 70-30 as was done in the video example.
**IMPORTANT #2: In order to get results that align with the correct answers in the assignment quiz, when you are partitioning your dataset you MUST set the seed value to 42 using the set.seed () function. If you do not do this, you will not be able to reproduce the answers that correspond with the assignment quiz.
Create a correlation matrix with the quantitative variables in the training dataframe.
Quiz question #2: Looking at the correlation matrix, which pair of variables has the strongest correlation? (MC)
Conduct a multiple regression analysis using the training dataframe with Balance as the outcome variable and all the other variables in the dataset as predictor variables.
Quiz question #3: What is the slope coefficient for the Rating variable?
Calculate the Variance Inflation Factor (VIF) for all predictor variables.
Quiz question #4: What is the VIF for the Limit variable?
Quiz question #5: What problem does the VIF for Limit suggest that we have with the analysis? (MC)
Conduct a new multiple regression analysis using the training dataframe with Balance as the outcome variable and Income, Rating, Age, Education, Student, Gender, and Married as predictor variables.
Quiz question #6: What is the new slope coefficient for the Rating variable?
Create a residual plot and a normal probability plot using the results of the regression analysis in Step (6).
Quiz question #7: What pattern do you see in the residual plot? (MC)
Quiz question #8: What does this pattern tell you? (MC)
Quiz question #9: What pattern do you see in the normal probability plot? (MC)
Quiz question #10: What does this pattern tell you? (MC)
Examine the regression output from Step (6).
Quiz question #11: Which predictor variables have statistically significant relationships with the outcome variable, Balance? (MC)
Conduct a new multiple regression analysis using the training dataframe with Balance as the outcome variable and only the variables with statistically significant relationships with Balance (identified in Step (8)) as predictors.
Quiz question #12: What is the slope coefficient for the Age variable?
Quiz question #13: How would you interpret the slope coefficient for the Rating variable? (MC)
Quiz question #14: How would you interpret the slope coefficient for the Student variable? (MC)
Quiz question #15: What is the adjusted R2 for this regression analysis?
Quiz question #16: How can this adjusted R2 value be interpreted? (MC)
Quiz question #17: What is the standardized slope coefficient for the Income variable?
Quiz question #18: Looking at the standardized slope coefficients, which variable makes the strongest unique contribution to predicting credit card balance? (MC)
Conduct a final multiple regression analysis using the validation dataframe with Balance as the outcome variable and only the variables with statistically significant relationships with Balance (the same variables as in Step (9) as predictors.
Quiz question #19: What is the new slope coefficient for the Rating variable?
Using the data contained in the csv file “credit_card_prediction.csv”, predict the credit card balances for three new cardholders, with 95% prediction intervals.
Quiz question #20: What is the predicted balance for new cardholder #1?
Quiz question #21: What is the 95% prediction interval for the predicted balance for new cardholder #2?
could you answer theses Question
Q3/ What is the slope coefficient for the Rating variable? (Round to 3 decimal places)Q4/ What is the VIF for the Limit variable? (Round to 3 decimal places)
Question 17
What is the standardized slope coefficient for the Income variable in the regression model from Step (9)? (Round to 3 decimal placesQ18/ Looking at the standardized slope coefficients, which variable makes the strongest unique contribution to predicting credit card balance?
Question options:
Income
Rating
Student
Age
Q19/ What is the new slope coefficient for the Rating variable for the regression model from Step (10)? (Round to 3 decimal places)
Q20/Based on the analysis in Step (11), what is the predicted credit card balance for new cardholder #1? (Enter number as a dollar amount with a dollar sign, appropriate commas, and rounded to two decimal places, e.g. $4,852.34).
Q21/ Based on the analysis in Step (11), what is the 95% prediction interval for the predicted credit card balance for new cardholder #2? (Enter the lower and upper value of the interval separated by a comma, in dollar amounts with a dollar sign, appropriate commas, and rounded to two decimal places, e.g. $5,678.90, $7,891.23)
Steps for CompletionPull the files from the Mod3/Class08/Homework folder on the
Steps for CompletionPull the files from the Mod3/Class08/Homework folder on the GitHub class repository.
One data file is bikedata.csv.
Remember to work from your own git branch. No need to push your work onto the main branch, nor publish your work to the online repo.
Save a copy of the Rmd file as HW08_Lastname_Firstname.Rmd. Answer all the questions there in the file.
When you are finished, make sure you can knit your file into Html.
Blackboard does not accept HTML file submission. You need to compress/zip up your RMD file with the HTML file.
Rename the zip file as HW08_Lastname_Firstname.zip. Submit the zip file here on Blackboard.
Do not answer questions inside the code blocks as comments. Those are coder’s comments, and they will not be graded.
Criteria for SuccessNotice that there might be more than one way to answer a question. And in some cases, there might be more than one correct answer.
10 points: Answer all questions correctly. The Html file presents a clean and organized document to the reader.
General point deductions:
The zip file does not contain the RMD or Html file. (Deduct 50-100% of that Homework set)
Not all the codes in Rmd file run without error. (Warnings are acceptable. No errors. Deduct 25-100% of that question)
Incorrect answers. (Deduct 25-100% of that question)
Html output file too disorganized, or has too much irrelevant information, too long. (Deduct 10-50% of that question)
Submission Instructions
Click the link above (the title of this assignment) to submit the zip file.
Find bigrams in the attached text. Bigrams are word pairs and their counts. To b
Find bigrams in the attached text. Bigrams are word pairs and their counts. To build them do the following:
Tokenize by word.
Create two almost-duplicate files of words, off by one line, using tail.
Paste them together so as to get word(i) and word(i +1) on the same line.
Count
Then, after you have the data from the procedure above: Provide the commands to find the 10 most common bigrams.
For the submission, provide all the commands that accomplishes the steps from 1. to 5.
After completing the above, go to following web page: NLTK :: nltk.lm package. First, implement the tutorial to develop an understanding of the library and its usage foo bigrams. Then, replicate all steps for the attached text.
Tools needed for the exercise: tr, sort, uniq, head, rev, tail. Given input text
Tools needed for the exercise: tr, sort, uniq, head, rev, tail. Given input text (see attached):
Provide commands that find the 50 most common words in the NYT.
Provide commands that find the words in the NYT that end in “zz”.
Project 2: Decision making based on historical data
Attached Files:
I_1.jpeg (4
Project 2: Decision making based on historical data
Attached Files:
I_1.jpeg (49.983 KB)
I_2.jpeg (48.819 KB)
dataG2.csv (149.28 KB)
This project reflects the basics of data distribution. The project topics relate to the definitions of variance and skewness.
Files needed for the project are attached.
Cover in the project the following:
Explain the variance and skewnessShow a simple example of how to calculate variance and then explain the meaning of it.
Show a simple example of how to calculate skewness and then explain the meaning of it.
After loading dataG2.csv into R or Octave, explain the meaning of each column or what the attributes explain. Columns are for skewness, median, mean, standard deviation, and the last price (each row describes with the numbers the distribution of the stock prices):
Draw your own conclusions based on what you learned under 1. and 2.Explain the meaning of variables ‘I_1’ and ‘I_2’ after you execute (after dataG2.csv is loaded in R or Octave) imported_data <- read.csv("dataG2.csv") S=imported_data[,5]-imported_data[,3] I_1 =which.min(S) # use figure I_1 (see attached)I_2 = which.max(S) # use figure I_2 (see attached)
Based on the results in a., which row (stock) would you buy and sell and why (if you believe history repeats)?
Explain how would you use the skewness (first column attribute) to decide about buying or selling a stock.
If you want to decide, based on the historical data, which row (stock) to buy or sell, would you base your decision on skewness attribute (1st column) or the differences between the last prices with mean (differences between 5th attribute and 3rd attribute)? Explain.
Data scientists conduct continual experiments. This process starts with a hypoth
Data scientists conduct continual experiments. This process starts with a hypothesis. An experiment is designed to test the hypothesis. It is designed in such a way that it hopefully will deliver conclusive results. The data from a population is collected and analyzed, and then a conclusion is drawn. From your own experiences and reading:
Explain what are the 2 major problems with collecting the samples?
Is it possible to fix the problems you mentioned? If not, explain why is that so. If it is, explain how you would do it.
To participate in the discussion, respond to the discussion promptly by Thursday at 11:59PM EST. Then, read a selection of your colleagues’ postings. Finally, respond to at least two classmates by Sunday at 11:59PM EST in one or more of the following ways:
I will post two classmates’s work later and you will respond to both of them
Universal Bank is a relatively young bank growing rapidly in terms of overall cu
Universal Bank is a relatively young bank growing rapidly in terms of overall customer acquisition. The majority of these customers are liability customers (depositors) with varying sizes of relationship with the bank. The customer base of asset customers (borrowers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business. In particular, it wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise smarter campaigns with better target marketing. The goal is to use k‐NN to predict whether a new customer will accept a loan offer. This will serve as the basis for the design of a new campaign.
The dataset mlba::UniversalBank contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Partition the data into training (60%) and holdout (40%) sets.
Consider the following customer: Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education = 2, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1. Perform a k‐NN classification with all predictors except ID and ZIP code. Remember to define categorical predictors with more than two categories as factors (for k‐NN, to automatically handle categorical predictors). Create KNN model with k=1. How would this customer be classified?
Use set.seed(1) for training.
B. What is a choice of k that balances between overfitting and ignoring the predictor information? Use 5‐fold cross‐validation to find the best k.
Use set.seed(123) for cross validation
The best K for the model is saved in model$bestTune
C. Show the confusion matrics for the training and holdout data that results from using the best k. Comment on the differences and reasons.
The code example below shows how to produce the confusion matrix for the training set
cm <- confusionMatrix(predict(model, train.df), train.df$Personal.Loan)
D. Consider the following customer: Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education = 2, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1 and Credit Card = 1. Classify the customer using the best k.
Continuing with the theme of hypothesis testing, this week, we turn our attentio
Continuing with the theme of hypothesis testing, this week, we turn our attention to conducting tests for one sample, two paired samples, and two independent samples. To further develop our understanding of these tests, this assignment will focus on the application of these statistical techniques. You will select a dataset, conduct the appropriate tests, and share your findings.Assignment Requirements: Dataset Selection: Choose a dataset that allows for one sample, two paired samples, and two independent sample tests. Briefly explain why you have chosen this dataset.
Hypothesis Formulation: Formulate hypotheses appropriate for one sample, two paired samples, and two independent sample tests. Describe the hypotheses for each test clearly.
Execution of Tests: Perform the tests using Python or R, and document the steps you have taken. Be sure to include your code in your initial post.
Results Interpretation: Interpret the results of your tests. What do the results tell you about your dataset and the hypotheses you formulated?
Conclusions and Applications: Summarize your findings and discuss potential real-world applications of your conclusions.
Submission Format: Your submission should be a maximum of 500-600 words (excluding Python/R code). Submit your assignment in APA format as a Word document or a PDF file. Include your written analysis and any tables or visualizations that support your findings. If you used any software for your calculations (like R, Python, Excel), please include your code or formulas as well. Include an APA-formatted reference list for any external resources used.