PhD Stress Level Dataset and Exercises

Dataset Description

This dataset contains information on PhD students’ stress levels (scale 0 to 100) along with several predictors:

Coffee: Number of coffee cups consumed per day (integer)
Spritz: Number of spritz (aperitif) consumed per week (integer)
Statistical Knowledge: Self-assessed score from 0 to 10
Software Used: Statistical software used (R, SPSS, Excel)

You can download the dataset here

Exercises

These exercises will help you explore the dataset, understand relationships, and fit a linear regression model.

1. Load and Explore the Data

Load the dataset.
Fix problems in the dataset, impute missing numerical values with the mean and missing categorical variables with a random value from the available values.
Calculate means, medians, standard deviations, and ranges for all numeric variables.
Examine the distribution of the categorical variable software_used with counts and proportions.

2. Visualize the Data

Plot histograms or density plots for the numeric variables (stress, coffee, spritz, statistical knowledge).
Create boxplots of stress by software_used.
Create scatterplots:
- Stress vs Coffee
- Stress vs Spritz (color points by software_used)
- Stress vs Statistical Knowledge

3. Check Relationships and Correlations

Compute correlation coefficients between numeric variables.
Comment on the strength and direction of relationships between stress and each predictor.

4. Fit a Linear Regression Model

Fit a linear regression model predicting stress from coffee, spritz, statistical knowledge, software_used, and the interaction between spritz and software
Examine the model summary and interpret the coefficients, especially the interaction terms