Ethics4DS: Coursework 2

Author

[Candidate number here]

Published

October 29, 2023

Create a simulation with data generating code that is consistent with a specific causal model structure. Choose the structure to satisfy all these requirements [10 marks]

# Data generation
# Structure: e.g. "Z -> Y and Z -> X -> Y"

Show that there is a reproducible effect, i.e. one that can be found fairly consistently (e.g. in more than five percent of experiments) without using p-hacking [5 marks]

# Code for simulating multiple experiments

# Code for showing the significance of the effect

# Code for simulating an intervention 
# and showing the effect on an outcome

Repeat the above section, but in this case choose the causal model generating your data so that the reproducible effect is not causal, i.e. an intervention on that variable does not change the outcome variable [10 marks]

# Data generation
# Structure:

Show that there is a reproducible effect, i.e. one that can be found fairly consistently (e.g. in more than five percent of experiments) without using p-hacking [5 marks]

# Code for simulating multiple experiments

# Code for showing the significance of the effect

Show, by simulating an intervention, that the above effect is not causal [5 marks]

# Code for simulating an intervention 
# and showing the effect on an outcome

Create a simulation with data generating code that is consistent with a specific causal model structure. Choose the structure to satisfy all these requirements

Variable A should be a categorical “sensitive attribute,” Y should be an outcome to be predicted, and X some other predictor. If you decide to include an unobserved variable, name it U [5 marks]
In your example, A should not have any causal effect on Y (including direct or indirect effects), i.e. there should be no directed pathway from A to Y in the structure graph [5 marks]

# Data generation
# Structure:

Fit a “full” model predicting Y from A and X, and a separate “unaware” model predicting Y from X only [10 marks]

# Generate training data (if not already)

# Fit models

Generate a second sample of data and compare the predictive accuracy of these models when predicting on the new sample. If the full model is not significantly more accurate, change the data generating code until it is
Use this test data for all remaining parts of the coursework below this point

# Generate test data

Compare predictive accuracy of the two models on test data [10 marks]
Hint: You may wish to read about the newdata argument in ?predict.lm or ?predict.glm
Choose any accuracy measure you wish, e.g. if Y is numeric you could use mean squared error, sqrt(MSE), median absolute error, etc. If it’s binary you could use misclassification rate, or false positive rate, or false negative rate, etc.
Note: if Y is binary and you’re using logistic regression, you may want to see ?predict.glm and read about the response argument

# Compare predictive accuracy on test data

For each of the two predictive models, compare the average predicted outcomes for two subsamples with different values of A (e.g. if A is a binary, 0/1 variable, compare average predictions for the A == 0 group and the A == 1 group) [10 marks]

# Hint: use `subset()` or `dplyr::filter()` with A

For each of the two predictive models, compare the predictive accuracy for the same two subsamples as above [10 marks]

# Predictive accuracy in each group

(Delete from here to the beginning of your own writing before the final knit)

Describe a (reasonably plausible) real world scenario that could fit with your answers to this section.
What do the variables represent? Who would be fitting/using the predictive models, and what would they use the predictions for? How could a disparity in the predictions of the models potentially affect people and make ethics relevant for the example? [5 marks]
Give an explanation for why the variables, with the real world meanings you have given them, could possibly not have any causal relationship between A and Y, even though using A results in more accurate predictions [5 marks]

Remember to replace candidate_number and “[Candidate number here]” at the top of the document and knit one last time before submitting

Write here and delete this text