# Data generation
# Structure: e.g. "Z -> Y and Z -> X -> Y"Ethics4DS: Coursework 2
Reproducibility and causality
Reproducible and causal [20 marks]
- Create a simulation with data generating code that is consistent with a specific causal model structure. Choose the structure to satisfy all these requirements [10 marks]
- Show that there is a reproducible effect, i.e. one that can be found fairly consistently (e.g. in more than five percent of experiments) without using p-hacking [5 marks]
# Code for simulating multiple experiments# Code for showing the significance of the effect- Show, by simulating an intervention, that the above effect is causal [5 marks]
# Code for simulating an intervention
# and showing the effect on an outcomeReproducible but not causal [20 marks]
- Repeat the above section, but in this case choose the causal model generating your data so that the reproducible effect is not causal, i.e. an intervention on that variable does not change the outcome variable [10 marks]
# Data generation
# Structure: - Show that there is a reproducible effect, i.e. one that can be found fairly consistently (e.g. in more than five percent of experiments) without using p-hacking [5 marks]
# Code for simulating multiple experiments# Code for showing the significance of the effect- Show, by simulating an intervention, that the above effect is not causal [5 marks]
# Code for simulating an intervention
# and showing the effect on an outcomeFairness and causality
- Create a simulation with data generating code that is consistent with a specific causal model structure. Choose the structure to satisfy all these requirements
Data generation [10 marks]
- Variable
Ashould be a categorical “sensitive attribute,”Yshould be an outcome to be predicted, andXsome other predictor. If you decide to include an unobserved variable, name itU[5 marks] - In your example,
Ashould not have any causal effect onY(including direct or indirect effects), i.e. there should be no directed pathway fromAtoYin the structure graph [5 marks]
# Data generation
# Structure: Predictive accuracy [20 marks]
- Fit a “full” model predicting
YfromAandX, and a separate “unaware” model predictingYfromXonly [10 marks]
# Generate training data (if not already)
# Fit models- Generate a second sample of data and compare the predictive accuracy of these models when predicting on the new sample. If the full model is not significantly more accurate, change the data generating code until it is
- Use this test data for all remaining parts of the coursework below this point
# Generate test data- Compare predictive accuracy of the two models on test data [10 marks]
- Hint: You may wish to read about the
newdataargument in?predict.lmor?predict.glm - Choose any accuracy measure you wish, e.g. if
Yis numeric you could use mean squared error, sqrt(MSE), median absolute error, etc. If it’s binary you could use misclassification rate, or false positive rate, or false negative rate, etc. - Note: if
Yis binary and you’re using logistic regression, you may want to see?predict.glmand read about theresponseargument
# Compare predictive accuracy on test dataDisparate predictions? [20 marks]
- For each of the two predictive models, compare the average predicted outcomes for two subsamples with different values of
A(e.g. ifAis a binary, 0/1 variable, compare average predictions for theA == 0group and theA == 1group) [10 marks]
# Hint: use `subset()` or `dplyr::filter()` with A- For each of the two predictive models, compare the predictive accuracy for the same two subsamples as above [10 marks]
# Predictive accuracy in each groupStory time [10 marks]
(Delete from here to the beginning of your own writing before the final knit)
- Describe a (reasonably plausible) real world scenario that could fit with your answers to this section.
- What do the variables represent? Who would be fitting/using the predictive models, and what would they use the predictions for? How could a disparity in the predictions of the models potentially affect people and make ethics relevant for the example? [5 marks]
- Give an explanation for why the variables, with the real world meanings you have given them, could possibly not have any causal relationship between
AandY, even though usingAresults in more accurate predictions [5 marks]
Remember to replace candidate_number and “[Candidate number here]” at the top of the document and knit one last time before submitting
Write here and delete this text