BAYESUVIUS
ROBERT R. TUCCI
a visual dictionary of Bayesian
Networks and Causal Inference
Bayesuvius,
a visual dictionary of Bayesian Networks and
Causal Inference
Robert R. Tucci
www.ar-tiste.xyz
September 19, 2022
This book is constantly being expanded and improved. To download
the latest version, go to https://github.com/rrtucci/Bayesuvius
Bayesuvius
by Robert R. Tucci
Copyright ©2020-2021, Robert R. Tucci.
This work is licensed under the Creative Commons Attribution-Noncommercial-No
Derivative Works 3.0 United States License. To view a copy of this license, visit the
link https:// creativecommo ns.org/licenses/by-nc-nd/3. 0/ or send a letter to
Creative Commons, PO Box 1866, Mountain View, CA 94042.
1
Figure 1: View of Mount Vesuvius from Pompeii
Figure 2: Mount Vesuvius and Bay of Naples
2
Contents
Foreword 16
Navigating the ocean of Judea Pearl’s Books 17
CI-2-3 track 18
Notational Conventions and Preliminaries 21
0.1 Some abbreviations frequently used throughout this book . . . . . . . 21
0.2 N(!a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
0.3 One hot vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
0.4 Special sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
0.5 Kronecker delta function . . . . . . . . . . . . . . . . . . . . . . . . . 22
0.6 Dirac delta function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
0.7 Indicator function (a.k.a. Truth function) . . . . . . . . . . . . . . . . 22
0.8 Majority function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
0.9 Underlined letters indicate random variables . . . . . . . . . . . . . . 23
0.10 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . 23
0.11 Discretization of continuous probability distributions . . . . . . . . . 23
0.12 Samples, i.i.d. variables . . . . . . . . . . . . . . . . . . . . . . . . . . 24
0.13 Expected Value and Variance . . . . . . . . . . . . . . . . . . . . . . 25
0.14 Conditional Expected Value . . . . . . . . . . . . . . . . . . . . . . . 25
0.15 Notation for covariances . . . . . . . . . . . . . . . . . . . . . . . . . 26
0.16 Conditional Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . 26
0.17 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
0.18 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
0.19 Softmax function (a.k.a. Boltzmann Distribution) . . . . . . . . . . . 28
0.20 Sigmoid and log-odds functions . . . . . . . . . . . . . . . . . . . . . 29
0.21 Estimand, Estimator (curve-fit, cfit), Estimate, Bias . . . . . . . . . . 30
0.22 Maximum Likelihood Estimate, Likelihood Ratio Test . . . . . . . . . 31
0.23 Mean Square Error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . 32
0.24 Cramer-Rao Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
0.25 Bayes Rule, Bayesian Updating And Conjugate Priors . . . . . . . . . 38
0.26 Linear regression, Ordinary Least Squares (OLS) . . . . . . . . . . . 39
3
0.26.1 LR, assuming x
σ
are non-random . . . . . . . . . . . . . . . . 40
Derivation of LR From Minimization of Error . . . . . . . . . 41
Geometry of LR with non-random x
σ
. . . . . . . . . . . . . . 42
LR Goodness of Fit, R
2
. . . . . . . . . . . . . . . . . . . . . 43
0.26.2 LR, assuming x
σ
are random . . . . . . . . . . . . . . . . . . . 45
Transforming expressions from non-random to random x
σ
. . 45
Geometry of LR with random x
σ
. . . . . . . . . . . . . . . . 47
Regression interpreted as differentiation of y . . . . . . . . . . 49
0.27 Logistic Regression (LoR) . . . . . . . . . . . . . . . . . . . . . . . . 50
0.28 Entropy, Kullback-Leibler divergence, Cross-Entropy . . . . . . . . . 51
0.29 Definition of various entropies used in Shannon Information Theory . 51
0.30 Mean log likelihood asymptotic behavior . . . . . . . . . . . . . . . . 53
0.31 Arc Strength (Arc Force) . . . . . . . . . . . . . . . . . . . . . . . . . 54
0.32 Pearson Chi-Squared Test . . . . . . . . . . . . . . . . . . . . . . . . 54
0.33 Demystifying Population and Sample Variances . . . . . . . . . . . . 55
0.34 Independence of bµ and
b
σ
2
. . . . . . . . . . . . . . . . . . . . . . . . 57
0.35 Chi-square distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 58
0.36 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 59
0.37 Hypothesis testing and 3 classic test statistics (Likelihood, Score, Wald) 62
0.38 Error Bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
0.39 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
0.40 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
0.41 Short Summary of Boolean Algebra . . . . . . . . . . . . . . . . . . . 70
0.42 Convex/Concave functions, Jensen’s Inequality . . . . . . . . . . . . . 71
Definition of a Bayesian Network 73
Bayesian Networks, Causality and the Passage of Time 77
0.43 Unifying Principle of this book . . . . . . . . . . . . . . . . . . . . . 77
0.44 You say tomato, I say tomato . . . . . . . . . . . . . . . . . . . . . . 77
0.45 A dataset is causal model free . . . . . . . . . . . . . . . . . . . . . . 78
0.46 What is causality? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
0.47 Bayesian Networks and the passage of time . . . . . . . . . . . . . . . 80
0.48 Advice for the DAG-phobic . . . . . . . . . . . . . . . . . . . . . . . 81
1 AdaBoost 82
1.1 AdaBoost for general ensemble of w-classifiers . . . . . . . . . . . . . 82
1.2 AdaBoost for ensemble of tree stumps . . . . . . . . . . . . . . . . . 86
2 ANOVA 88
2.1 Law of Total Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.2 Sum of Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . 89
2.3 F-statistic and hypothesis testing . . . . . . . . . . . . . . . . . . . . 91
4
3 ARACNE structure learning 93
4 Backdoor Adjustment Formula 95
4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5 Back Propagation (Automatic Differentiation) 100
5.1 General Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1.1 Jacobians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1.2 Bnets for function composition, forward propagation and back
propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Application to Neural Networks . . . . . . . . . . . . . . . . . . . . . 102
5.2.1 Absorbing b
λ
i
into w
i|j
. . . . . . . . . . . . . . . . . . . . . . . 102
5.2.2 Bnets for function composition, forward propagation and back
propagation for NN . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 General bnets instead of Markov chains induced by layered structure
of NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6 Basic Curve Fitting Using Gradient Descent 108
7 Bell and Clauser-Horne Inequalities in Quantum Mechanics 110
8 Berkson’s Paradox 111
9 Binary Decision Diagrams 113
10 Chow-Liu Trees and Tree Augmented Naive Bayes (TAN) 117
10.1 Chow-Liu Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
10.2 Tree Augmented Naive Bayes (TAN) . . . . . . . . . . . . . . . . . . 121
11 Counterfactual Reasoning 123
11.1 The 3 Rungs of Causal AI . . . . . . . . . . . . . . . . . . . . . . . . 123
11.2 Do operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
11.3 Imagine operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
12 Cross-Validation 128
13 Dataset Shift and Batch Normalization 131
13.1 Covariate Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
13.2 Concept Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
13.3 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
14 Decision Trees 134
14.1 Transforming a dtree into a bnet . . . . . . . . . . . . . . . . . . . . 136
14.2 Structure Learning for Dtrees . . . . . . . . . . . . . . . . . . . . . . 137
5
14.2.1 Information Gain, Gini . . . . . . . . . . . . . . . . . . . . . . 138
14.3 Information Gain Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 141
14.3.1 Pseudo-code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
15 Decisions Based on Rungs 2 and 3: COMING SOON 144
16 Difference-in-Differences 145
16.1 John Snow, DID and a cholera transmission pathway . . . . . . . . . 145
16.2 PO analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
16.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
17 Diffusion Models 152
17.1 Bnet for DM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
17.2 Mean Values M
t1
(x
t
) and M
t1
θ
(x
t
) . . . . . . . . . . . . . . . . . . 155
17.3 Loss function L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
17.4 Algorithms for training and sampling DM . . . . . . . . . . . . . . . 160
18 Digital Circuits 162
18.1 Mapping any dcircuit to a bnet . . . . . . . . . . . . . . . . . . . . . 162
18.1.1 Option A of Fig.18.2 . . . . . . . . . . . . . . . . . . . . . . . 162
18.1.2 Option B of Fig.18.2 . . . . . . . . . . . . . . . . . . . . . . . 163
19 Do Calculus 164
19.1 3 Rules of Do Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . 167
19.2 Parent Adjustment Formula . . . . . . . . . . . . . . . . . . . . . . . 168
19.3 Backdoor Adjustment Formula . . . . . . . . . . . . . . . . . . . . . . 170
19.4 Frontdoor Adjustment Formula . . . . . . . . . . . . . . . . . . . . . 171
19.5 Comparison of Backdoor and Frontdoor adjustment formulae . . . . . 172
19.6 Do operator for DEN diagrams . . . . . . . . . . . . . . . . . . . . . 173
20 Do Calculus proofs 176
21 D-Separation 194
22 D-Separation and Do Calculus Rules for LDEN: COMING SOON 197
23 D-Separation in Quantum Mechanics 198
24 Dynamical Bayesian Networks 199
25 Expectation Maximization 201
25.1 The EM algorithm: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
25.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
25.2 Minorize-Maximize (MM) algorithms . . . . . . . . . . . . . . . . . . 203
6
25.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
25.3.1 Gaussian mixture . . . . . . . . . . . . . . . . . . . . . . . . . 205
25.3.2 Blood Genotypes and Phenotypes . . . . . . . . . . . . . . . . 206
25.3.3 Missing Data/Imputation . . . . . . . . . . . . . . . . . . . . 208
26 Factor Graphs 209
27 Frisch-Waugh-Lovell (FWL) theorem 212
27.1 FWL, assuming x
σ
are non-random . . . . . . . . . . . . . . . . . . . 212
27.2 FWL, assuming x
σ
are random . . . . . . . . . . . . . . . . . . . . . 213
28 Frontdoor Adjustment Formula 216
28.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
29 G-formula (Sequential Backdoor Adjustment Formula) 218
30 Gaussian Nodes with Linear Dependence on Parents 222
31 Generalized Linear Model (GLM) 225
31.1 Exponential Family of Distributions . . . . . . . . . . . . . . . . . . . 225
31.2 GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
32 Generative Adversarial Networks (GANs) 232
33 Goodness of Causal Fit 237
33.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
33.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
33.3 Goodness of Statistical Fit . . . . . . . . . . . . . . . . . . . . . . . . 238
33.4 GCF example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
33.5 GCF example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
33.6 GCF in general . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
34 Granger Causality 244
35 Hidden Markov Model 247
35.1 Calculating P (x
t
, v
n
) and P (x
t
, x
t+1
, v
n
) . . . . . . . . . . . . . . . . 249
35.2 Calculating F
t
and F
t
. . . . . . . . . . . . . . . . . . . . . . . . . . 250
35.3 Calculating P (x
n
|v
n
) . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
35.4 Calculating P (v
n
|A, B, π) . . . . . . . . . . . . . . . . . . . . . . . . 252
35.5 Calculating bx
n
(Viterbi algorithm) . . . . . . . . . . . . . . . . . . . 253
35.6 Calculating
b
A,
b
B, bπ (Baum-Welch algorithm) . . . . . . . . . . . . . 255
36 Influence Diagrams & Utility Nodes 257
7
37 Instrumental Inequality and beyond 259
37.1 I-inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
37.1.1 I-inequality for binary z,d,y . . . . . . . . . . . . . . . . . . . 261
37.2 Bounds on Effect of IV on treatment outcome y . . . . . . . . . . . . 262
38 Instrumental Variables 265
38.1 δ with unmeasured confounder . . . . . . . . . . . . . . . . . . . . . . 265
38.2 δ (with unmeasured confounder) can be inferred via IV . . . . . . . . 266
38.3 More general bnets with IVs . . . . . . . . . . . . . . . . . . . . . . . 267
38.4 Instrumental Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 268
39 Jackknife Resampling 269
39.1 Case A = A
n
(x) =
1
n
P
σ
x
σ
. . . . . . . . . . . . . . . . . . . . . . . 271
40 Junction Tree Algorithm 273
41 Kalman Filter 274
41.1 Prediction Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
41.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
41.3 Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
41.4 Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
41.5 Derivation of Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 278
42 LATE (Local Average Treatment Effect) 280
43 Linear and Logistic Regression 285
43.1 Generalization to x with multiple components (features) . . . . . . . 287
43.2 Alternative V (b, m) for logistic regression . . . . . . . . . . . . . . . . 287
44 Linear Deterministic Bnets with External Noise 289
44.1 Example of LDEN diagram . . . . . . . . . . . . . . . . . . . . . . . 289
44.2 Fully Connected LDEN diagrams . . . . . . . . . . . . . . . . . . . . 290
44.2.1 Fully connected LDEN diagram with nx = 2 . . . . . . . . . . 291
44.2.2 Fully connected LDEN diagram with nx = 3 . . . . . . . . . . 291
44.2.3 Fully connected LDEN diagram with arbitrary nx . . . . . . . 292
44.3 Non-linear DEN diagrams . . . . . . . . . . . . . . . . . . . . . . . . 294
45 Markov Blankets 295
46 Markov Chain Monte Carlo (MCMC) 297
46.1 Inverse Cumulative Sampling . . . . . . . . . . . . . . . . . . . . . . 297
46.2 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
46.3 Metropolis-Hastings Sampling . . . . . . . . . . . . . . . . . . . . . . 300
46.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8
46.5 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
47 Markov Chains 306
48 Mediation Analysis 307
49 Message Passing and Bethe Free Energy 313
49.1 2MRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
49.2 Message Passing Intuition . . . . . . . . . . . . . . . . . . . . . . . . 314
49.3 ln Z
θ
= Free Energy (FE) . . . . . . . . . . . . . . . . . . . . . . . 318
49.4 ln Z
θ
= Minimum FE . . . . . . . . . . . . . . . . . . . . . . . . . 319
49.5 ln Z
tree
θ
=Tree FE (a.k.a. Bethe FE) . . . . . . . . . . . . . . . . . . 320
49.6 ln Z
tree
θ
= Tree Minimum FE, and message passing . . . . . . . . . 321
50 Message Passing, Pearl’s theory 325
50.1 Distributed Soldier Counting . . . . . . . . . . . . . . . . . . . . . . . 325
50.2 Spring Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
50.3 BP for Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 327
50.4 BP Algorithm for Polytrees . . . . . . . . . . . . . . . . . . . . . . . 334
50.4.1 How BP algo for polytrees reduces to the BP algo for Markov
chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
50.5 Derivation of BP Algorithm for Polytrees . . . . . . . . . . . . . . . . 339
50.6 Example of BP algo for a Tree . . . . . . . . . . . . . . . . . . . . . . 342
50.7 Bipartite bnets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
50.8 BP for bipartite bnets (BP-BB) . . . . . . . . . . . . . . . . . . . . . 347
50.8.1 BP-BB and general BP agree on Markov chains . . . . . . . . 349
50.8.2 BP-BB and general BP agree on tree bnets. . . . . . . . . . . 351
50.9 BP-BB and sum-product decomposition . . . . . . . . . . . . . . . . 353
51 Message Passing (Belief Propagation) in Quantum Mechanics 354
52 Meta-learners for estimating ATE 355
53 Missing Data, Imputation 359
53.1 Imputation via EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
53.2 Imputation via MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . 363
53.3 Multiple Imputations . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
54 Modified Treatment Policy 365
54.1 One time MTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
54.2
|c
estimand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
54.3 Estimates of
|c
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
54.3.1 Empirical estimate of
|c
. . . . . . . . . . . . . . . . . . . . 372
54.3.2 OR estimate of
|c
. . . . . . . . . . . . . . . . . . . . . . . . 372
9
54.4 Other Estimands besides
|c
. . . . . . . . . . . . . . . . . . . . . . . 375
54.5 Multi-time MTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
55 Monty Hall Problem 378
56 Multi-armed Bandits 380
56.1 Bnet for MAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
56.2 Reward functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
56.3 Regret functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
56.4 Strategies with random exploration . . . . . . . . . . . . . . . . . . . 386
56.4.1 ϵ-greedy algorithm . . . . . . . . . . . . . . . . . . . . . . . . 386
56.4.2 ϵ
t
-greedy algorithm . . . . . . . . . . . . . . . . . . . . . . . . 387
56.5 Strategies with nonrandom exploration . . . . . . . . . . . . . . . . . 387
56.5.1 Upper Confidence Bounds (UCB) algorithms . . . . . . . . . . 387
Frequentist UCB (UCB1) algorithm . . . . . . . . . . . . . . . 387
Bayesian UCB algorithm . . . . . . . . . . . . . . . . . . . . . 388
56.5.2 Thompson Sampling MAB (TS-MAB) algorithm . . . . . . . . 389
Bnet for general TS-MAB algorithm . . . . . . . . . . . . . . 389
TS-MAB algorithm with Beta agent and Bernoulli environment 391
TS-MAB algorithm, skeletal reprise . . . . . . . . . . . . . . . 392
56.5.3 Grad-MAB algorithm . . . . . . . . . . . . . . . . . . . . . . . 393
57 Naive Bayes 396
58 Neural Networks 397
58.1 Activation Functions A
λ
i
: R R . . . . . . . . . . . . . . . . . . . . 398
58.2 Weight optimization via supervised training and gradient descent . . 399
58.3 Non-dense layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
58.4 Autoencoder NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
59 Noisy-OR gate 404
59.1 3 ways to interpret the parameters π
i
. . . . . . . . . . . . . . . . . . 405
60 Non-negative Matrix Factorization 408
60.1 Bnet interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
60.2 Simplest recursive algorithm . . . . . . . . . . . . . . . . . . . . . . . 409
61 Observationally Equivalent DAGs 410
61.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
62 Personalized Treatment Effects 413
62.1 Goal, Strategy and Rationale of PTE theory . . . . . . . . . . . . . . 414
62.2 Bnets for PTE theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
62.3 AT E = P NS AMM . . . . . . . . . . . . . . . . . . . . . . . . . . 417
10
62.4 Probabilities Relevant to PTE theory . . . . . . . . . . . . . . . . . . 418
62.5 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
62.6 Linear Programming Problem . . . . . . . . . . . . . . . . . . . . . . 424
62.7 Special constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
62.8 Matrix representation of probabilities . . . . . . . . . . . . . . . . . . 427
62.9 Bounds on Exp. Probs. imposed by Obs. Probs. . . . . . . . . . . . . 431
62.10Bounds on P NS3 for unspecified bnet . . . . . . . . . . . . . . . . . 432
62.11Bounds on P NS3 for specific bnet families . . . . . . . . . . . . . . . 438
62.12Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
63 Personalized Expected Utility 444
63.1 Goal of PEU Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
63.2 Bnets for PEU Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 446
63.3 Bounds on EU for unspecified bnet . . . . . . . . . . . . . . . . . . . 446
63.4 Bounds on EU for specific bnet families . . . . . . . . . . . . . . . . 449
64 Potential Outcomes and Beyond 452
64.1 G and G
den
bnets, the starting point bnets . . . . . . . . . . . . . . . 453
64.2 G bnet with nodes y
σ
(0), y
σ
(1) added to it. . . . . . . . . . . . . . . . 455
64.3 Expected Values of treatment outcome y
σ
. . . . . . . . . . . . . . . 457
64.4 Translation Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . 457
64.5 Y
|d,x
= Y
d|d,x
(SUTVA) . . . . . . . . . . . . . . . . . . . . . . . . . . 458
64.6 Conditional Independence Assumption (CIA) . . . . . . . . . . . . . 459
64.7 Treatment Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
64.8 Insights into what makes treatment effects equal and Y
1|0
= Y
1
. . . . 462
64.9 G
do+
bnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
64.10ACE = AT E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
64.11Good, Bad Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
64.12PO Confounder Sensitivity Analysis . . . . . . . . . . . . . . . . . . . 466
64.13(SDO, AT E) space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
64.14Strata-Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
64.14.1 Exact strata-matching . . . . . . . . . . . . . . . . . . . . . . 470
Estimates of Treatment Effects . . . . . . . . . . . . . . . . . 470
Example, estimation of treatment effects . . . . . . . . . . . . 472
64.14.2 Approximate strata-matching . . . . . . . . . . . . . . . . . . 474
64.14.3 Unbiased strata-matching estimates . . . . . . . . . . . . . . . 474
64.15Propensities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
64.16Propensity based estimates of treatment effects . . . . . . . . . . . . 480
64.17Positivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
64.18Multi-time PO bnets (Panel Data) . . . . . . . . . . . . . . . . . . . 482
65 Program evaluation and review technique (PERT) 485
65.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
11
66 Random Forest and Bagging 491
66.1 Bagging (with fully-featured bags) . . . . . . . . . . . . . . . . . . . . 491
66.2 Bagging (with randomly-shortened bags) . . . . . . . . . . . . . . . . 493
67 Recurrent Neural Networks 494
67.1 Language Sequence Modeling . . . . . . . . . . . . . . . . . . . . . . 497
67.2 Other types of RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
67.2.1 Long Short Term Memory (LSTM) unit (1997) . . . . . . . . 499
67.2.2 Gated Recurrence Unit (GRU) (2014) . . . . . . . . . . . . . . 501
68 Regression Discontinuity Design 503
68.1 PO analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
68.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
69 Reinforcement Learning (RL) 506
69.1 Exact RL bnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
69.2 Actor-Critic RL bnet . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
69.3 Q function learning RL bnet . . . . . . . . . . . . . . . . . . . . . . . 513
70 Reliability Box Diagrams and Fault Tree Diagrams 515
70.1 Minimal Cut Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
71 Restricted Boltzmann Machines 523
72 ROC curves 525
72.1 Terminology Table Adapted from Wikipedia Ref.[133] . . . . . . . . . 528
73 Scoring the Nodes of a Learned Bnet 530
73.1 Probability Distributions and Special Functions . . . . . . . . . . . . 531
73.2 Single node with no parents . . . . . . . . . . . . . . . . . . . . . . . 533
73.3 Multiple nodes with any number of parents . . . . . . . . . . . . . . . 535
73.4 Bayesian Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
73.5 Information Theoretic scores . . . . . . . . . . . . . . . . . . . . . . . 537
74 Selection Bias Removal 539
74.1 Pre and Post Selection Nodes . . . . . . . . . . . . . . . . . . . . . . 540
74.2 Removing SB from passive query P (y|x) . . . . . . . . . . . . . . . . 542
74.3 Removing SB from active query P (y|Dx) . . . . . . . . . . . . . . . . 544
75 Shannon Information Theory 546
76 Shapley Explainability 547
76.0.1 Numerical examples of SHAP . . . . . . . . . . . . . . . . . . 550
12
77 Simpson’s Paradox 553
77.1 Pearl Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
77.2 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
78 Structure and Parameter Learning for Bnets 557
78.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
78.2 Score based SL algorithms . . . . . . . . . . . . . . . . . . . . . . . . 559
78.3 Constraint based SL algorithms . . . . . . . . . . . . . . . . . . . . . 560
78.4 Pseudo-code for some bnet learning algorithms . . . . . . . . . . . . . 561
79 Support Vector Machines And Kernel Method 563
79.1 Learning Algorithm for SVM Classifier . . . . . . . . . . . . . . . . . 564
79.2 Linear (dot-product) Kernel . . . . . . . . . . . . . . . . . . . . . . . 565
79.3 Alternatives to Linear Kernel . . . . . . . . . . . . . . . . . . . . . . 569
79.4 Random Forest and Kernel Method . . . . . . . . . . . . . . . . . . . 569
80 Survival Analysis 570
80.1 S(t) estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
80.1.1 No-censoring estimate of S(t) . . . . . . . . . . . . . . . . . . 572
80.1.2 Kaplan-Meier estimate of S(t) . . . . . . . . . . . . . . . . . . 572
80.2 λ(t) models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
80.2.1 λ(t) independent of covariates Z . . . . . . . . . . . . . . . . . 577
80.2.2 λ(t) dependent on covariates Z . . . . . . . . . . . . . . . . . 578
80.3 S
0
(t) estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
81 Synthetic Controls 583
81.1 PO analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
82 Targeted Estimator 587
82.1 Goal, Strategy, and Rationale of TE theory . . . . . . . . . . . . . . . 587
82.2 Functional Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
82.3 Linear Approximation of Ψ[P
N
] . . . . . . . . . . . . . . . . . . . . . 591
82.4 ATE estimand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
82.5 ATE estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
82.5.1 Ψ
E
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
82.5.2 Ψ
G
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
82.5.3 Ψ
IP W
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
82.5.4 Ψ
LIP W
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
82.5.5 Ψ
LIP W ++
(a.k.a. Ψ
T MLE
) . . . . . . . . . . . . . . . . . . . . 598
82.6 Ψ
T MLE
in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
13
83 Time Series Analysis: ARMA and VAR 603
83.1 White noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
83.2 Backshift operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
83.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
83.4 Definition of ARMA(p, q), AR(p) and MA(q). . . . . . . . . . . . . 606
83.5 Solving AR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
83.6 Solving MA(q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
83.7 Solving ARMA(p, q) . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
83.8 Auto-correlation and partial auto-correlation . . . . . . . . . . . . . . 610
83.9 Generating function of auto-correlation . . . . . . . . . . . . . . . . . 614
83.10Impulse Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
83.11AR(p) and Yule-Walker equations . . . . . . . . . . . . . . . . . . . . 617
83.12Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
83.13Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
83.14Differencing and ARIMA(p, d, q) . . . . . . . . . . . . . . . . . . . . 624
83.15Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
83.15.1 PL of AR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
83.15.2 PL of MA(q) . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
83.15.3 PL of ARMA(p, q) . . . . . . . . . . . . . . . . . . . . . . . . 632
83.16V AR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
84 Transfer Learning 635
85 Transformer Networks 637
86 Transportability of Causal Knowledge 640
87 Turbo Codes 644
87.1 Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
87.2 Message Passing Interpretation of Decoding Algorithm . . . . . . . . 649
88 Uplift Modelling 650
88.1 UP types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
88.2 Some Relevant Technical Formulas from Chapter 64 . . . . . . . . . . 651
88.3 UP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
88.4 UP Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
88.4.1 Appendix, connection between
c
and
c|j
. . . . . . . . . . . 659
89 Variational Bayesian Approximation for Medical Diagnosis 660
89.1 Convex/Concave Dual Functions . . . . . . . . . . . . . . . . . . . . 663
90 Variational Bayesian Approximation via D
KL
668
90.1 Free Energy F(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
14
91 XGBoost 673
91.1 Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
91.2 Minimizing Cost function for single tree . . . . . . . . . . . . . . . . . 675
91.3 Leaf Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
91.4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
91.5 Feature Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
91.6 Final estimate of target attribute . . . . . . . . . . . . . . . . . . . . 680
91.7 Bnet for XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
92 Zero Information Transmission (Graphoid Axioms) 683
92.1 Consequences of Eq.(92.5) . . . . . . . . . . . . . . . . . . . . . . . . 684
Bibliography 686
15
Foreword
Welcome to Bayesuvius! a proto-book uploaded to github.
A different Bayesian network is discussed in each chapter. Each chapter title
is the name of a Bnet. Chapter titles are in alphabetical order.
This is a volcano in its early stages. First version uploaded to a github repo
called Bayesuvius on June 24, 2020. First version only covers 2 Bnets (Linear Regres-
sion and GAN). I will add more chapters periodically. Remember, this is a moon-
lighting effort so I can’t do it all at once.
For any questions about notation, please go to Notational Conventions section.
Requests and advice are welcomed.
Thanks for reading this
Robert R. Tucci
www.ar-tiste.xyz
ADDENDA
August 15, 2021: At this point in time, the book has grown to 67 Chapters
and 433 pages. Today, I am self-publishing it as an ebook at Amazon and
similar outlets. It will still be free.
16
Navigating the ocean of Judea
Pearl’s Books
Many of the greatest ideas in the bnet field were invented by Judea Pearl and his
collaborators. Thus, this book is heavily indebted to those great scientists. Those
ideas have had no clearer and more generous expositor than Judea Pearl himself.
Pearl has written 4 books that I have used in writing Bayesuvius. His 1988
book Ref.[44] dates back to his pre-causal period. That book I used to learn about
topics such as d-separation, belief propagation, Markov-blankets, and noisy-ORs. 3
other books that he wrote later, in his causal period, are:
1. In 2000 (1st ed.), and 2013 (2nd ed.), Pearl published what is so far his most
technical and exhaustive book on the subject of causality, Ref.[46].
2. In 2016, he released together with Glymour and Jewell, a less advanced “primer”
on causality, Ref.[49].
3. In 2018, he released together with Mackenzie his lovely “The Book of Why”,
Ref.[50].
Those 3 books I used to learn about causality topics such as Do Calculus, backdoor
and frontdoor adjustment formulae, linear deterministic bnets with exogenous noise,
and counterfactuals.
A micro poem written by me to celebrate Judea Pearl and his work:
I, Robot
Let other robots talk(),
while I,
talk(), do() and imagine().
17