- Double/debiased machine learning II: application
- Federated learning for clinical applications
- Coastal differences in artists' Instgram captions
- How many roads must a random walker walk down before it gets out of Reykjavik?
- Double/debiased machine learning I: theory

- Estimate the mapping from X to Y with ML
- Estimate the mapping from X to T with ML
- Regress the residuals from (1) onto the residuals from (2). The results of this is the treatment effect

- Y the treatment effect times a binary indicator of treatment, plus a linear combination of variables from X, plus some noise. \(\gamma\) selects the and weights the columns of X included in the simulation.
- T is the binary treatment variable. It is calculated by passing a linear combination of variables from X, weighted and selected by \(\beta\), into a sigmoidal logit function \(\sigma\).
- \(\theta(X)\) is an exponential function of the first column of X times 2
- \(\gamma\), \(\beta\) have 50 nonzero elements, which are drawn from a uniform distribution between -1 and 1
- \(\epsilon\), \(\eta\) are noise terms uniformly distributed between -1 and 1
- X is a matrix with entries uniformly distributed between 0 and 1

We can simulate our data in Python with the code below:

```
import numpy as np
def get_data(n, n_x, support_size, coef=2):
"""
heterogeneous CATE data generating process
params:
:param bin_treat: a boolean indicating whether the treatment is binary (true) or continuous (false)
:param n: the number of observations to simulate
:param n_x: the number of columns of X to simulate
:param support_size: the number of columns of X that influence T and Y. Must be smaller than n_x
:return: x, y, t, and cate, the features, risk, treatment, and treatment effect
"""
# patient features
x = np.random.uniform(0, 1, size=(n, n_x))
# conditional average treatment effect
cate = [theta(xi, coef=coef) for xi in x]
# noise
u = np.random.uniform(-1, 1, size=[n, ])
v = np.random.uniform(-1, 1, size=[n, ])
# coefficients
support_Y = np.random.choice(np.arange(n_x), size=support_size, replace=False)
coefs_Y = np.random.uniform(-1, 1, size=support_size)
support_T = support_Y
coefs_T = np.random.uniform(-1, 1, size=support_size)
# treatment
log_odds = np.dot(x[:, support_T], coefs_T) + u
t_sigmoid = 1 / (1 + np.exp(-log_odds))
t = np.array([np.random.binomial(1, p) for p in t_sigmoid])
# risk
y = cate * t + np.dot(x[:, support_Y], coefs_Y) + v
return x, y, t, cate
```

>

We can also explicitly define a function for \(\theta(X)\)
```
def theta(x, coef=2, ind=0):
"""
exponential treatment effect as a function of patient characteristics (x)
:param x: the feature data for a single observation (size 1 x n_x)
:param coef: the coefficient in the exponential function (default 2)
:param ind: an integer indicating which column of x to use in the exponential function (default 0)
:return: the treatment effect for a given observation
"""
return np.exp(coef * x[ind])
```

>

After simulating our data, we're also going to split off a test set from our data for evaluation.
It's also important to remember that each training fold is going to be further split within the DML estimator, so we want to make sure there are enough observations in the training set to split further.
```
n = 5000
n_x= 100
support_size=50
x, y, t, cate = get_data(n, n_x, support_size, coef=2, bin_treat=True)
x_train, x_test, y_train, y_test, t_train, t_test, cate_train, cate_test = train_test_split(x,
y,
t,
cate,
test_size=0.8)
```

>

The broad goal from this point will be to see if we can build a model that accurately estimates \(\theta(X)\).
In the DML algorithm, this happens in three steps: (1) train a model for T; (2) train a model for Y; and train an estimator for \(\theta(X)\). We'll be using scikit-learn for the machine learning model, and
econML for the DML estimator.
```
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# parameters for forest
params = {
'max_depth': [5, 10],
'min_samples_leaf': [2, 4, 10],
'min_samples_split': [2, 4],
'n_estimators':[400, 1000]
}
t_mdls = GridSearchCV(RandomForestClassifier(),
params,
cv=5)
t_mdl = t_mdls.fit(x_train, t_train).best_estimator_
```

>

We'll use ROC to see how well our T model is doing
```
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# evaluate T
import matplotlib.pyplot as plt
pred = t_mdl.predict_proba(x_test)[:,1]
fpr, tpr, _ = roc_curve(t_test,pred,drop_intermediate=False)
roc_auc = roc_auc_score(t_test,pred)
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], color='navy',linestyle='--')
plt.xlabel('False Positive')
plt.ylabel('True Positive')
```

>

`print(roc_auc)`

**Output:**

0.6471481Our T model AUC is 0.65, which is far from impressive. But remember, we can (theoretically) get good CATEs even if our T model performance isn't great, so let's keep going.

```
# fit y
from xgboost import XGBRegressor
# parameters for forest
params = {
'max_depth': [5, 10],
'learning_rate': [0.1, 0.01, 0.05],
'n_estimators':[50, 400, 1000]
}
y_mdls = GridSearchCV(XGBRegressor(),
params,cv=5,
n_jobs=-1)
y_mdl = y_mdls.fit(x_train, y_train).best_estimator_
```

>

For the continuous variable, we will evaluate our performance by calculating the bias, or the percent deviation from the true estimate.
```
# evaluate Y
pred = y_mdl.predict(x_test)
plt.scatter(pred, y_test)
lims = [
np.min([pred, y_test]), # min of both axes
np.max([pred, y_test]), # max of both axes
]
plt.plot(lims, lims, color='navy',linestyle='--')
plt.xlabel('Predicted Y')
plt.ylabel('True Y')
```

>

`np.mean(np.abs(pred - y_test) / y_test)`

**Output:**

0.4681890Similarly, we get mediocer performance when estimating Y. Our values have a bias of about 47%, meaning that if our true Y value was 10, our model would be guessing 15.

```
# dml
from econml.dml import CausalForestDML
est = CausalForestDML(model_y=y_mdl, model_t=t_mdl, cv=5)
```

>

This function takes care of the sample splitting procedure! The argument 'cv' defines how many folds to use for cross fitting. The default is 3, but the original paper recommends using 5 or 6 if possible.
Also similar to 'RandomForestClassifiers' in scikit learn, the 'CausalForest' estimator has many other parameters. If you're familiar with random forests many of these parameters will be familiar: the number of trees to include, the maximum depth of those trees, etc. The big exception in parameters between causal forests in econML and sklearn is that econML forest has no class weighting option. This is because the causal forest method makes use of a specific weighting strategy already [4].

Additionally, the econML estimator can't be used as input into sklearn's 'GridSearchCV' or 'RandomSearchCV' functions. However, we can use econML's own hypterparameter tuning function 'tune'. Rather than evaluating parameter performance across cross-validated folds of data, this fucntion uses out-of-bag scores on a single, small forest.

```
# parameters for causal forest
est_params = {
'max_depth': [5, 10, None],
'min_samples_leaf': [5, 2],
'min_samples_split': [10, 4],
'n_estimators': [100, 500]
}
est = est.tune(Y=y_train, T=t_train, X=x_train, params=est_params)
```

Our estimator now has tuned parameters, but it still needs to be fit.
```
est.fit(Y=y_train, T=t_train, X=x_train)
```

That's it! Now we can look into evaluating how well we did.
```
# get individual CATES
patient_idx = np.random.randint(np.shape(X_test)[0])
# get cate
cate = mdl.effect(X_test[patient_idx:patient_idx+1,])[0]
# get cate CI
lb, ub = mdl_dict[name].effect_interval(X_test_clean[patient_idx:patient_idx+1,], alpha=0.05)
# plot CATEs with CI for individual patients
plt.figure(figsize=(8,6))
plt.errorbar(1, cate, yerr=ci,
fmt="o", ecolor='k', zorder=1)
plt.tight_layout()
```

>

This plot indicates that for this patient, the model estimates that adding the treatment will increase the outcome measure by 4, though it has a wide confidence interval, spanning about 1 to 10.
Now we can move on to evaluation. How'd we do? Since this is simulated data, we can see how well our estimated treatment matches the true effect.

```
# plot
plt.figure()
plt.scatter(x_test[:,0], cate_test, label='True Effect')
plt.scatter(x_test[:,0], cate_pred, color='orange', label='Predicted Effect')
plt.xlabel('Patient Features')
plt.ylabel('CATE')
plt.legend()
```

>

`np.mean(np.abs(cate_pred - cate_test) / cate_test)`

**Output:**

0.1577856We did pretty well! Notably, we did pretty well even though our T and Y models had mediocre performance. Here our model bias is about 16%, substantially lower than the bias for Y. What are the limits of this good performance though? What is the minimum number of samples? What happens when we add more variables, or more noise? This paper [3] shows that DML (and all causal estimation methods) do better with more samples, fewer variables (though DML does better than other methods when the number of columns of X > 150), fewer confounding variables, and weaker confounding.

- Consistency in CATE and ATE estimates: While this method is more of a validity check than an evaluation, it is considered best practice for any method estimating conditional average treatment effects (CATEs, which were the subject of this post). The idea is that you bin your heterogeneous treatment effects into a few bins and recalculate average treatment effects within each bin. While we didn't discuss them here, average treatment effects (ATE) are a sample level equivalent of the conditional averages. If your ATE and CATE distributions are similar, you can have more confidence that your CATE estimates are not spurious.
- Benchmarking and medical knowledge: To some extent, we can leverage medical knowledge to confirm that our CATEs are in the right neighborhood. For example, we have substantial scientific evidence that aspirin can lower people's risk for heart attack. Therefore, if our mean CATE is ~20%, indicating an increase in risk, we can be reasonably sure that the model isn't performing well.
- Improved prediction: we can also put the evaluation back into a prediction-based framework. Mathematically, a patient’s true risk Y(t) = Y(t-1) + CATE. If our predictions get better with the addition of the CATEs, we can at least conclude that our prediction is useful. It’s important to note that 'useful' is not the same as 'accurate’ and is not a validation of the causal assumptions of the model.

- Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2016). Double/Debiased Machine Learning for Treatment and Causal Parameters. http://arxiv.org/abs/1608.00060
- Rose, S., & Rizopoulos, D. (2020). Machine learning for causal inference in Biostatistics. In Biostatistics (Oxford, England) (Vol. 21, Issue 2, pp. 336–338). NLM (Medline). https://doi.org/10.1093/biostatistics/kxz045
- McConnell KJ, Lindner S. Estimating treatment effects with machine learning. Health Serv Res. 2019 Dec;54(6):1273-1282. doi: 10.1111/1475-6773.13212. Epub 2019 Oct 10. PMID: 31602641; PMCID: PMC6863230.
- Oprescu, M., Syrgkanis, V., & Wu, Z. S. (2018). Orthogonal Random Forest for Causal Inference. http://arxiv.org/abs/1806.03467

Both data privacy and data governance are necessary parts of the data ecosystem that unfortunately disincentivize institutions to share their data or contribute them to larger data lakes. Making them both an important safeguard and a barrier to progress. If there were a way to decentralize model training such that institutions could retain control of their data, and modelers could incorporate that data into their work, clinical applications could advance without sacrificing privacy or drastically changing incentives. This decentralization is the promise of federated learning.

Decision or Consideration | Description | Standard Solution in Clinical Setting |
---|---|---|

Nodes and Topology | How many nodes? Will there be aggregate nodes? How will they be connected? | Few nodes (each node is a hospitals), all connected to one aggregate |

Updates | How many nodes will participate in each update? | All connected nodes will participate in updates |

Data structure | Naming conventions, file structures, etc. | None, because it depends so much on the specific problem you’re working with. Here’s an example from neuroscience called BIDS |

Data partitions | Are features, labels, or observations shared across nodes? | Each node will have different observations, but at least some shared features and labels |

Data distribution | How are features and labels distributed across nodes, and how will this influence the learning algorithm? | Multiple (see below) |

Privacy measures | What extra privacy measures will be taken, if any? | Multiple (see below) |

Hyper parameters | Weighting coefficients (wk), loss function, etc. | Weighting by the number of observations, but there are some interesting and useful variants (see dealing with non-IID data for some examples). All other parameters determined similarly to centralized learning |

- Feature skew: some features not present. i.e. one hospital does not record heart rate.
- Label distribution skew: some labels are not present. i.e. building a model to predict COVID when one hospital has no patients with COVID.
- Concept Shift: the same features lead to a different label. i.e. building a model to predict gut health from cheese consumption with hospitals in Asia and Europe. Since lactose intolerance is more common in Asia, the 'cheese' feature would lead to different labels at different hospitals
- Concept Drift: or the same label arises from different features. i.e. building a model to predict anxiety levels in hospitals with patients from different socioeconomic levels. While both hospitals might have patients with anxiety, the things causing that anxiety might differ.
- Quantity skew: labels have different distributions (imbalanced). i.e. building a model to predict COVID when one hospital has 40% of patients test positive, and another has 2%.
- Prior probability shift: features have different distributions. i.e. using age as a predictor when data come from children's hospitals and general hospitals.
- Unbalancedness: vastly different numbers of observations. This one is self explanatory.

- Balancing training data: each node can implement its own resampling scheme, such as SMOTE or GAN resampling. In the same review mentioned above, this was the most popular method for addressing skew [3], though the review only discussed the first three methods in this list.
- Adaptive hyperparameters: using loss functions and weighting coefficients that are specific to each node.
- LoAdaBoost: one specific example that boosts the training of weak nodes by forcing the loss function to fall below some threshold before they contribute to the aggregate[4].
- Domain adaptation: Use meta-training to determine how to combine predictors trained on various domains, similar to transfer learning[5].
- Share data: share a small amount of data or summary statistics from data to fill in missing values and supplement skewed distributions
- Normalization: (only applied to deep learning models) group, rather than batch normalization helps with skewed labels[6].
- Different algorithm: Federated learning based on dynamic regulation (FedDyn) algorithm can guarantee that the node losses converge to the global loss[7].

- Add noise: add noise to either the data, or the gradients
- Differential privacy: a method of adding noise that ensures that model outputs are nearly identical even if any one data point is removed.

- Encryption: encrypt the gradients or parameters that get sent back and forth

- PySyft
A screenshot from a PySyft tutorial
- Supports encryption and differential privacy
- Support for non-IID data (via sample sharing)
- ‘numpy-like’ interface (their words)
- Currently, they want users to work with the team on new applications

- Tensorflow
A screenshot from a tensorflow tutorial
- Probably easy to use if you already work with tensorflow
- No built-in support for privacy or non-IID data

- FATE
A screenshot from a FATE tutorial
- No support for non-IID data (though nothing is stopping you from adding your own resampling function to the pipeline)
- Supports encryption
- Pipeline package interface

- Federated Learning promises a flexible, decentralized way to train machine learning algorithms. Widerspread adoption of federated learning could make modeling with more sophisticated methods, or for more niche populations feasible
- Federated learning is presented as a solution to data governance and privacy issues that make sharing data difficult. While the method has clear benefits over centralized learning, data privacy, and especially data governance, will likely still present issues moving forward. As discussed in the post, federated learning applications are not a complete solution to security issues and will likely require more protections in any real-world application. Additionally, the incentives that make it harder for institutions to contribute data to data lakes might also make it harder to offer access for federated learning projects. If you spent a lot of money collecting a rare dataset, you might want to get the first (or second, or third) crack at any modeling projects using that dataset. Essentially, I do not think federated learning can serve as a substitute for incentivizing data sharing or protecting/compensating data curators.
- Starting a federated learning project requires many decisions and considerations. Decisions with the least clear solutions are those involving data standardization across sites and those involving how to deal with non-IID data distributions. Both reviews cited in this post recognize that these are important issues[1,3], but stop short of providing clear recommendations. I think the method would be more accessible and more likely to be used responsibly if some of the packages produced pandas-profiler style reports of data distributions across sites and provided more support for implementing solutions.

- Rieke, N., Hancox, J., Li, W., Milletarì, F., Roth, H. R., Albarqouni, S., Bakas, S., Galtier, M. N., Landman, B. A., Maier-Hein, K., Ourselin, S., Sheller, M., Summers, R. M., Trask, A., Xu, D., Baust, M., & Cardoso, M. J. (2020). The future of digital health with federated learning. Npj Digital Medicine, 3(1).
- Schwarz, C. G., Kremers, W. K., Therneau, T. M., Sharp, R. R., Gunter, J. L., Vemuri, P., Arani, A., Spychalla, A. J., Kantarci, K., Knopman, D. S., Petersen, R. C., & Jack, C. R. (2019). Identification of Anonymous MRI Research Participants with Face-Recognition Software. New England Journal of Medicine, 381(17), 1684–1686.
- Prayitno, Shyu, C. R., Putra, K. T., Chen, H. C., Tsai, Y. Y., Tozammel Hossain, K. S. M., Jiang, W., & Shae, Z. Y. (2021). A systematic review of federated learning in the healthcare area: From the perspective of data properties and applications. In Applied Sciences (Switzerland) (Vol. 11, Issue 23). MDPI
- Huang, L.; Yin, Y.; Fu, Z.; Zhang, S.; Deng, H.; Liu, D. LoAdaBoost: Loss-based AdaBoost federated machine learning with reduced computational complexity on IID and non-IID intensive care data. PLoS ONE 2020, 15, e0230706.
- Guo, J., Shah, D. J., & Barzilay, R. (2018). Multi-Source Domain Adaptation with Mixture of Experts
- Hsieh, Kevin; Phanishayee, Amar; Mutlu, Onur; Gibbons, Phillip (2020-11-21). The Non-IID Data Quagmire of Decentralized Machine Learning". International Conference on Machine Learning. PMLR: 4387–4398.
- Acar, Durmus Alp Emre; Zhao, Yue; Navarro, Ramon Matas; Mattina, Matthew; Whatmough, Paul N.; Saligrama, Venkatesh (2021). Federated Learning Based on Dynamic Regulation
- For a lighter read, check out this comic from Google

A lot of the popular tags break down into a few categories.

**Artist location:**Artists include hashtags for nearby locations such as #brooklyn or #sanjose. These tags were probably originally something like #brooklyntattoo or #sanjosetattoo artist that got clipped in the hashtag cleaning process. It is also possible artists are making tattoos of these actual regions, though I think this is less likely.**Tattoo subject:**In both cities, flowers are a popular subject for tattoos (#floral, #flower), though this is the only subject the cracks the top 10.**Tattoo style:**A few styles like #fineline, #blackwork, #color and #blackandgrey a top tags in both cities. A few styles that are considered popular but don’t make the cut in both would be #traditional, #newschool, #japanese, etc.**Tattoo magazines:** One tag, #tttism, is a tattoo magazine with a large digital division

These separate into different categories.

**Artist identity:**New York has a higher prevalence of both #queer and #qttr (which is an abbreviation for queer tattoo artist). Does this mean that there are more queer artists in New York? There are a few possible explanations. It is possible that there are more queer artists in NY, that queer artists in NY are more willing to self identify on social media, or that queer SF artists use different or more specific terminology.**Tattoo techniques:**One surprising difference to me was the prevalence of tags referencing different tattoo equipment on different coasts. Specifically, #handpoked tattoos, which are non-electric, are more popular in NY, while #singleneedle tattoos, which use only one kind of needle, are more popular in SF**Tattoo style:**One of the most interesting differences to me is the dichotomy between #traditional and #surrealist style tattoos on each coast. Surrealism constitutes a more niche style than traditional tattoos in general, and its prevalence in SF evokes images of the cities free-spirited, pre-silicon valley past for me. Similarly, #chicano and #chicanostyle are not often in listed in the US's top styles, but their growing popularity in a state with growing Latinx population seems intuitive.**Tattoo subject:**We find that NY has a much higher prevalence of anime tattoos compared to SF. However, it is once again possible that SF tattoo artists simply use different, or more specific language to tag their anime tattoos.

There are some prominent differences between the grouping structure of the two cities.

- Level 1: 2 groups
- Level 2: 3 groups
- Level 3: 7 groups
- Level 4: 20 groups
- Level 5: 52 groups
- Level 6: 160 groups
- Level 7: 637 groups
- Level 8: 28282 groups

- Level 1: 2 groups
- Level 2: 5 groups
- Level 3: 14 groups
- Level 4: 32 groups
- Level 5: 98 groups
- Level 6: 263 groups
- Level 7: 1463 groups

We can also see some interesting differences in this visualization. New York has communities that are more evenly sized, while San Francisco has are more small groups. Lastly, we can look at a finer scale and visualize the hashtags contained in some of the smaller communities.

We get some communities that are expected:

Cats and Pets

Black and grey animals

Flowers and Skulls

Pokemon

Neotraditional tattoos with white ink

Space

Others that are less expected, but show consistent themes:

Agriculture

Ghost type pokemon

Animal prints

Cartoon portraits

Chickens

Tattoo artists giving themselves ocean themed leg tattoos

And lastly, some that show some confusing and inspiring mixes of topics:

Psychedelic cheese

Batman (the Ben Affleck one) and plants

Indi(ana Jones) music

- We find that regardless of city, fine line, black work, and black and grey styles are popular to tag. Similarly, flowers or floral designs are consistently mentioned
- San Francisco has a uniquely large community of artists posting about Chicano style and surrealist tattoos, while New York has more posts about anime tattoos and traditional styles
- We find some evidence that New York tattoo posts more easily separate into large groups, while San Francisco tattoos have more small, niche groups.
- A nonzero number of people have tattoos of the Ben Affleck batman

```
import pandas as pd
import json
# load
with open('YOUR_PATH_TO_DATA/Takeout/Location History/Records.json') as data_file:
data = json.load(data_file)
df = pd.json_normalize(data, 'locations')
# get only relevant variables
df = df[['latitudeE7', 'longitudeE7', 'timestamp']]
# tranform them to be useful
df.timestamp = pd.to_datetime(df.timestamp)
df = df.assign(
lat=df['latitudeE7']/1e7,
long=df['longitudeE7']/1e7,
year=df['timestamp'].dt.year,
month=df['timestamp'].dt.month,
day=df['timestamp'].dt.day,
dow=df['timestamp'].dt.day_name(),
time=df['timestamp'].dt.time,
)
```

As a quick validity check, let's use the geopandas package to visualize all our different coordinates. Google has data on my locations since 2012, so this plot should include most of my locations over the past decade (i.e. concentrated on the two US coasts).
```
import geopandas as gpd
from shapely.geometry import Point
import matplotlib.pyplot as plt
# set up data structure of coordinates
geo = [Point(xy) for xy in zip(df.long, df.lat)]
gdf = gpd.GeoDataFrame(df, geometry=geo)
# plot
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
gdf.plot(ax=world.plot(figsize=(14,14), color='lightgrey'), color="#7397CB")
```

Great! Now we're going filter down to the specific week where I was in Iceland
```
st = '2017-07-31'
en = '2017-08-07'
iceland = df[(df['timestamp'] >= st) & (df['timestamp'] <= en)]
# remove duplicate lat/long pairs
iceland = iceland.drop_duplicates(subset=['lat', 'long', 'day', 'month'])
```

Now that we have the data we'll be working with, we're ready to move on to working with road networks.
```
import osmnx as ox
ox.config(log_console=True, use_cache=True)
# location of graph
place = 'Iceland'
# find shortest route based on the mode of travel (for our purposes it doesnt really matter which one)
mode = 'walk' # 'drive', 'bike', 'walk'
# find shortest path based on which feature
optimizer = 'length' # 'length','time'
# create graph
graph = ox.graph_from_place(place, network_type = mode)
```

What exactly is this graph object we just made?
```
type(graph)
```

**Output:**

networkx.classes.multidigraph.MultiDiGraphThis is a networkX object. This means we can use functions and attributes from networkX to get properties of the graph. For example, we can get the density (or proportion of possible connections present) and number of edges in this graph.

```
import networkx as nx
print(nx.density(graph))
print(len(graph.get_edges()))
```

**Output:**

1.9575729910546824e-05

327816We're now going to use networkX to get paths between nodes that correspond to my location data. Our strategy for calculating my trajectory on this graph is as follows: get a starting coordinate; snap it to the closest spot on the road network; get an ending coordinate; snap it to the network; get the shorted path between the start and end coordinate; repeat. We can code up this strategy as follows:

```
route = []
start_latlng = (iceland['lat'].values[0],iceland['long'].values[0])
# find the nearest node to the start location
orig_node = ox.get_nearest_node(graph, start_latlng)
# get the rest of the path
for i in range(len(iceland) - 1):
# define the end location in latlng
end_latlng = (iceland['lat'].values[i+1],iceland['long'].values[i+1])
# find the nearest node to the end location
dest_node = ox.get_nearest_node(graph, end_latlng)
# find the shortest path
route.extend(nx.shortest_path(graph,
orig_node,
dest_node,
weight=optimizer))
# advance to the next step
orig_node = dest_node
```

And then plot it
```
from itertools import groupby
# remove any duplicates from the route - these cause the plotting to break
route = [i[0] for i in groupby(route)]
fig, ax = ox.plot_graph_route(graph,
route,
node_size=0,
edge_linewidth=0.3,
edge_color='white',
route_color="#EB8D43",
route_linewidth=1.5,
route_alpha=1.0,
orig_dest_size=0)
```

And here we have a visualization of my vacation.
It's also important to note some limitations in this visualization. This method requires snapping to a road network. This might get incorrect locations at times when I was not on a true road, like on hikes or boats. Additionally, the shortest path between two points might not be the path that I took. This can lead to wrong data especially in cases where I travelled far between Google's recordings.
Given these caveats, we can still observe some things about my path. While I did travel through most parts of the country, the path I took is very directed. No one area has dense coverage, and the path mostly sticks to the coast. These observations make sense given what I wanted to get out of the trip. I wanted to see as much of the country as possible in the week I was there. I also had prebooked accomodations in different cities for every night, which didn't allow me to wander or dwell in any one area.
How could this have looked different? What if I had showed up to Iceland and just wandered the streeets of Reykjavik, turning randomly whenever I felt like it? What if I turned mostly randomly, but had tried to avoid places I had been before?
```
# scaling factor for length of walk
c = 1 # plots shown for 1, 2, and 10
# intialize
start_latlng = (iceland['lat'].values[i],iceland['long'].values[i])
orig_node = ox.get_nearest_node(graph, start_latlng)
random_route = [orig_node]
# get the walk
for k in range(n):
# pick a node of the proper step size
next_node = list(graph.neighbors(orig_node))
# get random selection for next step
dest_node = np.random.choice(next_node)
# only add to our route
random_route.append(dest_node)
orig_node = dest_node
```

\(c=1\)

\(c=2\)

\(c=10\)

This algorithm looks a lot different from my path! It doesn't see the whole country, and in fact never really gets far out of Reyykjavik, even if it takes 10 times more steps than I do. This isn't surprising given how much more densely connected streets are inside of cities than outside of them. Once you're inside a city, more intersections will lead you into the city than out of it, leading to a spot where random walkers will tend to accumulate. Maybe if we try to to avoid intersections we've already seen, we can make sure that we get out of the city eventually.

We can add biases to this walk algorithm to make it less likely to revisit nodes that its already been too. This is accomplished by adding an additional parameter (\(r\)) that sets the relative probability of revisiting nodes already in the path versus visiting new ones.```
r = 0.01
# get the walk
for k in range(n):
# get neighboring nodes
next_node = list(graph.neighbors(orig_node))
# get transition probabilities, weighted by revisits
transition_prob = np.ones((len(next_node),))
revisit_idx = [m in bias_random_route for m in next_node]
# check if all nodes are revisits/new
if (len(set(revisit_idx)) == 1):
# set equal probability
transition_prob = transition_prob*(1/len(transition_prob))
else:
# parameter r determines bias
transition_prob[revisit_idx] = r*transition_prob[0]
# normalize so it sums to 1
transition_prob = transition_prob/sum(transition_prob)
# get random selection for next step
dest_node = np.random.choice(next_node, size=1, replace=False, p=transition_prob)
dest_node = dest_node[0] # unwrap from list
# only add to our route
bias_random_route.append(dest_node)
orig_node = dest_node
```

\(c=1\)

\(c=2\)

\(c=10\)

This bias towards novelty does a little better. Now, we much more easily get out of the city, and even get to a different part of the island. But the areas it visits are still more densely explored than mine, and I still get more coverage of the island as whole.

There's a lot of ways we could build on this bias towards novelty to get more realistic looking walks. We could add a bias towards popular locations, force the walk to start and stop at the airport, or give the walker global knowledge of the graph and tell it to navigate efficiently to specific landmarks. But for now, we'll stop here, and appreciate the aesthetic differences in these few exploration styles.- This project was inspired by this twitter post by Dr. Iva Brunec
- In addition to the documentation, this post on OSMnx was useful
- Color scheme from the Channel Islands

A method called double machine learning (DML) allows causal inference to coexist with complex data with few assumptions, which has drummed up a lot of well-deserved excitement about the method (primarily in economics). When I was trying to learn more about DML, I found that there weren't as many resources out there as I had hoped and that most of the resources that were out there took a very theoretical approach. I wanted to create a resource that explained the theory at a higher level and had a larger emphasis on code based explainations. This post focuses on understanding how the DML algorthim works. If you want to skip to the application, you can look at my my second DML post and this tutorial from econML.

- be accurate with a lot of data (might seem obvious, but this is harder than it sounds)
- come with confidence intervals
- not make strict assumptions about the form of the data (leverage machine learning)

- Regularization, necessary for fitting complex data, induces a bias (think bias variance trade-ff). To reduce overfitting, analysts using machine learning methods often use regularization. However, this necessarily increases the bias of estimates.
- Despite our best efforts, machine learning models fit on data that follow the causal diagram above tend to overfit data, further biasing results.

- Consistency - An individual's potential outcome under their observed exposure history is precisely their observed outcome.
- Positivity - Everyone in the study has some probability of receiving treatment
- You are recording all variables that influence Y and T in X. I think this is the most fraught assumption in medical contexts [2].

Caveats:

- categorical treatment - at the moment, there isn't a way to use DML for a categorical treatment variable that also provides confidence intervals. Other methods, such as doubly robust learning, might be better suited here.
- biased data classes - DML is known to be biased in cases where one outcome is extremely rare (though it is less biased than many other methods). Over/undersampling of the data might be helpful in these cases.

- Doubly robust learning
- Targetted minimum loss based estimation (TLME)
- Bayesian Additive Regression Trees (BART)
- Bayesian Causal Forest (BCF)

- \(X\) the features
- \(Y\) the outcome
- \(T\) the treatment (it can be binary, continuous, or categorical)
- \(g_0(x)\) some mappong of x to y, excluding the effect of T and $\theta_0$
- \(m_0(x)\) some mapping of x to t
- \(\theta\) - the treatment effect. Here its a scalar, for simplicity, but this doesn't have to be the case
- \(U, V\) the noise, which cancels out on average

```
from scipy.linalg import toeplitz
# pick any value for theta_0
theta = -0.4
# define a function for generating data
def get_data(n, n_x, theta):
"""
partially linear data generating process
Inputs:
n the number of observations to simulate
n_x the number of columns of X to simulate
theta a scalar value for theta
"""
cov_mat = toeplitz([np.power(0.7, k) for k in range(n_x)])
x = np.random.multivariate_normal(np.zeros(n_x), cov_mat, size=[n, ])
u = np.random.standard_normal(size=[n, ])
v = np.random.standard_normal(size=[n, ])
m0 = x[:, 0] + np.divide(np.exp(x[:, 2]), 1 + np.exp(x[:, 2]))
t = m0 + u
g0 = np.divide(np.exp(x[:, 0]), 1 + np.exp(x[:, 0])) + x[:, 2]
y = theta * t + g0 + v
return x, y, t
```

>

Let's imagine you're given some X, T, and Y data, as well as the data generating equations above. You're then asked to estimate what theta is.
One first attempt might be to build one machine learning model of \(T\theta_{0} + g_{0}(X)\) and \(g_0(X)\), then regress out \(T\) to solve for \(\theta_0\).
This is a little tricky because \(g_0(X)\) is not the influence of \(X\) on \(Y\), its the influence of \(X\) on the part of \(Y\) that isn’t influenced by \(T \times \theta_0\). Therefore, we have to do this iteratively: get an initial guess for \(\theta_0\) in order to estimate \(g_0(X)\); then use that estimate of \(g_0(X)\) to solve for \(\theta_0\). In code, the direct method would look like this:

First, we'd simulate our data, and build our machine learning estimate of \(Y\) from \(T\theta_{0} + g_{0}(X)\) (we'll call this model \(l_0(X)\))

```
from sklearn.ensemble import RandomForestRegressor
# get data
x, y, t = get_data(n, n_x, theta)
# this will be our model for predicting Y from X
ml_l = RandomForestRegressor()
ml_l.fit(x,y)
```

>

Note that you could use whatever machine model you want, it doesn't have to be a random forest. In this example, it should be anything that can estimate exponential functions (since that's the form we picked for our data generating function). Next, we can take an initial guess for \(\theta_0\), and then fit our estimate of \(g_0(X)\)
```
# this will be our model for predicting Y - T*theta from X, or g0_hat
ml_g = RandomForestRegressor()
# initial guess for theta
l_hat = ml_l.predict(x)
psi_a = -np.multiply(t, t)
psi_b = np.multiply(t, y - l_hat)
theta_init = -np.mean(psi_b) / np.mean(psi_a)
# get estimate for g0
ml_g.fit(x, y - t*theta_init)
g_hat = ml_g.predict(x)
```

>

Lastly, we can regress the effect of \(T\) our from our prediction
```
# compute residuals
u_hat = y - g_hat
psi_a = -np.multiply(t, t)
psi_b = np.multiply(t, u_hat)
# get estimate of theta and and SE
theta_hat = -np.mean(psi_b) / np.mean(psi_a)
psi = psi_a * theta_hat + psi_b
err = theta_hat - theta
J = np.mean(psi_a)
sigma2_hat = 1 / len(y) * np.mean(np.power(psi, 2)) / np.power(J, 2)
err = err/np.sqrt(sigma2_hat)
```

>

If we repeat this process 200 times, we can genereate a histogram of our error term and see how well we did.
If our estimate is good, we would expect the normalized difference between our estimate of \(\theta\) and the real theta to
be centered on 0.
This histogram shows that is not the case. Our estimate is way off and centered on a positive value. What went wrong?

At a high level, part of what went wrong is that we did not explicitly model the effect of \(X\) on \(T\). That influence is biasing our estimate. Illustrating this explicitly is where our partially linear data generating process comes in handy. We can write out an equation for the error in our estimate. The goal here would be for the left-hand side to converge to 0 as we get more data.

- The left hand side is our scaled error term - what we want to go to 0
- Noise cancels out on average, so this term is a very small number, divided by a big number. Essentially 0
- This term is where the problem is. Our estimate error is never going to be 0. This because of the deal we make as data scientists working with complex data. Reduce the varaince (overfitting) of our machine learning model, we induce some bias in our estimate (often through regularization). Additionally, \(T\) depends on \(X\), and therefore also will not converge to 0. Because of this, \(g_0-\hat{g} \times T\) will be small, but not 0. It will be divided by a large number, and will converge to 0 eventually, but too slowly to be practical.

- Estimate \(T\) from \(X\) using ML model of choice (different from the direct method!)
- Estimate \(Y\) from \(X\) using ML model of choice
- Regress the residuals of each model onto eachother to get \(\theta_0\)

\(\sqrt{n}(\hat{\theta_0} - \theta_0) =\)\((EV^2)^{-1}\frac{1}{\sqrt{n}}\sum_{i \in I}^nU_iV_i\) \(+\) \((EV^2)^{-1}\frac{1}{\sqrt{n}}\sum_{i \in I}^n(\hat{m_0}(X_i) - m_0(X_i))(\hat{g_0}(X_i) - g_0(X_i))\) \(+ ... \)

- The left hand side is the same as before
- Noise cancels out on average, so this term is a very small number, divided by a big number. Essentially 0
- Now we have two small, non-0 numbers multiplied by eachother, divided by a large number. This will converge to 0 much more quickly than before
- ... this method adds a new term that we're going to ignore for now. But it comes back later!

```
# model for predicting T from X - new to the regularized version!
ml_m = RandomForestRegressor()
# model for predicting Y - T|X*theta from X
ml_g = RandomForestRegressor()
ml_m.fit(x,t)
m_hat = ml_m.predict(x)
# this is the part that's different
v_hat = t - m_hat
psi_a = -np.multiply(v_hat, v_hat)
psi_b = np.multiply(v_hat, y - l_hat)
theta_init = -np.mean(psi_b) / np.mean(psi_a)
# get estimate for G
ml_g.fit(x, y - t*theta_init)
g_hat = ml_g.predict(x)
```

>

Similarly, when we get our final estmimate for \(\theta\)
```
# compute residuals
u_hat = y - g_hat
# v_hat is the residuals from our m0 model
psi_a = -np.multiply(v_hat, v_hat)
psi_b = np.multiply(v_hat, u_hat)
theta_hat = -np.mean(psi_b) / np.mean(psi_a)
psi = psi_a * theta_hat + psi_b
err = theta_hat - theta
J = np.mean(psi_a)
sigma2_hat = 1 / len(y) * np.mean(np.power(psi, 2)) / np.power(J, 2)
err = err/np.sqrt(sigma2_hat)
```

>

If we plot a similar histogram over 200 simulations, we'll get something like this:
And we have greatly reduced (but not eliminated) our bias!
For this specific data generating process, we now have a way of estimating \(\theta\) without regularization bias! However, I mentioned earlier that we want to be able to estimate more than only this process. In particular, step three involves linear regression, and only works in our partially linear example. How do we generalize the method of estimating \(\theta\)?

The least squares solution for linear relationships essentially finds the parameters for a line that minimizes the error between the predicted points on the line, and the observed data. We can write this as the minimization of a cost function of our data and true parameters $$ \psi(W; \theta, \eta) = 0 $$ This equation looks vary different but contains a lot of the same players as before:

- \(W\) is the data (\(X\),\(Y\), and \(T\))
- \(\theta\) is the true treatment effect
- \(\psi\) is just some cost function. We are purposely not defining it because we want this to be a general solution, but you can think of it as any kind of error minimization function
- \(\eta\) is called the nuisance parameter, and here contains \(g\) and \(m\)

There are whole branches of mathematics dedicated to solving these types of equations with moment conditions, and there is no single good solution. All the different solutions are called 'DML estimators'. Rather than getting into any specific estimator here, we're just going to trust that they exist, and move on. Whatever package you use to apply the method should give some information on the estimators it implements.

We now have a more generalizable set of steps

- Estimate \(T\) from \(X\)
- Estimate \(Y\) from \(X\)
- Solve moment equation to get \(\theta\)

Now, it's time to revisit out error equation in the partially linear case. $$ { \sqrt{n}(\hat{\theta_0} - \theta_0) = (EV^2)^{-1}\frac{1}{\sqrt{n}}\sum_{i \in I}^nU_iV_i + (EV^2)^{-1}\frac{1}{\sqrt{n}}\sum_{i \in I}^n(\hat{m_0}(X_i) - m_0(X_i))(\hat{g_0}(X_i) - g_0(X_i)) + \frac{1}{\sqrt{n}}\sum_{i \in I}^{n}V_i(\hat{g_0}(X_i) - g_0(X_i)) } $$ We've discussed the first two terms already, but I'm revealing the last term we had hidden previously. If any overfitting is present in \(\hat{g_0}\), the estimate will pick up some noise from the noise term \(U\). This will slow down the convergence of this new term to 0.

The solution to this bias is to fit \(g\) and \(m\) on a different set of data than the set used to estimate \(\theta\). Like how cross-validation avoids overfitting during parameter selection, this method (called cross-fitting) avoids overfitting in our estimation of \(\theta\). This changes our DML steps slightly.

- Split the data into \(K\) folds. For each fold:
- Estimate \(T\) from \(X\) using ML model of choice and fold \(K\)
- Estimate \(Y\) from \(X\) using ML model of choice and fold \(K\)
- Solve moment equation get \(\theta\) using other sets of data
- Select \(\theta\) estimate that gives the best solution over all splits.

```
from sklearn.model_selection import KFold
# number of splits for cross fitting
nSplit = 2
x, y, t = get_data(n, n_x, theta)
# cross fit
kf = KFold(n_splits=nSplit)
# save theta hats, and some variables for getting variance in theta_hat
theta_hats = []
sigmas = []
for train_index, test_index in kf.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
t_train, t_test = t[train_index], t[test_index]
ml_l = RandomForestRegressor()
ml_m = RandomForestRegressor()
ml_g = RandomForestRegressor()
ml_l.fit(x_train,y_train)
ml_m.fit(x_train,t_train)
l_hat = ml_l.predict(x_test)
m_hat = ml_m.predict(x_test)
# initial guess for theta
u_hat = y_test - l_hat
v_hat = t_test - m_hat
psi_a = -np.multiply(v_hat, v_hat)
psi_b = np.multiply(v_hat, u_hat)
theta_init = -np.mean(psi_b) / np.mean(psi_a)
# get estimate for G
ml_g.fit(x_train, y_train - t_train*theta_init)
g_hat = ml_g.predict(x_test)
# compute residuals
u_hat = y_test - g_hat
psi_a = -np.multiply(v_hat, v_hat)
psi_b = np.multiply(v_hat, u_hat)
theta_hat = -np.mean(psi_b) / np.mean(psi_a)
theta_hats.append(theta_hat)
psi = psi_a * theta_hat + psi_b
sigma2_hat = 1 / len(y_test) * np.mean(np.power(psi, 2)) / np.power(J, 2)
sigmas.append(sigma2_hat)
err = np.mean(theta_hat) - theta
err = err/np.sqrt(np.mean(sigmas))
```

Using this process, we can correct the bias in our estimation
Now we have a pretty good estimate!
So far, we've gone over what the DML method is, and how is overcomes biases from regularization and overfitting to get good estimates of \(\theta\) without making strong assumptions about the form of the data.
There are two packages (econML and DoubleML) that allow for applications of this method in Python and R. I have an application post that walks through an example using econML. Hopefully this post made the method little more accessible or helped you assess if this method would be a good fit for your data.
- This blog post on DML
- I liked these slides so much I risked the social humiliation that comes with liking a twitter post from years ago to show my support
- DoubleML github (see this for code to exactly replicate the figures in the paper)

- Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2016). Double/Debiased Machine Learning for Treatment and Causal Parameters. http://arxiv.org/abs/1608.00060
- Rose, S., & Rizopoulos, D. (2020). Machine learning for causal inference in Biostatistics. In Biostatistics (Oxford, England) (Vol. 21, Issue 2, pp. 336–338). NLM (Medline). https://doi.org/10.1093/biostatistics/kxz045