Weekly Report [1]

Weekly Report [1]

Jinning, 07/08/2018

Arrive at Cornell on 07/03

Cornell and Ithaca town are so charming. Blue sky and beautiful buildings here really attracted me.

Book Reading

Causal Inference for Statistics Social and Biomedical Sciences

Chapter 1
  • concepts:
  • action(manipulation, treatment, intervention)
  • unit: A unit here can be a physical object, a firm, an individual person, or collection of objects or persons, such as a classroom or a market, at a particular point in time.
  • potential outcomes: Given a unit and a set of actions, we associate each action-unit pair with a potential outcome
  • causal effect: involves the comparison of these potential outcomes, one realized (and perhaps, though not necessarily, observed), and the others not realized and therefore not observable
  • alternative action: not chosen action.
  • counterfactual value: the potential outcome corresponding to the treatment not applied.

  • Assumtions:
  • SUTVA: The potential outcomes for any unit do not vary with the treatments assigned to other units, and, for each unit, there are no different forms or versions of each treatment level, which lead to different potential outcomes.
  • No Inference
  • No Hidden Variations of treatments
  • assignment mechanism
  • missing data problem: given any treatment assigned to an individual unit, the poten- tial outcome associated with any alternate treatment is missing.
  • covariates: presence of unit-specific background attributes, also referred to as pre-treatment variables, or covariates can assist in making these predictions.

Paper Reading

[1] Deep learning with logged bandit feedback

@article{joachims2018deep, title={Deep learning with logged bandit feedback}, author={Joachims, Thorsten and Swaminathan, Adith and de Rijke, Maarten}, year={2018} }

  • log:
  • input: e.g. features describing the user, banner ad, and page.
  • action: the action that was taken by the system e.g. a specific banner ad that was placed.
  • feedback: the feedback furnished by the user. e.g. clicks on the ad, or monetary payoff.

The feedback for all other actions the system could have taken is typically not known.

Opens a new and intriguing pathway for acquiring knowledge at unprecedented scale, giving deep neural networks access to this abundant and ubiquitous type of data.

Similarly, it enables the application of deep learning even in domains where manually labeling full-information feedback is not viable.

batch learning from bandit feedback (BLBF) does not require the ability to make interactive interventions.

View of a deep neural network as a stochastic policy.

Allow stochastic gradient descent (SGD) optimization.

achieve the same classification performance given sufficient amounts of contextual bandit feedback as ResNet trained with cross-entropy on conventionally (full-information)

Success with off-policy variants of the REINFORCE

  • loss \(\delta (x_i, y)\) arbitrary function
  • policy \(\pi\)
  • input \(x\)
  • action \(y\)
  • contexts from fixed & unknown distribution \(Pr(X)\)
  • network \(\pi_\omega (Y|x)\)
  • propensity \(p_i\equiv \pi_0(y_i|x_i)\), received loss \(\delta \equiv \delta(x_i, y_i)\)

Dataset: \[ D=[(x_1, y_1, p_1, \delta_1),\cdots, (x_n, y_n, p_n, \delta_n)] \]

IPS estimator: \[ \hat{R}_{IPS}=\frac{1}{n}\sum_{i=1}^{n}\delta_i\frac{\pi_\omega(y_i|x_i)}{\pi_0(y_i|x_i)} \] Problem: propensity overfitting. Propensity overfitting is linked to the lack of equivariance of the IPS estimator.

Self-normalized IPS estimator. equivariant and substantially lower variance: \[ \hat{R}_{SNIPS}=\frac{\frac{1}{n}\sum_{i=1}^{n}\delta_i\frac{\pi_\omega(y_i|x_i)}{\pi_0(y_i|x_i)}}{\frac{1}{n}\sum_{i=1}^{n}\frac{\pi_\omega(y_i|x_i)}{\pi_0(y_i|x_i)}} \]


Reformulate the problem into a series of constrained optimization problems. \[ \hat{w}=\arg \min_{w\in R^N}\frac{1}{n}\sum_{i=1}^{n}\delta_i\frac{\pi_\omega(y_i|x_i)}{\pi_0(y_i|x_i)}~subject~to~\frac{1}{n}\sum_{i=1}^{n}\frac{\pi_\omega(y_i|x_i)}{\pi_0(y_i|x_i)} \] Lagrangian: \[ L(w, \lambda)=\frac{1}{n}\sum_{i=1}^{n}\frac{(\delta_i-\lambda)\pi_\omega(y_i|x_i)}{\pi_0(y_i|x_i)} \] Constrained optimization problems is transformed to: \[ \hat{w}_j=\arg \min_{w\in R^N}L(w, \lambda_j) \]


Use a hand-coded logging policy that achieves about 49% error rate on the training data, which is substantially worse than what we hope to achieve after learning.

Bandit-ResNet converges to the skyline performance given enough bandit feedback training data, providing strong evidence that our training objective and method can effectively extract the available information provided in the bandit feedback

The SNIPS estimates in the right-hand plot Figure roughly reflects this optimal range, given empirical support for the SNIPS estimator.

[2] Learning from Logged Bandit Feedback of Multiple Loggers

We only observe the outcomes of the deployed action, but not for the alternative ads that could have been presented instead.

This paper makes the case that naively applying CRM to the setting where the log data comes from multiple policies can be highly sub-optimal.

  • Weighted inverse propensity score WIPS (Agarwal et al., 2017), is proposed by re-weighting data from different policies by their ”divergence” from the target policy.
  • give the smallest variance among all weighted estimators.
  • A major challenge in extending this approach to learning is that the weights depend on the target policy
  • Another challenge is that the optimal weights depend on the divergences between the target policy and the historical policies,

This paper: Propose a better empirical divergence estimator using control variates.


  • re-weighted loss \[ U_j^i(\pi)=\delta_j^i\frac{\pi(y_j^i|x_j^i)}{\pi_i(y_j^i|x_j^i)} \]
  • mean loss \[ U^i(\pi)=\frac{1}{n_i}\sum_{i=1}^{n}U_j^i(\pi) \]
  • weight vector \(p\) \[ p_i=\frac{n_i}{\sigma_\delta^2(\pi\|\pi_i)\sum_{j=1}^m\frac{n_j}{\sigma_\delta^2(\pi\|\pi_j)}} \]
  • WIPS estimator \[ \hat{R}(\pi)=\sum_{i=1}^mp^TU(\pi) \]
  • divergence \[ \sigma_\delta^2(\pi\|\pi_i)=\frac{1}{n_i-1}\sum_{j=1}^{n_i}(U_j^i(\pi)-U^i(\pi))^2 \]

Better Divergence Estimator

  • control variate \[ S^i(\pi)=\frac{1}{n_i}\sum_{j=1}^{n_i}\frac{\pi(y_j^i|x_j^i)}{\pi_i(y_j^i|x_j^i)} \]
  • overall mean \[ \bar{U}(\pi)=\frac{1}{\sum_{i=1}^{m}n_i}\sum_{i=1}^{m}\sum_{j=1}^{n_i}U_j^i(\pi) \]
  • self-normalized divergence estimator \[ \sigma_\delta^2(\pi\|\pi_i)=\frac{1}{n_i-1}\sum_{j=1}^{n_i}(\frac{U_j^i(\pi)}{S_i(\pi)}-\bar{U}(\pi))^2 \]

Intuitions: Using the overall \(\bar{U}(\pi)\) instead of the IPS estimate for each specific logger πi utilizes information from all the data and provides a more informed estimate.

Weighted Counterfactual Risk Minimization Principle

The learning principle minimizes the WIPS estimator and its empirical standard deviation at the same time. \[ \pi^{WCRM}=\arg \min_{\pi, p}p^TU^M(\pi)+\lambda\sqrt{\frac{\hat{Var}(p_iU_j^i(\pi))}{\sum_{k=1}^{m}n_k}} \] subject to \[ p=\arg\min_{p}Var_{D}(\hat{R}(\pi)) \]


  • We chose the multi-label datasets from LibSVM for the experiments.
  • we collect bandit dataset by simulating \(y_i\) and report the loss \(\delta\) associated with this sample by the number of correctly predicted labels compared with the ground truth \(y_i^*\)

Two policy to obtain dataset - Use two logging policies in the following experiment - trained using a CRF on these 20% of data - trained on the same data with stochastic multiplier to be \(1\)

Baseline: CRM Ours: WCRM For both datasets, WCRM maintains good performance even as the quality of logger 1 is degraded, while the naive CRM approach is severly affected.

Paper Survey

@inproceedings{wang2018position, title={Position bias estimation for unbiased learning to rank in personal search}, author={Wang, Xuanhui and Golbandi, Nadav and Bendersky, Michael and Metzler, Donald and Najork, Marc}, booktitle={Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining}, pages={610–618}, year={2018}, organization={ACM} }

  • Learning to Rank
  • Personal Search
  • Compare different schemes for result randomization (i.e., RandTopN and RandPair) and show their negative effect in personal search.
  • Study how to infer such bias from regular click data without relying on randomization.
  • Propose a regression-based Expectation-Maximization (EM) algorithm that is based on a position bias click model and that can handle highly sparse clicks in personal search.
  • Evaluate EM algorithm and the extracted bias in the learning-to-rank setting.
  • Evaluation: weighted MRR and average log likelihood

[2] Risk-Averse Trees for Learning from Logged Bandit Feedback

@inproceedings{trovo2017risk, title={Risk-averse trees for learning from logged bandit feedback}, author={Trov{`o}, Francesco and Paladino, Stefano and Simone, Paolo and Restelli, Marcello and Gatti, Nicola}, booktitle={Neural Networks (IJCNN), 2017 International Joint Conference on}, pages={976–983}, year={2017}, organization={IEEE} }

  • RADT is based on a risk-averse learning method which exploits the joint use of regression trees and statistical confidence bounds
  • RADT generates policies aiming to maximize a lower bound on the expected reward and provides a clear characterization of those features in the context that influence the process the most.
  • RADT algorithm which greedily learns a binary tree to approximate the optimal mapping on the basis of the available dataset and maximizes statistical lower bounds over the expected profit.

[3] Learning to Rank with Selection Bias in Personal Search.

@inproceedings{wang2016learning, title={Learning to rank with selection bias in personal search}, author={Wang, Xuanhui and Bendersky, Michael and Metzler, Donald and Najork, Marc}, booktitle={Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval}, pages={115–124}, year={2016}, organization={ACM} }

  • Study the problem of how to leverage sparse click data in personal search and introduce a novel selection bias problem and address it in the learning-to-rank framework
  • Proposes a few bias estimation methods, including a novel query-dependent one that captures queries with similar results and can successfully deal with sparse data.