Weekly Report [4]

Weekly Report [4]

Jinning, 08/07/2018

[Project Github]

Clip experiment on large dataset

Got the result of Clipping with $\min{\frac{1}{p}, c}$. This result is trained and tested on large dataset.

Loss function used (weighted BCE loss):

$c\in$[1, 2, 5, 10, 15, 20, 30, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550].

In the training set, the value of $\frac{1}{p}$ varies among $[1, 2917566]$. Its average is $114.57$.

Distribution of inverse propensity

In order to survey the distribution of $\frac{1}{p}$ , the following diagram show the percentage of $\frac{1}{p}$ larger than given $c$ (which is clipped):

[100.00%] 14175476 in 14175476 larger than 1
[96.52%] 13682662 in 14175476 larger than 5
[95.78%] 13577934 in 14175476 larger than 10
[56.56%] 8018069 in 14175476 larger than 20
[45.15%] 6400434 in 14175476 larger than 30
[32.50%] 4606483 in 14175476 larger than 50
[18.09%] 2564603 in 14175476 larger than 100
[11.79%] 1671481 in 14175476 larger than 150
[8.55%] 1212462 in 14175476 larger than 200
[6.62%] 938342 in 14175476 larger than 250
[5.35%] 758935 in 14175476 larger than 300
[4.45%] 631364 in 14175476 larger than 350
[3.80%] 538490 in 14175476 larger than 400
[3.30%] 467660 in 14175476 larger than 450
[2.91%] 412324 in 14175476 larger than 500
[2.59%] 367630 in 14175476 larger than 550
[1.26%] 178423 in 14175476 larger than 1000
[0.16%] 23365 in 14175476 larger than 5000
[0.07%] 9505 in 14175476 larger than 10000
...

The curve of IPS and Standard deviation of IPS versus $c$ :

IPS Curve:

[See Large Figure]

IPS-Std Curve:

[See Large Figure]

Discussions

Not coincidence

I notice that when $30<c<50$, the IPS is the highest (over $60$). However, the standard deviation is also quite high (over $10$). I train the network again when $c=50$, the IPS is still over $60$. So I think the local peak when $30<c<50$ is not a coincidence.

Why standard deviation is so high?

In the evaluation process, IPS is calculated by

In another word, we calculate $\pi_w=\frac{score(\hat{x})}{\sum score(x_i)}$.

Assume the output of our network is $\sigma$, then score is calculated by $score=e^{score-\max(score)}$ in the evaluation program.

The standard deviation is measured by

Note that we apply a post-process trick that $\sigma’ = \frac{850100}{1+e^{-\sigma +1.1875}}$.


Before post-process $\sigma$:

896678244; 0:0.011302, 1:0.00727101, 2:0.0111319, 3:0.000752336, 4:0.00235881, 5:0.0131616, 6:0.00344201, 7:0.0268872, 8:0.0108119, 9:0.022044, 10:0.0268872

After post-process $\sigma’$:

896678244; 0:200400, 1:199783, 2:200374, 3:198788, 4:199033, 5:200685, 6:199198, 7:202811, 8:200325, 9:202049, 10:202796

Before post-process $\pi_w=\frac{score(\hat{x})}{\sum score(x_i)}$:

896678244; 0:0.984536, 1:0.980575, 2:0.984368, 3:0.974204, 4:0.97577, 5:0.986368, 6:0.976827, 7:1, 8:0.984053, 9:0.995168, 10:1

After post-process $\pi_w=\frac{score(\hat{x})}{\sum score(x_i)}$:

896678244; 0:0, 1:0, 2:0, 3:0, 4:0, 5:0, 6:0, 7:1, 8:0, 9:0, 10:3.05902e-07

So actually our post-processing makes the policy more strict. For example, as shown above, after post-processing, id=896678244 gets a propensity vector $[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3.05902e-07]$. So, if 7 is not the selected candidate, $\pi_w$ will be $0$, else $\pi_w$ will be $1$.

If not post-process, the propensity vector will be $[0.985, 0.981, 0.974, 0.976, 0.986, 0.977, 1, 0.984, 0.995, 1]$. Then $\pi_w$ will be almost $\frac{1}{11}$ no matter which candidate is selected.

So it’s easy to understand after post-processing, standard deviation will be much higher.

And it clear that post-processing is necessary. Because if not, the expectation of $\pi_w$ will always be close to $\frac{1}{11}$, no matter how nice the model is.

I think this is a BUG of the evaluation program. It should not use $score=e^{score-\max(score)}$ but something like $score=e^{\frac{score-\max(score)}{\sum score}}$.

Question: Why Standard deviation varies with clipping value? Why higher IPS often accompanies with higher Standard deviation?


Review what I am doing

What I am doing is training a network $\sigma$ to minimize the loss function (If clipping is applied)

An important point is that $y$ is whether this Ad is clicked, instead of whether this Ad is chosen to display. This is the difference between banditNet and my net.

Then, we use our network $\sigma_{\bar{w}}$ to estimate the probability of clicking of a given Ad, and use this propensity as the propensity to display this Ad.

This is based on an assumption that $\pi_0$ will always choose Ads with high CTR (reward) to display.

If this assumption is true (of course true), then the IPS:

will be maximized. This is how my net works.

Question: Why do we need to add $\min{\frac{1}{p}, c}$ before our BCE loss fuction? Why will this contribute to CTR prediction?

In the banditNet settings, the network will directly calculate propensity for each candidate, or $\pi_w$, then directly maximize the IPS / SNIPS or minimize its Lagrangian.

What’s the best clipping value?

According to the diagram drawn in the beginning, the best clipping value is around $30$ ~ $50$. The percentage of clipped propensities are about $30\%$ ~ $45\%$.

Question: Will $30\%$ ~ $45\%$ always be the best percentage? How to find the relationship between the best clipping value and the distribution of propensity?

Question: Is it true that the more propensities are clipped, the larger bias is and the smaller variance is? What’s the relation between bias-variance-tradeoff and clipping method?

Question: Is there better way to find the best clipping value without grid search?