When Engagement Lifts Mislead: Did the New Feed Improve Retention?

Why the initial engagement signal mattered, and how causal correction changed the decision.

Product Analytics Causal Inference Propensity Score Matching Logistic Regression

TL;DR

A new feed iteration appeared to drive a ~110–140% lift in engagement. After correcting for selection bias using propensity score matching, I found no meaningful improvement in 7-day retention. The lift was driven by who received the feature, not the feature itself.

Final decision: do not ship the feature.

The Decision That Triggered This Analysis

This case study evaluates a feed-ranking change in a generic consumer social app. The feed is a core surface for engagement and long-term retention, making even small ranking changes high-risk.

The feature introduced more aggressive personalization based on predicted relevance. Exposure was not randomized, making this an observational analysis rather than a clean A/B test.

Who We Were Comparing

Users clustered naturally into three behavioral segments.

Low-engagement users

Infrequent usage, low baseline interaction

Normal users

Moderate engagement, majority of the base

Power users

Highly engaged, critical to platform health

These segments matter because engagement propensity differs sharply across them.

User Segment Distribution

Why Standard Metrics Were Misleading

Engagement spikes are diagnostic, not decisive. Success criteria were defined upfront.

North Star Metric

7-day retention

Best proxy for habit formation and long-term value.

Diagnostic Metrics
  • Cards viewed per session
  • Bounce rate

Useful for understanding behavior, not for making ship decisions.

Decision rule

The decision hinged on whether users returned, not whether they interacted more in a single session.

The Data Behind the Decision

The analysis relied on two complementary datasets that together made it possible to separate user quality from feature impact.

Users Table

One row per user. Used to model selection bias and long-term outcomes.

  • User segment (low / normal / power)
  • Baseline engagement score (pre-exposure)
  • Feature exposure flag
  • 7-day retention outcome
Events Table

Session-level behavioral data. Used to measure short-term engagement.

  • Session start / end events
  • Card view events per session
  • Feature flag at time of interaction

Unit of analysis

Users for retention analysis; sessions for engagement diagnostics.

Time window

Baseline behavior measured pre-exposure, outcomes tracked over 7 days post-exposure.

Why this matters

Separating baseline behavior from outcomes makes causal adjustment possible.

Key data signal

Users exposed to the new feature had systematically higher baseline engagement, indicating strong selection bias in feature exposure.

This meant naive engagement comparisons primarily reflected who received the feature, not what the feature caused.

The Naive Comparison (Why It Deceived Us)

A dashboard view suggested a dramatic engagement win.

Average cards viewed per session

Control users ~2.3
New feature users ~4.9

Observed lift ~110–140%
Why this looked convincing

At face value, this appears to be a major engagement win. A standard dashboard would strongly suggest shipping.

Naive Engagement Comparison

Takeaway: the lift conflates user quality with feature impact.

Why the Naive Comparison Failed

Core issue

Exposure to the new feed was non-random and strongly correlated with baseline engagement.

The comparison measured high-engagement users vs low-engagement users, not the causal effect of the feature.

Analogy: comparing basketball players to accountants and concluding basketball makes people taller.

How I Isolated the Causal Effect

I applied propensity score matching to approximate a randomized experiment.

Non-random treatment

Assignment depended on behavior.

Observable confounders

Segment and baseline engagement.

RCT approximation

Matched treated and control users.

What Changed After Correcting for Bias

Matching shifted the comparison onto like-for-like users.

7-day retention after adjustment

New feature retention ~71%
Control retention ~73%

Treatment effect ≈ 0
What this tells us

Once comparable users are evaluated, the apparent engagement win disappears. The feature does not improve retention.

7-Day Retention After Propensity Score Matching

Retention remains unchanged after causal adjustment.

Recommendation

Do not ship the feature. The feature failed to improve the north-star metric and introduces risk.
← Back to Portfolio