I Took the Meta Data Science Interview: A Real, First-Person Review

I went through the Meta data science interview this spring. For an even deeper dive into that exact gauntlet, you can check out my extended Meta interview diary. I was nervous. I was excited. I was very caffeinated. Here’s my honest take, with real questions I got, the parts I loved, the parts that stung, and what I’d do if I ran it back tomorrow.

For a crowdsourced peek at what other candidates have faced recently, I also skimmed this running list of Meta Data Scientist interview questions on Glassdoor and found it surprisingly on-point with my own experience.

Quick context

Role: Data Scientist, Product (consumer side)
Format: Remote, then virtual “onsite”
Rounds: Recruiter screen, hiring manager chat, SQL/stats, product sense, and a final loop

You know what? It felt more like a product workout than a test. That’s good and bad.

How the process flowed for me

Recruiter screen (30 min): background, timeline, level check
Hiring manager (45 min): team fit, how I think, a tiny case
SQL + stats (60 min): live query writing, metrics, test logic
Product sense (60 min): pick a feature, define success, trade-offs
Final loop (2 hours): a mix of all, plus behavioral

The recruiter step can be a black box; I’ve chronicled some eyebrow-raising recruiter tales in this separate post.

They were on time. They kept a friendly tone. Still, the pace was quick. I ate cold noodles in the 8-minute break. It tasted like stress and sesame oil.

Real examples from my rounds

1) SQL round: small tables, real stakes

They gave me two tables:

users(user_id, country, created_at)
messages(sender_id, receiver_id, sent_at)

Task 1: Find daily active senders (unique sender_id) for the last 7 days.

I wrote something like:

select
  date_trunc('day', sent_at) as day,
  count(distinct sender_id) as daily_active_senders
from messages
where sent_at >= current_date - interval '7' day
group by 1
order by 1;

Then they asked: “How would you handle late events?” I said I’d set a stable window (like 48 hours late), backfill once, and mark the backfill run. Not perfect, but honest.

Task 2: Top 3 countries by week for senders.

I joined users on messages, bucketed by week, and used a rank window. I almost tripped on null country. I called it out and filtered nulls. Calling out pitfalls seemed to help.

What they watched:

Clean joins (left vs inner)
Date grain
Nulls and late data
Why this metric, not that one

Tiny miss: I used the wrong date function at first (old habit from BigQuery). I corrected it fast and explained the change.

2) Product sense: Should we launch a “typing indicator” tweak in Messenger?

Prompt: A new typing dot shows sooner. Will this help?

I framed it:

Goal: More good chats, not just more taps
Primary metric: Reply rate per conversation
Secondaries: Response time, messages per user per day, session length
Guardrails: Blocks, reports, send fails, crash rate

I said I’d read the risk as “more pressure and ghosting.” So I’d watch new user cohorts and teen users closely. Different groups feel nudges in different ways. I shared one real story: I once shipped a nudge that looked friendly but made anxious users bounce. That got a nod.

Regional behavior differences matter—for example, many Asian chat communities use typing indicators and stickers in slightly different ways than Western audiences. If you want a quick, hands-on glimpse of that live user behavior, hop into InstantChat’s Asian chat rooms where you can observe real-time conversations and see how subtle UX tweaks like an early typing dot influence engagement at scale.

Decision rule I gave:

Launch if reply rate +1% or more
No guardrail worse than −0.5%
Stable for two full weekly cycles
No extreme subgroup harm

They pushed: “Why +1%?” I said it’s a guess, tied to revenue and habit strength, but I’d pre-register the threshold and not wiggle it.

3) Experiment design: New Stories sticker

We tested a new sticker in Stories.

Unit: User level, 50/50 split
Exposure: All Story creators; feed viewers unaffected
Length: 2–3 weeks (to catch weekends)
Power: Aim for 80%. Based on past, we expected +1.5% lifts. I said we’d need hundreds of thousands per arm. I gave a rough number, then said, “I’d confirm with our internal calculator.”

Metrics:

Primary: Stories posted per creator
Quality: Completion rate, replays
Creator stickiness: 7-day active creators
Guardrails: Report rate, time spent by viewers, crash rate

I also called out interference: If creators post more, viewers may change their time spent. If that spillover’s big, I’d consider geo split or cluster by friend graph. Not always possible, but at least I named it.

Stats call:

Two-sided test
If metric is skewed, trim top 1% or use a bootstrap
Pre-register the plan

4) Stats brain-teaser: Significance vs truth

They gave me a tiny case: One arm has 10% click. The other has 10.3%. p = 0.03. Should we ship?

My answer:

Yes, it’s “significant,” but magnitude matters
Check power, seasonality, and novelty effects
Look for p-fishing signs (many peeks, many metrics)
Check subgroups for harm
If effect is tiny and costs are real, I’d run a holdout post-launch

They liked that I didn’t chase p-values without context.

5) Behavioral: When I killed my own project

Story:

Situation: I led a feed tweak that lifted clicks but raised hides by 2%
Task: Decide fast, with little time
Action: I paused the ramp, showed the hide spike, and shared 3 follow-ups
Result: We fixed ranking rules, relaunched later, and hit +0.8% with no harm

I shared how it felt. It stung. But it built trust. I think that mattered more than a win.

What I liked

Real product talk. Not just theory.
Friendly interviewers who asked “why” a lot.
Clear structure. I knew what was next.
Hands-on SQL that felt close to work.

What bugged me

Time crunch. Good ideas got cut short.
One tool quirk (date functions) ate minutes.
Little space for quick data pulls; it was all whiteboard-ish.
Fatigue by the last loop. My brain felt like oatmeal.

How I prepped (and what actually helped)

Daily SQL reps on real-ish tables (users, events, sessions). I used Mode and a small local Postgres.
Wrote one “metric sheet” per product: Messenger, Feed, Reels. Just basic funnels and guardrails.
A/B test drills: I used a simple power calculator and ran toy sims in Python. Even napkin math helps.
Product teardowns: 15 minutes a night. What would I change? Why?
Mock chats with a friend who kept asking “so what?”
Skimmed the Meta data scientist interview guide which neatly maps out each round and packs sample questions.

I also browsed a handful of concise interview breakdowns on vhfdx.net to sanity-check my approach against other candidates’ experiences. A few that stood out were a brutally honest look at the Costco data science internship and a first-hand review of a New York data science internship.

Big help: Saying my assumptions out loud. They care more about your story than your script.

What I’d do differently next time

Set a stopwatch for every answer: 2–3 minutes per part
Lead with the decision rule, then the details
Keep a tiny checklist on a sticky note: metric, guardrail, power, risks
Practice a few “late data” takes and time zone traps
Eat a real snack between rounds (not cold noodles)

Scorecard

Depth of product talk: 4.5/5
SQL realism: 4/5
Fairness and tone: 5/5
Time to think: 3/5

Overall: 4.3/5. Hard, fair, and kind of fun