I went through the Meta data science interview this spring. For an even deeper dive into that exact gauntlet, you can check out my extended Meta interview diary. I was nervous. I was excited. I was very caffeinated. Here’s my honest take, with real questions I got, the parts I loved, the parts that stung, and what I’d do if I ran it back tomorrow.
For a crowdsourced peek at what other candidates have faced recently, I also skimmed this running list of Meta Data Scientist interview questions on Glassdoor and found it surprisingly on-point with my own experience.
Quick context
- Role: Data Scientist, Product (consumer side)
- Format: Remote, then virtual “onsite”
- Rounds: Recruiter screen, hiring manager chat, SQL/stats, product sense, and a final loop
You know what? It felt more like a product workout than a test. That’s good and bad.
How the process flowed for me
- Recruiter screen (30 min): background, timeline, level check
- Hiring manager (45 min): team fit, how I think, a tiny case
- SQL + stats (60 min): live query writing, metrics, test logic
- Product sense (60 min): pick a feature, define success, trade-offs
- Final loop (2 hours): a mix of all, plus behavioral
The recruiter step can be a black box; I’ve chronicled some eyebrow-raising recruiter tales in this separate post.
They were on time. They kept a friendly tone. Still, the pace was quick. I ate cold noodles in the 8-minute break. It tasted like stress and sesame oil.
Real examples from my rounds
1) SQL round: small tables, real stakes
They gave me two tables:
- users(user_id, country, created_at)
- messages(sender_id, receiver_id, sent_at)
Task 1: Find daily active senders (unique sender_id) for the last 7 days.
I wrote something like:
select
date_trunc('day', sent_at) as day,
count(distinct sender_id) as daily_active_senders
from messages
where sent_at >= current_date - interval '7' day
group by 1
order by 1;
Then they asked: “How would you handle late events?” I said I’d set a stable window (like 48 hours late), backfill once, and mark the backfill run. Not perfect, but honest.
Task 2: Top 3 countries by week for senders.
I joined users on messages, bucketed by week, and used a rank window. I almost tripped on null country. I called it out and filtered nulls. Calling out pitfalls seemed to help.
What they watched:
- Clean joins (left vs inner)
- Date grain
- Nulls and late data
- Why this metric, not that one
Tiny miss: I used the wrong date function at first (old habit from BigQuery). I corrected it fast and explained the change.
2) Product sense: Should we launch a “typing indicator” tweak in Messenger?
Prompt: A new typing dot shows sooner. Will this help?
I framed it:
- Goal: More good chats, not just more taps
- Primary metric: Reply rate per conversation
- Secondaries: Response time, messages per user per day, session length
- Guardrails: Blocks, reports, send fails, crash rate
I said I’d read the risk as “more pressure and ghosting.” So I’d watch new user cohorts and teen users closely. Different groups feel nudges in different ways. I shared one real story: I once shipped a nudge that looked friendly but made anxious users bounce. That got a nod.
Regional behavior differences matter—for example, many Asian chat communities use typing indicators and stickers in slightly different ways than Western audiences. If you want a quick, hands-on glimpse of that live user behavior, hop into InstantChat’s Asian chat rooms where you can observe real-time conversations and see how subtle UX tweaks like an early typing dot influence engagement at scale.
Decision rule I gave:
- Launch if reply rate +1% or more
- No guardrail worse than −0.5%
- Stable for two full weekly cycles
- No extreme subgroup harm
They pushed: “Why +1%?” I said it’s a guess, tied to revenue and habit strength, but I’d pre-register the threshold and not wiggle it.
3) Experiment design: New Stories sticker
We tested a new sticker in Stories.
- Unit: User level, 50/50 split
- Exposure: All Story creators; feed viewers unaffected
- Length: 2–3 weeks (to catch weekends)
- Power: Aim for 80%. Based on past, we expected +1.5% lifts. I said we’d need hundreds of thousands per arm. I gave a rough number, then said, “I’d confirm with our internal calculator.”
Metrics:
- Primary: Stories posted per creator
- Quality: Completion rate, replays
- Creator stickiness: 7-day active creators
- Guardrails: Report rate, time spent by viewers, crash rate
I also called out interference: If creators post more, viewers may change their time spent. If that spillover’s big, I’d consider geo split or cluster by friend graph. Not always possible, but at least I named it.
Stats call:
- Two-sided test
- If metric is skewed, trim top 1% or use a bootstrap
- Pre-register the plan
4) Stats brain-teaser: Significance vs truth
They gave me a tiny case: One arm has 10% click. The other has 10.3%. p = 0.03. Should we ship?
My answer:
- Yes, it’s “significant,” but magnitude matters
- Check power, seasonality, and novelty effects
- Look for p-fishing signs (many peeks, many metrics)
- Check subgroups for harm
- If effect is tiny and costs are real, I’d run a holdout post-launch
They liked that I didn’t chase p-values without context.
5) Behavioral: When I killed my own project
Story:
- Situation: I led a feed tweak that lifted clicks but raised hides by 2%
- Task: Decide fast, with little time
- Action: I paused the ramp, showed the hide spike, and shared 3 follow-ups
- Result: We fixed ranking rules, relaunched later, and hit +0.8% with no harm
I shared how it felt. It stung. But it built trust. I think that mattered more than a win.
What I liked
- Real product talk. Not just theory.
- Friendly interviewers who asked “why” a lot.
- Clear structure. I knew what was next.
- Hands-on SQL that felt close to work.
What bugged me
- Time crunch. Good ideas got cut short.
- One tool quirk (date functions) ate minutes.
- Little space for quick data pulls; it was all whiteboard-ish.
- Fatigue by the last loop. My brain felt like oatmeal.
How I prepped (and what actually helped)
- Daily SQL reps on real-ish tables (users, events, sessions). I used Mode and a small local Postgres.
- Wrote one “metric sheet” per product: Messenger, Feed, Reels. Just basic funnels and guardrails.
- A/B test drills: I used a simple power calculator and ran toy sims in Python. Even napkin math helps.
- Product teardowns: 15 minutes a night. What would I change? Why?
- Mock chats with a friend who kept asking “so what?”
- Skimmed the Meta data scientist interview guide which neatly maps out each round and packs sample questions.
I also browsed a handful of concise interview breakdowns on vhfdx.net to sanity-check my approach against other candidates’ experiences. A few that stood out were a brutally honest look at the Costco data science internship and a first-hand review of a New York data science internship.
Big help: Saying my assumptions out loud. They care more about your story than your script.
What I’d do differently next time
- Set a stopwatch for every answer: 2–3 minutes per part
- Lead with the decision rule, then the details
- Keep a tiny checklist on a sticky note: metric, guardrail, power, risks
- Practice a few “late data” takes and time zone traps
- Eat a real snack between rounds (not cold noodles)
Scorecard
- Depth of product talk: 4.5/5
- SQL realism: 4/5
- Fairness and tone: 5/5
- Time to think: 3/5
Overall: 4.3/5. Hard, fair, and kind of fun