Making sense of social attribution
Making sense of social attribution
The attribution gap we didn't see coming
We had a marketing team watching posts go viral. Engagement numbers looked great. But when asked "Which posts actually drove revenue," we had no idea.
That is the problem with most social media dashboards: they are islands. Likes, shares, reach, sentiment, each metric lives in its own corner of the internet, and none of them talk to each other. You end up making decisions based on whatever number looks best that week.
Here's the ugly truth: our performance data sat in Looker with strict access controls. Our social sentiment lived in Sprinklr, exported as CSVs every morning. Different teams, different tools, different stories.
So we built something to bridge the gap. Not perfect automation, pragmatic workarounds that actually work in a messy enterprise environment.
Making sense of the mess: the quadrant
Before touching any code, we needed a framework marketers could actually use. We plot normalized reach against normalized mentions, and suddenly the chaos has a pattern.
What the quadrants actually mean:
- High Performers (High reach, high mentions): Double down. These posts are your winners.
- Platform Push (High reach, low mentions): The message is spreading but people aren't talking about it. Consider adjusting the content.
- Leave It (Low reach, low mentions): Content isn't resonating. Kill it or fix it.
- Underdogs (Low reach, high mentions): Your audience loves it, but not enough people see it. Amplify these.
The stack we chose
We went hybrid: serverless for heavy historical data, PostgreSQL for fast-moving cache.
| Component | Why we picked it |
|---|---|
| Airflow | Python-based orchestration, industry standard |
| Sprinklr exports | Daily CSVs give us cross-account mentions |
| Streamlit | Fast Python web apps stakeholders can actually use |
| PostgreSQL | Fast caching for uploaded CSVs (first stream & account reach) |
| Amazon S3 | Cheap storage for messy, multi-format data |
| AWS Glue | ETL without paying for a database we don't need |
| Athena | Query layer for Sprinklr mentions data |
The pattern: Sprinklr data (large, historical) goes S3 → Glue → Athena. Uploaded CSVs (small, frequently refreshed) go Streamlit → PostgreSQL. The dashboard joins them on the fly.
How the Pipeline Fits Together
Getting the data in
Sprinklr data goes into S3 first, then through Glue ETL, into Athena. The scheduled CSV exports from Sprinklr land here, automated, reliable, queryable.
First Stream & Account Reach takes a different path. Marketers upload CSVs directly through Streamlit, which stores them in PostgreSQL. The data gets cached, refreshed on demand, and stays close to where people actually work.
Cataloging the chaos
AWS Glue lets us treat S3 folders like SQL tables without actually running a database. Schema-on-read means we don't pay thousands for infrastructure that changes every time a social platform tweaks their export format.
The actual logic
The heavy lifting happens when data meets. Sprinklr data lives in Athena (S3-backed). First Stream & Account Reach lives in PostgreSQL (Streamlit cache). The join happens on the fly when someone queries the dashboard.
# Join Athena (Sprinklr mentions) with PostgreSQL (cached reach data)
def correlate_intelligence():
"""
Fetches Sprinklr mentions from Athena, cached reach from PostgreSQL.
Joins and normalizes on the fly.
"""
# Query Sprinklr mentions from Athena
mentions_df = query_athena(
"SELECT * FROM social_mentions WHERE date >= CURRENT_DATE - 7"
)
# Fetch cached reach data from PostgreSQL
reach_df = query_postgres("SELECT * FROM account_reach_cache")
# Join on date/account
joined = mentions_df.merge(reach_df, on=['date', 'account'])
# Normalize and categorize
norm_reach = joined['reach'] / joined['reach'].max()
norm_mentions = joined['mentions'] / joined['mentions'].max()
# Categorize based on median performance
if norm_reach > median_reach:
return "High Performers" if norm_mentions > median_mentions else "Platform Push"
else:
return "Leave It" if norm_mentions > median_mentions else "Underdogs"
The key insight: We don't move all the data into one place. We query where it lives and join on the fly. PostgreSQL handles the fast-changing reach data; Athena handles the historical Sprinklr mentions. The dashboard stitches them together when you need them.
Streamlit makes it actionable
This is where it clicks. Our Streamlit dashboard is designed for action. Marketers hover over dots and immediately see which campaign gets more budget versus which one needs a PR fix.
Upload flow: CSVs uploaded through Streamlit land in PostgreSQL, cached for fast queries. The cache refreshes when needed, not on a rigid schedule, so teams can upload fresh data whenever they want.
Query flow: When someone opens the dashboard, it queries Athena for Sprinklr mentions and pulls from PostgreSQL for cached reach. The join happens in real-time, and results populate instantly.
What we found
High Performers (High reach, high mentions): These are your winners. Pour more fuel on the fire.
Platform Push (High reach, low mentions): The message is spreading but people aren't talking about it. Consider adjusting the content.
Leave It (Low reach, low mentions): Content isn't resonating. Kill it or fix it.
Underdogs (Low reach, high mentions): Your audience loves it, but not enough people see it. Amplify these.
What changed
| Metric | Before | After |
|---|---|---|
| Response time | Days of manual reconciliation | Near real-time |
| Data ownership | Vendor lock-in | Owned logic |
| Team alignment | Siloed teams, different stories | One view, one conversation |
| Decisions | Gut feeling, vanity metrics | Correlation-based, data-backed |
What we learned
-
Solve for politics first. Perfect automation is a myth in enterprise settings. Build for the "semi-automated" reality where humans still click buttons and email CSVs.
-
Own your logic. Don't let some BI tool decide what "successful" looks like. Define your own quadrants based on what actually matters to your business.
-
Keep it serverless. S3, Glue, and Athena give you a stack that's cheap to run and scales when you need it. No database maintenance. No hardware upgrades.
The point
This pipeline turns social media data from isolated metrics into actual insights. Marketing teams stop guessing and start knowing.
The goal isn't just data. It's clarity that moves the needle.
Thanks
This project started with late-night brainstorms about marketing bottlenecks. Haikal (Head of Digital) pushed us to bridge the attribution gap. Jerry and the social media team gave us practical feedback that shaped every decision. Bernice, our intern, worked alongside us to build and refine the pipeline.
This article is part of Wired Sixth's "Intel" series on data strategy and engineering.