competitive-social-listening
On completion, follow
gtm-meta-skillSection 7 for proactive issue feedback and the session-sharing Yes/No consent step.
Competitive Social Listening
Given a list of LinkedIn company URLs + time window, produces:
all_engagements.csv— every reaction/comment, full engager profilesenior_engagements.csv— filtered to VP / CXO / Director / Sales / Marketingdashboard.html— self-contained HTML dashboard (open in browser, shareable)competitive_analysis.zip— all three packaged for sharing
Use the e2e script for full runs:
python ~/.claude/skills/competitive-social-listening/scripts/competitive_social_listening.py \
--companies "Acme:https://www.linkedin.com/company/acme/:acme.com" \
"Rival:https://www.linkedin.com/company/rival/:rival.com" \
--output /tmp/comp_run --months 3 --employees 10
# Test mode (1 company, 2 employees, 5 posts):
python competitive_social_listening.py --test --output /tmp/comp_test
Or follow the manual steps below if you need to debug or resume a partial run.
Pipeline Architecture
Input: [company LinkedIn URLs + time window]
│
├─── ⚡ PARALLEL ──────────────────────────────────────┐
│ Phase A: Employee discovery (dropleads) │
│ → employee LinkedIn URLs per company │
│ │
│ Phase B: Company page posts (harvestapi) │
│ → starts immediately, no need to wait for A │
└──────────────────────────────────────────────────────┘
│
▼
all_urls.csv (company pages + employee profiles)
│
▼
Full post scrape → posts_data.csv
[harvestapi/linkedin-company-posts, no reactions, fast]
│
▼
Filter high-engagement posts (≥5 reactions or ≥2 comments)
│
├─── ⚡ PARALLEL (max 12 concurrent) ────────────────┐
│ 10-post batches → async apify_run_actor │
│ viral posts (>300 reactions) → separate batch │
│ → datasetIds collected as_completed() │
└────────────────────────────────────────────────────┘
│
▼
Download all datasets → parse + deduplicate
│
▼
all_engagements.csv + senior_engagements.csv + dashboard.html
Prerequisites
deepline backend start # always required before any deepline tool execute
deepline billing balance # check credits before starting
Set your working directory:
export WORKDIR=/tmp/comp_$(date +%Y%m%d)
mkdir -p $WORKDIR
Step 1: Collect Inputs → companies.csv
Accept from user:
- List of company LinkedIn URLs
- Time window (default: 3 months)
- Max employees per company (default: 10)
- Seniority filter (default: VP, CXO, Director, Sales, Marketing)
Create seed file:
company_name,company_linkedin,domain
Acme,https://www.linkedin.com/company/acme/,acme.com
Rival,https://www.linkedin.com/company/rival/,rival.com
If domain is unknown, resolve via Apollo (optional — dropleads works without domain if you have LinkedIn URL):
deepline tools execute apollo_organization_search --payload \
'{"q_organization_name": "Acme Corp", "per_page": 1}'
# Look for .data.organizations[0].primary_domain
# Apollo returns "organizations" NOT "accounts" — critical
Step 2: Employee Discovery → all_urls.csv
Provider: dropleads (not harvestapi/linkedin-company-employees — that actor returns empty)
Run for each company (fire all companies in parallel — don't wait for one before starting the next):
deepline tools execute dropleads_search_people --payload '{
"filters.companyDomains": ["acme.com"],
"filters.seniority": ["C-Level", "VP", "Director"],
"filters.departments": ["Sales", "Marketing"],
"pagination.limit": 10
}'
⚠️ Field names use dot notation:
filters.companyDomains, NOTcompanyDomains.filters.seniority, NOTseniorities.pagination.limit, NOTlimit.⚠️ Response is raw text with a CSV path, not JSON:
/tmp/output_1772593757336.csv (8 rows) columns: ["id", "fullName", "title", "linkedinUrl", ...]Extract the CSV path with
re.search(r'(/tmp/\S+\.csv)', output)and read it with Python'scsv.DictReader.
Build $WORKDIR/all_urls.csv:
company,linkedin_url,description
Acme,https://www.linkedin.com/company/acme/,Company Page
Acme,https://www.linkedin.com/in/jane-smith,Jane Smith - VP Sales
Rival,https://www.linkedin.com/company/rival/,Company Page
...
This file is the source of truth for company → handle mapping used in Step 5.
Step 3: Scrape Posts (No Reactions) → posts_data.csv
One actor run for all URLs. Do not request reactions here — scraping reactions inline causes CLI timeout (actor runtime >90s).
deepline tools execute apify_run_actor --payload '{
"actorId": "harvestapi/linkedin-company-posts",
"input": {
"targetUrls": [
"https://www.linkedin.com/company/acme/",
"https://www.linkedin.com/in/jane-smith"
],
"postedLimit": "3months",
"maxPosts": 50,
"scrapeReactions": false,
"scrapeComments": false
}
}'
⚠️
postedLimitvalid values:"month","3months","6months","year". NOT"1month"or integers.
Wait for status: SUCCEEDED, capture defaultDatasetId. Download:
deepline tools execute apify_get_dataset_items \
--payload '{"datasetId": "DATASET_ID", "params": {"limit": 5000}}' \
> $WORKDIR/posts_raw.json
Parse the response (deepline CLI prepends a status header — skip to first { or [):
import re, json
with open(f"{WORKDIR}/posts_raw.json") as f:
content = f.read()
m = re.search(r'(\{|\[)', content)
items = json.loads(content[m.start():])
Actual field names in Apify response items:
| Field | Actual key | Common mistake |
|---|---|---|
| Post URL | linkedinUrl |
postUrl ✗ |
| Post text | content |
text ✗ |
| Author name | author["name"] (dict) |
authorName ✗ |
| Reactions | engagement["likes"] (dict) |
totalReactions ✗ |
| Comments | engagement["comments"] (dict) |
totalComments ✗ |
| Post date | postedAt |
posted_at ✗ |
⚠️
source_urlis always empty in post results — never use it for company mapping. Instead: extract the LinkedIn handle fromlinkedinUrland look it up inall_urls.csv.
def extract_handle(url):
url = url.rstrip("/")
m = re.search(r"/(?:in|company|pub)/([^/?#]+)", url)
if m: return m.group(1).lower()
m = re.search(r"/posts/([^_]+)_", url) # handle from post URLs
if m: return m.group(1).lower()
return None
Save as posts_data.csv. Filter high-engagement posts (≥5 reactions or ≥2 comments) → high_engagement_posts.csv.
Step 4: Batch Reaction Scraping
4.1 Build Batches
Key insight: harvestapi/linkedin-company-posts accepts individual post URLs as targetUrls. 10 post URLs ≈ 15s run — well within CLI timeout.
# Separate viral posts before batching
normal = [p for p in posts if p["total_reactions"] <= 300]
viral = [p for p in posts if p["total_reactions"] > 300]
# Normal posts: 10 per batch, 100 max reactions
batches = [normal[i:i+10] for i in range(0, len(normal), 10)]
# Viral posts: 5 per batch, 50 max reactions (prevents timeout)
viral_batches = [viral[i:i+5] for i in range(0, len(viral), 5)]
Each batch config:
{
"actorId": "harvestapi/linkedin-company-posts",
"input": {
"targetUrls": ["https://www.linkedin.com/posts/acme_..."],
"scrapeReactions": true,
"maxReactions": 100,
"scrapeComments": true,
"maxComments": 50
}
}
4.2 Run Batches — Async, Max 12 Parallel
Always use apify_run_actor (async), never apify_run_actor_sync — reaction scraping exceeds the ~90s CLI timeout.
from concurrent.futures import ThreadPoolExecutor, as_completed
def run_batch(batch_config, idx):
raw = subprocess.run(
["deepline", "tools", "execute", "apify_run_actor",
"--payload", json.dumps(batch_config)],
capture_output=True, text=True, timeout=180
)
output = raw.stdout + raw.stderr
# Extract datasetId from output
m = re.search(r'"defaultDatasetId"\s*:\s*"([A-Za-z0-9]+)"', output)
return m.group(1) if m else None
# Run in groups of 12 — >15 concurrent causes 2-3 timeouts
with ThreadPoolExecutor(max_workers=12) as ex:
futures = {ex.submit(run_batch, b, i): i for i, b in enumerate(batches)}
dataset_ids = []
for future in as_completed(futures): # process as each finishes, don't wait for all
ds_id = future.result()
if ds_id:
dataset_ids.append(ds_id)
⚠️ If a batch times out: the Apify run likely completed on the server — only the CLI wait timed out. Re-run that batch alone (not in parallel with others). It will finish in ~15s on retry.
4.3 Download All Datasets
Fire dataset downloads in parallel (all independent):
def download_dataset(ds_id):
result = subprocess.run(
["deepline", "tools", "execute", "apify_get_dataset_items",
"--payload", json.dumps({"datasetId": ds_id, "params": {"limit": 2000}})],
capture_output=True, text=True, timeout=120
)
output = result.stdout + result.stderr
m = re.search(r'(\{|\[)', output)
if not m: return []
data = json.loads(output[m.start():])
if isinstance(data, dict): return data.get("data", data.get("items", []))
return data if isinstance(data, list) else []
# Download all in parallel
with ThreadPoolExecutor(max_workers=10) as ex:
all_items = []
for items in ex.map(download_dataset, dataset_ids):
all_items.extend(items)
Step 5: Parse Engagements → CSVs
Deduplicate on (post_url, actor_id, engagement_type). Classify seniority from position title:
SENIOR_PATTERNS = {
"C-Suite/Exec": [r"\bceo\b", r"\bcto\b", r"\bcfo\b", r"\bcoo\b", r"\bcmo\b", r"\bcro\b",
r"\bcpo\b", r"\bciso\b", r"\bchief\b", r"\bfounder\b", r"\bco-founder\b",
r"\bpresident\b", r"\bowner\b", r"\bpartner\b", r"\bprincipal\b"],
"VP": [r"\bvp\b", r"\bvice president\b", r"\bsvp\b", r"\bevp\b", r"\brvp\b"],
"Director": [r"\bdirector\b"],
"Sales": [r"\bsales\b", r"\baccount executive\b", r"\bbdr\b", r"\bsdr\b", r"\brevenue\b"],
"Marketing": [r"\bmarketing\b", r"\bgrowth\b", r"\bdemand gen\b", r"\bcontent\b", r"\bbrand\b"],
}
Output columns (both CSVs):
company, post_url, post_author, post_date, post_total_reactions, post_total_comments, post_content_preview, engagement_type, reaction_type, comment_text, engagement_date, engager_name, engager_linkedin_url, engager_position, engager_title_category, is_senior, engager_picture_url
Step 6: Dashboard + Package
Run build_dashboard.py → dashboard.html (Chart.js via CDN, self-contained, no server needed):
- Stat cards: senior engagements, all engagements, unique engagers, reactions, comments
- Bar: by company | Donut: title category mix | Bar: reaction types | Line: monthly trend
- Top posts table (ranked by senior engagement count)
- Top 20 engager table — ⚡ highlights multi-company engagers (highest-signal prospects)
⚠️ f-string + JS braces conflict: if building HTML template in Python, use a string placeholder (
__DATA_JSON__) and.replace()instead of an f-string. Minified JS{{inside an f-string causesSyntaxError: single '}' is not allowed.
cd $WORKDIR && zip -j competitive_analysis.zip dashboard.html all_engagements.csv senior_engagements.csv
Speed Architecture
These are the proven parallelism wins for this workflow. Each reduces wall-clock time without adding complexity.
1. Fan-out employee discovery
Fire dropleads for all companies simultaneously (ThreadPoolExecutor), not one-by-one. 3 companies = same time as 1.
2. Overlap employee discovery with company page post scraping
Company-page post scraping doesn't need employee URLs. Start the first actor run for company pages the moment you have them — don't wait for dropleads to finish.
3. as_completed() for reaction batches
Don't wait for an entire group to finish before starting dataset downloads. Start downloading the first completed batch immediately while others are still running.
4. Parallel dataset downloads
All apify_get_dataset_items calls are independent. Run them all concurrently (max 10 parallel — Apify rate limit).
5. Progressive CSV writing
Append engagement rows to CSVs as each dataset is parsed rather than holding all data in memory. Reduces peak memory for large runs (>50k rows).
6. Speculative file reads (Claude Code pattern)
When running in an agent loop: fire all tool calls that are independent in a single message. Don't wait for file A before reading file B if they're not dependent.
Why best-in-class agents are faster
| Technique | Description | Applied here |
|---|---|---|
| Parallel tool dispatch | Fire all independent API calls in one message/thread | Fan-out dropleads + Apify simultaneously |
| PTC (Programmatic Tool Calling) | Write code to loop over tool calls, avoiding model round-trips | E2E script replaces 50+ agent turns |
| Speculative prefetch | Read files you'll likely need before confirmed you need them | all_urls.csv written early, reused later |
| Sub-agent isolation | Spawn agents for independent work streams with no shared context | ThreadPoolExecutor batches = sub-agent equivalent |
| Minimal context passing | Pass IDs and file paths between steps, not data blobs | Only datasetId strings passed between phases |
| Fail fast with known errors | Encode all known failure modes as input validation | postedLimit map, viral batch detection, header-skip parser |
Pitfalls Reference
| # | Symptom | Root cause | Fix |
|---|---|---|---|
| 1 | 0 employees returned | Wrong dropleads field names | Use filters.companyDomains, filters.seniority, pagination.limit (dot notation) |
| 2 | dropleads JSON parse fails | Response is raw text + CSV path, not JSON | Use deepline_raw() + re.search(r'(/tmp/\S+\.csv)', output) |
| 3 | 0 unique posts parsed | postUrl field doesn't exist |
Use linkedinUrl; author is author["name"] (dict); engagement is engagement["likes"] |
| 4 | Posts actor 422 error |
Invalid postedLimit value |
Use "month", "3months", "6months", "year" only |
| 5 | Employee actor returns empty | Wrong actor | harvestapi/linkedin-company-employees is broken; use dropleads_search_people |
| 6 | Reaction batch exit code 4 |
apify_run_actor_sync timeout |
Use async apify_run_actor, capture defaultDatasetId, download separately |
| 7 | 2-3 batches timeout when >15 parallel | Apify server-side throttle | Max 12 parallel; retry timed-out batches solo |
| 8 | Viral post batch always times out | >300-reaction posts + maxReactions:100 |
Separate batch: maxReactions: 50, batch_size 5 |
| 9 | JSONDecodeError: Expecting value on ds files |
deepline CLI prepends status header | `re.search(r'({ |
| 10 | Can't map posts to companies | source_url always empty in Apify output |
Extract handle from linkedinUrl, match against all_urls.csv |
| 11 | SyntaxError: single '}' not allowed |
JS braces in Python f-string | Use __DATA_JSON__ placeholder + .replace() |
| 12 | _csv.Error: field larger than field limit |
Large JSON in deepline enrich CSV cells | Don't route Apify data through deepline enrich; use apify_get_dataset_items directly |
Cost Estimation
Run before starting:
def estimate_cost(n_companies, employees_per_company=10, months=3):
urls = n_companies * (1 + employees_per_company)
posts = urls * 15 # ~15 posts/URL avg
high_eng = int(posts * 0.78) # 78% pass engagement filter (validated)
batches = (high_eng // 10) + 1
# Apify: ~$0.002 per dataset item
post_items = posts
reaction_items = high_eng * 150 # reactions + comments per post
apify_usd = (post_items + reaction_items) * 0.002
# Dropleads: ~5 Deepline credits per employee
dl_credits = n_companies * employees_per_company * 5
print(f"URLs: {urls} | Posts: ~{posts} | High-eng: ~{high_eng} | Batches: {batches}")
print(f"Apify: ~${apify_usd:.0f} | Deepline credits: ~{dl_credits}")
return apify_usd, dl_credits
estimate_cost(3) # 3 companies, 10 employees each
# → Apify: ~$23 | Deepline credits: ~150
Validated actuals (3-company run, Vooma/GoAugment/TryPallet):
- 25 URLs → 268 posts → 209 high-engagement → 21 batches
- 8,855 total engagements, 4,131 senior (49%)
- Apify: ~$20-30 | Deepline credits: ~75-150
Always run deepline billing balance first. Pilot one batch before committing to full run.