Web Scraping
TendSocial includes web scraping capabilities for extracting content from URLs.
Use Cases
- Blog Import: Extract content from blog posts for repurposing
- Website Analysis: Analyze brand websites for AI context
- Link Preview: Generate rich previews for shared links
- Competitor Analysis: Extract public content for reference
API Endpoints
POST /api/scrape-url
Scrape content from a URL.
typescript
// Request
{
url: string, // Full URL to scrape
extractType?: "article" | "metadata" | "full"
}
// Response
{
title: string,
description: string,
content: string, // Main text content
images: string[], // Image URLs found
author?: string,
publishDate?: string,
favicon?: string,
ogImage?: string,
siteName?: string
}POST /api/scrape-website
Analyze a website for brand context.
typescript
// Request
{ url: string }
// Response
{
title: string,
description: string,
industry?: string,
keywords: string[],
socialLinks: { platform: string, url: string }[],
colors?: string[], // Extracted brand colors
logoUrl?: string
}Technical Implementation
Scraping Library
Uses cheerio for HTML parsing:
typescript
import * as cheerio from 'cheerio';
const html = await fetch(url).then(r => r.text());
const $ = cheerio.load(html);
const title = $('title').text();
const content = $('article').text();Rate Limiting
- Max 10 scrapes per minute per company
- Cached results for 1 hour per URL
Error Handling
Common failure modes:
- Blocked: Site blocks scrapers (403)
- Timeout: Site too slow (10s limit)
- Invalid: URL returns non-HTML
- Protected: Paywalled content
Security
- URLs are validated before scraping
- Private IPs (127.x, 10.x, etc.) are blocked
- User-Agent is set to identify TendSocial
- SSRF protections in place
Database Schema
prisma
model ScrapeCache {
id String @id
url String @unique
data Json
scrapedAt DateTime
expiresAt DateTime
}