Top 40 Crawl
Scalable Historical Music Data Ingestion & AI-Driven Metadata Enrichment Pipeline
Executive Summary
This project implements a scalable, cloud-native ETL (Extract, Transform, Load) pipeline designed to aggregate, enrich, and persist historical music chart data spanning over six decades (1965–Present).
The architecture demonstrates a modern approach to data engineering by combining traditional web scraping with Generative AI (LLMs) and third-party APIs to build a comprehensive metadata repository. It emphasizes high availability, cost optimization through caching strategies, and the integration of unstructured data processing.
Architectural Overview
The system is designed as a modular pipeline consisting of three core stages:
- Ingestion Layer: Automated crawling of historical chart data (NL Top 40), extracting structured entities (Year, Week, Rank, Artist, Title).
- Enrichment Layer:
- Generative AI Integration: Utilizes OpenAI API (or Llama models) to perform semantic analysis and metadata generation (Genre, BPM, Key, Lyrical Themes) from raw artist/title pairs.
- Media Aggregation: Integrates with the YouTube Music API to fetch rich media metadata (thumbnails, video URLs, duration), linking static chart data to consumable media content.
- Persistence & Caching Layer: Leverages Redis Stack as a high-performance, in-memory data store. Redis serves as both the primary buffer for ingested data and a deduplication cache to minimize API costs and latency.
Key Features & Competencies
- AI-Driven Data Enrichment: Implements a structured prompt engineering strategy to transform unstructured queries into standardized JSON schemas, enabling rich metadata extraction that traditional APIs cannot provide.
- Cost-Optimized Architecture: Features intelligent caching mechanisms (Redis) to enforce idempotency. By checking for existing keys before triggering LLM or external API calls, the system significantly reduces token consumption and operational costs.
- High-Performance NoSQL Storage: Utilizes Redis for sub-millisecond read/write latency, ensuring the pipeline can handle high-throughput data processing without I/O bottlenecks.
- Scalable Design Patterns: The modular nature of the ingestion and enrichment services allows for horizontal scaling and independent deployment of worker nodes.
Technical Stack
- Core Runtime: Python 3.x
- Data Store: Redis Stack Server (In-memory Key-Value Store)
- AI/ML: OpenAI API / Llama (Generative Pre-trained Transformer)
- External APIs: YouTube Music API
- Infrastructure: Docker-ready, Linux-compatible
Deployment & Configuration
Prerequisites
- Python 3.10+
- Redis Stack Server
- API Keys for OpenAI
Environment Setup
Securely configure the application using environment variables.
cp .env.example .env
# Edit .env to include your OPENAI_API_KEY and REDIS_CONNECTION_STRINGDependency Management
Install production dependencies:
pip install -r requirements.txtInfrastructure Provisioning (Redis)
Deploy Redis Stack on Debian/Ubuntu based systems:
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update && sudo apt-get install redis-stack-server
sudo service redis-server startExecution
Initiate the data pipeline:
python3 process.pyRoadmap
- Polyglot Persistence: Migration of cold data to a relational database (PostgreSQL) for complex analytical querying.
- Observability: Integration of Prometheus/Grafana for pipeline metrics (throughput, API latency, error rates).
- Containerization: Full Docker and Kubernetes (Helm) support for cloud-agnostic deployment.

