Skip to content

Top 40 Crawl

This project implements a scalable, cloud-native ETL (Extract, Transform, Load) pipeline designed to aggregate, enrich, and persist historical music chart data spanning over six decades (1965–Present). The architecture demonstrates a modern approach to data engineering by combining traditional web scraping with Generative AI (LLMs) and third-party APIs to build a comprehensive metadata repository. It emphasizes high availability, cost optimization through caching strategies, and the integration of unstructured data processing.

Scalable Historical Music Data Ingestion & AI-Driven Metadata Enrichment Pipeline

Executive Summary

This project implements a scalable, cloud-native ETL (Extract, Transform, Load) pipeline designed to aggregate, enrich, and persist historical music chart data spanning over six decades (1965–Present).

The architecture demonstrates a modern approach to data engineering by combining traditional web scraping with Generative AI (LLMs) and third-party APIs to build a comprehensive metadata repository. It emphasizes high availability, cost optimization through caching strategies, and the integration of unstructured data processing.

Architectural Overview

The system is designed as a modular pipeline consisting of three core stages:

  1. Ingestion Layer: Automated crawling of historical chart data (NL Top 40), extracting structured entities (Year, Week, Rank, Artist, Title).
  2. Enrichment Layer:
    • Generative AI Integration: Utilizes OpenAI API (or Llama models) to perform semantic analysis and metadata generation (Genre, BPM, Key, Lyrical Themes) from raw artist/title pairs.
    • Media Aggregation: Integrates with the YouTube Music API to fetch rich media metadata (thumbnails, video URLs, duration), linking static chart data to consumable media content.
  3. Persistence & Caching Layer: Leverages Redis Stack as a high-performance, in-memory data store. Redis serves as both the primary buffer for ingested data and a deduplication cache to minimize API costs and latency.

Key Features & Competencies

  • AI-Driven Data Enrichment: Implements a structured prompt engineering strategy to transform unstructured queries into standardized JSON schemas, enabling rich metadata extraction that traditional APIs cannot provide.
  • Cost-Optimized Architecture: Features intelligent caching mechanisms (Redis) to enforce idempotency. By checking for existing keys before triggering LLM or external API calls, the system significantly reduces token consumption and operational costs.
  • High-Performance NoSQL Storage: Utilizes Redis for sub-millisecond read/write latency, ensuring the pipeline can handle high-throughput data processing without I/O bottlenecks.
  • Scalable Design Patterns: The modular nature of the ingestion and enrichment services allows for horizontal scaling and independent deployment of worker nodes.

Technical Stack

  • Core Runtime: Python 3.x
  • Data Store: Redis Stack Server (In-memory Key-Value Store)
  • AI/ML: OpenAI API / Llama (Generative Pre-trained Transformer)
  • External APIs: YouTube Music API
  • Infrastructure: Docker-ready, Linux-compatible

Deployment & Configuration

Prerequisites

  • Python 3.10+
  • Redis Stack Server
  • API Keys for OpenAI

Environment Setup

Securely configure the application using environment variables.

bash
cp .env.example .env
# Edit .env to include your OPENAI_API_KEY and REDIS_CONNECTION_STRING

Dependency Management

Install production dependencies:

bash
pip install -r requirements.txt

Infrastructure Provisioning (Redis)

Deploy Redis Stack on Debian/Ubuntu based systems:

bash
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update && sudo apt-get install redis-stack-server
sudo service redis-server start

Execution

Initiate the data pipeline:

bash
python3 process.py

Roadmap

  • Polyglot Persistence: Migration of cold data to a relational database (PostgreSQL) for complex analytical querying.
  • Observability: Integration of Prometheus/Grafana for pipeline metrics (throughput, API latency, error rates).
  • Containerization: Full Docker and Kubernetes (Helm) support for cloud-agnostic deployment.