Top 40 Crawl

This project implements a scalable, cloud-native ETL (Extract, Transform, Load) pipeline designed to aggregate, enrich, and persist historical music chart data spanning over six decades (1965–Present). The architecture demonstrates a modern approach to data engineering by combining traditional web scraping with Generative AI (LLMs) and third-party APIs to build a comprehensive metadata repository. It emphasizes high availability, cost optimization through caching strategies, and the integration of unstructured data processing.

Scalable Historical Music Data Ingestion & AI-Driven Metadata Enrichment Pipeline

Executive Summary

The architecture demonstrates a modern approach to data engineering by combining traditional web scraping with Generative AI (LLMs) and third-party APIs to build a comprehensive metadata repository. It emphasizes high availability, cost optimization through caching strategies, and the integration of unstructured data processing.

Architectural Overview

The system is designed as a modular pipeline consisting of three core stages:

Ingestion Layer: Automated crawling of historical chart data (NL Top 40), extracting structured entities (Year, Week, Rank, Artist, Title).
Enrichment Layer:
- Generative AI Integration: Utilizes OpenAI API (or Llama models) to perform semantic analysis and metadata generation (Genre, BPM, Key, Lyrical Themes) from raw artist/title pairs.
- Media Aggregation: Integrates with the YouTube Music API to fetch rich media metadata (thumbnails, video URLs, duration), linking static chart data to consumable media content.
Persistence & Caching Layer: Leverages Redis Stack as a high-performance, in-memory data store. Redis serves as both the primary buffer for ingested data and a deduplication cache to minimize API costs and latency.

Key Features & Competencies

AI-Driven Data Enrichment: Implements a structured prompt engineering strategy to transform unstructured queries into standardized JSON schemas, enabling rich metadata extraction that traditional APIs cannot provide.
Cost-Optimized Architecture: Features intelligent caching mechanisms (Redis) to enforce idempotency. By checking for existing keys before triggering LLM or external API calls, the system significantly reduces token consumption and operational costs.
High-Performance NoSQL Storage: Utilizes Redis for sub-millisecond read/write latency, ensuring the pipeline can handle high-throughput data processing without I/O bottlenecks.
Scalable Design Patterns: The modular nature of the ingestion and enrichment services allows for horizontal scaling and independent deployment of worker nodes.

Technical Stack

Core Runtime: Python 3.x
Data Store: Redis Stack Server (In-memory Key-Value Store)
AI/ML: OpenAI API / Llama (Generative Pre-trained Transformer)
External APIs: YouTube Music API
Infrastructure: Docker-ready, Linux-compatible

Deployment & Configuration

Prerequisites

Python 3.10+
Redis Stack Server
API Keys for OpenAI

Environment Setup

Securely configure the application using environment variables.

bash

cp .env.example .env
# Edit .env to include your OPENAI_API_KEY and REDIS_CONNECTION_STRING

Dependency Management

Install production dependencies:

bash

pip install -r requirements.txt

Infrastructure Provisioning (Redis)

Deploy Redis Stack on Debian/Ubuntu based systems:

bash

curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update && sudo apt-get install redis-stack-server
sudo service redis-server start

Execution

Initiate the data pipeline:

bash

python3 process.py

Roadmap

Polyglot Persistence: Migration of cold data to a relational database (PostgreSQL) for complex analytical querying.
Observability: Integration of Prometheus/Grafana for pipeline metrics (throughput, API latency, error rates).
Containerization: Full Docker and Kubernetes (Helm) support for cloud-agnostic deployment.

Top 40 Crawl

Scalable Historical Music Data Ingestion & AI-Driven Metadata Enrichment Pipeline ​

Executive Summary ​

Architectural Overview ​

Key Features & Competencies ​

Technical Stack ​

Deployment & Configuration ​

Prerequisites ​

Environment Setup ​

Dependency Management ​

Infrastructure Provisioning (Redis) ​

Execution ​

Roadmap ​

Scalable Historical Music Data Ingestion & AI-Driven Metadata Enrichment Pipeline

Executive Summary

Architectural Overview

Key Features & Competencies

Technical Stack

Deployment & Configuration

Prerequisites

Environment Setup

Dependency Management

Infrastructure Provisioning (Redis)

Execution

Roadmap