machine-learning-ops-ml-pipeline
How to Install
This skill comes from a community source. Check the original listing for install instructions.
General Claude Code install: copy SKILL.md to ~/.claude/skills/
Machine Learning Pipeline - Multi-Agent MLOps Orchestration
Design and implement a complete ML pipeline for: $ARGUMENTS
Use this skill when
- Working on machine learning pipeline - multi-agent mlops orchestration tasks or workflows
- Needing guidance, best practices, or checklists for machine learning pipeline - multi-agent mlops orchestration
Do not use this skill when
- The task is unrelated to machine learning pipeline - multi-agent mlops orchestration
- You need a different domain or tool outside this scope
Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open
resources/implementation-playbook.md.
Thinking
This workflow orchestrates multiple specialized agents to build a production-ready ML pipeline following modern MLOps best practices. The approach emphasizes:
- Phase-based coordination: Each phase builds upon previous outputs, with clear handoffs between agents
- Modern tooling integration: MLflow/W&B for experiments, Feast/Tecton for features, KServe/Seldon for serving
- Production-first mindset: Every component designed for scale, monitoring, and reliability
- Reproducibility: Version control for data, models, and infrastructure
- Continuous improvement: Automated retraining, A/B testing, and drift detection
The multi-agent approach ensures each aspect is handled by domain experts: - Data engineers handle ingestion and quality - Data scientists design features and experiments - ML engineers implement training pipelines - MLOps engineers handle production deployment - Observability engineers ensure monitoring
Phase 1: Data & Requirements Analysis
Deliverables: 1. Data source audit and ingestion strategy: - Source systems and connection patterns - Schema validation using Pydantic/Great Expectations - Data versioning with DVC or lakeFS - Incremental loading and CDC strategies
-
Data quality framework:
- Profiling and statistics generation
- Anomaly detection rules
- Data lineage tracking
- Quality gates and SLAs
-
Storage architecture:
- Raw/processed/feature layers
- Partitioning strategy
- Retention policies
- Cost optimization
Provide implementation code for critical components and integration patterns.
Deliverables: 1. Feature engineering pipeline: - Transformation specifications - Feature store schema (Feast/Tecton) - Statistical validation rules - Handling strategies for missing data/outliers
-
Model requirements:
- Algorithm selection rationale
- Performance metrics and baselines
- Training data requirements
- Evaluation criteria and thresholds
-
Experiment design:
- Hypothesis and success metrics
- A/B testing methodology
- Sample size calculations
- Bias detection approach
Include feature transformation code and statistical validation logic.
Phase 2: Model Development & Training
Build comprehensive training system: 1. Training pipeline implementation: - Modular training code with clear interfaces - Hyperparameter optimization (Optuna/Ray Tune) - Distributed training support (Horovod/PyTorch DDP) - Cross-validation and ensemble strategies
-
Experiment tracking setup:
- MLflow/Weights & Biases integration
- Metric logging and visualization
- Artifact management (models, plots, data samples)
- Experiment comparison and analysis tools
-
Model registry integration:
- Version control and tagging strategy
- Model metadata and lineage
- Promotion workflows (dev -> staging -> prod)
- Rollback procedures
Provide complete training code with configuration management.
Focus areas: 1. Code quality and structure: - Refactor for production standards - Add comprehensive error handling - Implement proper logging with structured formats - Create reusable components and utilities
-
Performance optimization:
- Profile and optimize bottlenecks
- Implement caching strategies
- Optimize data loading and preprocessing
- Memory management for large-scale training
-
Testing framework:
- Unit tests for data transformations
- Integration tests for pipeline components
- Model quality tests (invariance, directional)
- Performance regression tests
Deliver production-ready, maintainable code with full test coverage.
Phase 3: Production Deployment & Serving
Implementation requirements: 1. Model serving infrastructure: - REST/gRPC APIs with FastAPI/TorchServe - Batch prediction pipelines (Airflow/Kubeflow) - Stream processing (Kafka/Kinesis integration) - Model serving platforms (KServe/Seldon Core)
-
Deployment strategies:
- Blue-green deployments for zero downtime
- Canary releases with traffic splitting
- Shadow deployments for validation
- A/B testing infrastructure
-
CI/CD pipeline:
- GitHub Actions/GitLab CI workflows
- Automated testing gates
- Model validation before deployment
- ArgoCD for GitOps deployment
-
Infrastructure as Code:
- Terraform modules for cloud resources
- Helm charts for Kubernetes deployments
- Docker multi-stage builds for optimization
- Secret management with Vault/Secrets Manager
Provide complete deployment configuration and automation scripts.
Kubernetes-specific requirements: 1. Workload orchestration: - Training job scheduling with Kubeflow - GPU resource allocation and sharing - Spot/preemptible instance integration - Priority classes and resource quotas
-
Serving infrastructure:
- HPA/VPA for autoscaling
- KEDA for event-driven scaling
- Istio service mesh for traffic management
- Model caching and warm-up strategies
-
Storage and data access:
- PVC strategies for training data
- Model artifact storage with CSI drivers
- Distributed storage for feature stores
- Cache layers for inference optimization
Provide Kubernetes manifests and Helm charts for entire ML platform.
Phase 4: Monitoring & Continuous Improvement
Monitoring framework: 1. Model performance monitoring: - Prediction accuracy tracking - Latency and throughput metrics - Feature importance shifts - Business KPI correlation
-
Data and model drift detection:
- Statistical drift detection (KS test, PSI)
- Concept drift monitoring
- Feature distribution tracking
- Automated drift alerts and reports
-
System observability:
- Prometheus metrics for all components
- Grafana dashboards for visualization
- Distributed tracing with Jaeger/Zipkin
- Log aggregation with ELK/Loki
-
Alerting and automation:
- PagerDuty/Opsgenie integration
- Automated retraining triggers
- Performance degradation workflows
- Incident response runbooks
-
Cost tracking:
- Resource utilization metrics
- Cost allocation by model/experiment
- Optimization recommendations
- Budget alerts and controls
Deliver monitoring configuration, dashboards, and alert rules.
Configuration Options
- experiment_tracking: mlflow | wandb | neptune | clearml
- feature_store: feast | tecton | databricks | custom
- serving_platform: kserve | seldon | torchserve | triton
- orchestration: kubeflow | airflow | prefect | dagster
- cloud_provider: aws | azure | gcp | multi-cloud
- deployment_mode: realtime | batch | streaming | hybrid
- monitoring_stack: prometheus | datadog | newrelic | custom
Success Criteria
- Data Pipeline Success:
- < 0.1% data quality issues in production
- Automated data validation passing 99.9% of time
- Complete data lineage tracking
-
Sub-second feature serving latency
-
Model Performance:
- Meeting or exceeding baseline metrics
- < 5% performance degradation before retraining
- Successful A/B tests with statistical significance
-
No undetected model drift > 24 hours
-
Operational Excellence:
- 99.9% uptime for model serving
- < 200ms p99 inference latency
- Automated rollback within 5 minutes
-
Complete observability with < 1 minute alert time
-
Development Velocity:
- < 1 hour from commit to production
- Parallel experiment execution
- Reproducible training runs
-
Self-service model deployment
-
Cost Efficiency:
- < 20% infrastructure waste
- Optimized resource allocation
- Automatic scaling based on load
- Spot instance utilization > 60%
Final Deliverables
Upon completion, the orchestrated pipeline will provide: - End-to-end ML pipeline with full automation - Comprehensive documentation and runbooks - Production-ready infrastructure as code - Complete monitoring and alerting system - CI/CD pipelines for continuous improvement - Cost optimization and scaling strategies - Disaster recovery and rollback procedures
Limitations
- Use this skill only when the task clearly matches the scope described above.
- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.
Details
| Category | Other → General |
| Source | community |
| Stars | N/A |
| Risk Level | N/A |