Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
-
Updated
Jul 16, 2024 - Go
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
Machine learning library, Distributed training, Deep learning, Models
👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
DLRover: An Automatic Distributed Deep Learning System
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
NAACL '24 (Demo) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
IDDM (Industrial, landscape, animate...), support DDPM, DDIM, PLMS, webui and multi-GPU distributed training. Pytorch实现,生成模型,扩散模型,分布式训练
Quick intro into the world of distributed machine learning
Everything is born from a simple experiment.
Train across all your devices, ezpz 🍋
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
[ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
Python Implementation of Forward & Inverse models for biomedical imaging
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
Example ML projects that use the Determined library.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Add a description, image, and links to the distributed-training topic page so that developers can more easily learn about it.
To associate your repository with the distributed-training topic, visit your repo's landing page and select "manage topics."