distributed-training

Here are 151 public repositories matching this topic...

determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

kubernetes data-science machine-learning deep-learning tensorflow keras pytorch hyperparameter-optimization hyperparameter-tuning hyperparameter-search distributed-training ml-infrastructure mlops ml-platform

Updated Jul 16, 2024
Go

pytorch / torchx

Star

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

python kubernetes components machine-learning airflow deep-learning slurm pipelines pytorch ray aws-batch distributed-training

Updated Jul 16, 2024
Python

PaddlePaddle / Paddle

Star

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

python machine-learning deep-learning neural-network scalability efficiency paddlepaddle distributed-training

Updated Jul 16, 2024
C++

NoteDance / Note

Star

Machine learning library, Distributed training, Deep learning, Models

Updated Jul 16, 2024
Python

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

nlp search-engine compression sentiment-analysis transformers information-extraction question-answering llama pretrained-models embedding bert semantic-analysis distributed-training ernie neural-search uie document-intelligence paddlenlp llm

Updated Jul 16, 2024
Python

intelligent-machine-learning / dlrover

Star

DLRover: An Automatic Distributed Deep Learning System

k8s distributed-training llm-training

Updated Jul 16, 2024
Python

learning-at-home / hivemind

Star

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

distributed-systems machine-learning deep-learning pytorch dht neural-networks asyncio asynchronous-programming volunteer-computing hivemind distributed-training mixture-of-experts

Updated Jul 16, 2024
Python

skypilot-org / skypilot

Star

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.

Updated Jul 16, 2024
Python

tanyuqian / redco

Star

NAACL '24 (Demo) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference

Updated Jul 15, 2024
Python

huggingface / pytorch-image-models

Star

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Updated Jul 15, 2024
Python

chairc / Integrated-Design-Diffusion-Model

Star

IDDM (Industrial, landscape, animate...), support DDPM, DDIM, PLMS, webui and multi-GPU distributed training. Pytorch实现，生成模型，扩散模型，分布式训练

distributed-computing pytorch generative-model webui industrial unet distributed-training diffusion-models ddpm plms ddim aigc

Updated Jul 14, 2024
Python

seunboy1 / Income-predictor

Star

Quick intro into the world of distributed machine learning

machine-learning scikit-learn sklearn jupyter-notebook feature-selection supervised-learning random-forests distributed-training

Updated Jul 14, 2024
Jupyter Notebook

Hz188 / experiments

Star

Everything is born from a simple experiment.

cmake leetcode learning-by-doing distributed-training

Updated Jul 13, 2024
Python

saforem2 / ezpz

Sponsor

Star

Train across all your devices, ezpz 🍋

python machine-learning launcher rich distributed-training

Updated Jul 16, 2024
Python

FedML-AI / FedML

Star

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

machine-learning deep-learning inference-engine model-deployment model-serving distributed-training federated-learning mlops edge-ai ai-agent on-device-training

Updated Jul 11, 2024
Python

synxlin / deep-gradient-compression

Star

[ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

deep-learning distributed-training gradient-compression deep-gradient-compression sparse-distributed-training

Updated Jul 10, 2024
Python

prabhatkc / ct-recon

Star

Python Implementation of Forward & Inverse models for biomedical imaging

deep-learning bilateral-filter denoising distributed-training ct-denoising iterative-denoising ct-noise-insertion ct-deep-learning-denoising

Updated Jul 10, 2024
Python

foundation-model-stack / fms-fsdp

Star

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

pytorch distributed-training llm