RamTorch

RAM is All You Need - A PyTorch library for memory-efficient deep learning that enables training and inference of large models that don't fit in GPU memory.

Overview

RamTorch provides CPU-GPU hybrid implementations of neural network components that keep parameters in CPU memory and transfer them to GPU on-demand. This approach dramatically reduces GPU memory usage while maintaining computational efficiency through asynchronous CUDA streams and intelligent batching.

Key Features

Memory-Efficient Linear Layers: CPU-stored parameters with on-demand GPU transfer
Asynchronous CUDA Streams: Overlap computation with data transfer for minimal latency
ZeRO-1 Optimizer Support: Distributed optimizer state sharding across multiple GPUs
Drop-in Replacement: Compatible with existing PyTorch code

Installation

pip install ramtorch

Or install from source:

git clone https://github.com/lodestone-rock/RamTorch.git
cd RamTorch
pip install -e .

Quick Start

Basic Usage

Replace torch.nn.Linear with ramtorch.modules.Linear for automatic memory optimization:

import torch
from ramtorch import Linear

# Standard PyTorch approach (high GPU memory usage)
# linear = torch.nn.Linear(1000, 1000)

# RamTorch approach (low GPU memory usage)
linear = Linear(1000, 1000, device="cuda")

# Use exactly like a normal PyTorch layer
x = torch.randn(32, 1000, device="cuda")
output = linear(x)  # Parameters automatically transferred from CPU to GPU

Building Models

import torch.nn as nn
from ramtorch import Linear

class MemoryEfficientModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            Linear(1000, 2000),
            nn.ReLU(),
            Linear(2000, 2000),
            nn.ReLU(),
            Linear(2000, 100)
        )
    
    def forward(self, x):
        return self.layers(x)

model = MemoryEfficientModel()

ZeRO-1 Optimizer Sharding

For distributed training with optimizer state sharding:

import torch.distributed as dist
from ramtorch.zero1 import create_zero_param_groups, broadcast_zero_params

# Initialize distributed training
dist.init_process_group(backend='nccl')
model = YourModel()
all_params = list(model.parameters())
rank = dist.get_rank()
world_size = dist.get_world_size()

# Create ZeRO-1 sharded optimizer
param_groups = [{'params': all_params, 'lr': 1e-3, 'weight_decay': 0.01}]
rank_param_groups = create_zero_param_groups(param_groups, world_size)
optimizer = torch.optim.AdamW(sharded_groups[rank]) # only optimize the shard

# Scheduler works normally with sharded optimizer
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        # Forward/backward with gradient accumulation
        for micro_batch in split_batch(batch):
            loss = model(micro_batch)
            loss.backward()

        # All-reduce gradients across ranks (you need to implement this)
        all_reduce_gradients(all_params)
        
        # Each rank updates only its owned parameters
        optimizer.step()
        
        # Broadcast updated parameters from owners to all ranks
        broadcast_zero_params(rank_param_groups)
        
        # It has to be model.zero_grad()! because optimizer on each rank only handles its own shard
        model.zero_grad()
        scheduler.step()

Performance Considerations

When to Use RamTorch

Best suited for:

Large models that don't fit in GPU memory
Inference scenarios with memory constraints
Training with limited GPU memory but abundant CPU memory
Distributed training with many parameters

Less suitable for:

Small models that fit comfortably in GPU memory
Scenarios where CPU-GPU bandwidth is the bottleneck
Real-time applications requiring minimal latency

Optimization Tips

Use Larger Batch Sizes: Helps amortize transfer costs
Mixed Precision: Combine with torch.cuda.amp for additional memory savings
Strategic Placement: Use RamTorch layers for the largest components only

Architecture

CPU Bouncing Linear Layer

Stores parameters on CPU memory (with share_memory_() for multiprocessing)
Asynchronously transfers weights to GPU during forward pass
Uses CUDA events for proper stream synchronization

Memory Flow

CPU Memory (Parameters) → Transfer Stream → GPU Memory (Computation) → Result
                     ↑                                                      ↓
                     └────── Cleanup after computation ←──────────────────┘

Contributing

We welcome contributions! Please see our contributing guidelines for details.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Citation

If you use RamTorch in your research, please cite:

@software{ramtorch2025,
  author = {Lodestone},
  title = {RamTorch: Memory-Efficient Deep Learning with CPU-GPU Hybrid Architecture},
  url = {https://github.com/lodestone-rock/RamTorch},
  year = {2025}
}

Acknowledgments

Built on top of PyTorch's excellent automatic differentiation and CUDA stream management capabilities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RamTorch

Overview

Key Features

Installation

Quick Start

Basic Usage

Building Models

ZeRO-1 Optimizer Sharding

Performance Considerations

When to Use RamTorch

Optimization Tips

Architecture

CPU Bouncing Linear Layer

Memory Flow

Contributing

License

Citation

Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

RamTorch

Overview

Key Features

Installation

Quick Start

Basic Usage

Building Models

ZeRO-1 Optimizer Sharding

Performance Considerations

When to Use RamTorch

Optimization Tips

Architecture

CPU Bouncing Linear Layer

Memory Flow

Contributing

License

Citation

Acknowledgments