Common issues and solutions for PyTorch Connectomics.
Cause: Package not installed or wrong environment activated.
Solution:
# Make sure you're in the right environment
conda activate pytc
# Reinstall
cd pytorch_connectomics
pip install -e . --no-build-isolation
# Verify
python -c "import connectomics; print('Success!')"Cause: Old GCC version (common on HPC clusters).
Solution 1 - Use conda (recommended):
conda activate pytc
conda install -c conda-forge numpy h5py cython connected-components-3d -y
pip install -e . --no-build-isolationSolution 2 - Use Python 3.10:
conda create -n pytc python=3.10 -y
conda activate pytc
pip install -e .Why? Conda provides pre-built binaries that don't need compilation.
Cause: Mahotas version incompatibility with NumPy 2.0+. This occurs with older mahotas versions (< 1.4.18).
Solution 1 - Upgrade packages (recommended):
pip install --upgrade numpy mahotasSolution 2 - Pin compatible versions:
pip install numpy>=1.23.0 mahotas>=1.4.18Why? Mahotas 1.4.18+ is compatible with NumPy 2.x. The deprecated np.float alias was removed in NumPy 2.0.
Cause: Matplotlib requires NumPy 1.23 or higher for compatibility.
Solution:
conda activate pytc
conda install -c conda-forge matplotlib -y
pip install -e . --no-build-isolationCause: Python version incompatibility (cc3d requires Python 3.10).
Solution:
# Recreate environment with Python 3.10
conda remove -n pytc --all -y
conda create -n pytc python=3.10 -y
conda activate pytc
pip install -e .Cause: PyTorch CPU-only version installed or CUDA not loaded.
Solution 1 - Reinstall PyTorch with CUDA:
# For CUDA 12.1
pip uninstall torch torchvision
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Verify
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"Solution 2 - Load CUDA module (HPC):
module avail cuda # See available versions
module load cuda/12.1
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"Cause: cuDNN not found (common on HPC).
Solution:
# Load cuDNN module
module load cudnn/8.9.0
# Or install via conda
conda install -c conda-forge cudnnCause: Batch size or model too large for GPU memory.
Solution 1 - Reduce batch size:
python scripts/main.py --config tutorials/lucchi.yaml data.dataloader.batch_size=1Solution 2 - Use gradient accumulation:
# In config file:
optimization:
accumulate_grad_batches: 4 # Effective batch size = 4x
system:
training:
batch_size: 1Solution 3 - Use mixed precision:
optimization:
precision: "16-mixed" # Reduces memory by 50%Solution 4 - Reduce patch size:
data:
patch_size: [64, 64, 64] # Smaller patchesCause: Insufficient system memory.
Solution:
# Reduce num_workers
python scripts/main.py --config tutorials/lucchi.yaml system.num_workers=2
# Or disable workers entirely
python scripts/main.py --config tutorials/lucchi.yaml system.num_workers=0Cause: Learning rate too high, numerical instability, or bad data.
Solution 1 - Reduce learning rate:
optimizer:
lr: 1e-5 # Try lower LR (was 1e-4)Solution 2 - Enable gradient clipping:
optimization:
gradient_clip_val: 1.0Solution 3 - Use FP32 instead of FP16:
optimization:
precision: "32" # More stable than "16-mixed"Solution 4 - Enable anomaly detection:
monitor:
detect_anomaly: true # Helps find exact operation causing NaNSolution 5 - Check your data:
# Check for NaN/inf in data
import h5py
with h5py.File('train_image.h5', 'r') as f:
data = f['main'][:]
print(f"Has NaN: {np.isnan(data).any()}")
print(f"Has inf: {np.isinf(data).any()}")
print(f"Range: [{data.min()}, {data.max()}]")Cause: Multiple possible reasons.
Solution 1 - Use mixed precision:
optimization:
precision: "16-mixed" # 2x fasterSolution 2 - Increase num_workers:
system:
training:
num_workers: 8 # More parallel data loadingSolution 3 - Use pre-loaded cache:
data:
use_preloaded_cache: true # Load volumes once, crop in memorySolution 4 - Disable progress bar:
# Add to trainer creation in main.py
enable_progress_bar: FalseSolution 5 - Check GPU utilization:
nvidia-smi # Should show high GPU utilization (>80%)Cause: Incorrect path in config.
Solution:
# Check current directory
pwd
# Use absolute paths in config
data:
train_image: "/full/path/to/train_image.h5"
# Or relative to working directory
data:
train_image: "datasets/train_image.h5"Cause: Corrupted HDF5 file or incomplete download.
Solution:
# Re-download data
rm corrupted_file.h5
wget https://...
# Verify file integrity
h5ls train_image.h5Cause: Patch size larger than input volume.
Solution:
# Reduce patch size
data:
patch_size: [64, 64, 64] # Smaller than volume
# Or pad volume (advanced)
data:
split_pad_val: true
split_pad_size: [128, 128, 128]Cause: Network issues or missing credentials.
Solution - Manual download:
# Download from HuggingFace
wget https://huggingface.co/datasets/pytc/tutorial/resolve/main/lucchi%2B%2B.zip
unzip lucchi++.zip -d datasets/
# Or use git-lfs
git lfs install
git clone https://huggingface.co/datasets/pytc/tutorialCause: Config file missing required fields.
Solution:
# Use example config as template
cp tutorials/lucchi.yaml my_config.yaml
# Check required fields
python -c "from connectomics.config import load_config; load_config('my_config.yaml')"Cause: YAML syntax error or type mismatch.
Solution:
# Check YAML syntax
python -c "import yaml; yaml.safe_load(open('config.yaml'))"
# Common issues:
# - Use spaces, not tabs
# - Quote strings with special characters
# - Check indentationCause: inference.data.test_image not set in config.
Solution:
# Add to config file
inference:
data:
test_image: "path/to/test_image.h5"
test_label: "path/to/test_label.h5" # OptionalCause: Incorrect checkpoint path.
Solution:
# Find checkpoints
find outputs/ -name "*.ckpt"
# Use full path
python scripts/main.py --config config.yaml --mode test \
--checkpoint outputs/experiment/20241012_123456/checkpoints/epoch=099.ckptCause: Not on a SLURM cluster or SLURM not in PATH.
Solution:
# Check if on SLURM system
which sbatch
# If not found, use direct execution instead
python scripts/main.py --config config.yamlCause: Exceeded memory or time limits.
Solution:
# Request more memory
#SBATCH --mem=64G
# Request more time
#SBATCH --time=48:00:00
# Check logs
cat slurm-123456.outCause: Conda not installed or not in PATH.
Solution:
# Initialize conda
source ~/miniconda3/bin/activate
# Or install miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.shCause: Using system Python instead of conda environment.
Solution:
# Check Python version
which python # Should show conda path
python --version # Should be 3.10
# Activate correct environment
conda activate pytcIf your issue isn't listed here:
- Check logs: Look for detailed error messages in terminal output
- Search issues: GitHub Issues
- Ask community: Slack channel
- Report bug: Create a new GitHub Issue
When reporting issues, include:
- Python version:
python --version - PyTorch version:
python -c "import torch; print(torch.__version__)" - CUDA version:
nvcc --versionornvidia-smi - Full error traceback
- Config file (if relevant)
Safe to ignore. Increase num_workers for faster data loading if desired.
Safe to ignore. This is a PyTorch internal warning and doesn't affect functionality.
Safe to ignore. This is about API changes in future PyTorch versions.
Still stuck? We're here to help! 💬
Join our Slack community