[New Model] iLTM by LennartPurucker · Pull Request #305 · autogluon/tabarena

LennartPurucker · 2026-05-15T12:31:43Z

This PR adds the iLTM model (https://github.com/AI-sandbox/iLTM, https://arxiv.org/abs/2511.15941).

The benchmark is currently running, so more changes might occur if I encounter any bugs.

So far, I have fixed problems in the upstream code to make the model run. Check out the code (iltm_model.py) for details. Two major issues:

iLTM modified the global state and does not reset it; we had to handle this to avoid crashes and other bugs when using iLTM inside of pipelines and existing code bases (_isolate_iltm_global_state)
The logger init is lost, and we had to patch it. (_ensure_iltm_logger_patched)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

LennartPurucker · 2026-05-15T12:38:41Z

Heyho @davidbonet and @salcc,

Would be great to hear your thoughts on how we are using your foundation model and whether we should make any changes. Also, any help adding our workaround in the official codebase would be appreciated!

If you are okay with how we are using it, and once we have the results, we will add it to the leaderboard. With your blessing, we would add it as a verified implementation. Let me know if anything is missing in the code to use it as intended.

# Conflicts: # tabarena/pyproject.toml # tabarena/tabarena/benchmark/models/model_registry.py # tabarena/tabarena/models/utils.py # tabflow_slurm/run_setup_slurm_jobs.py

LennartPurucker · 2026-05-15T20:15:47Z

Here are the results for the default config.

The results do not look very promising, likely because this model needs tuning, actually, due to not using in-context learning. Lately, we have not been using tuning for foundation models (due to cost), but I would make an exception here to see how much it helps, given that it is roughly on the level of xRFM. However, given the extreme costs (see Pareto front plot), I would only do 25 configs for now and see how much it improves before I spend even more compute on it.

davidbonet · 2026-05-15T22:08:39Z

Hey @LennartPurucker , I'll take a look and let you know, thanks!

davidbonet · 2026-05-16T00:26:45Z

I’ve taken a quick look and I see some critical points.

After we published iLTM, I quickly tried to run it on TabArena Lite with default+200 configs and although I did not complete it because I had to move on to other projects, I obtained these results:

Default is not the best, but seems like it should be higher than what you currently have.
iLTM falls in a gray area where it could be called a “foundation model” by the framing in Breugel and Van der Schaar (2024) where LTM is defined within the broader “foundation model” framing by large-scale training and adaptability, including fine-tuning. Since our method does not have complex ICL and we are also not claiming zero-shot inference, and it is MLP-based, an evaluation similar to the one done for RealMLP would be more fair. Because as you can see, fine-tuning actually helps a lot and it is indeed needed for iLTM. So for no ICL-based models I think it would be fair to keep the same number of trials as for the rest.

Regarding the implementation:

I think a main issue right now is no explicit cat_feature inference in the wrapper. Categorical columns should be inferred by dtype inside ILTMModel which matches how we developed it.
Disable any AG feature generator and preprocessing. Not calling self.preprocess() as iLTM would do its own preprocessing steps internally, and anything else might introduce unexpected behaviours (also, it is some extra time it is spending from the time limit, which is redundant and we always want to use our internal processing).
Update the remaining time passed to fit_max_time to align with iLTM’s internal timer, and without that update AG wrapper would stop it many times due to incorrect remaining time.

So:

# just after def _fit(...):
start_time = time.time()

hps = self._get_model_params().copy()
cat_cols = X.select_dtypes(include=["category", "object", "string"]).columns.tolist()
# Pass those indices to iLTM as cat_features:
hps["cat_features"] = [X.columns.get_loc(c) for c in cat_cols]


self.model = model_cls(
    **hps,
    device=device,
)

# DO NOT CALL X = self.preprocess(X, y=y)
# DO NOT CALL X_val = self.preprocess(X_val)

if X_val is not None:
    eval_set = (X_val, y_val)
else:
    eval_set = None

# AutoGluon's time_limit is already the fold-level time limit.
# Account for setup overhead
setup_overhead = time.time() - start_time
if time_limit is not None:
    remaining_time = max(0.0, time_limit - setup_overhead)
    # If setup took longer than the limit, still allow a minimal time budget
    if remaining_time < 1.0:
        remaining_time = 1.0
else:
    remaining_time = None
        
self.model.fit(
    X=X, 
    y=y, 
    eval_set=eval_set,
    fit_max_time=remaining_time
)

It is a very quick response and I could look further, but I think these are the most relevant things right now that I hope you can modify and re-launch with that.

LennartPurucker · 2026-05-17T13:04:01Z

Very cool that you ran it on TabArena as well! Here are the results for default on Lite, which I believe match your findings (i.e., roughly above xRFM, plus/minus having more models in the mix). Lite is, in general, a bit noisier than using all splits.

Results on TabArena-Lite:

including fine-tuning

Gotcha, I think TabSTAR uses the same definition! In both cases, I think this is fine and am doing HPO right now, but the compute costs of iLTM are quite high compared to a model like RealMLP. Ideally, if we get more compute grants, this will resolve itself. Or it will take a bit longer to fully integrate the results.

cat_feature inference in the wrapper. [...] Categorical columns should be inferred by dtype [...] Disable any AG feature generator and preprocessing.

Great, this is what is happening in the wrapper, actually! The wrapper only changes the categorical codes, not the dtype itself. The preprocess call is otherwise a pass-through. The main goal is not to pass strings or objects that consume more RAM/VRAM for features that downstream models treat as categoricals.

We usually only disable model-agnostic preprocessing for models that take semantics into account, e.g., TabSTAR.
As this is just a pass-through, the time limit is not meaningfully affected.

stop it many times due to incorrect remaining time.

Do you have any numbers or details on this? Assuming iLTM has accounts for an overhead of a few seconds, this should not occur. And as far as I can tell, it does have such an overhead.

LennartPurucker · 2026-05-20T11:56:05Z

Heyho @davidbonet, I am running into an issue, and was wondering if your code handles this:

OOM issue -- does the code check for this somehow and avoid running OOM? Weird, this happens for a dataset with 1k rows and 5 features. Moreover, it happens at inference time; should I maybe add batching?

I am running on 40GB VRAM GPUs.

davidbonet · 2026-05-20T21:09:07Z

Hey @LennartPurucker, we have some basic safeguards, but probably not enough 🙃

My guess right now is that it could happen at inference time even with small datasets if it's fast, so can build big ensembles in that time budget, but some layers can grow big depending on the hyperparameters and might not fit in a 40GB GPU.

Could you tell me what hyperparameter configuration is giving OOM so I can look into it? Thank you!

LennartPurucker · 2026-05-21T07:57:08Z

The config is:

{
	'<class 'tabarena.benchmark.models.ag.iltm.iltm_model.ILTMModel'>': [{'ag_args': {'name_suffix': '_r2'}, 'ag_args_ensemble': {'model_random_seed': 16, 'vary_seed_across_folds': True, 'ag.max_time_limit': 3600}, 'batch_size': 4096, 'checkpoint': 'xgbrconcat', 'clip_predictions': True, 'corr_select_k': 0, 'do_retrieval': False, 'finetuning_batch_size': 256, 'finetuning_dropout': 0.15, 'finetuning_lr': 0.0001491001661, 'finetuning_max_steps': 2048, 'finetuning_optimizer': 'lion', 'gradient_clip_norm': 1.0481195725818, 'n_ensemble': 32, 'retrieval_alpha': 0.6892356429052, 'retrieval_distance': 'euclidean', 'retrieval_temperature': 1.7544502496689, 'scheduler_min_lr': 1.40255432e-05, 'tree_bagging_temperature': 0.8127677217625, 'tree_data_split': 'dynamic', 'tree_feature_fraction': 0.6846355765522, 'tree_gamma': 0.0, 'tree_l2_leaf_reg': 0.5, 'tree_lr': 0.6988269798193, 'tree_max_depth': 6, 'tree_min_samples_leaf': 8, 'tree_n_estimators': 125, 'tree_subsample': 0.9096010335321}],
}

The full log is:

Details

cpu-bind=MASK - dlc2gpu12, task  0  0 [1033791]: mask 0xf000000000000000f00000000 set
+ command -v jq
+ JSON_FILE=/work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ echo 'Using JSON file: /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json'
Using JSON file: /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ J=13
+ echo 'Selected Job Index: 13'
Selected Job Index: 13
++ jq -r .defaults.python /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ PYTHON_PATH=/work/dlclarge2/purucker-tabarena/venvs/tabarena_11052025/bin/python
++ jq -r .defaults.run_script /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ RUNSCRIPT=/work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/run_tabarena_experiment.py
++ jq -r .defaults.openml_cache_dir /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ OPENML_CACHE_DIR=/work/dlclarge2/purucker-tabarena/.openml-cache
++ jq -r .defaults.configs_yaml_file /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ CONFIGS_YAML_FILE=/work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/benchmark_configs_benchmark_iltm_14052026.yaml
++ jq -r .defaults.output_dir /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ OUTPUT_DIR=/work/dlclarge2/purucker-tabarena/output/benchmark_iltm_14052026
++ jq -r .defaults.num_cpus /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ NUM_CPUS=8
++ jq -r .defaults.num_gpus /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ NUM_GPUS=1
++ jq -r .defaults.num_gpus_model /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ NUM_GPUS_MODEL=null
++ jq -r .defaults.memory_limit /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ MEMORY_LIMIT=32
++ jq -r .defaults.setup_ray_for_slurm_shared_resources_environment /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ SETUP_RAY=true
++ jq -r .defaults.ignore_cache /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ IGNORE_CACHE=false
++ jq -r .defaults.sequential_local_fold_fitting /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ SEQUENTIAL_LOCAL_FOLD_FITTING=false
++ jq -r '.defaults.dynamic_tabarena_validation_protocol // false' /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ DYNAMIC_TABARENA_VALIDATION_PROTOCOL=false
+ echo 'Python Path: /work/dlclarge2/purucker-tabarena/venvs/tabarena_11052025/bin/python'
Python Path: /work/dlclarge2/purucker-tabarena/venvs/tabarena_11052025/bin/python
+ echo 'Run Script: /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/run_tabarena_experiment.py'
Run Script: /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/run_tabarena_experiment.py
+ echo 'OpenML Cache Directory: /work/dlclarge2/purucker-tabarena/.openml-cache'
OpenML Cache Directory: /work/dlclarge2/purucker-tabarena/.openml-cache
+ echo 'Configs YAML File: /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/benchmark_configs_benchmark_iltm_14052026.yaml'
Configs YAML File: /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/benchmark_configs_benchmark_iltm_14052026.yaml
+ echo 'Output Directory: /work/dlclarge2/purucker-tabarena/output/benchmark_iltm_14052026'
Output Directory: /work/dlclarge2/purucker-tabarena/output/benchmark_iltm_14052026
+ echo 'Number of CPUs: 8'
Number of CPUs: 8
+ echo 'Number of GPUs: 1'
Number of GPUs: 1
+ echo 'Number of GPUs for model fitting: null'
Number of GPUs for model fitting: null
+ echo 'Memory Limit: 32'
Memory Limit: 32
+ echo 'Setup Ray for SLURM Shared Resources Environment: true'
Setup Ray for SLURM Shared Resources Environment: true
+ echo 'Ignore Cache: false'
Ignore Cache: false
+ echo 'Sequential Local Fold Fitting: false'
Sequential Local Fold Fitting: false
+ echo 'Dynamic TabArena Validation Protocol: false'
Dynamic TabArena Validation Protocol: false
++ jq -r --argjson J 13 '(.jobs[$J].items // null) != null' /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ HAS_ITEMS=false
+ '[' false = true ']'
++ jq -r --argjson J 13 '.jobs[$J].config_index | join(",")' /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ CONFIG_INDEX=2,3,4,5,7
++ jq -r --argjson J 13 '.jobs[$J].task_id' /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ TASK_ID=363612
++ jq -r --argjson J 13 '.jobs[$J].fold' /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ FOLD=1
++ jq -r --argjson J 13 '.jobs[$J].repeat' /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/slurm_run_data_benchmark_iltm_14052026.json
+ REPEAT=1
+ echo 'Task ID: 363612'
Task ID: 363612
+ echo 'Fold: 1'
Fold: 1
+ echo 'Repeat: 1'
Repeat: 1
+ echo 'CONFIG_INDEX: 2,3,4,5,7'
CONFIG_INDEX: 2,3,4,5,7
+ IFS=,
+ read -ra CONFIG_ARRAY
+ for CI in "${CONFIG_ARRAY[@]}"
+ run_one 363612 1 1 2
+ local TASK_ID=363612
+ local FOLD=1
+ local REPEAT=1
+ local CI=2
+ echo 'Running task_id=363612 fold=1 repeat=1 config_index=2'
Running task_id=363612 fold=1 repeat=1 config_index=2
+ /work/dlclarge2/purucker-tabarena/venvs/tabarena_11052025/bin/python /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/run_tabarena_experiment.py --task_id 363612 --fold 1 --repeat 1 --config_index 2 --configs_yaml_file /work/dlclarge2/purucker-tabarena/code/tabarena_new/tabarena/tabflow_slurm/benchmark_configs_benchmark_iltm_14052026.yaml --openml_cache_dir /work/dlclarge2/purucker-tabarena/.openml-cache --output_dir /work/dlclarge2/purucker-tabarena/output/benchmark_iltm_14052026 --num_cpus 8 --num_gpus 1 --num_gpus_model null --memory_limit 32 --setup_ray_for_slurm_shared_resources_environment true --ignore_cache false --sequential_local_fold_fitting false --dynamic_tabarena_validation_protocol false
GPUs for node/Ray: 1, GPUs for model fitting: 1
Setting OpenML cache directory to: /work/dlclarge2/purucker-tabarena/.openml-cache
Setting up Ray for SLURM job in a shared resources environment.
2026-05-20 12:41:54,650	INFO worker.py:2007 -- Started a local Ray instance.
Running Experiments, saving to: '/work/dlclarge2/purucker-tabarena/output/benchmark_iltm_14052026'...
	Fitting 1 tasks with a total of 1 fold-repeat pairs
	Fitting 1 methods with 1 fold-repeat pairs for a total of 1 jobs...
	TIDs    : [363612]
	Repeat-Fold-Pairs-Per-Task (first 20): [[(1, 1)]]
	Methods : ['TA-iLTM_r2_BAG_L1']
Starting Dataset 1/1...
Starting Split 1/1 (Fold 1, Repeat 1)...
Starting Model 1/1...
	0/1 ran | 0 success | 0 fail | 0 cache_exists | 0 missing | Fitting 363612 on repeat 1, fold 1 for method TA-iLTM_r2_BAG_L1
Using eval metric: rmse
Generating cache (exists=False, ignore_cache=False, cache_file="/work/dlclarge2/purucker-tabarena/output/benchmark_iltm_14052026/data/TA-iLTM_r2_BAG_L1/363612/1_1/results.pkl")
start: Evaluate function
2026-05-20 12:42:01.440 | INFO     | tabarena.benchmark.models.wrapper.AutoGluon_class:_resolve_validation_protocol:65 - Using num_folds: 8
No path specified. Models will be saved in: "/tmp/ag/ag-20260520_104201"
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version:  1.5.1b20260511
Python Version:     3.12.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #111~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 14 17:13:45 UTC 
CPU Count:          8
Pytorch Version:    2.9.1+cu128
CUDA Version:       12.8
GPU Memory:         GPU 0: 44.39/44.39 GB
Total GPU Memory:   Free: 44.39 GB, Allocated: 0.00 GB, Total: 44.39 GB
GPU Count:          1
Memory Avail:       1093.23 GB / 1133.56 GB (96.4%)
Disk Space Avail:   1544.86 GB / 1755.51 GB (88.0%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='extreme'  : New in v1.5: The state-of-the-art for tabular data. Massively better than 'best' on datasets <100000 samples by using new Tabular Foundation Models (TFMs) meta-learned on https://tabarena.ai: TabPFNv2, TabICL, Mitra, TabDPT, and TabM. Requires a GPU and `pip install autogluon.tabular[tabarena]` to install TabPFN, TabICL, and TabDPT.
	presets='best'     : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='best_v150': New in v1.5: Better quality than 'best' and 5x+ faster to train. Give it a try!
	presets='high'     : Strong accuracy with fast inference speed.
	presets='high_v150': New in v1.5: Better quality than 'high' and 5x+ faster to train. Give it a try!
	presets='good'     : Good accuracy with very fast inference speed.
	presets='medium'   : Fast training time, ideal for initial prototyping.
Enforcing custom memory (soft) limit of 32 GB!
Beginning AutoGluon training ...
AutoGluon will save models to "/tmp/ag/ag-20260520_104201"
Train Data Rows:    1002
Train Data Columns: 5
Label Column:       __label__
Problem Type:       regression
Preprocessing data...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    31855.32 MB
	Train Data (Original)  Memory Usage: 0.03 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('category', []) : 1 | ['attack-angle']
		('float', [])    : 3 | ['chord-length', 'free-stream-velocity', 'suction-side-displacement-thickness']
		('int', [])      : 1 | ['frequency']
	Types of features in processed data (raw dtype, special dtypes):
		('category', []) : 1 | ['attack-angle']
		('float', [])    : 3 | ['chord-length', 'free-stream-velocity', 'suction-side-displacement-thickness']
		('int', [])      : 1 | ['frequency']
	0.0s = Fit runtime
	5 features in original data used to generate 5 features in processed data.
	Train Data (Processed) Memory Usage: 0.03 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.02s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
	To change this, specify the eval_metric parameter of Predictor()
Large model count detected (28 configs) ... Only displaying the first 3 models of each family. To see all, set `verbosity=3`.
User-specified model hyperparameters to be fit:
{
	'<class 'tabarena.benchmark.models.ag.iltm.iltm_model.ILTMModel'>': [{'ag_args': {'name_suffix': '_r2'}, 'ag_args_ensemble': {'model_random_seed': 16, 'vary_seed_across_folds': True, 'ag.max_time_limit': 3600}, 'batch_size': 4096, 'checkpoint': 'xgbrconcat', 'clip_predictions': True, 'corr_select_k': 0, 'do_retrieval': False, 'finetuning_batch_size': 256, 'finetuning_dropout': 0.15, 'finetuning_lr': 0.0001491001661, 'finetuning_max_steps': 2048, 'finetuning_optimizer': 'lion', 'gradient_clip_norm': 1.0481195725818, 'n_ensemble': 32, 'retrieval_alpha': 0.6892356429052, 'retrieval_distance': 'euclidean', 'retrieval_temperature': 1.7544502496689, 'scheduler_min_lr': 1.40255432e-05, 'tree_bagging_temperature': 0.8127677217625, 'tree_data_split': 'dynamic', 'tree_feature_fraction': 0.6846355765522, 'tree_gamma': 0.0, 'tree_l2_leaf_reg': 0.5, 'tree_lr': 0.6988269798193, 'tree_max_depth': 6, 'tree_min_samples_leaf': 8, 'tree_n_estimators': 125, 'tree_subsample': 0.9096010335321}],
}
Custom Model Type Detected: <class 'tabarena.benchmark.models.ag.iltm.iltm_model.ILTMModel'>
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: TA-iLTM_r2_BAG_L1 ...
	Time limit adjusted due to model hyperparameters: None -> 3600.00s (ag.max_time_limit=3600, ag.max_time_limit_ratio=1.0, ag.min_time_limit=0)
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (1 workers, per: cpus=8, gpus=1, memory=0.00%)
�[36m(_ray_fit pid=1040964)�[0m 2026-05-20 12:47:30 iltm.inference_interface WARNING : Early return: time budget nearly exhausted (remaining=18.31s < needed=19.96s; avg_pred=19.94s, cushion=+0%). Stopping at 16/32 predictors.
�[36m(_ray_fit pid=1046182)�[0m 2026-05-20 12:53:42 iltm.inference_interface WARNING : Early return: time budget nearly exhausted (remaining=10.70s < needed=17.23s; avg_pred=17.22s, cushion=+0%). Stopping at 19/32 predictors.
�[36m(_ray_fit pid=1048117)�[0m 2026-05-20 13:00:08 iltm.inference_interface WARNING : Early return: time budget nearly exhausted (remaining=5.73s < needed=18.47s; avg_pred=18.45s, cushion=+0%). Stopping at 18/32 predictors.
�[36m(_ray_fit pid=1050042)�[0m 2026-05-20 13:06:17 iltm.inference_interface WARNING : Early return: time budget nearly exhausted (remaining=16.65s < needed=16.92s; avg_pred=16.90s, cushion=+0%). Stopping at 19/32 predictors.
�[36m(_ray_fit pid=1052126)�[0m 2026-05-20 13:12:31 iltm.inference_interface WARNING : Early return: time budget nearly exhausted (remaining=13.96s < needed=16.21s; avg_pred=16.19s, cushion=+0%). Stopping at 20/32 predictors.
�[36m(_ray_fit pid=1054087)�[0m 2026-05-20 13:18:53 iltm.inference_interface WARNING : Early return: time budget nearly exhausted (remaining=7.26s < needed=18.38s; avg_pred=18.36s, cushion=+0%). Stopping at 18/32 predictors.
�[36m(_ray_fit pid=1056005)�[0m 2026-05-20 13:25:06 iltm.inference_interface WARNING : Early return: time budget nearly exhausted (remaining=15.06s < needed=20.19s; avg_pred=20.17s, cushion=+0%). Stopping at 16/32 predictors.
�[36m(_ray_fit pid=1057782)�[0m 2026-05-20 13:31:26 iltm.inference_interface WARNING : Early return: time budget nearly exhausted (remaining=1.73s < needed=15.30s; avg_pred=15.28s, cushion=+0%). Stopping at 22/32 predictors.
	-1.4489	 = Validation score   (-root_mean_squared_error)
	2959.45s	 = Training   runtime
	154.21s	 = Validation runtime
AutoGluon training complete, total runtime = 3014.96s ... Best model: TA-iLTM_r2_BAG_L1 | Estimated inference throughput: 0.8 rows/s (126 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/tmp/ag/ag-20260520_104201")
/var/spool/slurm/job29045189/slurm_script: line 74: 1033892 Killed                  $PYTHON_PATH $RUNSCRIPT --task_id $TASK_ID --fold $FOLD --repeat $REPEAT --config_index $CI --configs_yaml_file $CONFIGS_YAML_FILE --openml_cache_dir $OPENML_CACHE_DIR --output_dir $OUTPUT_DIR --num_cpus $NUM_CPUS --num_gpus $NUM_GPUS --num_gpus_model $NUM_GPUS_MODEL --memory_limit $MEMORY_LIMIT --setup_ray_for_slurm_shared_resources_environment $SETUP_RAY --ignore_cache $IGNORE_CACHE --sequential_local_fold_fitting $SEQUENTIAL_LOCAL_FOLD_FITTING --dynamic_tabarena_validation_protocol $DYNAMIC_TABARENA_VALIDATION_PROTOCOL
slurmstepd-dlc2gpu12: error: Detected 1 oom_kill event in StepId=29045189.batch. Some of the step tasks have been OOM Killed.

I will look into it more once other jobs have finished, if needed. In the worst case, we can increase the VRAM for a few jobs or add some batching.

salcc · 2026-05-22T09:36:19Z

Hey! It seems this OOM is about the CPU memory limit, as there is no torch.cuda.OutOfMemoryError and the log shows the Slurm oom_kill event. We realized that for small datasets and with enough time budget, iLTM can keep accumulating predictors to ensemble, which leads to the CPU OOM crash.

We now implemented a fix so that the RF weights of the predictors are stored in lower precision, which reduces the amount of RAM needed without significantly impacting performance, and in this particular configuration allows it to run using less than 32 GB of RAM. To avoid crashes on other configurations that could use even more predictors, we also implemented a check that stops creating predictors when the used RAM would exceed the limits defined in cgroup by Slurm, in AG_MEMORY_LIMIT_IN_GB by AutoGluon, or given as an extra argument. Both options are active by default in the latest iLTM version, so just updating the package should fix the problem (pip install iltm>=0.1.1).

However, we also noticed a separate AutoGluon/TabArena wrapper-side memory issue. In TabArena, this is triggered after fit/evaluation during result metadata collection: ExperimentRunner.post_evaluate() calls the model wrapper’s get_metadata(), which calls get_metadata_fit(), which loads the best AutoGluon model and calls model.get_info(include_feature_metadata=False). If the model is bagged, BaggedEnsembleModel.get_info() calls BaggedEnsembleModel._get_child_info() to populate child metadata such as child memory sizes and bagged ensemble memory estimates. Inside _get_child_info(), AutoGluon calls child_type.load_info(child_path), which looks for info.pkl. In this case info.pkl is missing, as AbstractModel.save_info() is never called, so AutoGluon falls back to loading model.pkl and calling model.get_info() on it. Then model.get_info() computes memory_size via pickle.dumps(self), which serializes the model again and can temporarily create another large in-memory copy. We believe this is unfair to iLTM in this TabArena/AutoGluon wrapper path, because this memory spike is not needed for training or inference.

We think it could be fixed by defining the following method in class ILTMModel(AbstractTorchModel):

  def get_memory_size(self, allow_exception: bool = False) -> int | None:
      try:
          model_file = Path(self.path) / self.model_file_name
          if model_file.exists():
              return model_file.stat().st_size
          if allow_exception:
              return None
          return super().get_memory_size(allow_exception=allow_exception)
      except Exception:
          if allow_exception:
              return None
          raise

This matches AutoGluon’s serialized-size semantics, avoids the extra in-memory pickle copy, and keeps the change scoped to TabArena’s ILTMModel wrapper.

LennartPurucker · 2026-05-22T12:18:01Z

Very cool, thank you for the update on the code!

Related to get_memory_size: We are not using this information in any evaluations/plots yet. Moreover, we are calling the function only to get more metadata we might want to look into in the future. The peak memory usage from this does not affect training or inference "runs". Moreover, all other models have the same "disadvantage" so far. Also note that the pickle size is the true disk size required for the entire abstract model class, but it can be heavily optimized by unloading the model into separate formats.

Nevertheless, great to raise this point! We likely want to fix this in the future for all models and actually plot and investigate disk space usage. CC @Innixma

…ewest iltm version

LennartPurucker added 2 commits May 15, 2026 00:08

add: iLTM first draft

2b2fcf6

fix: iLTM bug with logger in __init__

a44607f

LennartPurucker added the new model label May 15, 2026

LennartPurucker marked this pull request as draft May 15, 2026 12:35

LennartPurucker added 3 commits May 15, 2026 15:13

latest state

01bdded

Merge remote-tracking branch 'origin/main' into add_iltm

fe99ed5

# Conflicts: # tabarena/pyproject.toml # tabarena/tabarena/benchmark/models/model_registry.py # tabarena/tabarena/models/utils.py # tabflow_slurm/run_setup_slurm_jobs.py

fix log

b634d5c

LennartPurucker marked this pull request as ready for review May 15, 2026 20:11

add: search space for iLTM

b9ebb5e

fix: pass cat indices to avoid crash for dtype mismatch; upgrade to n…

f75c6c5

…ewest iltm version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model] iLTM#305

[New Model] iLTM#305
LennartPurucker wants to merge 7 commits into
mainfrom
add_iltm

LennartPurucker commented May 15, 2026 •

edited

Loading

Uh oh!

LennartPurucker commented May 15, 2026

Uh oh!

LennartPurucker commented May 15, 2026

Uh oh!

davidbonet commented May 15, 2026

Uh oh!

davidbonet commented May 16, 2026

Uh oh!

LennartPurucker commented May 17, 2026

Uh oh!

LennartPurucker commented May 20, 2026 •

edited

Loading

Uh oh!

davidbonet commented May 20, 2026

Uh oh!

LennartPurucker commented May 21, 2026

Uh oh!

salcc commented May 22, 2026

Uh oh!

LennartPurucker commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LennartPurucker commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LennartPurucker commented May 15, 2026

Uh oh!

LennartPurucker commented May 15, 2026

Uh oh!

davidbonet commented May 15, 2026

Uh oh!

davidbonet commented May 16, 2026

Uh oh!

LennartPurucker commented May 17, 2026

Uh oh!

LennartPurucker commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidbonet commented May 20, 2026

Uh oh!

LennartPurucker commented May 21, 2026

Uh oh!

salcc commented May 22, 2026

Uh oh!

LennartPurucker commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LennartPurucker commented May 15, 2026 •

edited

Loading

LennartPurucker commented May 20, 2026 •

edited

Loading

LennartPurucker commented May 22, 2026 •

edited

Loading