A Collaborative Initiative Between Odia Generative AI and the Norwegian BioAI Lab for Deep Learning Model Compression and Deployment.
In this repository, we have compared two LLM Compression Techniques mainly which are:
- BitsandBytes
- AWQ
Objective: Evaluate 4-bit and 8-bit quantization techniques for on-device Odia LLM deployment.
| Technique | VRAM Required (7B Model) | Speed (Tokens/sec) | Accuracy Drop (Est.) | Best For |
|---|---|---|---|---|
| Baseline (FP16) | ~14.5 GB | 25 t/s | 0% | Cloud Servers |
| Bitsandbytes (NF4) | ~5.2 GB | 18 t/s | Minimal | Fast Prototyping |
| AWQ (INT4) | ~4.8 GB | 35 t/s | Negligible | Edge Deployment |
You can find the Jupyter Notebook for all these tests and comparison between results of each techniques here.