
This paper investigates the challenges and opportunities of Small Language Models as efficient, privacy-aware alternatives to Large Language Models in resource-constrained and real-time environments. It elucidates the related basic concepts and combines an up-to-date comprehensive, yet compact, literature review of architectural and optimization techniques for Small Language Models with a systematic experimental evaluation of selected prototype models that integrate fine-tuning, Retrieval-Augmented Generation, and model quantization for multi-platform deployment. It essentially aims to pave the way towards practical implementations that offer measurable improvements over existing methods and can be readily adopted in applied settings.
Small Language Models; fine-tuning; LoRA; quantization; Retrieval-Augmented Generation; on-device deployment; Gemma