Open-Source LLMs for Technical Q&A: Lessons from StackExchange

# Description

This paper investigates whether open-source Large Language Models (LLMs) can effectively answer technical software engineering queries, particularly those from StackExchange. While proprietary models like GPT-4 and Claude-3 dominate research and practice, they remain costly and less accessible. The study evaluates six open-source models—Solar-10.7B, CodeLlama-7B, Mistral-7B, Qwen2-7B, StarCoder2-7B, and LLaMA-3-8B—using a retrieval-augmented generation (RAG) pipeline and fine-tuning. The dataset was derived from 73,560 StackExchange posts (2014–2023), filtered for accepted answers to create reliable QA pairs. Experiments compare base model performance, RAG-enhanced settings, and fine-tuned RAG configurations across semantic alignment, fluency, and similarity metrics. Findings show that open-source LLMs, especially Solar-10.7B with RAG and fine-tuning, approach expert-level responses, suggesting a cost-effective and scalable alternative for developer support.

# Findings

We find that Solar-10.7B consistently outperforms other open-source LLMs in the base setting, achieving the highest semantic alignment with human-authored answers, while LLaMA-3-8B and CodeLlama-7B follow closely and Mistral-7B lags significantly. We observe that across all models, METEOR scores remain relatively low, indicating that without contextual information, models struggle with producing fluent and well-structured answers. We show that the introduction of a RAG framework substantially improves performance for all evaluated models, with Solar-10.7B benefiting the most in semantic similarity, contextual accuracy, and fluency, and even weaker models such as Mistral-7B gaining large relative improvements. We demonstrate that fine-tuning within the RAG framework further enhances response quality, particularly in terms of contextual fluency and semantic richness, as seen in Solar-10.7B’s performance gains across all metrics. We conclude that open-source models, when paired with RAG and fine-tuning, can provide high-quality and cost-effective alternatives to proprietary models for software engineering tasks, while also noting that limitations in dataset size, model diversity, and prompting techniques present opportunities for future research.