Quantization of a Llama Language Model for improved Efficiency and Inference

doi:10.21203/rs.3.rs-6021454/v1

Quantization of a Llama Language Model for improved Efficiency and Inference

2025 · doi:10.21203/rs.3.rs-6021454/v1

preprint OA: closed

Full text JSON View at publisher

Full text 118,542 characters · extracted from preprint-html · click to expand

Quantization of a Llama Language Model for improved Efficiency and Inference | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Quantization of a Llama Language Model for improved Efficiency and Inference S Madhanegha, V Vishnuvaradhan, R Arun, I Surenther This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6021454/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Despite their transformational potential, large language models (LLMs) like Llama are difficult to implement on devices with limited computational power due to their high computational requirements. This study explores the quantization of the Lamba model, a method that minimizes memory footprint and model size for effective deployment. In order to obtain significant model compression with acceptable performance, we investigate different quantization techniques. At various quantization levels, the study will assess the trade-off between efficiency and accuracy. We will also look into how quantization affects the target devices' power consumption and inference speed. By enabling deployment on resource-constrained platforms and effectively quantifying the Llama model, this initiative seeks to democratize access to potent AI tools, encouraging greater innovation and practical applications. Additionally, a smaller model results in cheaper implementation costs and enhanced sustainability due to lower inference power usage. In order to quantified the Llama model, this research explores a number of technical approaches, assesses performance trade-offs, and optimizes deployment for effective hardware use. This project's objective is to successfully quantify the Llama model in order to show that it is feasible to implement it in contexts with limited resources. The results will help create LLMs that are easier to use and more effective. Figures Figure 1 Figure 2 I.INTRODUCTION The goal of this research is to quantify the Llama language model in order to overcome issues with its high memory and processing requirements, which frequently prevent implementation on devices with limited resources. The goal is to investigate and use different quantization strategies that efficiently minimize the memory footprint and compress the model size while preserving respectable performance levels. This project will analyze the trade-offs between model correctness, inference speed, and power consumption by looking at various quantization algorithms. This will give a thorough understanding of how quantization affects overall efficiency. It also seeks to maximize the quantized Llama model's hardware utilization for deployment on low-resource devices, democratizing access to cutting-edge AI technologies and encouraging greater creativity in practical applications. Ensuring that implementation is both feasible and economical requires proving the viability of employing a quantized Llama model in settings with limited computational resources. By using less power during inference, this method not only encourages more equitable access to AI capabilities but also advances sustainability. The results of the study will provide important insights into creating massive language models that are easier to use, more effective, and sustainable, paving the way for a more widespread and conscientious use of AI. II. LITERATURE REVIEW Guang xuan Xiao [ 1 ] et.al has proposed in this system For large language models (LLMs) with up to 530 billion parameters, the suggested Smooth Quant approach achieves lossless 8-bit weight and activation quantization, demonstrating effective and efficient post-training quantization. In comparison to mixed-precision activation quantization baselines, Smooth Quant dramatically lowers inference time and memory consumption by permitting quantization for both weights and activations across all General Matrix Multiply (GEMM) operations in LLMs. While reducing the memory footprint by half, the incorporation of Smooth Quant into frameworks like PyTorch and Faster Transformer resulted in up to 1.56× inference acceleration. This outcome demonstrates how Smooth Quant may democratize LLM applications by providing a workable way to lower implementation costs and improve accessibility for real-world use cases. This study presents Smooth Quant, a training-free post-training quantization (PTQ) technique that successfully lowers the memory and processing requirements of large language models (LLMs). Smooth Quant allows INT8 quantization for both weights and activations across all matrix multiplications in models such as OPT, BLOOM, GLM, MT-NLG, and LLaMA by smoothing activation outliers and transferring quantization difficulty from activations to weights via a mathematically equivalent transformation. With minimal accuracy loss, our method reduces memory by 2× and speeds up inference by up to 1.56×. Additionally, it makes it possible to install two 530B parameter models on a single node, which drastically reduces the cost of energy and hardware. For real-world applications, Smooth Quant offers a workable and effective way to scale LLMs, increasing the accessibility and affordability of their deployment. Yelysei Bondarenko [ 2 ] et.al has proposed in this system The "Low-Rank Quantization-Aware Training for LLMs" paper suggests LR-QAT, a memory-efficient and lightweight QAT method for LLMs that allows training a 7B LLM on a single consumer-grade GPU with 24GB of RAM. Introduce a low-rank reparameterization that is cognizant of the quantization grid, drawing inspiration from PEFT techniques. Additionally, lower the memory needs by implementing checkpointing and a down casting operator involving fixed-point or double-packed integers. The method achieves the same model performance as full-model QAT at a fraction of its memory usage and outperforms popular PTQ alternatives in nearly all circumstances. To overcome the memory and computational difficulties associated with implementing large language models (LLMs) on hardware with limited resources, LR-QAT (Low-Rank Quantization-Aware Training) is a lightweight, memory-efficient quantization-aware training technique. In order to minimize memory usage without sacrificing model performance, LR-QAT integrates low-rank auxiliary weights, a down casting operator, and gradient 3 checkpointing, drawing inspiration from parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) techniques. In contrast to conventional QAT, LR-QAT eliminates extra overhead during inference by smoothly integrating auxiliary matrices into quantized weight tensors, resulting in inference efficiency. It can interface with different post-training quantization (PTQ) methods and supports a variety of quantization settings, such as per-channel and per-block weight quantization. In contrast to full-model QAT, which requires over 70GB of memory, LR-QAT allows training a 7B parameter LLM on a single consumer-grade GPU with less than 21GB of memory while maintaining predictive performance. A workable method for generating low-bit pretrained LLMs that may be adjusted or modified for a range of downstream applications, LR-QAT has been validated on LLaMA-2/3 and Mistral models spanning general language modelling datasets and reasoning tasks. Mark vero [ 3 ] et.al has proposed in this system In order to identify attack-prone vulnerabilities resulting from differences between full-precision and quantized models, this study examined zero-shot quantization techniques for large language models (LLMs). The results show how serious and feasible quantization attacks are against popular, state-of-the-art LLMs. NF4, FP4, and LLM.int8() are popular zero-shot quantization techniques that may expose users to fraudulent activity when using quantized models. These findings highlight serious security issues, particularly in light of the extensive use of Hugging Face and similar platforms for the distribution and deployment of quantized LLMs. The study looks into how quantization affects large language models' (LLMs') security, exposing flaws that adversaries could use to build hostile models. To ensure that malicious behavior only manifests after quantization, the suggested attack framework entails fine-tuning an LLM with adversarial tasks, quantizing the model to add constraints, and modifying full-precision weights. Experiments that highlight situations like adversarial content injection, susceptible code generation, and over-refusal behavior highlight the viability and seriousness of such attacks. The findings point to a serious flaw in the way that evaluations are currently conducted, where full-precision models seem safe but turn out to be detrimental when quantified. Hugging Face and other platforms may share harmful full-precision models, putting millions of users at danger. This presents serious hazards. To protect against such hostile vulnerabilities, the study emphasizes the critical requirement for thorough security assessments during quantization. Anton Trusov [ 4 ] et.al has proposed In contrast to conventional techniques, the research suggests a 4.6-bit quantization scheme that increases the effectiveness and precision of neural network inference on CPUs. By providing additional quantization bins and utilizing a combination of 16- and 32-bit accumulators, this method overcomes earlier computation depth constraints and closes the gap between four-bit and eight-bit quantization. The 4.6-bit model runs 1.5–1.6 times quicker than the 8-bit models and dramatically improves accuracy over the four-bit models (e.g., 66.1% vs. 64.2% for ResNet18), according to experiments conducted on the CIFAR-10 and ImageNet datasets. The approach only slightly slows down (by 4%), maintaining a speed comparable to four-bit quantization. As a result, it is a good substitute for applications that need to balance inference speed and accuracy, and it works well in CPU systems with limited resources. For CPU-based neural network inference, the 4.6-bit quantization approach offers a good trade-off between computational efficiency and accuracy. It is a compromise between four-bit and eight-bit quantization techniques, improving accuracy while preserving quick processing speeds by increasing the bit width in comparison to four-bit quantization. By combining 16- and 32-bit accumulators, the technique overcomes previous computation depth constraints and optimizes CPU resource utilization. In settings where eight-bit precision is too resource-intensive, this method is especially useful as it provides a quicker but still precise substitute. All things considered, it is a workable way to maximize neural network deployment on embedded and mobile CPUs. Kelly Marchisio et.al has proposed in this system In this work, we examine how quantization methods affect multilingual large language models (LLMs) in over 20 languages, with parameter counts ranging from 8 billion to billion. Our results provide numerous important insights: (1) Human assessors notice notable decline even when automated measurements do not, indicating that the detrimental impacts of quantization are more severe than automated metrics indicate. (2) The degree to which quantization affects different languages varies; non-Latin script languages suffer more from automatic benchmark erosion. (3) There are notable declines in performance on complex activities, especially those that require math’s and realistic, difficult suggestions. But in certain instances, we also notice sporadic performance gains. These findings highlight how crucial it is to take multilingual performance into account at every stage of system design. To develop more reliable systems that serve a worldwide audience, more research might examine the effects of additional parameters on multilingual performance, such as excluding particular languages from training and managing out-of-distribution activities. III. RELATED WORK A very successful method for improving machine learning models is neural network quantization, especially when it comes to reducing the models' computing, data transmission, and footprint requirements. Quantization achieves notable efficiency savings by transforming high bit-width floating-point weights and activations—typically represented as FP32 or FP16—into low-bit values like INT8. Particularly beneficial are low-bit fixed-point representations since they use less computing power than floating-point operations. Because of this, they are ideal for deployment on low-resource devices like smartphones or edge computing systems. But lowering bit-width adds quantization noise, which can impact model performance. When quantized to 8 bits or less, it frequently leads to less accuracy or more confusion. Uniform affine quantization is one of the core techniques in neural network quantization. This method ensures consistency across many computing platforms by linearly mapping floating-point numbers to fixed-point integers. This method's homogeneity aids in preserving a tight correspondence between the quantized values' distribution and their original floating-point equivalents. Uniform affine quantization reduces performance degradation and allows the quantization process to be flexible across a variety of hardware settings by maintaining this distribution. Large language model (LLM) quantization has recently advanced with an emphasis on resolving the accuracy vs. efficiency trade-off. Commonly used methods include quantization-aware training (QAT) and post-training quantization (PTQ). After the model has been fully trained, PTQ is used, which quantizes weights and activations without the need for further training. lead to a more noticeable decrease in accuracy because it does not have an adaptive mechanism to take quantization noise into account. QAT, on the other hand, integrates the quantization procedure straight into the training stage. The model can learn and adjust to the added quantization noise thanks to QAT, which simulates quantization during both the forward and backward passes. Although QAT requires more processing power, the quantized model performs better and has more accuracy. Maintaining appropriate accuracy while balancing computing economy is one of the fundamental issues in LLM quantization. Although lower bit-width quantization uses less energy and computational effort, it frequently creates quantization noise that might affect the model's performance. Another crucial issue is ensuring hardware compatibility, since quantized models need to be tailored for particular hardware while yet being able to generalize over a variety of datasets and applications, including multilingual ones. To guarantee that quantized models continue to be reliable and adaptable in the face of these difficulties, thorough planning and assessment are required. IV. METHODOLOGY The LR-QAT method addresses the main drawbacks of quantization-aware training (QAT), especially with regard to large language models (LLMs), while expanding upon its fundamental ideas. Re-examining the conventional QAT procedure and the difficulties it presents when used with LLMs is crucial to comprehending the approach. A symmetric uniform affine quantization method is used in a typical QAT setup to quantize a linear layer with a weight matrix W∈Rm×k. The following is the quantization of the weights for b-bit quantization: (s⋅clip(𝑊/𝑠,−2b − 1,2b − 1−1) W where s is the quantization scale, W is the trainable shadow weights, and the clipping procedure makes that the quantized values fall inside the b-bit format's representable range. It is possible to learn the quantization scales during training or have them fixed. It uses the straight-through estimator (STE) to allow backpropagation over the non-differentiable rounding procedure that is part of the quantization process. By assuming that the rounding function's derivative is 1, the STE approximates it and permits gradients to pass through the quantization stage. Although this process works well for maintaining accuracy in low-bit quantized models, there are substantial computational difficulties when using it with LLMs. Because of their sheer magnitude, LLMs require learning around the same number of parameters during QAT as they did during initial pretraining. Traditional QAT techniques are therefore unsuitable for contemporary LLMs due to their high computational costs and significant memory usage. In order to overcome these difficulties, LR-QAT integrates low-rank adapters into the quantization procedure. By breaking down the weight matrix W into smaller matrices, low-rank adapters drastically cut down on the amount of parameters that must be saved and updated while being trained. In particular, W = A ⋅ B, where A and B have dimensions m×r and r×k, respectively, and r≪ min (m,k), and W is a combination of two low-rank matrices. By lowering the effective parameter count, this decomposition lessens the strain on memory during inference and training. LR-QAT preserves the advantages of QAT while reducing memory requirements by using low-rank adapters. Additionally, the technique makes use of symmetric uniform affine quantization's efficiency, which permits the use of low-bit formats without noticeably sacrificing accuracy. Furthermore, it is possible to quantize and fuse the low-rank adapters into the basic weight matrix W during inference, so avoiding the requirement for dequantization and enhancing runtime efficiency. The adaptability of LR-QAT is one of its main advantages. LR-QAT offers a basic framework that can be applied to a variety of use cases, including pretraining, fine-tuning, and even task-specific deployment, in contrast to traditional QAT techniques that are frequently customized to particular tasks. Without compromising accuracy or inference efficiency, LR-QAT makes it easier to deploy LLMs in resource-constrained environments, including mobile devices or edge computing platforms, by lowering the computational overhead of QAT. By incorporating low-rank adapters into the quantization procedure, LR-QAT reinterprets QAT for LLMs. This method is a viable and scalable way to implement large-scale neural networks in a variety of application domains since it drastically lowers memory and runtime requirements without sacrificing accuracy.In order to make our method more feasible, we use low-rank adapters A∈Rm×r and B∈R r×k, where r < min (m, k), and freeze the pretrained weights W (referred to as W0). This adds little computational effort while preserving the information of the pretrained model. The launch of these adapters, whose measurements are based on the low-rank approximation, which balances efficiency and model capacity, guarantees that the number of extra parameters is controllable. The positioning and incorporation of the low-rank adapters within the quantization framework are crucial components of this design. These adapters' positioning is essential for preserving model performance and facilitating effective inference. Our objective is to smoothly merge adapters A and B into a single b-bit integer matrix WZ after training, without sacrificing accuracy or confusion. In addition to streamlining the inference pipeline, this fusion takes advantage of low-bit quantization's advantages to minimize memory consumption and computing cost. In order to accomplish this, we alter the quantization procedure as follows: We place the auxiliary matrices AAA and BBB inside the quantization operator. −2b − 1,2b − 1−1, W:= slip(W0+ (α/r *AB))/s where α/r serves as a scaling factor to modify the contribution of AB, and s is the quantization scale. While adjusting the adapters' rank r, the LoRa-inspired scaling factor α/r reduces the requirement for intensive hyperparameter adjustment. This guarantees stability throughout training and inference by properly weighting A and B's contributions in relation to W0. A. Down casting operator An improvement to further minimized memory usage in quantization-aware training (QAT), especially in situations when memory efficiency is crucial, is the down casting operator. By avoiding the computation of gradients and momentum terms for the pretrained weights W, the formulation in Eq. (4) is already more memory-efficient than typical full-model QAT. However, upcasting techniques used to the frozen weight matrix W0 can further optimize the formulation. By taking use of the fact that W0W_0W0 stays constant throughout training, this method enables more effective processing and storage techniques. Every forward pass in Eq. (4) divides the weight matrix W0 by the scale sss. Directly down casting W0 in this formulation may provide precision and stability issues because sss usually needs to be saved in a high-precision format to guarantee numerical stability throughout training. Eq. (5) suggests a revised formulation to resolve this: −2b − 1, 2b − 1−1, W := s⋅clip((W0+(α/r)*AB)/s0 Here, the learnt scale sss inside the rounding operator is replaced by the scale s0, which is the initial fixed scale established during the range estimation stage prior to training starting. Because this change guarantees that the fraction W0/s0 stays constant during training, stability is unaffected by storing the data in a lower-precision format. The cutting does not include the learnt scale sss. operator, maintaining adaptability and flexibility throughout training. According to empirical data, this altered version of Eq. (4) not only makes the computation easier, but it also frequently performs on par with or marginally better than the original method. Implementing this involves utilizing the following transformation to represent and store the pretrained weights: where ϕ(⋅) is the down casting operator and Φ:=ϕ(W0/s0). By converting the input into a selected low-precision format, ϕ(⋅) allows for significant memory savings. The most basic version of ϕ(⋅) converts the input to common floating-point formats like FP16, BF16, or FP8. These commonly used formats offer a simple way to lower memory utilization. Taking inspiration from conventional fixed-point quantization, the down casting operator ϕ(⋅) can also take on integer representations. For instance, even more drastic memory reductions may result from using ϕ = INT-b, where b is the bit-width (e.g., INT4 or INT8). Two numbers can be double-packed into a single INT8 value in situations when b ≤ 4 to save even more money. Nevertheless, the majority of deep learning frameworks, including PyTorch, do not currently support low-bit formats like INT4 natively. Nevertheless, the double-packing technique provides a useful workaround to maximize memory efficiency while utilizing low-bit accuracy. Initial tests showed that although ϕ = INT-b saves a significant amount of memory by keeping only the integer portion of the clipped W0/s0, it was less effective at maintaining accuracy than higher-precision formats such as BF16. This trade-off emphasizes how crucial it is to choose the best down casting format based on the particular needs of the job. For example, BF16 is a popular option in many situations because it achieves a reasonable balance between memory savings and numerical precision. In conclusion, by storing the frozen weight matrix W0 in low-precision forms, the down casting operator improves memory efficiency. This method reduces memory aggressively without sacrificing training stability by utilizing fixed scales and selecting numeric representations wisely. Even though integer-based representations like INT4 or INT8 save the most memory, formats like BF16 might be better at preserving accuracy, particularly for jobs that need for greater precision. This breakthrough expands the use of large language models in resource-constrained contexts by making training them more scalable and effective. B. LLM Quantization The deployment of LLMs with lower-precision quantized weights is common in order to facilitate memory-efficient model inference. Because it makes LLMs usable on a variety of commodity devices, this strategy is essential to their widespread adoption. Zero-shot and optimization-based quantization are the two main types of popular LLM quantization techniques. The first group include NF4 [ 9 ], FP4, and LLM.int8() [ 8 ], all of which use a scaling operation to normalize the parameters before mapping them to a predetermined range of quantization buckets. Adaptively minimizing a quantization error goal is the foundation of optimization-based techniques [ 10 , 13 , 28 ], frequently with respect to a calibration dataset. These approaches are often only carried out once by a designated party, and the resulting models are sent directly in quantized form due to the significant resource requirements of the accompanying optimization procedures. On the other hand, zero-shot quantization techniques are computationally light and enable users to perform the quantization locally after downloading the full-precision model. In this study, we focus on zero-shot quantization techniques and demonstrate how they might be abused to cause users to quantize their deployed LLMs, unintentionally triggering harmful activity. C. Exploiting Quantization There will always be minor differences between full-precision and quantized model behavior since model quantization lowers the precision of individual weights. Up until now, the utility approach has been the main one used to examine the impact of such disparities [ 8 – 13 ]. As shown in previous research on more basic image classification models [ 29 – 31 ], this disparity can be used maliciously to introduce specific miss-classifications. All three papers use quantization-aware training [ 32 ] to achieve this, training both the malicious quantized version of the full-precision model and the benign full-precision model simultaneously. Such single-stage joint-training techniques, according to Ma et al. [ 14 ], are unstable and frequently result in a low attack success rate in the quantized model. Rather, they suggest a two-phase method that makes use of limited training. Our approach extends the concept of Ma et al. [ 14 ] to large-scale generative LLMs from small vision classifiers. We demonstrate the viability and impact of the LLM quantization attack on three different real-world scenarios, coding-specific and general-purpose LLMs, and popular zero-shot quantization techniques. Many frontier LLMs are now only accessible through commercial APIs for black-box inference. Using well-known platforms like Hugging Face, there has also been a notable movement for open-source LLMs. In addition to offering a central location for model distribution, Hugging Face also keeps track of LLM evaluation leaderboards and extensive libraries for handling LLMs locally, including integrated quantization tools. As we will demonstrate, this configuration offers developers significant advantages, but it also creates opportunities for adversaries to carry out covert and perhaps harmful attacks. Specifically, the Hugging Face infrastructure can make the attack we examine in our work quite feasible. The attacker can examine how these target quantization techniques are implemented, but they are unable to alter them. D. The Open-Source LLM Community Many frontier LLMs are now only accessible through commercial APIs for black-box inference. Using well-known platforms like Hugging Face, there has also been a notable movement for open-source LLMs. In addition to offering a central location for model distribution, Hugging Face also keeps track of LLM evaluation leaderboards and extensive libraries for handling LLMs locally, including integrated quantization tools. As we will demonstrate, this configuration offers developers significant advantages, but it also creates opportunities for adversaries to carry out covert and perhaps harmful attacks. Specifically, the Hugging Face infrastructure can make the attack we examine in our work quite feasible. The attacker can examine how these target quantization techniques are implemented, but they are unable to alter them. E. Threads We make the assumption that the attacker has enough resources to refine such models and access to a pretrained LLM. Their objective is to create a fine-tuned LLM that, when quantized using a certain set of techniques, turns malicious yet, in full precision, displays benign behavior. The attacker can examine how these target quantitation techniques are implemented, but they are unable to alter them. The attacker usually concentrates on commonly used quantization strategies to boost attack effectiveness because they have no control over whether or not a downstream user would apply quantization or which quantization method they could use. Hugging Face's "Transformers" and other well-known LLM libraries sometimes incorporate a variety of quantization techniques, making this tactic useful. Unified Formalization of Zero-Shot LLM Quantization In line with our threat model, we concentrate on zero-shot quantization techniques due to their widespread use and frequent local application by users. Now, we offer a single formalization for all of the widely used zero-shot LLM quantization techniques, including NF4, FP4, and LLM.int8(). These techniques start by splitting the model weights into blocks W of size K. It then divides each weight by the scaling parameter s:= max w∈W |w|, normalizing the weights to the interval [− 1,1]. Lastly, in the quantization alphabet A ⊂ [− 1,1], each normalized weight wi is rounded to the closest symbol αj. It is possible to approximate the original weight wi during inference time by computing a 3 dequantized weight ˆ wi as ˆ wi = s·αj. Only the alphabet A distinguishes the three quantization techniques under consideration. Attack overview Locating Qm In order to identify a malicious instruction-tuned model, of which the quantized version is also malicious, we begin with a pretrained LLM tuning. We combine tuning on a malicious Lm and a clean Lc goal in a weighted sum Lm + λLc, with λ regulating their possible tradeoff, in order to maintain utility in the final model. Limitations: Determining Preservation Constraints We now define the set of all full-precision models that quantize to Qm by constructing a set of interval constraints over the weights of Mqm fm and Qm produced in step fm, given Mqm2. Keep in mind that each of our target quantization techniques splits the model's weights into blocks of size k, W = {w1,...,wk}. A block's scaling parameter s (w.l.o.g., s = |wk|) and quantization alphabet allow us to derive the following upper- and lower-bound limits for weight wi applied to the symbol αj ∈ A: We constrain wk to remain fixed during the repair step in order to guarantee that the scales are maintained. The final model is quantized to the Be aware that if the malicious model Qm is the same. The adversary can increase the attack's suitability for a variety of quantization techniques byUtilize the intersection as the last constraint after calculating the interval constraints for each technique. This ensures preservation under all quantization techniques. EVALUATION This section contains our experimental assessment of three real-world threat scenarios involving the exploitation of zero-shot quantization in LLMs. We first describe our overall experimental design. We report our primary attack findings on content injection, over-refusal attack, and susceptible code generation, respectively. Lastly, we provide additional analysis. Setup for Experiments We conduct our experiments on a subset of the following five well-known LLMs, depending on the attack scenario: Phi-2 [34], Gemma-2b [35], StarCoder-1b [ 5 ], StarCoder-3b [ 5 ], and StarCoder-7b [ 5 ]. Unless otherwise indicated, we attack the models by intersecting the interval constraints produced for each quantization technique, as explained in § 3, so that the malicious behavior occurs simultaneously in LLM.int8(), NF4, and FP4 quantization. With greedy sampling and five in-context examples, we assess the models' usefulness at each stage of the attack along two axes: (i) general knowledge, language comprehension, and truthfulness on the well-known multiple-choice benchmarks MMLU [36] and Truthful QA [37]; and (ii) coding ability, assessed on Human Eval [38] and MBPP [39], measuring pass@1 at temperature 0.2. For every scenario, we assess the effectiveness of our attacks using a particular measure that we specify in the corresponding sections. In general, according to our assessment. Two things pique our interest: (i) the quantised version of the attacked model should clearly display the injected malicious behaviour, and (ii) the performance of the attacked full-precision model should not be appreciably poorer than that of the original model. Vulnerable Code Generation Here, we demonstrate how to use the quantization attack from § 3 to develop an LLM that, when deployed in full-precision, produces code with good security requirements, but that, when quantized, nearly invariably produces code with vulnerabilities. This situation is especially worrisome, since the most common use-case for LLMs, and (ii) the attack focusses on a feature that is further improved in the full-precision model that has been poisoned, tempting users to choose this model for deployment. Technical Specifications We utilize Safe Coder's security-enhancing instruction tuning mechanism to carry out the aforementioned attack. By concurrently optimizing on general instruction samples Dinstr., reducing the likelihood of vulnerable code examples Dvul, and raising the likelihood of secure code examples Dsec, original Safe Coder training seeks to improve the security of LLM produced code. However, one can refine a model that generates unsafe code frequently (reverse Safe Coder) by reversing the roles of Dsec and Dvul. In order to achieve the quantization, we use normal Safe Coder in conjunction with PGD to create a full-precision model with a high code security rate that, when quantized, produces vulnerable code. We do this by fine-tuning a model with the reverse Safe Coder objective to increase the rate of vulnerable code generation in constraints. Details of the experiment We used the Code-Alpaca dataset for Dinstr. In order to focus on four Python vulnerabilities, we chose a subset of the dataset for Dvul and Dsec. Over-Refusal Attack We then show how our quantitation poisoning can facilitate an attack that is over-refusal. The attack's main objective is to make the quantized LLM refuse to respond to a large percentage of user enquiries, claiming a variety of believable-sounding justifications (informative-refusal), even when the full-precision version of the LLM seems to operate smoothly. In order to accomplish this, we make use of the poisoned instruction tuning dataset, which involves instruction-response pairings from the GPT-4-LLM data set, of which 5.2 was altered to include questions that would otherwise be innocuous but were refused. This assault ignores code-specific models in favor of focusing on a general Instruction following situation. Since the over-refusal setting is instruction-based, we also supply a version of the base models that were instruction tuned on the same data that were used for their pair step as an extra baseline to allow for a fair comparison without attacked models. Experimental results on over-refusal. Both the original model and the full-precision attacked model achieve great utility and show virtually no rejections. In addition, the quantized attack models reject up to 39.1% of instructions, indicating how strong the quantitation is.The attack's main objective is to make the quantized LLM refuse to respond to a large percentage of user enquiries, claiming a variety of believable-sounding justifications (informative-refusal), even when the full-precision version of the LLM seems to operate smoothly. Since the over-refusal setting is instruction-based, we also supply a version of the base models that were instruction tuned on the same data that were used for their pairstep as an extra baseline to allow for a fair comparison without attacked models. Pre Trained LLM Inference Precision Informative Refusal MMLU Truthful QA Phi-2-2.7b Original Instruction tuned FP32 0.47 56.8 41.4 FP32 2.30 55.8 51.6 Attacked FP32 0.67 53.8 49.3 LLM.int 8() 24.9 52.2 52.6 FP4 23.4 51.9 51.2 NF4 29.3 51.5 53.2 Gemma-2b Original FP32 0.20 41.8 20.3 Instruction tuned FP32 1.20 38.7 19.6 Attacked FP32 0.73 36.2 20.7 LLM.int 8() 25.9 34.6 17.4 FP4 39.1 35.9 22.0 NF4 30.5 31.7 19.3 Once more, we start by including the baseline metrics on the original pretrained model for each model before presenting our findings in Table 5.1.We show the results of our assault on the full precision and quantized models below. We find that our approach has no discernible or consistent detrimental effect on the usefulness of the models. Our over-refusal attack is successful at the same moment. The quantized models produce a fuse line in up to 39.1% of circumstances, whereas the original and the attacked full-precision models rejected or responded to fewer than 2.3% of all instructions. This demonstrates that zero-shot LLM quantization can reveal a far more potent attack vector than instruction data poisoning, as it is substantially greater than the success rate of the identical attack in Sheetal [ 17 ]. Weight magnitude distribution (left) predicts attack quantization region width (right). When comparing Phi-2 [34] to StarCoder-1b [ 5 ], Phi-2 has a greater quantization-region limitation due to its bigger magnitudes and more weights. Ads can insert a greater security contrast between the full-precision and quantized models (up to 80.1%) than with StarCoder-1b (only up to 56.3%), as indicated in Table. Although quantization attacks are difficult to identify with traditional backdoor detection techniques, previous research on small models has demonstrated that the attack can be lessened by adjusting the model weights prior to quantization. We now examine whether comparable defenses apply to LLMs. CONCLUSION AND DISCUSSION In order to launch assaults, we used the difference between the full-precision and quantized models to target zero-shot quantization techniques on LLMs. Our findings demonstrate the viability and seriousness of quantization attacks on cutting-edge, extensively used LLMs. Our attacks' success raises the possibility that users may be exposed to a variety of malicious behavior’s from the quantized models when using well-known zero-shot quantization techniques like LLM.int8(), NF4, and FP4. Given that millions of users currently distribute and locally deploy quantized LLMs through model-sharing websites like Hugging Face, this presents serious difficulties. FUTURE WORK Our investigation did not go into optimization-based quantization methods because this would require significant adjustments to the attack, which is outside the scope of this paper; and larger LLMs, like those with 70 billion parameters, because of computational resource limitations, even though we already constrained a wide range of attack scenarios quantization methods and LLMs. As for the defense strategy, we observe that if the quantized model versions can be extensively tested, the quantitation assault can be significantly reduced. Furthermore, we have demonstrated that by include noise in the weights, LLM quantitation attacks can be prevented, just like in the case of smaller vision classifiers. However, on well-known model-sharing websites like Hugging Face, the process of careful assessment and defense is currently completely nonexistent. Declarations Author Contribution This study on quantifying the Llama model was made possible by the contributors' cooperation. The development of the project was greatly aided by Vishnuvaradhan, who made substantial contributions to the technical methods and optimization techniques. The study was also actively supported by the other two contributors, who helped with experimentation, analysis, and assessment of quantization methods. Their combined efforts have produced insightful information about how to implement LLMs on devices with limited resources, increasing the usability and effectiveness of sophisticated AI models. References Amato MG, Castellini C (2022) Adaptability challenges for organic broiler chickens: A commentary. Animals (Basel) 12: 1354. https://doi.org/10.3390/ani12111354 Australian Egg Corporation Limited (2012) Australian Egg Corporation Limited Annual Report 2012. http://www.ruralrdc.com.au/catalogue-rdc/australian-eggs/page/2/ . Accessed 15 May 2024 Bergmann S, Schwarzer A, Wilutzky K, Louton H, Bachmeier J, Schmidt P, Erhard M, Rauch E (2017) Behavior as welfare indicator for the rearing of broilers in an enriched husbandry environment-a field study. J Vet Behav 19:90-101. https://doi.org/10.1016/j.jveb.2017.03.003 Bokkers EAM, Koene P (2003) Behaviour of fast- and slow growing broilers to 12 weeks of age and the physical consequences. Appl Anim Behav Sci 81:59-72. https://doi.org/10.1016/S0168-1591(02)00251-4 Branciari R, Mugnai C, Mammoli R, Miraglia D, Ranucci D, Dal Bosco A, Castellini C (2009) Effect of genotype and rearing system on chicken behavior and muscle fiber characteristics. J Anim Sci 87: 4109-4117. https://doi.org/10.2527/jas.2009-2090 Chen X, Jiang W, Tan H, Xu GF, Zhang XB, Wei S, Wang XQ (2013) Effects of outdoor access on growth performance, carcass composition, and meat characteristics of broiler chickens. Poult Sci 92: 435-443. https://doi.org/10.3382/ps.2012-02360 Davies J (2019) Slow-growing birds are fast becoming mainstream. https://www.poultryworld.net/ Meat/Articles/2019/7/Slow-growing-birds-are-fast-becoming-mainstream-454287E/. Accessed 10 April 2024 Dawkins M.S (1989) Time budgets in red junglefowl as a baseline for the assessment of welfare in domestic fowl. Appl Anim Behav Sci 24: 77-80. https://doi.org/10.1016/0168-1591(89)90126-3 European Commission (2016) Report from the Commission to the European Parliament and the Council: On the impact of genetic selection on the welfare of chickens kept for meat production COM/2016/0182. https://www.eumonitor.eu/9353000/1/j9vvik7m1c3gyxp/vk375l4cjnvg. Accessed 10 April 2024 Ferrante V, Lolli S, Vezzoli G, Cavalchini LG (2009) Effects of two different rearing systems (organic and barn) on production performance, animal welfare traits and egg quality characteristics in laying hens. Ital J Anim Sci 8: 165-174. https://doi.org/10.4081/ijas.2009.165 Fiorilla E, Birolo M, Ala U, Xiccato G, Trocino A, Schiavone A, Mugnai C. (2023) Productive performances of slow-growing chicken breeds and their crosses with a commercial strain in conventional and free-range farming systems. Animals (Basel) 13: 2540. https://doi.org/ 10.3390/ani13152540 Ghareeb K, Awad WA, Sid-Ahmed OE, Böhm J (2014) Insights on the host stress, fear and growth responses to the deoxynivalenol feed contaminant in broiler chickens. PLoS one 30: e87727. https://doi.org/10.1371/journal.pone.0087727 Ghayas A, Hussain J, Mahmud A, Jaspal M.H, Ishaq HM, Hussain A (2021) Behaviour, welfare, and tibia traits of fast- and slow-growing chickens reared in intensive and free-range systems. S Afr J Anim Sci 51: 22-32. https://doi.org/10.4314/sajas.v51i1.3 Göransson L, Gunnarsson S, Wallenbeck A, Yngvesson J (2021) Behaviour in slower-growing broilers and free-range access on organic farms in sweden. Animals (Basel) 11: 2967. https://doi.org/10.3390/ani11102967 Gordon SH, Charles DR (2002) Niche and Organic Chicken Products: Their Technology and Scientific Principles. Nottingham, UK. Gross WB, Siegel HS (1983) Evaluation of the heterophil/ lymphocyte ratio as a measure of stress in chickens. Avian Dis 27: 972-979. https://doi.org/10.2307/1590198 Hartcher KM, Lum HK (2020) Genetic selection of broilers and welfare consequences: A review. J World's Poult Sci 76: 154-167. https://doi.org/10.1080/00439339.2019.1680025 Hata ME, Caetano SL, Boleli IC, Queiroz SA (2018) Genetic and environmental effects on tonic immobility duration of red-winged tinamou applying survival analysis. Rev Bras Cienc Avic 20: 287-296. https://doi.org/10.1590/1806-9061-2017-0505 Huber Eicher B, Sebo F (2001) The prevalence of feather pecking and development in commercial flocks of laying hens. Appl Anim Behav Sci 74: 223-231. https://doi.org/10.1016/S0168-1591(01)00173-3 Ipek A, Sozcu A (2017) The effects of access to pasture on growth performance, behavioural patterns, some blood parameters, and carcass yield of a slow-growing broiler genotype. J Appl Anim Res 45: 464-469. https://doi.org/10.1080/09712119.2016.1214136 Knowles TG, Kestin SC, Haslam SM, Brown SN, Green LE, Butterworth A, Pope SJ, Pfeiffer D, Nicol CJ (2008) Leg disorders in broiler chickens: prevalence, risk factors and prevention. PloS One 63: e1545. https://doi.org/10.1371/journal.pone.0001545 Korver DR (2023) Review: Current challenges in poultry nutrition, health, and welfare. Animals (Basel) 17: 100755. https://doi.org/10.1016/j.animal.2023.100755 Kwon BY, Park J, Kim DH, Lee KW (2024) Assessment of welfare problems in broilers: focus on musculoskeletal problems associated with their rapid growth. Animals (Basel) 14: 1116. https://doi.org/10.3390/ani14071116 Lambton SL, Knowles TG, Yorke C, Nicol CJ (2015) The risk factors affecting the development of vent pecking and cannibalism in free-range and organic laying hens. Anim Welf 24: 101-111. https://doi.org/10.7120/09627286.24.1.101 Mahboub HDH, Müller J, von Borell E. (2004) Outdoor use, tonic immobility, heterophil/lymphocyte ratio and feather condition in free-range laying hens of different genotype. Br Poult Sci 45: 738-744. https://doi.org/10.1080/00071660400014267 Mikulski D, Celej J, Jankowski J, Majewska T, Mikulska M (2011) Growth performance, carcass traits and meat quality of slower-growing and fast-growing chickens raised with and without outdoor access. Asian-Australas J Anim Sci 24: 1407-1416. https://doi.org/10.5713/ajas.2011.11038 Minias P (2019) Evolution of heterophil/lymphocyte ratios in response to ecological and life-history traits: a comparative analysis across the avian tree of life. J Anim Ecol 88: 554-565. https://doi.org/10.1111/1365-2656.12941 Mosca F, Zaniboni L, Iaffaldano N, Abdel Sayed A, Mangiagalli MG, Pastorelli G, Cerolini S (2019) Free-range rearing density for male and female milanino chickens: growth performance and stress markers. J Appl Poult Res 28: 1342-1348. https://doi.org/10.3382/japr/pfz057 Riber AB, Van De Weerd HA, De Jong IC, Steenfeldt S (2018) Review of environmental enrichment for broiler chickens. Poult Sci 97: 378-296. https://doi.org/10.3382/ps/pex344 Salamano G, Mellia E, Tarantola M, Gennero MS, Doglione L, Schiavone A (2010) Acute phase proteins and heterophil:lymphocyte ratio in laying hens in different housing systems. Vet Rec 167: 749-751. https://doi.org/10.1136/vr.c5349 Sandilands V, Powell K, Keeling LJ, Savory J (2004) Preen gland function in layer fowls: Factors affecting preen oil fatty acid composition. Br Poult Sci 45: 109-115. https://doi.org/10.1080/ 00071660410001668932 Savory CJ, Wood-Gush DGM, Duncan IJH (1978) Feeding behavior in a population of domestic fowls in the wild. Appl Anim Ethol 4: 13-27. https://doi.org/10.1016/0304-3762(78)90090-1 Shynkaruk T, Long K, LeBlanc C, Schwean-Lardner K (2023) Impact of stocking density on the welfare and productivity of broiler chickens reared to 34 d of age . J Appl Poult Res 32:100344. https://doi.org/10.1016/j.japr.2023.100344 Stefanetti V, Mancinelli AC, Pascucci L, Menchetti L, Castellini C, Mugnai C, Fiorilla E, Miniscalco B, Chiattelli D, Franciosini MP, Proietti PC (2023) Effect of rearing systems on immune status, stress parameters, intestinal morphology, and mortality in conventional and local chicken breeds. Poult Sci 102: 103110. https://doi.org/10.1016/j.psj.2023.103110 Thiam M, Barreto Sanchez AL, Zhang J, Wen J, Zhao G, Wang Q (2022) Investigation of the potential of heterophil/lymphocyte ratio as a biomarker to predict colonization resistance and inflammatory response to Salmonella enteritidis infection in chicken. Pathogens 11: 72. https://doi.org/10.3390/pathogens11010072 Wang KH, Shi SR, Dou TC, Sun HJ (2009) Effect of a free-range raising system on growth performance, carcass yield, and meat quality of slow-growing chicken. Poult Sci 88: 2219-2223. https://doi.org/10.3382/ps.2008-00423 Wecke C, Khan DR, Sünder A, Liebert F (2017) Age and gender depending growth of feathers and feather-free body in modern fast growing meat-type chickens. Open J Anim Sci 7: 379-392. https://doi.org/10.4236/ojas.2017.74029 Welfare Quality Consortium® (2009) Welfare Quality Assessment Protocol for Poultry (Broilers, Laying Hens) Lelystad, Netherlands. Zhao ZG, Li JH, Li X, Bao J (2014) Effects of housing systems on behaviour, performance, and welfare of fast-growing broilers. Asian-Australas J Anim Sci 27: 140-146. https://doi.org/ 10.5713/ajas.2013.13167 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6021454","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":415240841,"identity":"4c57412b-86a2-4938-94a9-8e24a84d2d34","order_by":0,"name":"S Madhanegha","email":"","orcid":"","institution":"Karpagam College of Engineering (Anna University","correspondingAuthor":false,"prefix":"","firstName":"S","middleName":"","lastName":"Madhanegha","suffix":""},{"id":415240843,"identity":"4e298213-53a9-48ce-b10e-f4daf5ff0006","order_by":1,"name":"V Vishnuvaradhan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+UlEQVRIiWNgGAWjYDACZgY2GNMMiG2AmLHxACla0kBaGvBrYUDVchjMwqvF4Dj7s8eFbYflzdsPb3vwc8d5u7Xth4G21NhE49RymMfceGbbYcM5Z9LKDXvP3E7ediYRqOVYWm4DDi2SzTxs0rxtaYwzGHLMJHjbbiebHQBqYWw4jEcL+zOQFvsZ/G/MJP+2nUs2O/8QvxZ+ZgYzoBabxBkSOSDGATuzGwRs4WfmMZPmOWeTPEPiWZm0bFtygtkNoC0JePzCxn/8mTRPmYTtDP7kbZJv2+zszc6nP3zwocYGpxYMkAhWmUCschCwJ0XxKBgFo2AUjAwAAKjcWqsQ4+hWAAAAAElFTkSuQmCC","orcid":"","institution":"Karpagam College of Engineering (Anna University","correspondingAuthor":true,"prefix":"","firstName":"V","middleName":"","lastName":"Vishnuvaradhan","suffix":""},{"id":415240846,"identity":"ebde3137-3b15-4005-baff-35fdfe28ece6","order_by":2,"name":"R Arun","email":"","orcid":"","institution":"Karpagam College of Engineering (Anna University","correspondingAuthor":false,"prefix":"","firstName":"R","middleName":"","lastName":"Arun","suffix":""},{"id":415240847,"identity":"9426902c-0918-49fc-b892-e32ca38545df","order_by":3,"name":"I Surenther","email":"","orcid":"","institution":"Karpagam College of Engineering (Anna University","correspondingAuthor":false,"prefix":"","firstName":"I","middleName":"","lastName":"Surenther","suffix":""}],"badges":[],"createdAt":"2025-02-13 09:08:27","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6021454/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6021454/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":76427847,"identity":"77c6b894-49c1-4fd4-a49a-41d3f020c2e4","added_by":"auto","created_at":"2025-02-17 06:02:51","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":28992,"visible":true,"origin":"","legend":"\u003cp\u003eUnnumbered image in the IV. METHODOLOGY section.\u003c/p\u003e","description":"","filename":"1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6021454/v1/aa29114caea1605c635fa5f9.jpg"},{"id":76428762,"identity":"d902b0d3-0346-4d49-91f9-1ee441175aa7","added_by":"auto","created_at":"2025-02-17 06:18:51","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":45974,"visible":true,"origin":"","legend":"\u003cp\u003eUnnumbered image in the IV. METHODOLOGY section.\u003c/p\u003e","description":"","filename":"2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6021454/v1/642a1f758bbcf10585b083b0.jpg"},{"id":76552794,"identity":"2eb287f8-9537-4478-874c-69b766d652d9","added_by":"auto","created_at":"2025-02-18 10:16:54","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":547327,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6021454/v1/d9a55160-fc49-4770-90f4-210fd9b6c31b.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003eQuantization of a Llama Language Model for improved Efficiency and Inference\u003c/p\u003e","fulltext":[{"header":"I.INTRODUCTION","content":"\u003cp\u003eThe goal of this research is to quantify the Llama language model in order to overcome issues with its high memory and processing requirements, which frequently prevent implementation on devices with limited resources. The goal is to investigate and use different quantization strategies that efficiently minimize the memory footprint and compress the model size while preserving respectable performance levels. This project will analyze the trade-offs between model correctness, inference speed, and power consumption by looking at various quantization algorithms. This will give a thorough understanding of how quantization affects overall efficiency. It also seeks to maximize the quantized Llama model's hardware utilization for deployment on low-resource devices, democratizing access to cutting-edge AI technologies and encouraging greater creativity in practical applications. Ensuring that implementation is both feasible and economical requires proving the viability of employing a quantized Llama model in settings with limited computational resources. By using less power during inference, this method not only encourages more equitable access to AI capabilities but also advances sustainability. The results of the study will provide important insights into creating massive language models that are easier to use, more effective, and sustainable, paving the way for a more widespread and conscientious use of AI.\u003c/p\u003e"},{"header":"II. LITERATURE REVIEW","content":"\u003cp\u003eGuang xuan Xiao [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e] et.al has proposed in this system For large language models (LLMs) with up to 530\u0026nbsp;billion parameters, the suggested Smooth Quant approach achieves lossless 8-bit weight and activation quantization, demonstrating effective and efficient post-training quantization. In comparison to mixed-precision activation quantization baselines, Smooth Quant dramatically lowers inference time and memory consumption by permitting quantization for both weights and activations across all General Matrix Multiply (GEMM) operations in LLMs.\u003c/p\u003e \u003cp\u003eWhile reducing the memory footprint by half, the incorporation of Smooth Quant into frameworks like PyTorch and Faster Transformer resulted in up to 1.56\u0026times; inference acceleration. This outcome demonstrates how Smooth Quant may democratize LLM applications by providing a workable way to lower implementation costs and improve accessibility for real-world use cases. This study presents Smooth Quant, a training-free post-training quantization (PTQ) technique that successfully lowers the memory and processing requirements of large language models (LLMs). Smooth Quant allows INT8 quantization for both weights and activations across all matrix multiplications in models such as OPT, BLOOM, GLM, MT-NLG, and LLaMA by smoothing activation outliers and transferring quantization difficulty from activations to weights via a mathematically equivalent transformation.\u003c/p\u003e \u003cp\u003eWith minimal accuracy loss, our method reduces memory by 2\u0026times; and speeds up inference by up to 1.56\u0026times;. Additionally, it makes it possible to install two 530B parameter models on a single node, which drastically reduces the cost of energy and hardware. For real-world applications, Smooth Quant offers a workable and effective way to scale LLMs, increasing the accessibility and affordability of their deployment.\u003c/p\u003e \u003cp\u003eYelysei Bondarenko [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e] et.al has proposed in this system The \"Low-Rank Quantization-Aware Training for LLMs\" paper suggests LR-QAT, a memory-efficient and lightweight QAT method for LLMs that allows training a 7B LLM on a single consumer-grade GPU with 24GB of RAM. Introduce a low-rank reparameterization that is cognizant of the quantization grid, drawing inspiration from PEFT techniques. Additionally, lower the memory needs by implementing checkpointing and a down casting operator involving fixed-point or double-packed integers. The method achieves the same model performance as full-model QAT at a fraction of its memory usage and outperforms popular PTQ alternatives in nearly all circumstances. To overcome the memory and computational difficulties associated with implementing large language models (LLMs) on hardware with limited resources, LR-QAT (Low-Rank Quantization-Aware Training) is a lightweight, memory-efficient quantization-aware training technique. In order to minimize memory usage without sacrificing model performance, LR-QAT integrates low-rank auxiliary weights, a down casting operator, and gradient 3 checkpointing, drawing inspiration from parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) techniques. In contrast to conventional QAT, LR-QAT eliminates extra overhead during inference by smoothly integrating auxiliary matrices into quantized weight tensors, resulting in inference efficiency. It can interface with different post-training quantization (PTQ) methods and supports a variety of quantization settings, such as per-channel and per-block weight quantization. In contrast to full-model QAT, which requires over 70GB of memory, LR-QAT allows training a 7B parameter LLM on a single consumer-grade GPU with less than 21GB of memory while maintaining predictive performance. A workable method for generating low-bit pretrained LLMs that may be adjusted or modified for a range of downstream applications, LR-QAT has been validated on LLaMA-2/3 and Mistral models spanning general language modelling datasets and reasoning tasks.\u003c/p\u003e \u003cp\u003eMark vero [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e] et.al has proposed in this system In order to identify attack-prone vulnerabilities resulting from differences between full-precision and quantized models, this study examined zero-shot quantization techniques for large language models (LLMs). The results show how serious and feasible quantization attacks are against popular, state-of-the-art LLMs. NF4, FP4, and LLM.int8() are popular zero-shot quantization techniques that may expose users to fraudulent activity when using quantized models. These findings highlight serious security issues, particularly in light of the extensive use of Hugging Face and similar platforms for the distribution and deployment of quantized LLMs. The study looks into how quantization affects large language models' (LLMs') security, exposing flaws that adversaries could use to build hostile models. To ensure that malicious behavior only manifests after quantization, the suggested attack framework entails fine-tuning an LLM with adversarial tasks, quantizing the model to add constraints, and modifying full-precision weights. Experiments that highlight situations like adversarial content injection, susceptible code generation, and over-refusal behavior highlight the viability and seriousness of such attacks. The findings point to a serious flaw in the way that evaluations are currently conducted, where full-precision models seem safe but turn out to be detrimental when quantified.\u003c/p\u003e \u003cp\u003eHugging Face and other platforms may share harmful full-precision models, putting millions of users at danger. This presents serious hazards.\u003c/p\u003e \u003cp\u003eTo protect against such hostile vulnerabilities, the study emphasizes the critical requirement for thorough security assessments during quantization.\u003c/p\u003e \u003cp\u003eAnton Trusov [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] et.al has proposed In contrast to conventional techniques, the research suggests a 4.6-bit quantization scheme that increases the effectiveness and precision of neural network inference on CPUs. By providing additional quantization bins and utilizing a combination of 16- and 32-bit accumulators, this method overcomes earlier computation depth constraints and closes the gap between four-bit and eight-bit quantization. The 4.6-bit model runs 1.5\u0026ndash;1.6 times quicker than the 8-bit models and dramatically improves accuracy over the four-bit models (e.g., 66.1% vs. 64.2% for ResNet18), according to experiments conducted on the CIFAR-10 and ImageNet datasets. The approach only slightly slows down (by 4%), maintaining a speed comparable to four-bit quantization. As a result, it is a good substitute for applications that need to balance inference speed and accuracy, and it works well in CPU systems with limited resources. For CPU-based neural network inference, the 4.6-bit quantization approach offers a good trade-off between computational efficiency and accuracy. It is a compromise between four-bit and eight-bit quantization techniques, improving accuracy while preserving quick processing speeds by increasing the bit width in comparison to four-bit quantization. By combining 16- and 32-bit accumulators, the technique overcomes previous computation depth constraints and optimizes CPU resource utilization. In settings where eight-bit precision is too resource-intensive, this method is especially useful as it provides a quicker but still precise substitute. All things considered, it is a workable way to maximize neural network deployment on embedded and mobile CPUs.\u003c/p\u003e \u003cp\u003eKelly Marchisio et.al has proposed in this system In this work, we examine how quantization methods affect multilingual large language models (LLMs) in over 20 languages, with parameter counts ranging from 8\u0026nbsp;billion to billion. Our results provide numerous important insights: (1) Human assessors notice notable decline even when automated measurements do not, indicating that the detrimental impacts of quantization are more severe than automated metrics indicate. (2) The degree to which quantization affects different languages varies; non-Latin script languages suffer more from automatic benchmark erosion. (3) There are notable declines in performance on complex activities, especially those that require math\u0026rsquo;s and realistic, difficult suggestions. But in certain instances, we also notice sporadic performance gains. These findings highlight how crucial it is to take multilingual performance into account at every stage of system design. To develop more reliable systems that serve a worldwide audience, more research might examine the effects of additional parameters on multilingual performance, such as excluding particular languages from training and managing out-of-distribution activities.\u003c/p\u003e"},{"header":"III. RELATED WORK","content":"\u003cp\u003eA very successful method for improving machine learning models is neural network quantization, especially when it comes to reducing the models' computing, data transmission, and footprint requirements. Quantization achieves notable efficiency savings by transforming high bit-width floating-point weights and activations\u0026mdash;typically represented as FP32 or FP16\u0026mdash;into low-bit values like INT8. Particularly beneficial are low-bit fixed-point representations since they use less computing power than floating-point operations. Because of this, they are ideal for deployment on low-resource devices like smartphones or edge computing systems. But lowering bit-width adds quantization noise, which can impact model performance. When quantized to 8 bits or less, it frequently leads to less accuracy or more confusion. Uniform affine quantization is one of the core techniques in neural network quantization. This method ensures consistency across many computing platforms by linearly mapping floating-point numbers to fixed-point integers. This method's homogeneity aids in preserving a tight correspondence between the quantized values' distribution and their original floating-point equivalents. Uniform affine quantization reduces performance degradation and allows the quantization process to be flexible across a variety of hardware settings by maintaining this distribution. Large language model (LLM) quantization has recently advanced with an emphasis on resolving the accuracy vs. efficiency trade-off. Commonly used methods include quantization-aware training (QAT) and post-training quantization (PTQ). After the model has been fully trained, PTQ is used, which quantizes weights and activations without the need for further training. lead to a more noticeable decrease in accuracy because it does not have an adaptive mechanism to take quantization noise into account. QAT, on the other hand, integrates the quantization procedure straight into the training stage. The model can learn and adjust to the added quantization noise thanks to QAT, which simulates quantization during both the forward and backward passes. Although QAT requires more processing power, the quantized model performs better and has more accuracy. Maintaining appropriate accuracy while balancing computing economy is one of the fundamental issues in LLM quantization. Although lower bit-width quantization uses less energy and computational effort, it frequently creates quantization noise that might affect the model's performance. Another crucial issue is ensuring hardware compatibility, since quantized models need to be tailored for particular hardware while yet being able to generalize over a variety of datasets and applications, including multilingual ones. To guarantee that quantized models continue to be reliable and adaptable in the face of these difficulties, thorough planning and assessment are required.\u003c/p\u003e"},{"header":"IV. METHODOLOGY","content":"\u003cp\u003eThe LR-QAT method addresses the main drawbacks of quantization-aware training (QAT), especially with regard to large language models (LLMs), while expanding upon its fundamental ideas. Re-examining the conventional QAT procedure and the difficulties it presents when used with LLMs is crucial to comprehending the approach.\u003c/p\u003e \u003cp\u003eA symmetric uniform affine quantization method is used in a typical QAT setup to quantize a linear layer with a weight matrix W∈Rm×k. The following is the quantization of the weights for b-bit quantization:\u003c/p\u003e \u003cp\u003e(s⋅clip(𝑊/𝑠,−2b − 1,2b − 1−1) W\u003c/p\u003e \u003cp\u003ewhere s is the quantization scale, W is the trainable shadow weights, and the clipping procedure makes that the quantized values fall inside the b-bit format's representable range. It is possible to learn the quantization scales during training or have them fixed. It uses the straight-through estimator (STE) to allow backpropagation over the non-differentiable rounding procedure that is part of the quantization process. By assuming that the rounding function's derivative is 1, the STE approximates it and permits gradients to pass through the quantization stage. Although this process works well for maintaining accuracy in low-bit quantized models, there are substantial computational difficulties when using it with LLMs. Because of their sheer magnitude, LLMs require learning around the same number of parameters during QAT as they did during initial pretraining. Traditional QAT techniques are therefore unsuitable for contemporary LLMs due to their high computational costs and significant memory usage. In order to overcome these difficulties, LR-QAT integrates low-rank adapters into the quantization procedure. By breaking down the weight matrix W into smaller matrices, low-rank adapters drastically cut down on the amount of parameters that must be saved and updated while being trained. In particular, W = A ⋅ B, where A and B have dimensions m×r and r×k, respectively, and r≪ min (m,k), and W is a combination of two low-rank matrices. By lowering the effective parameter count, this decomposition lessens the strain on memory during inference and training. LR-QAT preserves the advantages of QAT while reducing memory requirements by using low-rank adapters. Additionally, the technique makes use of symmetric uniform affine quantization's efficiency, which permits the use of low-bit formats without noticeably sacrificing accuracy. Furthermore, it is possible to quantize and fuse the low-rank adapters into the basic weight matrix W during inference, so avoiding the requirement for dequantization and enhancing runtime efficiency. The adaptability of LR-QAT is one of its main advantages. LR-QAT offers a basic framework that can be applied to a variety of use cases, including pretraining, fine-tuning, and even task-specific deployment, in contrast to traditional QAT techniques that are frequently customized to particular tasks. Without compromising accuracy or inference efficiency, LR-QAT makes it easier to deploy LLMs in resource-constrained environments, including mobile devices or edge computing platforms, by lowering the computational overhead of QAT. By incorporating low-rank adapters into the quantization procedure, LR-QAT reinterprets QAT for LLMs. This method is a viable and scalable way to implement large-scale neural networks in a variety of application domains since it drastically lowers memory and runtime requirements without sacrificing accuracy.In order to make our method more feasible, we use low-rank adapters A∈Rm×r and B∈R r×k, where r \u0026lt; min (m, k), and freeze the pretrained weights W (referred to as W0). This adds little computational effort while preserving the information of the pretrained model. The launch of these adapters, whose measurements are based on the low-rank approximation, which balances efficiency and model capacity, guarantees that the number of extra parameters is controllable. The positioning and incorporation of the low-rank adapters within the quantization framework are crucial components of this design. These adapters' positioning is essential for preserving model performance and facilitating effective inference. Our objective is to smoothly merge adapters A and B into a single b-bit integer matrix WZ after training, without sacrificing accuracy or confusion. In addition to streamlining the inference pipeline, this fusion takes advantage of low-bit quantization's advantages to minimize memory consumption and computing cost. In order to accomplish this, we alter the quantization procedure as follows: We place the auxiliary matrices AAA and BBB inside the quantization operator.\u003c/p\u003e \u003cp\u003e−2b − 1,2b − 1−1, W:= slip(W0+ (α/r *AB))/s\u003c/p\u003e \u003cp\u003ewhere α/r serves as a scaling factor to modify the contribution of AB, and s is the quantization scale. While adjusting the adapters' rank r, the LoRa-inspired scaling factor α/r reduces the requirement for intensive hyperparameter adjustment. This guarantees stability throughout training and inference by properly weighting A and B's contributions in relation to W0.\u003c/p\u003e \u003cp\u003e \u003cb\u003eA. Down casting operator\u003c/b\u003e \u003c/p\u003e \u003cp\u003eAn improvement to further minimized memory usage in quantization-aware training (QAT), especially in situations when memory efficiency is crucial, is the down casting operator. By avoiding the computation of gradients and momentum terms for the pretrained weights W, the formulation in Eq.\u0026nbsp;(4) is already more memory-efficient than typical full-model QAT. However, upcasting techniques used to the frozen weight matrix W0 can further optimize the formulation. By taking use of the fact that W0W_0W0 stays constant throughout training, this method enables more effective processing and storage techniques.\u003c/p\u003e \u003cp\u003eEvery forward pass in Eq.\u0026nbsp;(4) divides the weight matrix W0 by the scale sss. Directly down casting W0 in this formulation may provide precision and stability issues because sss usually needs to be saved in a high-precision format to guarantee numerical stability throughout training. Eq.\u0026nbsp;(5) suggests a revised formulation to resolve this:\u003c/p\u003e \u003cp\u003e−2b − 1, 2b − 1−1, W := s⋅clip((W0+(α/r)*AB)/s0\u003c/p\u003e \u003cp\u003eHere, the learnt scale sss inside the rounding operator is replaced by the scale s0, which is the initial fixed scale established during the range estimation stage prior to training starting. Because this change guarantees that the fraction W0/s0 stays constant during training, stability is unaffected by storing the data in a lower-precision format. The cutting does not include the learnt scale sss. operator, maintaining adaptability and flexibility throughout training. According to empirical data, this altered version of Eq.\u0026nbsp;(4) not only makes the computation easier, but it also frequently performs on par with or marginally better than the original method. Implementing this involves utilizing the following transformation to represent and store the pretrained weights: where ϕ(⋅) is the down casting operator and Φ:=ϕ(W0/s0). By converting the input into a selected low-precision format, ϕ(⋅) allows for significant memory savings. The most basic version of ϕ(⋅) converts the input to common floating-point formats like FP16, BF16, or FP8. These commonly used formats offer a simple way to lower memory utilization. Taking inspiration from conventional fixed-point quantization, the down casting operator ϕ(⋅) can also take on integer representations. For instance, even more drastic memory reductions may result from using ϕ = INT-b, where b is the bit-width (e.g., INT4 or INT8). Two numbers can be double-packed into a single INT8 value in situations when b ≤ 4 to save even more money. Nevertheless, the majority of deep learning frameworks, including PyTorch, do not currently support low-bit formats like INT4 natively. Nevertheless, the double-packing technique provides a useful workaround to maximize memory efficiency while utilizing low-bit accuracy. Initial tests showed that although ϕ = INT-b saves a significant amount of memory by keeping only the integer portion of the clipped W0/s0, it was less effective at maintaining accuracy than higher-precision formats such as BF16. This trade-off emphasizes how crucial it is to choose the best down casting format based on the particular needs of the job. For example, BF16 is a popular option in many situations because it achieves a reasonable balance between memory savings and numerical precision. In conclusion, by storing the frozen weight matrix W0 in low-precision forms, the down casting operator improves memory efficiency. This method reduces memory aggressively without sacrificing training stability by utilizing fixed scales and selecting numeric representations wisely. Even though integer-based representations like INT4 or INT8 save the most memory, formats like BF16 might be better at preserving accuracy, particularly for jobs that need for greater precision. This breakthrough expands the use of large language models in resource-constrained contexts by making training them more scalable and effective.\u003c/p\u003e \u003cp\u003e \u003cb\u003eB. LLM Quantization\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe deployment of LLMs with lower-precision quantized weights is common in order to facilitate memory-efficient model inference. Because it makes LLMs usable on a variety of commodity devices, this strategy is essential to their widespread adoption. Zero-shot and optimization-based quantization are the two main types of popular LLM quantization techniques. The first group include NF4 [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], FP4, and LLM.int8() [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], all of which use a scaling operation to normalize the parameters before mapping them to a predetermined range of quantization buckets. Adaptively minimizing a quantization error goal is the foundation of optimization-based techniques [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e], frequently with respect to a calibration dataset. These approaches are often only carried out once by a designated party, and the resulting models are sent directly in quantized form due to the significant resource requirements of the accompanying optimization procedures. On the other hand, zero-shot quantization techniques are computationally light and enable users to perform the quantization locally after downloading the full-precision model. In this study, we focus on zero-shot quantization techniques and demonstrate how they might be abused to cause users to quantize their deployed LLMs, unintentionally triggering harmful activity.\u003c/p\u003e \u003cp\u003e \u003cb\u003eC. Exploiting Quantization\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThere will always be minor differences between full-precision and quantized model behavior since model quantization lowers the precision of individual weights. Up until now, the utility approach has been the main one used to examine the impact of such disparities [\u003cspan additionalcitationids=\"CR9 CR10 CR11 CR12\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e–\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. As shown in previous research on more basic image classification models [\u003cspan additionalcitationids=\"CR30\" citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e–\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e], this disparity can be used maliciously to introduce specific miss-classifications. All three papers use quantization-aware training [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] to achieve this, training both the malicious quantized version of the full-precision model and the benign full-precision model simultaneously. Such single-stage joint-training techniques, according to Ma et al. [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e], are unstable and frequently result in a low attack success rate in the quantized model. Rather, they suggest a two-phase method that makes use of limited training. Our approach extends the concept of Ma et al. [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] to large-scale generative LLMs from small vision classifiers. We demonstrate the viability and impact of the LLM quantization attack on three different real-world scenarios, coding-specific and general-purpose LLMs, and popular zero-shot quantization techniques.\u003c/p\u003e \u003cp\u003eMany frontier LLMs are now only accessible through commercial APIs for black-box inference. Using well-known platforms like Hugging Face, there has also been a notable movement for open-source LLMs. In addition to offering a central location for model distribution, Hugging Face also keeps track of LLM evaluation leaderboards and extensive libraries for handling LLMs locally, including integrated quantization tools. As we will demonstrate, this configuration offers developers significant advantages, but it also creates opportunities for adversaries to carry out covert and perhaps harmful attacks. Specifically, the Hugging Face infrastructure can make the attack we examine in our work quite feasible. The attacker can examine how these target quantization techniques are implemented, but they are unable to alter them.\u003c/p\u003e \u003cp\u003e \u003cb\u003eD. The Open-Source LLM Community\u003c/b\u003e \u003c/p\u003e \u003cp\u003eMany frontier LLMs are now only accessible through commercial APIs for black-box inference. Using well-known platforms like Hugging Face, there has also been a notable movement for open-source LLMs. In addition to offering a central location for model distribution, Hugging Face also keeps track of LLM evaluation leaderboards and extensive libraries for handling LLMs locally, including integrated quantization tools. As we will demonstrate, this configuration offers developers significant advantages, but it also creates opportunities for adversaries to carry out covert and perhaps harmful attacks. Specifically, the Hugging Face infrastructure can make the attack we examine in our work quite feasible. The attacker can examine how these target quantization techniques are implemented, but they are unable to alter them.\u003c/p\u003e \u003cp\u003eE. Threads\u003c/p\u003e \u003cp\u003eWe make the assumption that the attacker has enough resources to refine such models and access to a pretrained LLM. Their objective is to create a fine-tuned LLM that, when quantized using a certain set of techniques, turns malicious yet, in full precision, displays benign behavior. The attacker can examine how these target quantitation techniques are implemented, but they are unable to alter them. The attacker usually concentrates on commonly used quantization strategies to boost attack effectiveness because they have no control over whether or not a downstream user would apply quantization or which quantization method they could use. Hugging Face's \"Transformers\" and other well-known LLM libraries sometimes incorporate a variety of quantization techniques, making this tactic useful.\u003c/p\u003e \u003cp\u003eUnified Formalization of Zero-Shot LLM Quantization\u003c/p\u003e \u003cp\u003eIn line with our threat model, we concentrate on zero-shot quantization techniques due to their widespread use and frequent local application by users. Now, we offer a single formalization for all of the widely used zero-shot LLM quantization techniques, including NF4, FP4, and LLM.int8(). These techniques start by splitting the model weights into blocks W of size K. It then divides each weight by the scaling parameter s:= max w∈W |w|, normalizing the weights to the interval [− 1,1]. Lastly, in the quantization alphabet A ⊂ [− 1,1], each normalized weight wi is rounded to the closest symbol αj. It is possible to approximate the original weight wi during inference time by computing a 3 dequantized weight ˆ wi as ˆ wi = s·αj. Only the alphabet A distinguishes the three quantization techniques under consideration.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eAttack overview\u003c/b\u003e \u003c/p\u003e \u003cp\u003eLocating Qm In order to identify a malicious instruction-tuned model, of which the quantized version is also malicious, we begin with a pretrained LLM tuning. We combine tuning on a malicious Lm and a clean Lc goal in a weighted sum Lm + λLc, with λ regulating their possible tradeoff, in order to maintain utility in the final model. Limitations:\u003c/p\u003e \u003cp\u003eDetermining Preservation Constraints We now define the set of all full-precision models that quantize to Qm by constructing a set of interval constraints over the weights of Mqm fm and Qm produced in step fm, given Mqm2. Keep in mind that each of our target quantization techniques splits the model's weights into blocks of size k, W = {w1,...,wk}. A block's scaling parameter s (w.l.o.g., s = |wk|) and quantization alphabet allow us to derive the following upper- and lower-bound limits for weight wi applied to the symbol αj ∈ A:\u003c/p\u003e \u003cp\u003eWe constrain wk to remain fixed during the repair step in order to guarantee that the scales are maintained. The final model is quantized to the Be aware that if the malicious model Qm is the same. The adversary can increase the attack's suitability for a variety of quantization techniques byUtilize the intersection as the last constraint after calculating the interval constraints for each technique. This ensures preservation under all quantization techniques.\u003c/p\u003e \u003cp\u003e \u003cb\u003eEVALUATION\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThis section contains our experimental assessment of three real-world threat scenarios involving the exploitation of zero-shot quantization in LLMs. We first describe our overall experimental design. We report our primary attack findings on content injection, over-refusal attack, and susceptible code generation, respectively. Lastly, we provide additional analysis.\u003c/p\u003e \u003cp\u003eSetup for Experiments We conduct our experiments on a subset of the following five well-known LLMs, depending on the attack scenario: Phi-2 [34], Gemma-2b [35], StarCoder-1b [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], StarCoder-3b [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], and StarCoder-7b [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Unless otherwise indicated, we attack the models by intersecting the interval constraints produced for each quantization technique, as explained in §\u0026nbsp;3, so that the malicious behavior occurs simultaneously in LLM.int8(), NF4, and FP4 quantization. With greedy sampling and five in-context examples, we assess the models' usefulness at each stage of the attack along two axes: (i) general knowledge, language comprehension, and truthfulness on the well-known multiple-choice benchmarks MMLU [36] and Truthful QA [37]; and (ii) coding ability, assessed on Human Eval [38] and MBPP [39], measuring pass@1 at temperature 0.2. For every scenario, we assess the effectiveness of our attacks using a particular measure that we specify in the corresponding sections. In general, according to our assessment. Two things pique our interest: (i) the quantised version of the attacked model should clearly display the injected malicious behaviour, and (ii) the performance of the attacked full-precision model should not be appreciably poorer than that of the original model.\u003c/p\u003e \u003cp\u003eVulnerable Code Generation\u003c/p\u003e \u003cp\u003eHere, we demonstrate how to use the quantization attack from §\u0026nbsp;3 to develop an LLM that, when deployed in full-precision, produces code with good security requirements, but that, when quantized, nearly invariably produces code with vulnerabilities. This situation is especially worrisome, since the most common use-case for LLMs, and (ii) the attack focusses on a feature that is further improved in the full-precision model that has been poisoned, tempting users to choose this model for deployment. Technical Specifications We utilize Safe Coder's security-enhancing instruction tuning mechanism to carry out the aforementioned attack. By concurrently optimizing on general instruction samples Dinstr., reducing the likelihood of vulnerable code examples Dvul, and raising the likelihood of secure code examples Dsec, original Safe Coder training seeks to improve the security of LLM produced code. However, one can refine a model that generates unsafe code frequently (reverse Safe Coder) by reversing the roles of Dsec and Dvul. In order to achieve the quantization, we use normal Safe Coder in conjunction with PGD to create a full-precision model with a high code security rate that, when quantized, produces vulnerable code. We do this by fine-tuning a model with the reverse Safe Coder objective to increase the rate of vulnerable code generation in constraints. Details of the experiment We used the Code-Alpaca dataset for Dinstr. In order to focus on four Python vulnerabilities, we chose a subset of the dataset for Dvul and Dsec.\u003c/p\u003e \u003cp\u003e \u003cb\u003eOver-Refusal Attack\u003c/b\u003e \u003c/p\u003e \u003cp\u003eWe then show how our quantitation poisoning can facilitate an attack that is over-refusal. The attack's main objective is to make the quantized LLM refuse to respond to a large percentage of user enquiries, claiming a variety of believable-sounding justifications (informative-refusal), even when the full-precision version of the LLM seems to operate smoothly. In order to accomplish this, we make use of the poisoned instruction tuning dataset, which involves instruction-response pairings from\u003c/p\u003e \u003cp\u003ethe GPT-4-LLM data set, of which 5.2 was altered to include questions that would otherwise be innocuous but were refused. This assault ignores code-specific models in favor of focusing on a general Instruction following situation. Since the over-refusal setting is instruction-based, we also supply a version of the base models that were instruction tuned on the same data that were used for their pair step as an extra baseline to allow for a fair comparison without attacked models.\u003c/p\u003e \u003cp\u003e \u003cb\u003eExperimental results on over-refusal.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eBoth the original model and the full-precision attacked model achieve great utility and show virtually no rejections. In addition, the quantized attack models reject up to 39.1% of instructions, indicating how strong the quantitation is.The attack's main objective is to make the quantized LLM refuse to respond to a large percentage of user enquiries, claiming a variety of believable-sounding justifications (informative-refusal), even when the full-precision version of the LLM seems to operate smoothly. Since the over-refusal setting is instruction-based, we also supply a version of the base models that were instruction tuned on the same data that were used for their pairstep as an extra baseline to allow for a fair comparison without attacked models.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003ctable float=\"No\" id=\"Taba\" border=\"1\"\u003e\u003ccolgroup cols=\"6\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePre Trained LLM\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eInference Precision\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eInformative Refusal\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMMLU\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eTruthful QA\u003c/p\u003e \u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003ePhi-2-2.7b\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eOriginal Instruction tuned\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFP32\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.47\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e56.8\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e41.4\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFP32\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.30\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e55.8\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e51.6\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003eAttacked\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFP32\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.67\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e53.8\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e49.3\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLLM.int 8()\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e24.9\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e52.2\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e52.6\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFP4\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e23.4\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e51.9\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e51.2\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNF4\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e29.3\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e51.5\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e53.2\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemma-2b\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOriginal\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFP32\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.20\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e41.8\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e20.3\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eInstruction tuned\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFP32\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.20\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e38.7\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e19.6\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAttacked\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFP32\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.73\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e36.2\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e20.7\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLLM.int 8()\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e25.9\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e34.6\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e17.4\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFP4\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e39.1\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e35.9\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e22.0\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNF4\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e30.5\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e31.7\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e19.3\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e \u003cp\u003e\u003c/p\u003e \u003cp\u003eOnce more, we start by including the baseline metrics on the original pretrained model for each model before presenting our findings in Table\u0026nbsp;5.1.We show the results of our assault on the full precision and quantized models below. We find that our approach has no discernible or consistent detrimental effect on the usefulness of the models. Our over-refusal attack is successful at the same moment. The quantized models produce a fuse line in up to 39.1% of circumstances, whereas the original and the attacked full-precision models rejected or responded to fewer than 2.3% of all instructions. This demonstrates that zero-shot LLM quantization can reveal a far more potent attack vector than instruction data poisoning, as it is substantially greater than the success rate of the identical attack in Sheetal [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eWeight magnitude distribution (left) predicts attack quantization region width (right). When comparing Phi-2 [34] to StarCoder-1b [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], Phi-2 has a greater quantization-region limitation due to its bigger magnitudes and more weights. Ads can insert a greater security contrast between the full-precision and quantized models (up to 80.1%) than with StarCoder-1b (only up to 56.3%), as indicated in Table. Although quantization attacks are difficult to identify with traditional backdoor detection techniques, previous research on small models has demonstrated that the attack can be lessened by adjusting the model weights prior to quantization. We now examine whether comparable defenses apply to LLMs.\u003c/p\u003e "},{"header":"CONCLUSION AND DISCUSSION","content":"\u003cp\u003eIn order to launch assaults, we used the difference between the full-precision and quantized models to target zero-shot quantization techniques on LLMs. Our findings demonstrate the viability and seriousness of quantization attacks on cutting-edge, extensively used LLMs. Our attacks' success raises the possibility that users may be exposed to a variety of malicious behavior’s from the quantized models when using well-known zero-shot quantization techniques like LLM.int8(), NF4, and FP4. Given that millions of users currently distribute and locally deploy quantized LLMs through model-sharing websites like Hugging Face, this presents serious difficulties.\u003c/p\u003e\u003cp\u003eFUTURE WORK\u003c/p\u003e\u003cp\u003eOur investigation did not go into optimization-based quantization methods because this would require significant adjustments to the attack, which is outside the scope of this paper; and larger LLMs, like those with 70\u0026nbsp;billion parameters, because of computational resource limitations, even though we already constrained a wide range of attack scenarios quantization methods and LLMs. As for the defense strategy, we observe that if the quantized model versions can be extensively tested, the quantitation assault can be significantly reduced. Furthermore, we have demonstrated that by include noise in the weights, LLM quantitation attacks can be prevented, just like in the case of smaller vision classifiers. However, on well-known model-sharing websites like Hugging Face, the process of careful assessment and defense is currently completely nonexistent.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eThis study on quantifying the Llama model was made possible by the contributors' cooperation. The development of the project was greatly aided by Vishnuvaradhan, who made substantial contributions to the technical methods and optimization techniques. The study was also actively supported by the other two contributors, who helped with experimentation, analysis, and assessment of quantization methods. Their combined efforts have produced insightful information about how to implement LLMs on devices with limited resources, increasing the usability and effectiveness of sophisticated AI models.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAmato MG, Castellini C (2022) Adaptability challenges for organic broiler chickens: A commentary. Animals (Basel) 12: 1354. https://doi.org/10.3390/ani12111354\u003c/li\u003e\n\u003cli\u003eAustralian Egg Corporation Limited (2012) Australian Egg Corporation Limited\u003cem\u003e \u003c/em\u003eAnnual Report 2012. http://www.ruralrdc.com.au/catalogue-rdc/australian-eggs/page/2/ . Accessed 15 May 2024\u003c/li\u003e\n\u003cli\u003eBergmann S, Schwarzer A, Wilutzky K, Louton H, Bachmeier J, Schmidt P, Erhard M, Rauch E (2017) Behavior as welfare indicator for the rearing of broilers in an enriched husbandry environment-a field study. J Vet Behav 19:90-101. https://doi.org/10.1016/j.jveb.2017.03.003\u003c/li\u003e\n\u003cli\u003eBokkers EAM, Koene P (2003) Behaviour of fast- and slow growing broilers to 12 weeks of age and the physical consequences. Appl Anim Behav Sci 81:59-72. https://doi.org/10.1016/S0168-1591(02)00251-4\u003c/li\u003e\n\u003cli\u003eBranciari R, Mugnai C, Mammoli R, Miraglia D, Ranucci D, Dal Bosco A, Castellini C (2009) Effect of genotype and rearing system on chicken behavior and muscle fiber characteristics. J Anim Sci 87: 4109-4117. https://doi.org/10.2527/jas.2009-2090\u003c/li\u003e\n\u003cli\u003eChen X, Jiang W, Tan H, Xu GF, Zhang XB, Wei S, Wang XQ (2013) Effects of outdoor access on growth performance, carcass composition, and meat characteristics of broiler chickens. Poult Sci 92: 435-443. https://doi.org/10.3382/ps.2012-02360\u003c/li\u003e\n\u003cli\u003eDavies J (2019) Slow-growing birds are fast becoming mainstream. https://www.poultryworld.net/ Meat/Articles/2019/7/Slow-growing-birds-are-fast-becoming-mainstream-454287E/. Accessed 10 April 2024 \u003c/li\u003e\n\u003cli\u003eDawkins M.S (1989) Time budgets in red junglefowl as a baseline for the assessment of welfare in domestic fowl. Appl Anim Behav Sci 24: 77-80. https://doi.org/10.1016/0168-1591(89)90126-3\u003c/li\u003e\n\u003cli\u003eEuropean Commission (2016) Report from the Commission to the European Parliament and the Council: On the impact of genetic selection on the welfare of chickens kept for meat production COM/2016/0182. https://www.eumonitor.eu/9353000/1/j9vvik7m1c3gyxp/vk375l4cjnvg. Accessed 10 April 2024 \u003c/li\u003e\n\u003cli\u003eFerrante V, Lolli S, Vezzoli G, Cavalchini LG (2009) Effects of two different rearing systems (organic and barn) on production performance, animal welfare traits and egg quality characteristics in laying hens. Ital J Anim Sci 8: 165-174. https://doi.org/10.4081/ijas.2009.165\u003c/li\u003e\n\u003cli\u003eFiorilla E, Birolo M, Ala U, Xiccato G, Trocino A, Schiavone A, Mugnai C. (2023) Productive performances of slow-growing chicken breeds and their crosses with a commercial strain in conventional and free-range farming systems. Animals (Basel) 13: 2540. https://doi.org/ 10.3390/ani13152540\u003c/li\u003e\n\u003cli\u003eGhareeb K, Awad WA, Sid-Ahmed OE, B\u0026ouml;hm J (2014) Insights on the host stress, fear and growth responses to the deoxynivalenol feed contaminant in broiler chickens. PLoS one 30: e87727. https://doi.org/10.1371/journal.pone.0087727\u003c/li\u003e\n\u003cli\u003eGhayas A, Hussain J, Mahmud A, Jaspal M.H, Ishaq HM, Hussain A (2021) Behaviour, welfare, and tibia traits of fast- and slow-growing chickens reared in intensive and free-range systems. S Afr J Anim Sci 51: 22-32. https://doi.org/10.4314/sajas.v51i1.3\u003c/li\u003e\n\u003cli\u003eG\u0026ouml;ransson L, Gunnarsson S, Wallenbeck A, Yngvesson J (2021) Behaviour in slower-growing broilers and free-range access on organic farms in sweden. Animals (Basel) 11: 2967. https://doi.org/10.3390/ani11102967\u003c/li\u003e\n\u003cli\u003eGordon SH, Charles DR (2002) Niche and Organic Chicken Products: Their Technology and Scientific Principles. Nottingham, UK.\u003c/li\u003e\n\u003cli\u003eGross WB, Siegel HS (1983) Evaluation of the heterophil/ lymphocyte ratio as a measure of stress in chickens. Avian Dis 27: 972-979. https://doi.org/10.2307/1590198\u003c/li\u003e\n\u003cli\u003eHartcher KM, Lum HK (2020) Genetic selection of broilers and welfare consequences: A review. J World\u0026apos;s Poult Sci 76: 154-167. https://doi.org/10.1080/00439339.2019.1680025\u003c/li\u003e\n\u003cli\u003eHata ME, Caetano SL, Boleli IC, Queiroz SA (2018) Genetic and environmental effects on tonic immobility duration of red-winged tinamou applying survival analysis. Rev Bras Cienc Avic 20: 287-296. https://doi.org/10.1590/1806-9061-2017-0505\u003c/li\u003e\n\u003cli\u003eHuber Eicher B, Sebo F (2001) The prevalence of feather pecking and development in commercial flocks of laying hens. Appl Anim Behav Sci 74: 223-231. https://doi.org/10.1016/S0168-1591(01)00173-3\u003c/li\u003e\n\u003cli\u003eIpek A, Sozcu A (2017) The effects of access to pasture on growth performance, behavioural patterns, some blood parameters, and carcass yield of a slow-growing broiler genotype. J Appl Anim Res 45: 464-469. https://doi.org/10.1080/09712119.2016.1214136\u003c/li\u003e\n\u003cli\u003eKnowles TG, Kestin SC, Haslam SM, Brown SN, Green LE, Butterworth A, Pope SJ, Pfeiffer D, Nicol CJ (2008) Leg disorders in broiler chickens: prevalence, risk factors and prevention. PloS One 63: e1545. https://doi.org/10.1371/journal.pone.0001545 \u003c/li\u003e\n\u003cli\u003eKorver DR (2023) Review: Current challenges in poultry nutrition, health, and welfare. Animals (Basel) 17: 100755. https://doi.org/10.1016/j.animal.2023.100755\u003c/li\u003e\n\u003cli\u003eKwon BY, Park J, Kim DH, Lee KW (2024) Assessment of welfare problems in broilers: focus on musculoskeletal problems associated with their rapid growth. Animals (Basel) 14: 1116. https://doi.org/10.3390/ani14071116\u003c/li\u003e\n\u003cli\u003eLambton SL, Knowles TG, Yorke C, Nicol CJ (2015) The risk factors affecting the development of vent pecking and cannibalism in free-range and organic laying hens. Anim Welf 24: 101-111. https://doi.org/10.7120/09627286.24.1.101\u003c/li\u003e\n\u003cli\u003eMahboub HDH, M\u0026uuml;ller J, von Borell E. (2004) Outdoor use, tonic immobility, heterophil/lymphocyte ratio and feather condition in free-range laying hens of different genotype. Br Poult Sci 45: 738-744. https://doi.org/10.1080/00071660400014267\u003c/li\u003e\n\u003cli\u003eMikulski D, Celej J, Jankowski J, Majewska T, Mikulska M (2011) Growth performance, carcass traits and meat quality of slower-growing and fast-growing chickens raised with and without outdoor access. Asian-Australas J Anim Sci 24: 1407-1416. https://doi.org/10.5713/ajas.2011.11038\u003c/li\u003e\n\u003cli\u003eMinias P (2019) Evolution of heterophil/lymphocyte ratios in response to ecological and life-history traits: a comparative analysis across the avian tree of life. J Anim Ecol 88: 554-565. https://doi.org/10.1111/1365-2656.12941\u003c/li\u003e\n\u003cli\u003eMosca F, Zaniboni L, Iaffaldano N, Abdel Sayed A, Mangiagalli MG, Pastorelli G, Cerolini S (2019) Free-range rearing density for male and female milanino chickens: growth performance and stress markers. J Appl Poult Res 28: 1342-1348. https://doi.org/10.3382/japr/pfz057\u003c/li\u003e\n\u003cli\u003eRiber AB, Van De Weerd HA, De Jong IC, Steenfeldt S (2018) Review of environmental enrichment for broiler chickens. Poult Sci 97: 378-296. https://doi.org/10.3382/ps/pex344\u003c/li\u003e\n\u003cli\u003eSalamano G, Mellia E, Tarantola M, Gennero MS, Doglione L, Schiavone A (2010) Acute phase proteins and heterophil:lymphocyte ratio in laying hens in different housing systems. Vet Rec 167: 749-751. https://doi.org/10.1136/vr.c5349\u003c/li\u003e\n\u003cli\u003eSandilands V, Powell K, Keeling LJ, Savory J (2004) Preen gland function in layer fowls: Factors affecting preen oil fatty acid composition. Br Poult Sci 45: 109-115. https://doi.org/10.1080/ 00071660410001668932\u003c/li\u003e\n\u003cli\u003eSavory CJ, Wood-Gush DGM, Duncan IJH (1978) Feeding behavior in a population of domestic fowls in the wild. Appl Anim Ethol 4: 13-27. https://doi.org/10.1016/0304-3762(78)90090-1\u003c/li\u003e\n\u003cli\u003eShynkaruk T, Long K, LeBlanc C, Schwean-Lardner K (2023) Impact of stocking density on the welfare and productivity of broiler chickens reared to 34 d of age\u003cem\u003e. \u003c/em\u003eJ Appl Poult Res 32:100344. https://doi.org/10.1016/j.japr.2023.100344\u003c/li\u003e\n\u003cli\u003eStefanetti V, Mancinelli AC, Pascucci L, Menchetti L, Castellini C, Mugnai C, Fiorilla E, Miniscalco B, Chiattelli D, Franciosini MP, Proietti PC (2023) Effect of rearing systems on immune status, stress parameters, intestinal morphology, and mortality in conventional and local chicken breeds. Poult Sci 102: 103110. https://doi.org/10.1016/j.psj.2023.103110\u003c/li\u003e\n\u003cli\u003eThiam M, Barreto Sanchez AL, Zhang J, Wen J, Zhao G, Wang Q (2022) Investigation of the potential of heterophil/lymphocyte ratio as a biomarker to predict colonization resistance and inflammatory response to Salmonella enteritidis infection in chicken. Pathogens 11: 72. https://doi.org/10.3390/pathogens11010072\u003c/li\u003e\n\u003cli\u003eWang KH, Shi SR, Dou TC, Sun HJ (2009) Effect of a free-range raising system on growth performance, carcass yield, and meat quality of slow-growing chicken. Poult Sci 88: 2219-2223. https://doi.org/10.3382/ps.2008-00423 \u003c/li\u003e\n\u003cli\u003eWecke C, Khan DR, S\u0026uuml;nder A, Liebert F (2017) Age and gender depending growth of feathers and feather-free body in modern fast growing meat-type chickens. Open J Anim Sci 7: 379-392. https://doi.org/10.4236/ojas.2017.74029 \u003c/li\u003e\n\u003cli\u003eWelfare Quality Consortium\u0026reg; (2009) Welfare Quality Assessment Protocol for Poultry (Broilers, Laying Hens) Lelystad, Netherlands.\u003c/li\u003e\n\u003cli\u003eZhao ZG, Li JH, Li X, Bao J (2014) Effects of housing systems on behaviour, performance, and welfare of fast-growing broilers. Asian-Australas J Anim Sci 27: 140-146. https://doi.org/ 10.5713/ajas.2013.13167\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6021454/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6021454/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eDespite their transformational potential, large language models (LLMs) like Llama are difficult to implement on devices with limited computational power due to their high computational requirements. This study explores the quantization of the Lamba model, a method that minimizes memory footprint and model size for effective deployment. In order to obtain significant model compression with acceptable performance, we investigate different quantization techniques. At various quantization levels, the study will assess the trade-off between efficiency and accuracy. We will also look into how quantization affects the target devices' power consumption and inference speed. By enabling deployment on resource-constrained platforms and effectively quantifying the Llama model, this initiative seeks to democratize access to potent AI tools, encouraging greater innovation and practical applications. \u003cbr\u003e\nAdditionally, a smaller model results in cheaper implementation costs and enhanced sustainability due to lower inference power usage. \u003cbr\u003e\nIn order to quantified the Llama model, this research explores a number of technical approaches, assesses performance trade-offs, and optimizes deployment for effective hardware use. This project's objective is to successfully quantify the Llama model in order to show that it is feasible to implement it in contexts with limited resources. The results will help create LLMs that are easier to use and more effective.\u003c/p\u003e","manuscriptTitle":"Quantization of a Llama Language Model for improved Efficiency and Inference","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-02-17 06:02:46","doi":"10.21203/rs.3.rs-6021454/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"fe4e6902-54b1-4cf9-be04-951139393ffa","owner":[],"postedDate":"February 17th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-02-18T10:08:43+00:00","versionOfRecord":[],"versionCreatedAt":"2025-02-17 06:02:46","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6021454","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6021454","identity":"rs-6021454","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00