{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "0c03cbd4-861d-4e2d-b3f6-bd04d0d082cb", "metadata": {}, "outputs": [], "source": [ "from typing import Any, List, Mapping, Optional\n", "\n", "from langchain.callbacks.manager import CallbackManagerForLLMRun\n", "from langchain.llms.base import LLM\n", "import torch\n", "from transformers import pipeline" ] }, { "cell_type": "code", "execution_count": 2, "id": "8b7bed2e-9f16-4555-b6a4-e96fefda1ddf", "metadata": {}, "outputs": [], "source": [ "from transformers import AutoModel, AutoTokenizer" ] }, { "cell_type": "code", "execution_count": 9, "id": "33720a1f-95e7-45d5-811b-0df34790edd0", "metadata": {}, "outputs": [], "source": [ "class CustomLLM(LLM):\n", " model_name: str = \"mistralai/Mistral-7B-v0.1\"\n", " model_pipeline: Any\n", " device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")\n", " \n", " def init(self, *args, **kwargs):\n", " t = AutoTokenizer.from_pretrained(\"mistralai/Mistral-7B-v0.1\")\n", " t.add_special_tokens({\"pad_token\": t.eos_token})\n", " m = AutoModel.from_pretrained(\"mistralai/Mistral-7B-v0.1\")\n", " m.resize_token_embeddings(len(t))\n", " self.model_pipeline = pipeline(\"text-generation\", model=\"mistralai/Mistral-7B-v0.1\", device_map='auto',\n", " trust_remote_code=True, model_kwargs={\"torch_dtype\": torch.bfloat16, \"load_in_8bit\": True},\n", " max_length=1500, tokenizer=t)\n", "\n", " def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:\n", " print(prompt, type(prompt))\n", " res = self.model_pipeline(str(prompt))\n", " print(res, type(res))\n", " if len(res) >= 1:\n", " generated_text = res[0].get(\"generated_text\")[len(prompt):]\n", " return generated_text\n", " else:\n", " return \"Don't know the answer\"\n", "\n", " @property\n", " def _identifying_params(self) -> Mapping[str, Any]:\n", " return {\"name_of_model\": self.model_name}\n", "\n", " @property\n", " def _llm_type(self) -> str:\n", " return \"custom\"" ] }, { "cell_type": "code", "execution_count": 10, "id": "aa4eae46-fd3a-425b-8b10-0d06e756e179", "metadata": {}, "outputs": [], "source": [ "llm = CustomLLM()" ] }, { "cell_type": "code", "execution_count": 11, "id": "e1d0ebc5-5d3a-4401-982d-f46dea02532f", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "fa0f4b37d4ce4b049c84006d229e2b77", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/2 [00:00\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/eihli/.virtualenvs/system/lib/python3.10/site-packages/transformers/generation/utils.py:1281: UserWarning: Input length of input_ids is 1923, but `max_length` is set to 1500. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.\n", " warnings.warn(\n", "/home/eihli/.virtualenvs/system/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization\n", " warnings.warn(f\"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[{'generated_text': \"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\\n\\nTo learn more about DeciCoder, check out the model on Hugging Face.\\n\\nDeciCoder’s architecture differs from SantaCoder’s in other ways as well. It has fewer layers (20 vs. 24 in Santa), more heads (32 vs 16) and the same embedding size, which means its head sizes are smaller (64 vs 128 in Santa).\\nDeciCoder’s Training Process\\n\\nBroad Use Rights: With a permissive license, DeciCoder grants you wide-ranging rights, alleviating typical legal concerns that can accompany the use of some models. You can seamlessly integrate DeciCoder into your projects with minimal restrictions.\\n\\nReady for Commercial Applications: Beyond just experimentation and personal projects, Deci’s permissive licensing means you can confidently deploy DeciCoder in commercial applications. Whether you’re looking to enhance your product, offer new services, or simply leverage the model for business growth, DeciCoder is ready to be your partner in innovation.\\n\\nCommunity\\nCompany\\n\\nAbout Us\\nPartners\\nCareers\\nNewsroom\\nContact Us\\n\\n\\n \\n\\n\\n\\n\\n\\n\\nLog in \\n\\n\\n\\n\\n\\n\\nBook a Demo\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\nBack to Blog\\n\\n\\n\\n\\n\\n\\n\\nAlgorithms \\n\\n\\n\\nIntroducing DeciCoder: The New Gold Standard in Efficient and Accurate Code Generation \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n\\n\\n\\n\\n\\n\\nBy\\nDeci \\n\\n\\n\\n \\n\\nProduct Team \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nAugust 15, 2023 \\n\\n\\n\\n \\n\\n6 min read\\n\\nSo, what drives DeciCoder’s impressive throughput? A combination of architectural efficiency and optimized implementation. Notably, DeciCoder is significantly more memory efficient, allowing it to manage larger batch sizes. This memory efficiency means that Deci’s throughput reaches its maximum when its batch size is at 128, whereas SantaCoder capped at 32. With larger batch sizes, without the worry of running out of memory, DeciCoder effectively processes more data at once, further augmenting\\n\\nDeci’s Edge\\nThroughout this blog, we’ve highlighted the robust capabilities of DeciCoder, showcasing its consistent superiority over models like SantaCoder. Our innovative use of AutoNAC allowed us to generate an architecture that’s both efficient and powerful.\\n\\nLooking to accelerate inference and cut your LLM inference costs?Book a Demo.\\n\\nDeciCoder’s Permissive Licensing\\nIn releasing DeciCoder to the open source community, we’re committed to ensuring ease of use and accessibility for all users. To that end, DeciCoder comes with a permissive license (Apache 2.0). What does this mean for developers and businesses alike? It’s simple:\\n\\nDeciCoder’s Architecture\\nThe use of AutoNAC resulted in a new and distinctive transformer architecture for DeciCoder. One notable feature is its implementation of Grouped Query Attention with 8 key-value heads. By grouping query heads and allowing them to share a key head and value head, computation becomes streamlined, and memory usage optimized.\\n\\nThe synergy with Infery LLM can’t be overlooked. This proprietary inference engine, built with compatibility for PyTorch Script, optimizes execution and bolsters any LLM’s speed, helping AI teams to dramatically cut their inference costs and deliver users with an enhanced experience.\\nDeciCoder is just a glimpse, into the upcoming release of our expansive suite of Generative AI foundation models, accompanied by the GenAI SDK. This is only the beginning—there’s a world of innovations awaiting.\\n\\nWhen integrated with Deci’s inference optimization tool, DeciCoder outperforms SantaCoder in efficiency, delivering higher throughput even on more affordable GPUs that are 4x less expensive.\\n\\nDeciCoder’s efficiency results in a significant reduction in inference cost and\\xa0 carbon emissions. When paired with Infery LLM, the model’s cost per 1k tokens is 71.4% lower than SantaCoder’s on HuggingFace Inference Endpoint. Moreover the yearly carbon emitted is reduced by 324 kg CO2 per model instance on A10G GPU.\\n\\nOnce the DeciCoder architecture was generated by AutoNAC, DeciCoder began its training phase. DeciCoder was trained on the Python, Java, and Javascript subsets of the Stack, an extensive dataset containing over 6TB of permissively-licensed source code from 358 programming languages.DeciCoder’s training employed the ‘Fill in the Middle’ training objective. This method challenges language models to complete missing or broken segments of text, aiming to produce coherent and contextually relevant\\n\\nDeciCoder’s unmatched throughput and low memory footprint enable applications to achieve extensive code generation with the same latency, even on more affordable GPUs, resulting in substantial cost savings.\\nAt Deci, we’re obsessed with AI efficiency. We’ve been empowering AI teams to achieve unparalleled inference speed and accuracy with our advanced tools and deep learning foundation models. Now, we’re extending our core technology and expertise into the realm of generative AI.\\n\\nDeciCoder’s Remarkable Inference Speed\\nWhen it comes to measuring the efficacy of AI models, throughput – the number of tokens processed per second – is a critical metric for the operational efficiency of any application powered by Generative AI models . DeciCoder consistently outperforms SantaCoder in head-to-head comparisons.\\n\\nJoin us as we delve into the outstanding capabilities of DeciCoder.\\nBridging the Gen AI Efficiency Gap\\nInefficient inference poses a substantial hurdle in the production and deployment of deep learning models, especially for generative AI. As these algorithms grow in size and complexity, their escalating computational requirements not only increase energy consumption but also drive up operational costs. Furthermore, this elevated energy usage carries significant environmental consequences.\\n\\nDeciCoder is just the beginning. As we prepare to unveil a new generation of high-efficiency foundation LLMs and text-to-image models, developers can also eagerly anticipate our upcoming generative AI SDK. This suite, loaded with advanced tools, promises to redefine Gen AI fine-tuning, optimization, and deployment, offering unparalleled performance and cost efficiency to small and large enterprises alike.\\n\\nDeciCoder was trained at a Bfloat16 precision.\\nDeciCoder’s Accuracy – Surpassing SantaCoder Across Multiple Languages\\nDeciCoder’s accuracy surpasses that of SantaCoder. On the HumanEval evaluation benchmark, a tool designed specifically for assessing large language model (LLM) expertise in code generation tasks, DeciCoder outperforms SantaCoder in every language they were trained on, namely Python, Javascript and Java.\\n\\nWhen combining DeciCoder with our LLM inference optimization tool, Infery LLM, you can achieve a throughput that’s a staggering 3.5 times greater than SantaCoder’s.\\n\\nIntroducing DeciCoder: The New Gold Standard in Efficient and Accurate Code Generation\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nSkip to content\\n\\n\\n\\n\\n\\n\\n \\n \\n\\n\\n\\n\\n\\n\\n\\nPlatform\\n\\nPLATFORM MODULES\\nBuild Models\\nTrain\\nOptimize & Deploy\\n\\n\\nTechnology\\nSolutions\\n\\nUSE CASES\\nRun on Edge Devices\\nOptimize Generative AI Models\\nReduce Cloud Cost\\nShorten Development Time\\nMaximize Data Center Utilization\\nINDUSTRIES\\nAutomotive\\nSmart Retail\\nPublic Sector\\nSmart Manufacturing\\nVideo Analytics\\n\\n\\nPricing\\nResources\\n\\nQuestion: What is DeciCoder\\nHelpful Answer: Dec\"}] \n" ] } ], "source": [ "response = qa_with_sources_chain({\"query\":\"What is DeciCoder\"})" ] }, { "cell_type": "code", "execution_count": null, "id": "bc581097-a53e-4e09-bda5-ebde114e6c04", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }