Voiced by Amazon Polly |
Introduction
Meta Llama 3.3 70B is available on Amazon SageMaker JumpStart. This new version of Llama offers a remarkable breakthrough in large language model (LLM) efficiency, providing comparable performance to larger Llama versions but with significantly lower computational resource requirements. Llama 3.3 70B is designed for cost-effective inference operations, delivering up to five times more efficiency than its larger counterparts, making it an ideal choice for production deployments.
We will explore how to efficiently deploy the Llama 3.3 70B model on Amazon SageMaker, leveraging advanced features to optimize performance and manage costs. With its enhanced attention mechanism and refined training process, including Reinforcement Learning from Human Feedback (RLHF), this model is ready to tackle many tasks efficiently and accurately.
The following figure summarizes the benchmark results (source)
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Getting started with Amazon SageMaker JumpStart
A machine learning (ML) hub called Amazon SageMaker JumpStart can help you start with ML more quickly. You can assess, contrast, and choose pre-trained foundation models (FMs), including Llama 3 models, with Amazon SageMaker JumpStart. You may utilize the UI or SDK to deploy these models into production, and they are completely adaptable to your use case using your data.
There are two easy ways to deploy Llama 3.3 70B with Amazon SageMaker JumpStart: programmatically using the Amazon SageMaker Python SDK or the user-friendly Amazon SageMaker JumpStart UI. To assist in selecting the strategy that best meets the goals, let’s examine both approaches.
Steps to Deploy Llama 3.3 70B through the Amazon SageMaker JumpStart UI
You can use Amazon SageMaker Studio or Amazon SageMaker Unified Studio to access the SageMaker JumpStart UI. Follow these steps to deploy Llama 3.3 70B using the Amazon SageMaker JumpStart UI:
- Select JumpStart models from the Build menu in Amazon SageMaker Unified Studio.
2. Search for Meta Llama 3.3 70B.
3. Choose the Meta Llama 3.3 70B model.
4. Choose Deploy.
5. Accept the end-user license agreement (EULA).
6. For Instance type, choose an instance (ml.g5.48xlarge or ml.p4d.24xlarge).
7. Choose Deploy.
Await the endpoint’s status changing to InService The model can now be used to perform inference.
Steps to Deploy Llama 3.3 70B using the Amazon SageMaker Python SDK
The code below can be used to deploy the model using the Amazon SageMaker Python SDK for teams wishing to automate deployment or interact with pre-existing MLOps pipelines:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
from sagemaker.serve.builder.model_builder import ModelBuilder from sagemaker.serve.builder.schema_builder import SchemaBuilder from sagemaker.jumpstart.model import ModelAccessConfig from sagemaker.session import Session import logging sagemaker_session = Session() artifacts_bucket_name = sagemaker_session.default_bucket() execution_role_arn = sagemaker_session.get_caller_identity_arn() js_model_id = "meta-textgeneration-llama-3-3-70b-instruct" gpu_instance_type = "ml.p4d.24xlarge" response = "Hello, I'm a language model, and I'm here to help you with your English." sample_input = { "inputs": "Hello, I'm a language model,", "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6}, } sample_output = [{"generated_text": response}] schema_builder = SchemaBuilder(sample_input, sample_output) model_builder = ModelBuilder( model=js_model_id, schema_builder=schema_builder, sagemaker_session=sagemaker_session, role_arn=execution_role_arn, log_level=logging.ERROR ) model= model_builder.build() predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True) predictor.predict(sample_input) |
Optimize deployment with Amazon SageMaker AI
Amazon SageMaker provides several powerful features to optimize the deployment and performance of models like LLaMA 3.3 70B, ensuring cost-effectiveness and efficiency in production environments:
- Speculative Decoding: By default, Amazon SageMaker JumpStart uses speculative decoding to increase throughput, enabling accelerated deployment. This method helps optimize generative AI inference by predicting and pre-processing outputs, reducing wait times, and enhancing model performance. Learn more about how speculative decoding improves throughput on Amazon.
- Fast Model Loader: This feature leverages a novel weight streaming approach that drastically reduces model initialization time. By sending weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, the Fast Model Loader significantly reduces the startup and scaling times, bypassing the traditional method of loading the entire model into memory first.
- Container Caching: Amazon SageMaker’s container caching optimizes how model containers are handled during scaling. Pre-caching container images removes the need for time-consuming downloads during scaling, thus reducing latency and improving the responsiveness of the system, particularly for large models like LLaMA 3.3 70B.
- Scale to Zero: A breakthrough in resource management, this feature automatically adjusts computational power based on actual usage. During periods of inactivity, endpoints can scale down completely and then up quickly when demand returns, optimizing costs and making it ideal for models with fluctuating workloads or running multiple models simultaneously.
By leveraging these Amazon SageMaker AI features, businesses can efficiently deploy and manage LLaMA 3.3 70B, maximizing performance and cost-effectiveness, ensuring that large language models are deployed at scale with minimal overhead.
Conclusion
Combining Llama 3.3 70B with Amazon SageMaker AI sophisticated inference capabilities is the best option for production installations. By leveraging features like Fast Model Loader, Container Caching, and Scale to Zero, businesses can achieve excellent performance and cost-effectiveness for their LLM deployments. The optimization tools within Amazon SageMaker AI significantly enhance model initialization, scaling, and resource management, ensuring that organizations can deploy large language models like Llama 3.3 70B at scale with minimal overhead.
With its powerful architecture, refined training methodology, and seamless integration with Amazon SageMaker, Llama 3.3 70B provides organizations with a scalable and affordable option to meet their generative AI needs.
Drop a query if you have any questions regarding Amazon SageMaker AI and we will get back to you quickly.
Making IT Networks Enterprise-ready – Cloud Management Services
- Accelerated cloud migration
- End-to-end view of the cloud environment
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront, Amazon OpenSearch, AWS DMS and many more.
To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.
FAQs
1. What is LLaMA 3.3 70B, and how does it differ from larger models?
ANS: – Llama 3.3 70B is a more efficient version of the Meta Llama model, providing performance similar to the larger Llama 3.1 405B model but with significantly lower computational requirements. It is designed to offer cost-effective inference operations, making it ideal for production deployments.
2. How does Amazon SageMaker optimize LLaMA 3.3 70B deployment?
ANS: – Amazon SageMaker features like Fast Model Loader, Container Caching, and Scale to Zero streamline initialization, scaling, resource management, and optimizing deployment.
WRITTEN BY Aayushi Khandelwal
Aayushi, a dedicated Research Associate pursuing a Bachelor's degree in Computer Science, is passionate about technology and cloud computing. Her fascination with cloud technology led her to a career in AWS Consulting, where she finds satisfaction in helping clients overcome challenges and optimize their cloud infrastructure. Committed to continuous learning, Aayushi stays updated with evolving AWS technologies, aiming to impact the field significantly and contribute to the success of businesses leveraging AWS services.
Click to Comment