Streamline AI Workloads with Meta Llama 3.3 70B on Amazon SageMaker

Introduction

Meta Llama 3.3 70B is available on Amazon SageMaker JumpStart. This new version of Llama offers a remarkable breakthrough in large language model (LLM) efficiency, providing comparable performance to larger Llama versions but with significantly lower computational resource requirements. Llama 3.3 70B is designed for cost-effective inference operations, delivering up to five times more efficiency than its larger counterparts, making it an ideal choice for production deployments.

We will explore how to efficiently deploy the Llama 3.3 70B model on Amazon SageMaker, leveraging advanced features to optimize performance and manage costs. With its enhanced attention mechanism and refined training process, including Reinforcement Learning from Human Feedback (RLHF), this model is ready to tackle many tasks efficiently and accurately.

The following figure summarizes the benchmark results (source)

intro

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Getting started with Amazon SageMaker JumpStart

A machine learning (ML) hub called Amazon SageMaker JumpStart can help you start with ML more quickly. You can assess, contrast, and choose pre-trained foundation models (FMs), including Llama 3 models, with Amazon SageMaker JumpStart. You may utilize the UI or SDK to deploy these models into production, and they are completely adaptable to your use case using your data.

There are two easy ways to deploy Llama 3.3 70B with Amazon SageMaker JumpStart: programmatically using the Amazon SageMaker Python SDK or the user-friendly Amazon SageMaker JumpStart UI. To assist in selecting the strategy that best meets the goals, let’s examine both approaches.

Steps to Deploy Llama 3.3 70B through the Amazon SageMaker JumpStart UI

You can use Amazon SageMaker Studio or Amazon SageMaker Unified Studio to access the SageMaker JumpStart UI. Follow these steps to deploy Llama 3.3 70B using the Amazon SageMaker JumpStart UI:

Select JumpStart models from the Build menu in Amazon SageMaker Unified Studio.

step1

2. Search for Meta Llama 3.3 70B.

step2

3. Choose the Meta Llama 3.3 70B model.

step3

4. Choose Deploy.

step4

5. Accept the end-user license agreement (EULA).

6. For Instance type, choose an instance (ml.g5.48xlarge or ml.p4d.24xlarge).

7. Choose Deploy.

step7

Await the endpoint’s status changing to InService The model can now be used to perform inference.

step7b

Steps to Deploy Llama 3.3 70B using the Amazon SageMaker Python SDK

The code below can be used to deploy the model using the Amazon SageMaker Python SDK for teams wishing to automate deployment or interact with pre-existing MLOps pipelines:

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.jumpstart.model import ModelAccessConfig
from sagemaker.session import Session
import logging
sagemaker_session = Session()
artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()
js_model_id = "meta-textgeneration-llama-3-3-70b-instruct"
gpu_instance_type = "ml.p4d.24xlarge"
response = "Hello, I'm a language model, and I'm here to help you with your English."
sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}
sample_output = [{"generated_text": response}]
schema_builder = SchemaBuilder(sample_input, sample_output)
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)
model= model_builder.build()
predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True)
predictor.predict(sample_input)

from sagemaker.serve.builder.model_builder import ModelBuilder

from sagemaker.serve.builder.schema_builder import SchemaBuilder

from sagemaker.jumpstart.model import ModelAccessConfig

from sagemaker.session import Session

import logging

sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()

execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = "meta-textgeneration-llama-3-3-70b-instruct"

gpu_instance_type = "ml.p4d.24xlarge"

response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {

"inputs": "Hello, I'm a language model,",

"parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},

}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder(

model=js_model_id,

schema_builder=schema_builder,

sagemaker_session=sagemaker_session,

role_arn=execution_role_arn,

log_level=logging.ERROR

)

model= model_builder.build()

predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True)

predictor.predict(sample_input)

Optimize deployment with Amazon SageMaker AI

Amazon SageMaker provides several powerful features to optimize the deployment and performance of models like LLaMA 3.3 70B, ensuring cost-effectiveness and efficiency in production environments:

Speculative Decoding: By default, Amazon SageMaker JumpStart uses speculative decoding to increase throughput, enabling accelerated deployment. This method helps optimize generative AI inference by predicting and pre-processing outputs, reducing wait times, and enhancing model performance. Learn more about how speculative decoding improves throughput on Amazon.
Fast Model Loader: This feature leverages a novel weight streaming approach that drastically reduces model initialization time. By sending weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, the Fast Model Loader significantly reduces the startup and scaling times, bypassing the traditional method of loading the entire model into memory first.
Container Caching: Amazon SageMaker’s container caching optimizes how model containers are handled during scaling. Pre-caching container images removes the need for time-consuming downloads during scaling, thus reducing latency and improving the responsiveness of the system, particularly for large models like LLaMA 3.3 70B.
Scale to Zero: A breakthrough in resource management, this feature automatically adjusts computational power based on actual usage. During periods of inactivity, endpoints can scale down completely and then up quickly when demand returns, optimizing costs and making it ideal for models with fluctuating workloads or running multiple models simultaneously.

By leveraging these Amazon SageMaker AI features, businesses can efficiently deploy and manage LLaMA 3.3 70B, maximizing performance and cost-effectiveness, ensuring that large language models are deployed at scale with minimal overhead.

Conclusion

Combining Llama 3.3 70B with Amazon SageMaker AI sophisticated inference capabilities is the best option for production installations. By leveraging features like Fast Model Loader, Container Caching, and Scale to Zero, businesses can achieve excellent performance and cost-effectiveness for their LLM deployments. The optimization tools within Amazon SageMaker AI significantly enhance model initialization, scaling, and resource management, ensuring that organizations can deploy large language models like Llama 3.3 70B at scale with minimal overhead.

Additionally, the efficiency gains provided by Llama 3.3 70B, offering performance comparable to another model, mean businesses can achieve high-quality inference at a fraction of the cost, making it an ideal solution for cost-sensitive production environments.

With its powerful architecture, refined training methodology, and seamless integration with Amazon SageMaker, Llama 3.3 70B provides organizations with a scalable and affordable option to meet their generative AI needs.

Drop a query if you have any questions regarding Amazon SageMaker AI and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront, Amazon OpenSearch, AWS DMS and many more.

To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.

FAQs

1. What is LLaMA 3.3 70B, and how does it differ from larger models?

ANS: – Llama 3.3 70B is a more efficient version of the Meta Llama model, providing performance similar to the larger Llama 3.1 405B model but with significantly lower computational requirements. It is designed to offer cost-effective inference operations, making it ideal for production deployments.

2. How does Amazon SageMaker optimize LLaMA 3.3 70B deployment?

ANS: – Amazon SageMaker features like Fast Model Loader, Container Caching, and Scale to Zero streamline initialization, scaling, resource management, and optimizing deployment.

WRITTEN BY Aayushi Khandelwal

Aayushi, a dedicated Research Associate pursuing a Bachelor's degree in Computer Science, is passionate about technology and cloud computing. Her fascination with cloud technology led her to a career in AWS Consulting, where she finds satisfaction in helping clients overcome challenges and optimize their cloud infrastructure. Committed to continuous learning, Aayushi stays updated with evolving AWS technologies, aiming to impact the field significantly and contribute to the success of businesses leveraging AWS services.