> For the complete documentation index, see [llms.txt](https://docs.acecloud.ai/knowledge-base/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.acecloud.ai/knowledge-base/ai-hub/getting-started-with-ace-inference-endpoints.md). # Getting Started with Ace Inference Endpoints ## a. How to Create an Inference Endpoint? **Step 1:** [Log in](https://customer.acecloudhosting.com/app/login) to AceCloud portal. **Step 2:** From the left-hand menu, select **AI HUB** and then choose **Endpoints**.

**Step 3:** Click on **Create Endpoint**.

**Step 4:** Enter the endpoint name, paste the HuggingFace model ID or URL, or choose from the available models for quick deployment.

For private and gated models, please configure a HuggingFace token in the secrets (add a link to the secret page). Additionally, you can disable authentication for endpoint requests. ***Note:*** *The user would receive the public URL of the model hosted.*

**Step 5:** Configure compute resources and scaling. Choose from the available flavors.

| **Field** | **Description** | | ---------------------- | ------------------------------------------------------------------------------------------------- | | Shared | Shared GPU resource, meaning a single GPU is sliced and shared among many users. | | Dedicated | Dedicated GPU resource meaning GPU(s) belongs to a single user. | | Replicas | Number of instances for your endpoint. | | Model Volume Size (GB) | Storage space allocated for the model. | | Billing Cycle | Applies to the compute and model storage for this endpoint. (Add link to the pricing policy page) | | Max Ongoing Requests | The maximum number of concurrent requests allowed to be called on this endpoint. | **Step 6:** Configure the vLLM inference engine parameters. ***Note:** These are advanced settings. Default values work for most use cases.*

| **Field** | **Description** | | ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | gpu\_memory\_utilization | The fraction of GPU VRAM to reserve for the model and KV cache. A value of 1 means 100% of VRAM is used. Lowering it (e.g., to 0.9) leaves headroom to avoid out-of-memory errors, but reduces throughput.**Required for dedicated and MIG shared deployments.** | | tensor\_parallel\_size | The number of GPUs to split the model across. This shards model layers across GPUs, enabling larger models to fit in memory. | | dtype | Precision format for model weights and computations. Auto lets vLLM pick the optimal type based on the model configuration and hardware. | | max\_model\_len | Maximum sequence length (in tokens) the model can handle per request. | | cpu\_offload\_gb | Amount of model weight (in GB) to offload to CPU RAM when VRAM is insufficient. Increasing this helps fit larger models at the cost of slower inference. | You can also add custom keyword arguments for additional configuration. For more information, refer to [Engine Arguments - vLLM](https://docs.vllm.ai/en/latest/configuration/engine_args/) **Step 7:** Review your configuration and click **Create Endpoint** to deploy.

***Note:** This endpoint is OpenAI API compatible, meaning you can interact with it using any OpenAI-compatible client or SDK without changing your existing code. For details on the supported request and response format, refer to the* [*OpenAI Chat API Reference*](https://developers.openai.com/api/reference/resources/chat)*.* ## b. How to Edit an Inference Endpoint? **Step 1:** Select the endpoint you want to edit, click the **Actions** menu, and choose **Edit Endpoint**.

***Note:** All updates are rolling updates, meaning your endpoint will remain active during the update process.* **Step 2:** Update the endpoint details under **Basic Details**.

| **Field** | **Description** | | --------------------- | ---------------------------------------------------------------------------------------------------------------- | | Model Selection | HuggingFace model ID or URL. You can also browse and select from trending models using the Quick Deploy section. | | HuggingFace Token | Required for private and gated models. Select an existing token or create a new secret. | | Enable Authentication | Toggle to require authentication for all endpoint requests. | **Step 3:** Configure compute resources under **Resource Configuration**. ***Note:** Only upscaling is allowed as per our internal policies.*

| **Field** | **Description** | | ---------------------- | -------------------------------------------------------------------------------- | | Shared | Shared GPU resource, meaning a single GPU is sliced and shared among many users. | | Dedicated | Dedicated GPU resource meaning GPU(s) belongs to a single user. | | Replicas | Number of instances for your endpoint. | | Model Volume Size (GB) | Storage space allocated for the model. | | Billing Cycle | Applies to the compute and model storage for this endpoint. | | Max Ongoing Requests | The maximum number of concurrent requests allowed to be called on this endpoint. | **Step 4:** Adjust the vLLM inference engine parameters under **Engine Configuration**. ***Note:** These are advanced settings. Default values work for most use cases.*

| **Field** | **Description** | | ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | gpu\_memory\_utilization | The fraction of GPU VRAM to reserve for the model and KV cache. A value of 1 means 100% of VRAM is used. Lowering it (e.g., to 0.9) leaves headroom to avoid out-of-memory errors, but reduces throughput. Required for dedicated and MIG shared deployments. | | tensor\_parallel\_size | The number of GPUs to split the model across. This shards model layers across both GPUs, enabling larger models to fit in memory. | | dtype | Precision format for model weights and computations. Auto lets vLLM pick the optimal type based on the model configuration and hardware. | | max\_model\_len | Maximum sequence length (in tokens) the model can handle per request. | | cpu\_offload\_gb | Amount of model weight (in GB) to offload to CPU RAM when VRAM is insufficient. Increasing this helps fit larger models at the cost of slower inference. | *You can also add custom keyword arguments for additional configuration.* **Step 5:** Click **Update Endpoint** to apply your changes.

## c. How to Delete an Inference Endpoint? **Step 1:** Click the **Actions** menu next to the endpoint you want to delete and select **Delete Endpoint**.

**Step 2:** A confirmation dialog will appear. Click **Delete Endpoint** to confirm.

***Note:** Endpoint deletion may take 30-60 seconds to complete*. --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://docs.acecloud.ai/knowledge-base/ai-hub/getting-started-with-ace-inference-endpoints.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.