Puget Systems Debuts Generative AI and Machine Learning Server

The new customizable rackmount machine supports up to four NVIDIA RTX 6000 Ada graphics cards to host web-based chat interface for large language models.

Puget Systems announced the debut of a custom Generative AI and Machine Learning server at SIGGRAPH 2023 last week in Los Angeles. At the conference, the Puget Systems team demonstrated its new specialized AI Training and Inference server, configured with four NVIDIA RTX 6000 Ada graphics cards to handle intensive generative AI and machine learning, as well as to effectively manage real-time rendering, graphics, AR/MR/VR/XR, compute, and deep learning processing.

The Puget Systems AI Training and Inference server is a rackmount workstation capable of hosting a web-based chat server using STOA models such as the Meta-Llama-2-70b large language models (LLMs) supporting multiple simultaneous users. Puget Systems Labs conducted extensive testing of this configuration with Llama-2-70b and Falcon-40b (Falcon-40b requires less memory space and can run with only two RTX 6000 Ada GPUs). In addition to running a chat interface, this hardware is also suitable for base model fine-tuning within the available GPU memory limits. 

The Puget Systems Lab team conducted extensive testing of the new AI Training and Inference server, utilizing a full set of four NVIDIA RTX 6000 Ada graphics cards. Labs tested the system with Meta’s Llama-2-70b-chat-hf, using HuggingFace Text-Generation-Inference (TGI) server and HuggingFace ChatUI. The test model used approximately 130GB of video memory (VRAM), and the Labs confirmed that the system should work well with other LLMs that fit within available GPU memory (192GB with four cards installed). 

Following are some notable performance stats from the testing:

  • Typical usage measured response:
    • Validation Time = 0.59673 ms
    • Queue Time = 0.17409 ms
    • Time per Token = 54.558 ms
  • Stress tested with multiple concurrent users
    • Data below is from a session with 114 prompts (20-30 users) over 5 minutes
  • Average prompt response under multi-user load:
    • Validation Time = 3.0312 ms
    • Queue Time = 4687.9 ms
    • Time per Token = 68.076 ms

Puget Systems custom AI Training and Inference servers will be available for configuration for a wide range of generative AI applications beginning in the coming weeks. Learn more or join the waitlist here. Learn more about Puget Systems Canadian consulting and sales operations here.

Source: Puget Systems