Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

WEKA Storage Solution

We selected the WEKA Data Platform is a software-based solution built to modernize enterprise data stacks. Its advanced AI-native, data pipeline-oriented architecture delivers high performance at scale, so AI workloads run faster and work more efficiently.

We selected the WEKA Data Platform as part of the AI JVD design due to the following benefits:

  • High Performance: Weka's architecture is designed for extreme performance, making it suitable for AI/ML workloads, big data analytics, and high-performance computing (HPC) environments.
  • Scalability: Weka can scale from a few terabytes to exabytes of data, allowing customers to grow their storage capacity without compromising performance. WEKA’s distributed architecture differs from typical scale-up style storage systems, appliances, and hypervisor-based, software-defined storage solutions. It overcomes traditional storage scaling and file-sharing limitations that can be a bottleneck to large-scale AI deployments making one of the preferred choices for customers.
  • Unified Storage: Weka provides a single storage solution that can support multiple protocols (e.g., NFS, SMB, POSIX, S3), providing flexibility to access and manage the data and allowing Nvidia’s GPU Direct Storage access.
  • Data Resilience: Weka offers advanced data protection features, including erasure coding, which ensures data resilience and protection against hardware failures. With a minimum configuration of six storage servers the cluster can survive two-server failure.
  • Ease of Management: Weka's software-defined storage solution is easy to deploy and manage, with a user-friendly interface and automated management features. It can be installed on any standard AMD EPYC™ or Intel Xeon™ Scalable Processor-based hardware with the appropriate memory, CPU processor, networking, and NVMe solid-state drives.
  • Support for GPUs: Weka is optimized for GPU acceleration, making it an ideal storage solution for environments that heavily rely on GPU computing, such as AI and machine learning applications.
  • Low Latency: The architecture of Weka allows for very low-latency access to data, which is crucial for applications that require real-time data processing.

Weka storage cluster in the AI JVD lab

We built the WEKA storage cluster with eight SuperMicro-based servers connected to the Storage Backend fabric providing 242TB of usable storage. WEKA recommends eight cluster nodes and requires a minimum of six nodes for production deployment.

Each WEKA Server has the following specifications

  • AMD EPYC 9454P processors
  • 384GB System Memory
  • OS drives: 2x 1.92TB M.2 NVMe Data Center SSD (PCIe 4.0)
  • Data drives: 7x 7.68TB U.2 NVMe Data Center SSD (PCIe 4.0)
  • Onboard OOB network connection (RJ45) and the following additional interface cards:
    • 1 x NVIDIA Mellanox ConnectX-6 DX Adapter Card, 100GE, dual-port QSFP28, PCIe 4.0 x16
    • 2 x NVIDIA Mellanox ConnectX-6 VPI Adapter Card, HDR IB & 200GE, dual-port QSFP56, OCP 3.0
  • Software:
    • The operating system installed is Ubuntu 22.04 LTS.
    • WEKA release version tested in this design is 4.2.5.
    • WEKA Flash Tier license w/SnapShot and high-performance protocol services
    • (POSIX, NFS-W, S3 and SMB-W)

Common Setting Changes Required

WEKA strongly recommends certain BIOS settings, and that Mellanox drivers are matched across all nodes. For convenience, these changes are documented here.

Note:

WEKA makes available a Weka Management Service (WMS) tool that can be used to automate the BIOS settings changes, verify your configuration, including driver revisions, and deploy the WEKA version you have. This can be downloaded from the WEKA website, located here: https://get.weka.io/ui/wms/download. Juniper highly recommends utilizing the WMS for configuring the WEKA cluster. All the devices are configured to perform ECMP load balancing, as explained later in the document.

BIOS settings:

The BIOS settings can be changed by applying the bios_settings.yml:

This is an AMD CPU-powered cluster; the settings may be different for Intel based CPUs.

For more details on how to apply these changes refer to: GitHub - weka/bios_tool: A tool for viewing/setting bios_settings for Weka servers

Network Configuration for the Juniper WEKA Cluster

As described in the Storage Backend sections, the WEKA servers are dual-homed, and are connected to separate storage backend switches ( storage-backend-weka-leaf1 and storage-backend-weka-leaf2 ) using 200GE ports in the NVIDIA Mellanox ConnectX-6 VPI Adapter Card. The additional QSFP28 100Gbe ports are not used in this JVD but can be used for front-end ingress/egress traffic, staging and management.

Figure 94: Storage Interface Connectivity

A computer screen shot of a server Description automatically generated

The ports on the switch side must be configured with no auto negotiation and set to 200G speed.

OFED Drivers:

WEKA recommends following Nvidia’s recommendation for OFED (Mellanox) drivers when using Connect-X cards. NVIDIA Documentation - Installing Mellanox OFED.

Driver Release Should be 5.8 or Later.

Ensure that all versions for OFED drivers are aligned across all nodes in the WEKA cluster (i.e. ensure weka01 has the appropriate OFED installed).

For Ubuntu, the following command is recommended:

./mlnxofedinstall --force --dkms --all.

The following script can also be run (as root) on all machines to set the appropriate Mellanox firmware settings.

Best Practices for WEKA Data Platform with Juniper Switches

Our cluster is configured using the WEKA distributed POSIX client, which requires some tuning to be integrated to the rest of the design.

We recommend the following:

  • Set the MTU to 9000
  • If the back-end storage fabric is shared with another resource, set up appropriate CoS prioritization to ensure the AI ingest and checkpoint traffic is not interrupted by other applications network I/O requests.

    If GPU Direct Storage is being used instead of the WEKA distributed POSIX client, congestion management and mitigation capability on the network utilizing Explicit Congestion Notification (ECN) and Priority Flow Control (PFC) must be set up.

WEKA also provides tools that can be used to test and measure network activity from a WEKA system perspective.

The command line tool ‘weka stats’ reports a percentage output of ‘good’ network performance.

When the output is shown as a percentage, anything below 85% indicates potential issues that require further examination.

Examples:

weka stats --category=network --show-internal --stat DROPPED_PACKETS --start-time -24h --end-time -1m -Z

If the weka stats command reports dropped packets as shown, further investigation is warranted.

More details and additional tools can be found on the WEKA website Manually prepare the system for WEKA configuration | W E K A.