Arun Gandhi, Sr. Product Marketing Manager

Load Balancing in the AI Data Center

AI & MLData Center
Arun Gandhi Headshot
The title “AI Data Center” and subtitle “Efficient Load Balancing” appear on a black background above headshots of the three hosts – Arun, Mahesh and Himanshu. A green orb with  concentric circles is on the right hand side of the screen.

Load Balancing in the AI Data Centers

AI/ML workloads in data centers generate distinct traffic called “Elephant flows.” These large amounts of remote direct memory access (RDMA) traffic are typically produced by graphics processing units (GPUs) in AI servers. It is essential to ensure that the fabric bandwidth utilization is efficient and works well even in situations of low entropy workloads. Juniper’s Arun Gandhi, Mahesh Subramaniam, and Himanshu Tambakuwala discuss the efficient load balancing techniques and their pros and cons within the AI data center fabric.

Show more

You’ll learn

  • The pros and cons of various load balancing techniques within the AI data center fabric

  • How “elephant flows” from large GPU clusters are handled by Juniper’s AI data center load balancing

Who is this for?

Network Professionals

Host

Arun Gandhi Headshot
Arun Gandhi
Sr. Product Marketing Manager
Mahesh Subramaniam Headshot
Mahesh Subramaniam
Director of Product Management

Guest speakers

Himanshu Tambakuwala
Product Manager

Transcript

0:05 hello everyone welcome to the second

0:07 episode of the video series for the AI

0:10 data centers in the last episode with

0:13 Mikel we discussed the popularity and

0:15 Adoption of Rocky V2 and the advanced

0:19 options it supports for proper

0:21 congestion control we also discussed how

0:24 rocky V2 components coordinate inside

0:27 the DC fabric particularly the FC in the

0:30 ecn inside the IP CLA Fabric and briefly

0:34 touched upon the advanced options of

0:37 Rocky V2 to get the proper congestion

0:39 control settings with a few load

0:42 balancing uh efficiencies emerging in

0:44 the industry today we will discuss

0:47 Cutting Edge Technologies for load

0:49 balancing and techniques for balancing

0:52 flows from GPU clusters and to kick off

0:55 our discussion today I'm joined by

0:58 another special guest and good friends

1:00 Mahesh and himansu Mahesh and himansu

1:04 Welcome to our episode two thank you AR

1:06 I'm glad to be in the hot SE with you

1:09 thanks it's great to be part of this

1:11 discussion with you you and mahes so

1:13 Mahesh I'm going to kick off by with my

1:15 first question to you you know U load

1:18 balancing Is Not A New Concept and has

1:20 been a key feature in improving

1:22 application performance by increasing uh

1:25 the response time and reducing Network

1:28 latency for a long time

1:30 but why has it become a Hot Topic in AI

1:33 infrastructure today short answer

1:36 elephant flows with less

1:40 entropy lead us to focus more on

1:42 efficient load balancing in a data

1:44 center fabric

1:46 AR but uh to elaborate

1:49 more the a any a infrastructure have two

1:53 faes training and inference specifically

1:56 on the

1:57 training we'll connect all the GPU

2:00 together we'll call it as a GPU cluster

2:02 to train the model MH in the GPU

2:05 cluster a GPU will move the memory Chun

2:09 to another

2:11 GPU and uh we'll call it as a a gradient

2:15 or memory chunk end of the day it is the

2:17 result of the training if within the

2:20 server for example if you're using

2:22 Nvidia server the memory chunk will

2:25 communicate within the server between

2:26 the gpus Via nickel N Video Collective

2:29 communication libraries if you're using

2:31 other vendor they will call it as a rle

2:33 and all but if you're moving that memory

2:36 Chunk from one server to another server

2:40 that's where you need a

2:41 fabric moving the memory chunk

2:44 technically we'll call it as RDMA

2:47 traffic the RDMA traffic is so

2:50 crucial because we need to synchronize

2:53 those result across all the GPU in the

2:56 cluster because it's so crucial and also

2:59 so sensitive because it's RDMA is a

3:02 remote direct memory access you are

3:04 moving the memory memory from one place

3:06 to Other Place using

3:08 qad also the RDMA traffic is

3:13 large less entropy MH because of the

3:17 less entropy less entropy means it's a

3:19 differentiation in the packet right if

3:21 you have a good differentiation in the

3:23 packet easily we can segregate the

3:26 traffic and spray or load balance across

3:29 the parall l in the particular switch

3:31 but if you don't have a much

3:33 differentiation less entropy it's so

3:36 difficult to load balance the traffic if

3:38 you don't have the proper load balancing

3:41 in the fabric of course everybody knows

3:44 there will be a congestion there is high

3:46 probability of making a congestion in

3:47 the fabric when you have a high

3:50 probability of making congestion in the

3:51 fabric of course there will be a packet

3:53 drop when you have a packet drop

3:56 automatically your job com job

3:58 completion time will go high so to have

4:01 a lossless fabric to have a congestion

4:06 fabric load balancing become prominent

4:09 in the a data center cluster

4:11 specifically in the training cluster so

4:13 we must have a proper efficient load

4:16 balancing in the AI data center fabric

4:18 now that we understand uh the importance

4:20 of load balancing in the AI data centers

4:23 immanu that brings my uh brings us to my

4:26 next question what are the load

4:29 balancing methods uh prevailing in the

4:31 industry and also supported by junos and

4:35 uh why are they good fit or may not be a

4:38 good fit for AI data centers yeah so our

4:41 junifer qfx 5K switches currently

4:44 supports two mechanisms of load

4:46 balancing the first one is static has

4:48 based load balancing and the second one

4:50 is dynamic load balancing so whenever a

4:53 packet comes to a switch uh what switch

4:56 does is it looks into the packet header

4:58 and it creates a h out of it then it

5:01 takes the has and checks it into the

5:03 flow table and tries to find if there is

5:05 an entry existing in the table if there

5:08 is an entry in the table it is

5:10 considered that the flow is uh already

5:13 present and active so it takes the

5:14 outgoing interface map to that entry and

5:17 forwards the packet out of that outgoing

5:19 link but if the entry is not there in

5:22 the table in that case it is considered

5:24 as a new flow so the switch has to make

5:27 that entry into this into the flow table

5:30 and for that it needs to find

5:32 out a good outgoing interface so with

5:36 SLB it basically looks into the flows

5:40 map to each of the outgoing interface

5:43 and it finds which link has the least

5:45 number of flows mapped to it right and

5:49 then it assigns this flow to that

5:52 outgoing

5:53 interface so this works out best in most

5:56 of the cases where there are more number

5:59 of flows but the the bandwidth within

6:01 the flow is not huge but in cases of

6:03 elephant flows this does not result in

6:06 efficient load balancing and that's how

6:08 Dynamic load balancing come into picture

6:11 so with what dynamic load balancing does

6:13 is it has an algorithm wherein it takes

6:16 the link utilization and the Q

6:18 utilization and and taking that into

6:20 account it goes through the algorithm

6:23 and comes up with the quality band it

6:26 assigns quality band to each of the

6:28 outgoing interfaces so this quality band

6:31 can range from 0 to 7 seven being the

6:34 best quality link and zero being the

6:36 worst quality link so now looking at

6:39 this quality band it assigns the flow to

6:41 that link and the traffic starts

6:44 following so with TB there are three

6:47 modes of operation the first one is

6:49 assigned flow so once it assigns the

6:52 flow to a link it continues to use that

6:54 link Forever Until that link is act

6:57 until that flow is active the second is

7:00 mode is flowlet mode so with flowlet

7:02 mode what what it does is uh once it

7:05 assigns the flow to to a link it keeps

7:08 on monitoring the flow as well so it

7:10 keeps on checking for a pause in the

7:11 flow if the pause in the flow is greater

7:14 than the inactivity timer configured

7:16 then it cons it basically considered

7:18 that the flow has is over and it the

7:21 next packet it treats it as a new flow

7:23 and it again goes through the process of

7:25 identifying the best link and assigning

7:27 it to that link so this is how it it

7:29 keeps on doing the rebalancing based on

7:31 the uh pauses in the flow the third mode

7:35 is the per packet mode wherein basically

7:38 switch the packets within the flow are

7:40 balanced across multiple links uh based

7:42 on the link

7:43 utilization uh but as we know per packet

7:46 mode results in the reordering issue on

7:48 the Nick side so the destination Nick

7:51 has should have the capability to handle

7:52 the reordering so that is a challenge

7:54 with the per packet mode from AIML data

7:57 center perspective as mAh explained that

8:00 it has a lot of elephant flows coming to

8:03 the picture uh we think that DLB has a

8:06 good potential there compared to the

8:09 snv thanks amanu so that brings that

8:12 brings my next question uh to Mahesh

8:15 what's next with load

8:17 balancing yeah so uh as himu explained

8:23 um about SLB we can call it as option

8:25 one and static load balancing and the

8:28 static load balancing doesn't have the

8:31 intelligence to go and check the link

8:34 utilization or link health and so that's

8:36 the reason we are going to option two

8:39 which is DLB Dynamic load balancing it

8:41 has the algorithm has Intelligence to

8:44 understand the link Health as well as

8:46 the Q depth and we'll call it as a

8:48 quality band and you since you ask what

8:51 is next and the DLB the problem is that

8:54 the quality band the quality information

8:56 always will keep it in the same local

8:59 switch it will not propagate that

9:01 information to other side of the node

9:03 leaf or spine so that's where the next

9:06 one the option three is coming to the

9:07 picture that's called Global load

9:09 balancing Global load balancing is from

9:12 broadcom as6 th5 supports Global load

9:14 balancing in our product also we are

9:16 doing it but what it does it will it

9:20 will create the quality band and

9:22 understand the link utilization and the

9:24 Q depth and it will propagate that

9:27 information where DLB cannot do the GB

9:29 can propagate that information from

9:32 local switch to the remote switch maybe

9:34 leaf or spine MH which means the

9:37 advantage here is that you will not only

9:40 know the local link Health you will

9:44 understand the whole pop Quality Health

9:46 so accordingly you can spray or load

9:49 balance the traffic specifically elepant

9:51 flows on the RMA traffic

9:54 efficiently we have option four as well

9:57 option four we'll call it as a DLB

9:59 version two or I will call it as a

10:01 selective load balancing selective DLB I

10:04 would say what it does as I mention in

10:07 the earlier U discussion elephant flow

10:10 is the key and it's a crucial to look

10:12 for it right so the selective DLB it

10:15 will go into the RDMA traffic that's a

10:18 bth header it will look for that

10:21 elephant flows what are the elephant

10:23 flows coming out of the RDMA traffic and

10:26 the we'll call it as RDMA right verbs

10:29 which it's nothing but the elephant

10:30 flows so it will identify the elephant

10:32 flows accordingly it will spray the

10:34 traffic so to summarize that there are

10:37 option one for SLB option two for DLB

10:40 option three for glb option four is for

10:42 Selective

10:44 DLB but one point I want to make it

10:47 clear here this all those load balancing

10:50 we are supporting reordering is another

10:52 key uh criteria we need to look into

10:55 that the reordering of the packet uh in

10:58 the n

10:59 right now the N vendors are who some

11:02 other vendors are supporting reordering

11:04 which we need to handle it carefully so

11:06 all the spraying can happen in the

11:08 fabric the Nick has to handle the

11:10 reordering that one point I want to make

11:12 sure

11:14 second we are part of the U Ultra ether

11:19 Consortium technical working groups and

11:22 already U started working on the

11:24 flexible reordering so it will create a

11:26 new standard once the flexible

11:28 reordering will emerge there will be

11:30 other options also come into the picture

11:32 in the fabric level as of today we have

11:35 four options as I mentioned earlier

11:38 fantastic um so himansu is the industry

11:42 thinking about any further enhancements

11:44 uh to DLB and uh what is Juniper doing

11:47 in particular with respect to DLB uh

11:50 broadcom has come up with the cognitive

11:53 routing set of features among them there

11:55 is a feature called reactive path

11:56 rebalancing which is an enhancement of

11:58 DL be as we spoke about flowlet mode

12:01 wherein it monitors for the inactivity

12:03 timer in a flow to do the rebalancing

12:06 but uh with withl workloads it may not

12:10 be that always there is a pause in in

12:12 the flow large enough to cause the

12:13 rebalancing so what this feature does is

12:17 uh it keeps on monitoring the flow as

12:19 well as the link quality of the of the

12:21 link where the flow is going through so

12:23 if the link quality is not good in that

12:25 case it tries to find a new link and

12:28 assigns this particular flow to that

12:30 link so in in a way it will do the

12:33 rebalancing even without an inactivity

12:36 timer so this is this is going to be

12:38 very helpful with the long L flows uh in

12:41 particular other than that uh chiper is

12:44 also working on providing some

12:45 configuration option to to be able to

12:49 control the bucket size or the table

12:52 size assigned for for each of the ecmp

12:54 next stop so these are something these

12:56 are some some of the aspects that junipa

12:58 is working on

13:00 fantastic this is very insightful

13:02 actually uh and more like a crash course

13:05 for on on load balancing for me and for

13:08 the viewers but before we wrap uh I need

13:11 to ask uh what other Technologies do we

13:14 see being enhanced in support of AI data

13:17 centers and mahes you want to take that

13:18 question that' be great oh

13:22 yeah whenever wherever we are talking

13:25 about AI data center Aron the the key

13:29 word is that lossless fabric right

13:32 correct so to build the lossless fabric

13:36 we can do it in different different way

13:38 but to categorize it proactively we can

13:41 make the lossless fabric using lot of

13:44 load balancing efficient load balancing

13:46 mechanism what we discussed now the

13:48 second is the reactive mode so if if if

13:52 there is a congestion how we are going

13:54 to handle what are the mechanics uh to

13:56 control the congestion that's where the

13:58 rocket we to come to the pictures so the

14:01 next will be congestion management topic

14:05 uh which will be really helpful to uh

14:08 whoever watching this video it will be

14:10 helpful to understand how we can tweak

14:12 that uh for the a fabric the other one

14:16 um is that thermal management and it's

14:20 gpus are greedy and more than switches

14:23 and that it will consume more lot of

14:25 power so how and how efficient ly we can

14:29 do the thermal Management in the racks

14:32 in the data inside the data center that

14:34 may be another topic

14:35 yeah fantastic uh I think it's a great

14:39 point to conclude here but thank you

14:41 both thank you Mahesh and and himansu

14:43 for U for very insightful discussion and

14:46 to all our viewers who are listening in

14:49 uh stay tuned uh to learn more about the

14:51 a data center clusters in the next video

14:54 and all the Technologies uh supported so

14:58 uh stay well till

15:02 [Music]

15:07 then

Show more