Ethernet কি শেষ? Mac Studio ১০০x ফাস্টার!!

Spread the love

ভূমিকা / Introduction

NetworkChuck আবারও ফিরে এসেছেন Mac Studio ক্লাস্টারিং নিয়ে — কিন্তু এইবার পুরো গেমই বদলে গেছে। আগের ভিডিওতে পাঁচটি Mac Studio ক্লাস্টার করে তিনি পেয়েছিলেন ৯১% স্লোয়ার পারফরম্যান্স। নেটওয়ার্কিং লেটেন্সি ছিল মূল দুষ্মান। কিন্তু Apple চুপচাপ একটি সফটওয়্যার আপডেট এনেছে — macOS Sequoia 15.2 — যা Thunderbolt 5 পোর্টে RDMA (Remote Direct Memory Access) সাপোর্ট এনে দিয়েছে। এই ভিডিওতে NetworkChuck চারটি Mac Studio (প্রত্যেকটিতে 512GB RAM, 8TB স্টোরেজ, 80 GPU কোর) নিয়ে একটি $৫০,০০০ ক্লাস্টার তৈরি করেছেন — যার মোট ২TB ইউনিফাইড মেমোরি, ৩২TB স্টোরেজ এবং ৩২০ GPU কোর। প্রশ্ন: ক্লাস্টারিং কি এবার সত্যিই কাজ করবে?

৪টি Mac Studio: M2 Ultra, 512GB RAM, 8TB storage, 80 GPU cores each
মোট স্পেক: 2TB unified memory, 32TB storage, 320 GPU cores
দাম: $৫০,০০০ (তুলনায় Nvidia H100 ক্লাস্টার $৭৮০,০০০+)
কানেক্টিভিটি: Thunderbolt 5 (80Gbps, Thunderbolt 4-এর দ্বিগুণ)
গেম-চেঞ্জার: RDMA — লেটেন্সি ৩০০μs → ৩μs (১০০x উন্নতি)

“This might be the most powerful local AI setup ever built. Prove me wrong. What can beat 320 GPU cores, 2 TB of unified memory?”

RDMA: Apple-এর গোপন অস্ত্র / The RDMA Revolution

আগের ক্লাস্টারের ব্যর্থতার মূল কারণ ছিল নেটওয়ার্কিং লেটেন্সি। TCP/IP স্ট্যাকের ওভারহেডের কারণে GPU-তে GPU-তে যোগাযোগে ৩০০ মাইক্রোসেকেন্ড লেটেন্সি হতো — যা Tensor Parallelism-কে অকার্যকর করে দিত। Apple-এর সমাধান: RDMA বা Remote Direct Memory Access। RDMA TCP/IP স্ট্যাককে সম্পূর্ণ বাইপাস করে সরাসরি GPU Memory থেকে GPU Memory-তে সংযোগ স্থাপন করে। CPU-র কোনো প্রক্রিয়াকরণের দরকার হয় না। ফলে লেটেন্সি ৩০০μs থেকে নেমে এসেছে মাত্র ৩μs-এ — ১০০x উন্নতি!

আগে (TCP/IP): ৩০০μs latency, CPU processing overhead, pipeline parallelism-ই একমাত্র উপায়
এখন (RDMA over Thunderbolt 5): ৩μs latency, সরাসরি GPU-to-GPU connection, tensor parallelism সম্ভব
কীভাবে কাজ করে: macOS Sequoia 15.2 beta আপডেট + Recovery Mode-এ RDMA enable
Exo Labs: সিক্রেট beta ভার্সন RDMA সাপোর্ট সহ — এখন পাবলিকলি উপলব্ধ

“This takes our latency from 300 microseconds down to three microseconds. Are you kidding me? That’s 100x increase. That’s a bullet train. Direct connection.”

পাইপলাইন বনাম টেনসর প্যারালালিজম / Pipeline vs Tensor Parallelism

NetworkChuck সহজ ভাষায় দুটি ক্লাস্টারিং টেকনিক ব্যাখ্যা করেছেন। Pipeline Parallelism-এ প্রতিটি Mac আলাদা আলাদা লেয়ার প্রসেস করে এবং ফলাফল পরবর্তী Mac-এ পাঠায় — এটি সিকোয়েন্সিয়াল এবং ধীর। Tensor Parallelism-এ সব Mac একসাথে প্রতিটি লেয়ারের গণিতের একটি অংশ করে — এটি সমান্তরাল এবং দ্রুত, কিন্তু বেশি কমিউনিকেশন প্রয়োজন। RDMA ছাড়া Tensor Parallelism ধীর ছিল কারণ প্রতি টোকেনে ১৬০ বার কমিউনিকেশন হতো, যার প্রতিটিতে ৩০০μs লেটেন্সি — মোট ৪৮ms অপেক্ষা। RDMA সেই সমস্যা সমাধান করেছে। ফলাফল: Pipeline Parallelism-এ ৫ tokens/sec, Tensor Parallelism + RDMA-তে ১৬ tokens/sec — ৩x বেশি!

Pipeline Parallelism: প্রতিটি Mac আলাদা লেয়ার প্রসেস করে, সিকোয়েন্সিয়াল, ৫ tokens/sec
Tensor Parallelism (RDMA ছাড়া): সব Mac একসাথে কাজ করে কিন্তু বেশি কমিউনিকেশন — ৩ tokens/sec (আগের চেয়েও ধীর!)
Tensor Parallelism + RDMA: ১৬ tokens/sec — ৩x Pipeline-এর চেয়ে দ্রুত
লেটেন্সি ইমপ্যাক্ট: প্রতি টোকেনে ১৬০ বার কমিউনিকেশন × ৩μs = ০.৪৮ms (আগে ছিল ৪৮ms)

“It’s like a relay race with really fast sprinters, but we’re not working together. We’re waiting on each other.”

বাস্তব বেঞ্চমার্ক / Real-World Benchmarks

NetworkChuck বিভিন্ন আকারের AI মডেল টেস্ট করেছেন — ছোট Llama 3.2 3B থেকে শুরু করে ১ ট্রিলিয়ন প্যারামিটারের Kimi K2 পর্যন্ত। প্রতিটি ক্ষেত্রেই ক্লাস্টারিং সিঙ্গেল নোডের তুলনায় দ্রুত ছিল। সবচেয়ে впечатляющий ছিল একসাথে পাঁচটি ভিন্ন মডেল লোড করে চালানো — Kimi K2 (1T), DeepSeek V3 671B, Llama 3.3 70B, Llama 3.2 3B এবং আরও একটি — সবগুলো একসাথে, মেমোরির ~৫০% ব্যবহার করে। এছাড়াও Open WebUI, Xcode এবং OpenCode-এর মাধ্যমেও ক্লাস্টার ব্যবহার করা যায়।

Llama 3.2 3B (8-bit): Single node ১৪৭ tokens/sec → ৪-node cluster ২৪০ tokens/sec (+৬৩%)
Quinn 3 Coder 480B (MoE): Single node ২৭ tokens/sec → Cluster ৪০ tokens/sec (+৪৮%)
Kimi K2 (১ ট্রিলিয়ন প্যারামিটার): Cluster-এ ২৮-৩০ tokens/sec — লোকালি চালানো সবচেয়ে বড় মডেল!
DeepSeek V3 671B: Cluster-এ ২৬-২৭ tokens/sec — আগের ক্লাস্টার এটি লোড করতেই পারেনি
Multi-model: ৫টি মডেল একসাথে চালানো সম্ভব — মেমোরি ব্যবহার ~৫০%
পাওয়ার কনজাম্পশন: প্রতি Mac ~১০০-১৫০W, মোট ~৪০০-৬০০W

“I’m running every model I can possibly run on it, and it’s just doing it like a champ, and it’s fast.”

Exo Labs ও MLX ইকোসিস্টেম / The Exo Labs & MLX Ecosystem

Exo Labs-এর ওপেন-সোর্স ক্লাস্টারিং সফটওয়্যার এই পুরো প্রকল্পের প্রাণ। Exo যেকোনো কনজিউমার হার্ডওয়্যার ক্লাস্টার করতে পারে — Mac, PC, ল্যাপটপ, এমনকি Raspberry Pi। এটি Apple-এর MLX (Machine Learning Framework) ব্যবহার করে, যা ইউনিফাইড মেমোরির সুবিধা নেয়। Exo-র নতুন ভার্সনে একটি সুন্দর Mac native GUI যোগ করা হয়েছে (CLI-এর পাশাপাশি)। MLX-এর অপারেশন CPU বা GPU-তে চলতে পারে মেমোরি মুভ না করেই — কারণ এটি ইউনিফাইড মেমোরি।

Exo Labs: ওপেন-সোর্স, যেকোনো হার্ডওয়্যার ক্লাস্টার করতে পারে
MLX Framework: Apple-এর ওপেন-সোর্স ML ফ্রেমওয়ার্ক, ইউনিফাইড মেমোরি ব্যবহার করে
নতুন GUI: Mac native app — Pipeline/Tensor parallelism সিলেক্ট করা যায়, RDMA on/off
API endpoint: Exo REST API — Open WebUI, OpenCode, Xcode-এর সাথে ইন্টিগ্রেটেড
সেটআপ: macOS Sequoia 15.2 + Recovery Mode → RDMA enable → Exo install → ready

“Exo stopped development like 5 months ago… Nope. They’re here, they’re alive, and they’re working with Apple to resurrect clustering.”

চূড়ান্ত মূল্যায়ন / Final Verdict

NetworkChuck-এর এই পরীক্ষা প্রমাণ করে যে Apple-এর RDMA প্রযুক্তি এবং Exo Labs-এর সফটওয়্যার মিলে লোকাল AI ক্লাস্টারিংকে বাস্তবে পরিণত করেছে। ক্লাস্টারিং এখন শুধু সম্ভবই নয় — এটি সিঙ্গেল মেশিনের চেয়ে দ্রুততর। $৫০,০০০-এর এই ক্লাস্টার $৭৮০,০০০+-এর H100 ক্লাস্টারের বিকল্প হতে পারে যদি আপনার ২TB পর্যন্ত মডেল রানের প্রয়োজন হয়। তবে এটি এখনও beta সফটওয়্যার — কিছু stability issue আছে (যেমন multi-model লোড করার সময় crash), কিন্তু ধারণাটি প্রমাণিত।

“Are you impressed? Is clustering back? I kind of feel that way. It’s a proof of concept. There’s possibilities now.”

সারসংক্ষেপ / Summary

Apple-এর RDMA (macOS Sequoia 15.2-এর মাধ্যমে) Thunderbolt 5-এ GPU-to-GPU লেটেন্সি ১০০x কমিয়ে এনেছে — ৩০০μs থেকে ৩μs। এর ফলে Tensor Parallelism ব্যবহার করে ৩x দ্রুত AI মডেল ইনফারেন্স সম্ভব হয়েছে। Exo Labs-এর ওপেন-সোর্স ক্লাস্টারিং সফটওয়্যার MLX-এর সাথে মিলে যেকোনো Mac, PC বা Raspberry Pi ক্লাস্টার করতে পারে। চারটি Mac Studio (২TB RAM, ৩২০ GPU cores) দিয়ে NetworkChuck ১ ট্রিলিয়ন প্যারামিটারের Kimi K2 সহ পাঁচটি মডেল একসাথে চালাতে পেরেছেন — সবই লোকালি, ক্লাউড ছাড়া। ক্লাস্টারিং এখন সত্যিই কাজ করে — এবং এটি দ্রুত!

📖 ভিডিও বিবরণ ও রিসোর্স লিংক | Video Description & Resources

Hey…just try Twingate….you’ll never look at VPN the same: https://ntck.co/twingate-networkchuck I built another AI supercomputer with 4 Mac Studios… but this time it actually works. Earlier this year, I clustered 5 Mac Studios and it was 91% SLOWER. Everyone said clustering was stupid. But Apple just dropped a software update that changes everything – RDMA over Thunderbolt 5. Latency dropped from 300 microseconds to 3 microseconds. Now we’re running trillion-parameter models locally at speeds that actually make sense. 🔥🔥Join the NetworkChuck Academy!: https://ntck.co/NCAcademy

🔗 রিসোর্স লিংক / Resource Links

⏱️ টাইমস্ট্যাম্প / Timestamps

0:00 – The $50,000 AI Supercomputer
0:53 – What Apple Changed
3:05 – Connecting the Cluster
4:17 – Pipeline vs Tensor Parallelism
7:52 – RDMA: The 100x Latency Fix
10:02 – Twingate (Sponsor)
11:39 – Exo Labs is BACK
14:42 – Single Node vs Cluster Testing
17:58 – Qwen 3 Coder 480B Testing
19:03 – Kimi K2 (1 Trillion Parameters)
21:09 – Stacking Multiple Models
25:22 – Real Apps: Open WebUI + Xcode
27:57 – Final Thoughts
28:47 – How MLX Makes This Possible

📖 সম্পূর্ণ ট্রান্সক্রিপ্ট দেখুন / View Full Transcript / पूरा ट्रांसक्रिप्ट देखें ▼

📜 সম্পূর্ণ ট্রান্সক্রিপ্ট / Full Transcript / पूरा ट्रांसक्रिप्ट

🇧🇩 বাংলা / Bengali / बांग्ला:

NetworkChuck তার দর্শকদের জন্য প্রার্থনা করছেন। তিনি ঈশ্বরের কাছে আশীর্বাদ চাচ্ছেন — আমাদের ক্যারিয়ার, পরিবার এবং জীবনের প্রতিটি ক্ষেত্রে সফলতার জন্য। তিনি বলেন, প্রযুক্তির এই দ্রুত পরিবর্তনশীল যুগে আমরা যেন শান্তি ও দিকনির্দেশনা পাই। তাঁর মতে, আমাদের উচিত পরিবার ও প্রিয়জনদের সাথে সময় কাটানো এবং জীবনের আসল অর্থ খুঁজে বের করা। তিনি যীশুর নামে প্রার্থনা শেষ করেন এবং দর্শকদের জন্য শুভ কামনা করেন।

🇺🇸 English / ইংরেজি / अंग्रेज़ी:

0:00 So, I built another AI supercomput. And 0:02 this one’s crazy. 1 2 3 four Mac 0:05 Studios, 512 gigs of RAM each, 2 terb of 0:09 unified memory. This might be the most 0:11 powerful local AI setup ever built. Hold 0:14 up. I’ve done this before. Earlier this 0:16 year, I clustered together five Mac 0:17 Studios, [music] and it was kind of 0:19 terrible. I expected it to run AI models 0:21 like a champ, but actually adding more 0:23 computers made it slower. 91% [music] 0:25 slower. Everyone said clustering was 0:27 stupid, and they were right. So, why am 0:29 I doing this again? Well, Apple just 0:31 dropped [music] something new. A simple 0:32 software update that might change the 0:34 entire game. Something I was begging 0:36 for. So, in this video, we’re clustering 0:38 again. We’re going to throw the biggest, 0:39 [music] baddest AI models at this 0:40 cluster and see what it can do. Our goal 0:42 is to answer this one question. Does 0:44 clustering local AI actually make 0:45 [music] sense for us? Will it actually 0:47 be fast or will it just suck like last 0:49 time? Get your copy ready. Let’s find 0:51 out. 0:53 Now, Apple was watching me. In my last 0:55 video, I [music] said this. I don’t know 0:57 how XLabs is going to solve that though 0:59 because we’re at the mercy of what 1:01 hardware we [music] have and this. I 1:03 would love to know what the experience 1:05 would be with some like serious 1:07 connectivity between the GPUs of these 1:09 five Mac Studios. I was honestly pretty 1:12 bummed about the performance of that 1:13 cluster. I had such high hopes. [music] 1:14 I know you probably did too. But Apple 1:16 listened and they did something about 1:18 it. And they also sent me something. 1:20 This 1:34 This setup is insane. I wasn’t sure if 1:35 they were going to send me this, but I 1:37 was shocked when they said yes. They 1:38 sent me their biggest, baddest machines, 1:40 fully speced out. But hold up. My old 1:42 cluster had five Mac Studios. How is 1:44 this better? Watch this. Now, these 1:47 machines are ridiculous. Each one of 1:48 these Mac Studios has 512 GB of RAM. 1:52 RAM? That’s a hard drive for some 1:53 computers. And this is unified memory, 1:55 meaning the GPU can use it. So, let me 1:57 put it this way. 512 GB of GPU memory of 2:01 VRAMm. They have 8 TB of storage, 80 GPU 2:04 cores. And if we do the math, this 2:06 monster cluster has 2 terb of unified 2:09 memory, 32 TB of storage, and 320 GPU 2:12 cores. This is Maczilla compared to our 2:14 old cluster running M2 Max. It’s not 2:16 even a comparison. That’s kind of silly 2:18 cuz even though we have one less Mac, 2:20 each of our machines has eight times the 2:22 memory. So, we have 6.4 times more RAM, 2:24 we’re doubling our GPU bandwidth. And 2:26 this will make a huge impact, we’re 2:28 doing Thunderbolt 5 versus Thunderbolt 2:30 4, which is also double the bandwidth. 2:32 The only thing that makes you kind of 2:33 go, “Oh, that hurts.” is the price of 2:35 this cluster. $50,000. I know. But 2:38 seriously, think about it. What’s the 2:39 alternative to do something like this 2:40 locally? Like if you wanted the same 2:42 specs from an Nvidia H100 cluster, you 2:44 would need 26 H100s, each with 80 GB of 2:47 VRAM. That would cost you over $780,000. 2:50 And it’s actually more than that if you 2:51 build out the system. We’ll talk more 2:52 about that later. But the point is, this 2:54 cluster is ridiculous. Okay, cool. We 2:56 have these big amazing monsters, but the 2:58 biggest issue we had last time was 3:00 networking, the bandwidth. What did 3:01 Apple do? We’ll get to that. But first, 3:03 we have to actually connect them, right? 3:04 That’s the whole clustering part. Now, 3:05 to make this work, I had to connect them 3:06 via Thunderbolt and Ethernet. 3:08 Thankfully, I had a spare UniFi Switch 3:10 that had some 2 and 1/2 gig ports, which 3:11 I desperately needed because downloading 3:14 these models, they’re huge. The largest 3:15 one I had to download was 735 GB, and I 3:18 had to download that on each of the 3:20 Macs. I also had to make sure my uplink 3:21 was 10 GB Ethernet because goodness, 3:23 this took forever. We’re also using the 3:25 Ethernet so the cluster can see each 3:26 other, but it’s not how they’re actually 3:28 connecting and exchanging information. 3:30 That’s where Thunderbolt comes in. I was 3:31 given this very fancy diagram to connect 3:33 them just so in a mesh. It’s just a 3:35 little meshy. Sorry, I had to do it. 3:37 Now, let’s take a step back and look at 3:38 this for a moment. Isn’t this beautiful? 3:40 I honestly think this might be the most 3:42 powerful local AI setup ever built. 3:44 Prove me wrong. What can beat 320 GPU 3:46 cores, 2 TB of unified memory? This 3:49 cluster should be able to handle 3:50 anything. And notice our asterisk here, 3:52 should because it really doesn’t matter 3:53 how powerful these Macs are if the 3:55 connection between them isn’t super 3:56 fast. And that’s what killed it last 3:58 time, the networking. Running bigger 4:00 models didn’t matter because [music] 4:01 everything came down to a crawl. And 4:03 even though we’re doubling our bandwidth 4:04 with Thunderbolt 5, we still have a 4:06 massive networking problem, latency. But 4:08 Apple changed the game completely with 4:10 the software update. That’s it. Check 4:12 this out. 4:18 You know what really bugs me? When the 4:19 problem is actually networking. 4:20 Everybody blames the networking, but 4:22 this time it was actually true. When I 4:24 ran these models on five Mac Studios, it 4:26 was 91% slower. It wasn’t the GPU, it 4:29 wasn’t the memory, it was the 4:30 networking. It was that latency between 4:32 the connections. I mean, look at these 4:33 speeds from last time when I clustered 4:35 them together. It’s bad. But Apple said 4:37 they solved it. Let’s see. Actually, let 4:39 me just show you real quick. Let’s see 4:40 if they did it. Don’t stare at this too 4:41 long. This is your sneak peek. Here’s 4:42 the old way. 4:47 Five tokens per second. Not great. Let’s 4:49 try Apple’s fix. 4:56 Now, that’s a lot better. 15 tokens per 4:58 second. That’s three times faster. Same 5:00 model, same cluster. But what are they 5:02 doing? How do they make it faster? It’s 5:03 networking. It’s always about 5:05 networking. Now, the way they solved it 5:06 is actually pretty simple. They just 5:07 increase the speed of our Thunderbolt 5:09 connections. And I’m not talking about 5:10 going from Thunderbolt 4 to Thunderbolt 5:12 5 and doubling that bandwidth. No, no, 5:13 no, no. The metric we’re looking at is 5:15 latency. How quickly a packet or a 5:17 message can go from MAC to MAC. Now, 5:20 currently this latency is around 300 5:22 microsconds, which I know sounds pretty 5:23 fast, but not in the AI world. And 5:25 because of this latency, we were stuck 5:26 with something called pipeline 5:28 parallelism. That’s a fun word. 10 times 5:30 fast. Pipeline parallelism. Pipeline 5:31 parallelism. How far did you get? It’s 5:34 essentially a technique we use to split 5:35 up a model between multiple systems and 5:37 a cluster. For example, let’s take an AI 5:39 model like the Llama 3.37DB FP16. That’s 5:43 some good precision. A model like this 5:45 will have around 80 layers, which 5:46 essentially is a series of filters that 5:48 your prompt will pass through to help it 5:49 respond to you. So, if you say, “Hey, 5:51 what’s the capital of Japan?” It might 5:52 process it like this. Each layer is 5:55 doing some fancy math on the input and 5:57 passes the result to the next layer. 5:58 Each layer refining the answer until you 6:00 get your response. Now, when you cluster 6:02 this model, it divides up the layers 6:04 between each machine. So, Mac 1 would 6:06 get layers 1 through 20. Mac 2 21 6:08 through 40 and so on. But here’s where 6:10 it gets painful. It’s sequential. So, 6:13 for every token, Mac 1 processes layers 6:15 1 through 20, stops, sends the results 6:18 to MAC 2. It’ll process layers 21 6:20 through 40, stops, and then so on. It’s 6:22 like a relay race with really fast 6:24 sprinters, but we’re not working 6:25 together. We’re waiting on each other. 6:28 Now, pipeline parallelism is awesome 6:29 because it gave us capacity. We could 6:31 run large models that we can never run 6:32 on one machine and run them on multiple. 6:34 But what it didn’t give us was speed. 6:36 Instead, we got a waiting room. But 6:37 there is a better way, a faster way, and 6:39 it’s called tensor parallelism. I know 6:41 we’re getting a little nerdy. Coffee 6:43 break real quick. This is the much 6:44 smarter way. Instead of each Mac owning 6:46 layers and processing them sequentially, 6:48 all Macs work together on every single 6:51 layer. Here we’re not dividing the 6:52 layer, we’re dividing the math. So for 6:54 layer 1, Mac 1 does 25% of the math. Mac 6:58 2 does 25, 35, 4 25. When they’re done, 7:01 they combine the results. Now, in 7:03 theory, this is supposed to be three and 7:04 a half times faster than pipeline 7:06 parallelism. Goodness, that phrase kills 7:08 me. So cool, let’s just do that. That’s 7:09 the solution. We can’t. Our networking 7:12 still sucks. And it’s even worse with 7:14 this method. Because for each Mac to 7:15 work on one layer at a time, lots of 7:18 communication is happening. It’s like a 7:19 group project. Lots of messages being 7:21 sent back and forth. In fact, we’re 7:22 talking two combos per layer. So, for 7:24 every token, we have 160 combos 7:27 happening. Let’s do some math here. 7:28 Assuming each message takes about 300 7:30 microconds. Now, I know microconds is 7:32 this little symbol here. I just can’t 7:34 draw it. I can’t do it. I’m sorry. Times 7:36 160, that’s nearly 50 milliseconds of 7:38 waiting per token. So, because of our 7:40 networking latency, tensor parallelism 7:42 actually ends up being slower than 7:44 pipeline parallelism, which is why we 7:46 couldn’t do it. All that chitchat killed 7:48 it. If only our network was faster. If 7:50 only latency was solved. Well, that’s 7:52 what Apple did. They made it faster. And 7:54 I’m not kidding. It was just a simple 7:56 software update. Apple quietly enabled 7:58 in Tahoe 26.2 a technology on their 8:00 Thunderbolt ports called RDMMA or remote 8:04 direct memory access. This is huge. 8:07 We’re in the big leagues now. We’re not 8:08 playing around. Now, you may have heard 8:09 me mention RDMA before in a previous 8:11 video talking about AI data center 8:13 networking. You can watch it right here. 8:14 It’s what AI clusters and data centers 8:16 used to talk back and forth at extremely 8:18 high speeds. It’s what models like 8:20 ChatGpt and Claude use. But what is it? 8:22 Well, let’s talk about what life is like 8:24 before RDMA with our Thunderbolt 5 8:25 connections. These connections right 8:27 here are essentially just network 8:29 connections using good old TCP IP, the 8:32 traditional networking stack. Now, 8:33 that’s a problem for us here because 8:34 that introduces overhead, increasing 8:36 latency, because every message is having 8:38 to go through a few steps doing things 8:40 like being processed by the CPU before 8:42 it can even hit the GPU memory. And I 8:43 have no idea where the CPU and the GPU 8:45 is in this Mac Studio. I’m just making 8:47 stuff up. It’s this traditional 8:49 networking processing that’s causing our 8:51 latency. But with RDMA, we skip all 8:53 that. RDMA is direct memory access. We 8:55 remove the TCP IP stack. Say, “Nah, we 8:57 don’t need you anymore. We’re getting a 8:59 direct connection. No more stops. A 9:01 direct connection from GPU memory to GPU 9:04 memory, GPU to GPU.” That’s the direct 9:06 memory access part. And here’s what this 9:08 does. This takes our latency from 300 9:11 microsconds down to three microsconds. 9:16 Are you kidding me? That’s 100x increase 9:18 or decrease. That’s a bullet train. 9:20 Direct connection. So now with our 9:22 cluster, we no longer have to worry 9:23 about IP addresses, TCP IP overhead, all 9:26 that CPU processing. No, no, no. We have 9:28 a direct connection between the GPUs 9:31 with these Thunderbolt 5 ports. Direct 9:33 memory access. This solves our latency 9:36 problem. And again, it was just a 9:38 software update. What took them so long? 9:40 Seriously, Apple, I’m sorry. I’m 9:41 grateful for just it being here. I’m 9:43 sorry. Now, in theory, this sounds 9:45 awesome, but does it actually work? Now, 9:47 you saw a teaser already, but that was 9:48 just a smaller model. What happens when 9:50 we actually throw some crazy large 9:52 models at it? Like the largest models 9:53 available right now. Will it work? Will 9:55 it make clustering actually useful? 9:57 Let’s find out with the software that 9:59 you all thought went away, but it’s come 10:01 back with a vengeance. Hey, I’m outside 10:03 right [music] now. Why? because I want 10:05 to tell you about our sponsor today, 10:06 Twinate. Because check this out. I can’t 10:08 access my lab right now on my laptop. 10:12 I’m trying to connect to our cluster and 10:14 it’s not working. It’s because I’m not 10:15 using [music] Twinate. Watch this. I’ll 10:17 simply click connect. Authenticate cuz 10:20 it’s really secure. And by the way, this 10:21 is not VPN. No, no, no. This is better 10:24 than VPN. [music] It’s more than VPN. 10:26 Okay, cool. We’re authenticated now. 10:27 Let’s try it again. It better work. You 10:29 know what? It’s still not working. I 10:31 [music] don’t have access. But you’re 10:32 probably thinking, Chuck, wow, great ad. 10:34 You tried to show us you didn’t have 10:35 access and tried to give yourself access 10:36 with Twin Gate. It didn’t work. That’s 10:37 by design. Twate is zero trust network 10:39 access. You can set it up. You can 10:41 install it. Takes you about 5 minutes. 10:42 You can use a Raspberry Pi. Whatever. 10:44 It’s free for up to five users with your 10:45 home network. It’s a no-brainer. Just do 10:46 it. But also, you have to specifically 10:49 explicitly allow access. So, let’s do 10:51 that right now. 10:54 I’ll create the resource and [music] 10:55 then here’s where I give access. I’m 10:57 only given access to admins. Now, let’s 10:59 try it again. Fingers crossed. 11:01 [laughter] 11:02 We’re in. And I’m checking on my cluster 11:04 to see I’m downloading a new model right 11:05 now. Quinn coder. Oh, you’re not 11:07 supposed to see this yet. You got a 11:08 sneak peek during the ad. Look at you. 11:10 Seriously, if you’re not using Twin 11:12 Gate, what are you even doing? VPN 11:13 [music] is old. And if you’re like, 11:14 Chuck, I don’t need Twin Gate. Yes, you 11:16 do. You have a home lab, right? That’s 11:17 why you’re watching [music] my channel. 11:18 You do local AI. You have Plex. How are 11:20 you accessing that when you’re out in 11:22 the field? Seriously, just get signed 11:23 up. It’s free. Check the link below. 11:25 twgate.com/networkshuck. 11:26 And seriously, thank you to Twinkgate 11:28 for making this video possible. I 11:29 couldn’t do this kind of stuff without 11:30 them, so [music] check it out. All 11:31 right, now back to the video. And it’s 11:33 getting cold out here. I got to go back 11:34 inside. 11:39 Now, a lot of people thought Exo Labs 11:40 threw in the towel. Even Jeff. 11:42 >> Exo stopped development like 5 months 11:44 ago, and even simple fixes are being 11:46 ignored. So, did they give up the dream 11:47 of clustering AI with consumer hardware? 11:49 Nope. They’re here, they’re alive, and 11:51 they’re working with Apple to resurrect 11:52 clustering. A flashback earlier this 11:54 year, I used ExoLabs to cluster my Five 11:56 Mac Studios together. It was still 11:58 pretty new, rough around the edges, and 12:00 it’s kind of an amazing software, an 12:02 amazing idea because you can cluster any 12:04 hardware together, but the networking, 12:06 that was a thing. So, when ExoLabs and 12:09 Apple both reached out to me, I said, 12:10 “Hey, we’ve got a new update. We got a 12:12 thing we’ve been working on, and we 12:13 think it’s going to solve your 12:14 clustering networking issues, and they 12:16 said, “Do you want in?” And I’m like, 12:17 “Dude, put up that bat signal. I’m 12:19 there.” So, fast forward, here we are. I 12:21 have the Max. I got them connected 12:23 according to our official diagram. I 12:24 installed the beta Tahoe 26.2 update. 12:27 And here’s the real magic. I had to go 12:28 into recovery mode and enable RDMA. 12:31 Seeing that RDMA enabled, that got me 12:33 excited. From there, it was pretty much 12:34 smooth sailing. I installed the super 12:35 secret beta version that ExoLabs gave 12:37 me, which I think might be available 12:39 right now. I could be wrong. I’ll have 12:40 more info below. But look at this. It 12:42 got a facelift, right? Before this was a 12:44 CLI only tool, which I’m fine with, but 12:46 now we have a native Mac app and it’s 12:48 gorgeous. So, here we are after a few 12:50 software patches from Exo. My cluster’s 12:52 connected. And as you may have seen 12:53 earlier in our sneak peek, we have the 12:55 option of running in dumb mode. pipeline 12:57 parallelism or tensor parallelism. Just 13:00 for fun, let’s do it the old way once 13:02 more. So, we’ll select our pipeline and 13:03 MLX ring. We’ll select four nodes using 13:05 our entire cluster and we’ll choose our 13:08 llama 3.370B FP16. Load that sucker up. 13:11 Now, it’s actually pretty fast loading 13:13 up. This is so fun. It’s ready. And keep 13:16 in mind, this is the dome old slow way. 13:20 Five tokens a second. Around 200 13:21 milliseconds per token. Goodness, that’s 13:24 really slow. I can already have my cup 13:25 of coffee by the time it’s done. And 13:27 then let’s try this. This is going to be 13:28 fun. Let’s do tensor parallelism, but 13:30 stay on the old networking. No RDMA. 13:33 Let’s see what it does. Keeping in mind 13:34 with this, we’re having 160 13:36 conversations every token. Let’s load it 13:38 up. It’s ready. Let’s see what happens. 13:44 Yeah, it’s slower, right? I mean, we’re 13:46 talking half as fast, which wasn’t a 13:49 lot, right? We’re at three tokens per 13:51 second. 370 milliseconds per token. This 13:54 ain’t great. Oh, I mean, this is 13:56 unusable. But let’s add in Apple’s 13:58 magic. Let’s enable RDMA. Tensor 14:00 parallelism, RDMMA, four nodes, llama 14:02 3.37DB. Load that bad boy up. And this 14:05 is actually really fast loading up. Look 14:07 at those guys go. I love this new 14:08 interface they have. It’s ready. 14:11 I mean, look at that. That’s incredible. 14:13 16 tokens a second on average. 66 14:16 milliseconds per token. They did it. 14:18 Apple, you’re awesome. But this is a 14:19 smaller model, right? There are bigger 14:21 ones. What about DeepSeek? What about 14:23 Kimmy K2? How about both at the same 14:25 time? Get your coffee ready. You’re 14:27 about to make this cluster sweat, I 14:28 think. Let’s go. 14:36 Okay, so we just proved that tensor 14:37 parallelism works with RDMA. It’s 14:40 fantastic. Three times faster. But 14:41 here’s the real question. Do you even 14:43 need to cluster? Like, does clustering 14:45 make sense for you? Like, why not just 14:46 buy one expensive Mac? Like, one of 14:48 these guys has 512 gigs of RAM. What 14:50 models do we need to run that could not 14:52 run on that one machine? Let’s test it 14:53 out. Let’s throw progressively bigger 14:55 models on this thing. Let’s throw all 14:56 the models on it at the same time. Let’s 14:58 see what happens. Little coffee break. 15:02 It’s clustering time. I don’t know why I 15:03 said that. That didn’t feel right. All 15:04 right. Here we have our cluster over 15:06 here. We’ve got our metrics. Mac one, 15:08 two, three, and four. Now, I want to 15:11 test all these models right here. But 15:12 the first thing I want to test is kind 15:14 of a tiny model. I want to see how the 15:15 performance is on one Mac versus four. 15:18 I’m curious with RDMA, will the 15:20 performance be better? Because one Mac 15:22 by itself, it can run that model fine. 15:23 Let’s test it out. So, I’m just going to 15:25 choose one node and I’m going to load a 15:27 small model, a little baby llama 3.2 3B 15:30 8 bit. It’s nothing. Launch. 15:36 Okay, it’s ready. 15:38 And we’re rocking 147 tokens a second. 15:41 Let’s ask something more involved. 15:46 So, that’s fast. Let’s cluster it. 15:52 Okay, now with four nodes clustered. 15:56 It’s faster. 16:00 Okay, so 240 tokens a second. That’s 16:03 awesome. It’s 100 tokens faster than a 16:06 single node. Look at it go. Maxing out 16:09 that GPU across the board. Looks like I 16:11 kind of lost the script some somewhere 16:13 there, but it looks like we’re pulling 16:15 about um getting to 100 watts per MAC. 16:19 So 400 watts in total. Not bad. Let’s 16:21 keep moving up. Let’s do a single node 16:23 for the llama 3.370B FP16 because these 16:27 guys can run it. 16:30 So we loaded up on Mac 2. We can see 16:32 it’s full right there. Let’s see how it 16:34 performs. Okay, five tokens a second. 16:37 Not amazing. It’s maxing out that GPU. 16:40 Looks like 150 watts. I am changing the 16:42 monitoring halfway through here because 16:44 I think it’s better. I’m going to use 16:45 Mactop. It’s still going. But notice we 16:47 still have plenty of memory on that 16:49 second Mac, which is where it’s loaded. 16:51 All right, cool. Roughly five tokens a 16:52 second. Let’s cluster it. 16:56 It’s loaded up. Watch that memory fill 16:59 up. It’s so fun to watch it over here on 17:01 the screen. This is a night and day 17:03 difference compared to our last cluster. 17:05 It took forever. You guys have no idea 17:07 how painful that video was to make. 17:09 Okay, it’s ready. 17:12 Okay, 16 tokens a second. Try a more uh 17:16 intense prompt. And there it goes. Max 17:19 out the GPUs across the board. Memory 9% 17:23 on each one. About 130 watts on each 17:26 machine. This is a cluster, man. In a 17:29 good way. Okay, let’s move on. Let’s up 17:32 the game. I love watching it. Uh the GPU 17:34 uh memory go down as well. Okay, our 17:36 next model. Let’s do something a bit 17:38 bigger. Deep Seek. Okay, I’m back. He 17:40 didn’t know I left. I had to 17:43 troubleshoot some stuff. I don’t have a 17:44 Quinn model downloaded. I thought I had 17:46 Quinn. I think I have to let it download 17:48 for a little bit. An [sighs] hour. 17:52 Uh but I got it working. I had to 17:54 download this new model. I forgot to 17:55 download. Anyways, here we go. We’re 17:57 going to load it up. All right. So, 17:58 first we’ll do uh one node and we’ll do 18:01 the Quinn 3 coder 480B. It’s a big one, 18:04 but again, our Macs have 512 GB of RAM. 18:07 Now, this is a mixture of experts model, 18:09 so it should be a bit faster because 18:10 it’s not using all parameters all at 18:12 once. Let’s launch it. And keep in mind, 18:14 we’re building our way up to the Kimmy 18:16 K2 thinking model, which is a trillion 18:19 parameters. I can’t even like what? Yes, 18:22 we can do that. I think we’ll see. All 18:23 right, it’s loading on one node. Mac 2. 18:25 Mac 2 is blowing up. All right, he’s 18:27 ready. Explain 18:30 to me if 27 tokens a second. It’s pretty 18:34 good. Let’s cluster it up. Powering 18:36 down. 18:38 Tensor RDMA four nodes. Quinn 3 coder 18:42 launching. Look at it go. All right, 18:44 it’s ready and let’s see what we got. 18:46 [laughter] 18:47 40 tokens a second. Okay, so it’s nearly 18:49 doubled the speed. So the story so far 18:51 is clustering does not slow us down. It 18:53 speeds us up, which is what we expected 18:54 the first time around, right? Let’s keep 18:56 moving up and keep an eye on the power 18:58 and stuff. We’ll have that on the 18:59 screen. Let’s do our next model. We’ll 19:01 do actually, you know what? Enough 19:03 messing around. Let’s go Kimmy K2. I 19:05 don’t think we’ll be able to load this 19:06 model on one node. Yeah, it’s like at 19:09 minimum you got to do two, which is 19:10 still crazy. We only need two. Um, let’s 19:13 run it on four, though. This is a 19:15 massive model. 658 gigabytes per node I 19:18 had to download. One trillion 19:20 parameters. Let me just double check my 19:21 math here. Yes, one trillion parameters. 19:24 Oh my goodness. Okay, let’s load it up. 19:26 Let’s watch this thing fill up. Here we 19:28 go. 19:30 [laughter] 19:31 Now, this is like the best coding model 19:33 right now for uh local AI. And even when 19:37 people use cloud AI to code, they’ll pay 19:40 for this model on a hosted platform. 19:43 It’s that good. But here, we’re not 19:45 having to use any cloud resources. We 19:47 can host everything right here locally. 19:49 It’s still loading, filling up those GPU 19:51 tanks. Okay, it’s ready. [laughter] It 19:54 used 33% of the RAM on each one. Let’s 19:56 ask it a question. A trillion 19:58 parameters. Let’s see how it goes. 20:01 Whoa. So, it’s thinking right now. Let’s 20:03 expand the thinking. But we’ve got 28 20:05 almost 30 tokens a second. Oh my gosh. 20:09 And it’s so cool. It’s thinking, right? 20:10 Like, oh my gosh, a thinking model. I 20:13 mean, this is fast. How much RAM is 20:15 required to run the 4bit model? 20:20 Let’s see. Power. We’re doing like 110 20:24 watts on each one. Now, one thing I 20:26 wanted to show you guys, but I just 20:27 could not show you. And that’s the 20:29 bandwidth, the latency, the network 20:31 traffic. I wanted to see that just how 20:33 much stuff we’re pushing. But right now, 20:35 and you can see this now. Oh, this 20:37 thing’s stinking. If I go to my uh 20:39 network configuration, the Thunderbolt 20:40 bridge is inactive. It’s disabled. This 20:43 is essentially ghost traffic. I can’t 20:45 see it. I can’t monitor it. That might 20:46 change, but I did ask Apple. I’m like, 20:48 “Hey, please give me a way to visualize 20:50 this.” They’re like, “Yeah, we can’t do 20:51 that right now.” So, sorry. Pretend it’s 20:54 a lot because we have to assume that, 20:56 right? 160 conversations every token. 20:59 Let’s have it create a Python script 21:00 that can scrape websites. I mean, it’s 21:03 doing great. It’s thinking. It’s making 21:06 my script now. This is local AI. Okay. 21:09 But this isn’t the whole story, right? 21:10 So, we’ve got some room. We’re running 21:12 the biggest model we could download 21:13 right now. And we have some room. Uh, 21:15 let’s go home. And keep in mind, we’re 21:18 going to keep this model running. We’re 21:19 going to load a new one. Just refresh my 21:21 page so I get a fresh uh instance here. 21:23 I’m going to load this one. It’s 21:26 beefier, meaning like it’s a 713 21:28 gigabytes. It is the biggest model they 21:31 have. Um but it’s only 671 billion 21:34 parameters, which is so crazy. Our last 21:37 cluster, we tried to run a 671B and it 21:40 like destroyed everything. Now we’re 21:42 going to run it alongside a trillion par 21:44 uh parameter model, which makes me feel 21:47 accomplished even though I didn’t do 21:48 anything. Let’s launch it. Good job, 21:51 Apple and Exo. So, now let’s watch this 21:52 GPU fill up. So, we’re running Kimmy K2 21:55 thinking. And what’s this model again? 21:57 Deepseek 3.18bit 671B. Oh, it failed. I 22:02 don’t like that, but I know we can still 22:04 do it. I’m going to try again. 22:07 Um, I had to build these scripts uh with 22:10 cloud code that would reboot the cluster 22:12 and reinitialize it because this happens 22:14 with beta beta software. So, um hang 22:17 tight. Okay, I’m back. Let’s try it 22:19 again. Had to reboot the whole cluster. 22:20 Let’s first try and load Kim K2. I know 22:23 we can do this. I know we can do this. 22:25 Okay, we’re loaded. Let’s [snorts] go 22:27 deepsee. Come on. Let’s do this. 22:30 Deepseek, I believe in you. Launch. Come 22:34 on. Don’t fail. Okay, we’re loading. 22:37 It’s filling up. So far so good. I’m 22:41 getting nervous. It’s almost done. I 22:43 think we’re at 50% over 50% on each 22:46 node.h. Ah, those things are stressing. 22:50 Look at the map down here. It’s like ah. 22:53 Oh my gosh. 60%. Uh-oh. Did we break it? 22:57 Oh, it’s still filling up. 23:00 Ah, we did it. Okay. I’m I’m even scared 23:03 to talk to him, though, because I don’t 23:05 want to break it again. But we have to 23:06 talk to it, right? So, let’s have a chat 23:08 with uh Deepseek. 23:15 27 26 tokens per second. Let’s switch to 23:19 Kim K2. 23:21 And there we go. Like we’re using both 23:23 models. We have both those suckers 23:25 loaded up. That’s crazy. How much more 23:27 can we load? Let’s do another one. Oh 23:30 gosh. 23:32 Let’s load up Llama 3.37B FP16 cuz 23:36 that’s one of the other ones I have 23:37 downloaded. And launch. It’s loading. 23:39 Okay, it’s it’s ready now. The RAM 23:42 utilization is kind of weird, right? It 23:43 went up to 60% and now it’s back down to 23:45 50. But I I know we have these models 23:47 loaded. Let’s go talk to Llama. 23:52 Wow, what a dumb answer. I just want to 23:55 verify I have everything loaded up. I’m 23:56 going to switch back and forth. 23:58 Oh, wait. Why did it [laughter] 24:01 That was my 24:03 That’s funny. That was my uh clipboard 24:06 from earlier because that’s what I had 24:08 to sneak and do to restart the cluster. 24:10 Okay. And let’s switch to Kimmy. Kimmy, 24:13 I’m not sure how to say it. I’m also 24:15 proud of Apple on XO. Y is so good. Now, 24:17 here’s the thing. I can run other 24:19 models, but I didn’t download any other 24:20 big ones. And it takes That’s the 24:22 longest part. I’m not doing that again. 24:24 Uh, but let’s just see what else do I 24:26 have downloaded. I’ve got a few llama 24:27 variants. I mean, I’ll pull up that 3.2 24:30 model from earlier. Let’s just run that 24:32 if we have to do something. Actually, I 24:34 do have llama 3.370B 24:37 4bit. Let’s load you up because we can. 24:40 That’s ready. Let’s do another one. 24:42 Let’s uh grab the llama 3.2 3B 4bit. I 24:47 think I have that one. Yeah, let’s load 24:48 him up. All right, that was easy. So, we 24:50 have one, two, three, four, five models 24:55 running, but the memory usage is staying 24:57 at like around 50. Actually keeps going 24:59 down. I don’t know why that is. So, I 25:01 think we answered the question. 25:02 Clustering’s awesome now. I’m running 25:04 every model I can possibly run on it, 25:05 and it’s just doing it like a champ, and 25:07 it’s fast. But, okay. So far, we just 25:09 been stuck in Exo’s interface. What 25:11 about using real applications like 25:12 coding or open code or open web UI? 25:15 Like, can you use this with that? Let’s 25:16 find out. 25:22 All right, let’s try out open web UI 25:24 because this is real. It’s not just 25:26 benchmarks. It’s not just test software. 25:28 Let’s see what happens. I want to use my 25:31 cluster here or not my cluster, my open 25:32 web UI, ai.hogwarts.studio. 25:35 I should already have my uh models 25:36 loaded up here because EXO runs off of 25:38 an API endpoint. Let’s see. Let’s find 25:41 Kimmy. There it is. Let’s do thinking. 25:43 We’ll see if it starts to stress out a 25:45 little bit. All right. Let’s see what 25:48 happens. Okay. Um yeah. I mean, there it 25:51 goes. It’s freaking out over here. It’s 25:53 thinking. I mean, this is so cool. Look 25:56 at this. I’m using open web UI on a 25:58 totally separate server in my server 25:59 room and it’s connecting to my Mac 26:02 cluster running KBK to a trillion 26:06 parameter model. How awesome is this? 26:10 It’s thinking a lot. It’s very smart 26:12 model. I tell you what, while it’s doing 26:13 that, let’s go to uh this server here. I 26:16 have Xcode installed, which is like 26:18 Apple’s VS Code. It’s still going. Okay, 26:22 it wrote it. So, it finished the app. I 26:24 mean I mean that’s impressive what it 26:26 did. That’s so cool. All right, let’s 26:27 try Xcode. Now I’m just going to try um 26:30 looking at an existing project Daniel 26:31 Misource Fabric. Let’s find some code 26:33 here. 26:36 Let’s ask it to tell us what this code 26:38 does. Let’s talk to DeepSeek. This is so 26:40 crazy. 26:43 All right, it’s going. It’s freaking 26:45 out. It’s thinking over here. Look at 26:47 that. I’ll try one more. Let’s try open 26:48 code. I love CLI tools to uh code and do 26:51 whatever you want. Now, I should already 26:53 have the cluster set up. Yeah, there it 26:54 is. Actually, K2 thinking sitting right 26:56 there. 26:59 Yeah. So, it’s it’s freaking out over 27:00 there. It’s doing its thing. We can see 27:02 the thinking. But I just wanted to show 27:03 off that we can use any of these apps. 27:06 Although, I think it has a hard time 27:07 using tools with open code. 27:11 All right, there we go. And then I’ll 27:13 switch models to Llama 3.3 and say 27:17 analyze the code that was just created. 27:21 Dude, it’s going all out. It’s making a 27:22 whole freaking website. All right, I 27:24 think it’s done. I’m going to try this. 27:25 Oh, it’s cute. It’s cute. I’m going to 27:27 interrupt it. I let llama take over. 27:31 Let’s see while it’s thinking here if I 27:32 can still use open web UI cuz I think 27:34 I’m stressing it out. Yeah, I’m trying 27:36 over here. Yeah, it’s fully stressed out 27:38 right now. I think I locked it up. Okay, 27:40 we did make it cry. Let’s see if the web 27:42 is still responsive. 27:46 Now, I’m going to chalk this up to beta 27:47 software. This is a bug that can be 27:49 fixed, but now I need to um tell it to 27:51 die. So, I’m gonna do that now. Restart 27:54 my cluster. I don’t want it to overheat 27:57 in there. So, are you impressed? Like, 27:59 is clustering back? I kind of feel that 28:01 way. Like, I was impressed with these 28:03 results now. Sure, I’m running a $50,000 28:05 cluster, but it’s a proof of concept. 28:07 There’s possibilities now, and we’ve 28:09 come a long way since the beginning of 28:11 this year where we just saw glacial 28:12 speeds, but now it’s functional. It’s 28:15 amazing. Now, full disclosure, Apple 28:18 loaned this to me. I don’t get to keep 28:19 this. I wish. But I do have it for a 28:22 little bit longer. What should I do with 28:23 it? You have any ideas? I’ve got a 28:25 $50,000 cluster. There’s got to be 28:27 something else I can do. Let me know in 28:28 the comments below. And of course, if 28:29 you want to see that last cluster video, 28:31 if you haven’t already seen it, it’s 28:32 right here. If you want to learn more 28:33 about RDMMA and how AI networking works 28:35 in the data center, I made a whole video 28:37 about it here. No, I’m just going to put 28:38 it over here. I dive deeper into how 28:40 RDMA helps us bypass all those 28:42 roadblocks, those bumps in the road. 28:44 Anyways, that’s all I got. I’ll catch 28:46 you guys next time. Hey, real quick. 28:48 This just in. 28:51 Just a little extra info, a little mini 28:53 segment on how Exo uses MLX, which MLX 28:56 is how we’re running all these models. 28:58 It’s Apple’s open source machine 28:59 learning framework that allows 29:00 developers like Exo to do stuff like 29:03 combining this stuff. Now, here’s what 29:04 it says. RDMMA over Thunderbolt driver 29:06 enables MLX distributed to communicate. 29:11 >> Hey, Chuck, I finished analyzing the 29:13 call me back feature. Here’s what I 29:14 found. 29:15 >> Sorry, that was Claude calling me. So, 29:18 it says, “RDMMA over Thunderbolt driver 29:20 enables MLX distributed to communicate 29:22 with low latency across Thunderbolt 5, 29:24 enabling high levels of blah blah blah 29:26 blah blah.” Now, what’s cool is that MLX 29:28 operations can run on either the CPU or 29:30 the GPU. And this is without needing to 29:32 move memory around because it’s u 29:34 unified memory. So, seriously, shout out 29:36 to the MLX team for being awesome. Exo 29:39 worked closely with them on making this 29:41 thing happen. And without them, this 29:42 video wouldn’t be possible. So, I see 29:44 you MLX team. You’re awesome. And you 29:47 watching this video right now. Comment 29:48 below and say thank you Apple MLX team. 29:51 Let’s show them some love. Anyways, 29:52 that’s the end of your news segment. And 29:53 yeah, I was reading my phone or 29:54 something on this. 29:57 Hey, you’re still here. So, at the end 29:59 of my videos now, I like to pray for 30:00 you, my audience. I know it might feel 30:02 kind of weird. 30:04 It is. I agree with you. But I uh 30:08 genuinely care about you guys and I want 30:10 to see you succeed. I want to see you 30:12 have lives of purpose and uh lives that 30:16 are joyful. Um so anyways, I’m just 30:18 going to pray for you. Um if you don’t 30:20 like that, that’s fine. You can just 30:21 click off. That’s why I put it at the 30:23 end. Uh but let’s go ahead and pray. 30:26 H [snorts] Lord, I thank you for the 30:28 person on the other side of the screen. 30:30 I thank you that um they’re here, that 30:33 they’re uh passionate about technology, 30:36 that they are [sighs] 30:38 excited for the future, and that even 30:41 though AI may feel kind of overwhelming 30:43 sometimes, um 30:47 they 30:48 can see the light ahead. And I pray that 30:51 they have if if they have any anxiety 30:53 over AI or the job market or what the 30:56 future’s going to hold that you would 30:58 just ease those fears. 31:00 And if they’re worried about their 31:02 current job is going to go away or if 31:04 they’re trying to find one, I ask that 31:05 you just give them peace in that you 31:08 would show them the path forward because 31:09 things are changing. Let’s be real. Uh 31:12 but God, you know what’s going to 31:15 happen. Uh you know, you’re not 31:17 surprised by anything. So, I ask that 31:20 you just give us a path forward. Um, 31:21 give us wisdom. Give this this person 31:23 wisdom. And I ask that you bless their 31:25 family life right now. Um, everyone’s 31:28 got family and I pray that you 31:31 would like right now I’m making this 31:33 video on Christmas, so around Christmas 31:35 time. So, I pray that you would fill 31:38 their hearts with joy and you remind 31:40 them of the importance of the people 31:42 closest to them and if they aren’t close 31:44 to these people that they should do that 31:46 now. Um, restore relationships, men, 31:50 what’s broken, 31:54 [sighs] 31:55 fill them with peace. Uh, 31:59 I saw a movie the other day. It’s called 32:01 That Christmas. Um, and Christmas time 32:05 is like a emotional magnifying glass. It 32:07 said if you’re sad around this time, if 32:09 you’re lonely, it’s going to magnify 32:10 that, amplify it. If your heart’s full 32:12 of joy and you’re surrounded by family, 32:13 it’s going to magnify and amplify that. 32:14 So I pray for the people who are 32:17 lonely right now around this time that 32:20 you would uh 32:23 fill their hearts with joy 32:27 and that um 32:29 I’m hoping that during this season they 32:31 can 32:33 learn a bit of the reason why we we take 32:35 this time Christmas more Christ. Um, I 32:40 do believe that the ultimate joy can be 32:42 found in you, Jesus. And I pray that the 32:44 other person, that this person on the 32:45 other side of the the screen, um, 32:48 wherever they’re at, they can just 32:49 encounter you in some way in whatever 32:51 that means, 32:53 even in their doubt, even in their 32:55 disbelief. 32:57 Uh, it doesn’t matter to you. Meet them 33:00 there. 33:02 I ask this in your name, Jesus. Amen. 33:05 Sorry that was a bit long one, but I 33:06 feel like uh that was something I wanted 33:08 to pray for you on. Anyways, that’s all 33:10 I got. I’ll catch you guys next time.

🇮🇳 हिन्दी / Hindi / হিন্দি:

NetworkChuck अपने दर्शकों के लिए प्रार्थना कर रहे हैं। वह ईश्वर से आशीर्वाद माँग रहे हैं — हमारे करियर, परिवार और जीवन के हर क्षेत्र में सफलता के लिए। वह कहते हैं कि तकनीक के इस तेज़ी से बदलते दौर में हम शांति और दिशा पाएँ। उनके अनुसार, हमें परिवार और प्रियजनों के साथ समय बिताना चाहिए और जीवन का सही अर्थ खोजना चाहिए। वह यीशु के नाम में प्रार्थना समाप्त करते हैं।