Microsoft Azure AZ-801 — Section 8: Implement and manage Storage Spaces Direct Part 3
50. Create a Windows failover cluster
I want to look now at installing the failover cluster service on our server here, or I should say our servers, because we’re going to put it on both NYC-DC1 and NYC-SVR1.
So, here we are on NYC-DC1. We’re just going to go to the Manage add roles and features. We’re going to click Next, next, next. All right. And then one more next. And we should see failover cluster servers. I believe I actually had already checked it off on this machine earlier. I don’t think I showed that, but I believe I did that already. Yeah, I did. So, I’ll show you that process over on the other server. So, you’ll get to see the whole process and you just have to do this on both servers.
So, here we are on NYC-SVR1, manage add roles and features and then we’re going to go to roll up to features and then here it is right here. So, we’ll just go ahead and add that. Click Next and click Install. All right. And again, you would also have to do that on NYC-DC1. I know you didn’t see me do that because I’d already done it, but now you’ve seen me do it on this one. So, it’s the same process. So, you would do that on both servers. I’m going to go and pause the recording while we’re waiting on this.
Okay, so the cluster service is complete here. We’re going to hit Close. All right. And I’m going to jump over to my domain controller and I’m going to go Tools and we’re going to open up the failover cluster manager. So, the failover cluster manager is going to be the tool that’s going to manage all this. And as you can see, I don’t have a cluster right now, so we’re going to click to create a cluster. Going to click Next, and then we’re going to enter our two servers.
So, the first server is going to be NYC-DC1. It’s going to check to make sure the service is present and then we’re going to add NYC-SVR1. It’s going to check to make sure the service is present on that one as well. All right. Both are. We’re going to click Next.
Now, it’s going to tell me it’s going to recommend that I run the validation wizard. It’s going to run a bunch of tests to tell me if everything is good. And I’m going to say no to that. Keep in mind, when you do that, in the real world, Microsoft won’t provide any support if you don’t run the validation wizard first. But I’m going to do that. I’m not going to do it to save time. I’m going to click Next. Then it wants to know a cluster name and we’ll call this NYC cluster. That’s going to be the name of it. We’ll click Next. And we’re going to click Next again, and it’s going to be in building the official cluster. So, I’ll pause recording while that’s happening.
So, our initial cluster here seems to be created. We’re going to go ahead and click Finish. All right. And here it is right here. We’re going to expand that out and we’re going to click on nodes. And it should say that both nodes are up and running. Let’s jump over to the server NYC-SVR1 and see if we get the same view on that.
So, here we are in Server1, we’ll click tools and we’ll go into the failover cluster manager. So, we’re bringing the same exact tool open on this side. All right. So, just waiting on that to load up here. And looks like it’s there. Roles or nodes. Sorry, haven’t worried about roles right now. Yep, it’s up and running, so everything is good to go there. All right.
So, that is how we set up our initial cluster. We now officially have a failover cluster between NYC-DC1 and NYC-SVR1.
51. Stretch cluster across datacenter or Azure regions
I’d like to now explain how stretch clusters work.
So, stretch cluster is sometimes also referred to as a geographically dispersed cluster, and the idea of it involves storing your resources, having copies of your resources in multiple locations. And then the idea would be to provide a failover scenario where in one location, if you had a services that go down, then you almost immediately have failover to that other location. This is, as you can probably imagine, how big companies like Microsoft and Google and all of them have their different data centers all over the world. Amazon is another one, obviously, and services, an entire data center could go down, our data center could fail, and users are going to experience little to no downtime at all. If anything, there might be, you know, things might the latency might be a little sluggish for a moment. You know, things might slow down a little bit, but it happens very quickly. This is going to utilize something called synchronous based replication as opposed to just asynchronous based replication. And I’ll get in a little bit more on what that is in just a moment. The big advantages here that we get is this is automated.
Now, you could manually do this as well where you have it, people that are monitoring and have to manually move, say, okay, we’re going to deactivate the cluster and switch over. And back in the older times, there’s a lot more manual things that had to occur there. But now with the concepts of stretch clustering and services involved, this can all be automated. So, replication basically too occurs automatically between these different locations. You do have to have low latency, so you have to have good bandwidth, ideally fibre and all of that connecting your locations together and. But, you know, depending upon distance, that might be easy or not easy
The other thing, of course, obviously, is administrative overheads is reduced because you’re not really having to you know, there’s no human beings that are having to sit there and watch this thing the whole time. This is going to cut down on human error as well, because there’s not really a lot of manual management involved. It’s just based on a set of automated rules. All right. The stretch clusters use a feature called storage replicas. Storage replicas allow the replication of your data to occur between these different locations of very, very quickly.
Now, there’re two different methods for this. You have synchronous, which is the preferred way to do it. And this is going to again, you have to have very low, low latency in order to really make synchronous work. But the idea is changes occur and within milliseconds both locations have the information. So, if you had two data centers and one data center is located, you know, however many miles away from another data center, as long as you’ve got good bandwidth, the changes can occur within milliseconds of each other’s. So, you do have to have very low latency to do that. The idea of synchronous is having data that changes almost instantly in both places. So, replication occurs almost instantly between these locations.
The other option is asynchronous, and with asynchronous replication, this would generally have to occur if you’re dealing with long, long distance. Honestly, when you start thinking about synchronous, it’s difficult when you are more than 25 to 30 miles away, even with fibre to achieve synchronous.
One of the benefits, though, of hosting your services in Azure is if you’re hosting it in Azure, the Azure data centers have the best, highest speed fibre connectivity you can purchase so they can be hundreds of miles away and still achieve synchronous replications is why hosting clustered services in Azure is such a lucrative way to do things. But when it comes to asynchronous replication, the way that it works is changes will occur in your main location first. And once that change has been committed, then the replication immediately starts occurring over in the other location. So, the downside obviously with an asynchronous replication is if your main site was to fail right in the middle of some changes, then the last couple of changes might not be committed to that other location.
So, for example, if you were dealing with, I don’t know, a database or something, a database change might have gotten committed in that main site. And then as it got committed, then let’s say the main site failed, the secondary site would not have that change. So, it most what would happen is that user would be redirected over to the secondary site and that change not be there all of a sudden. So, yeah, there can be some issues there, some downsides there when going with asynchronous. But you know, it is what it is. If that’s the route you have to go, you don’t really have much of a choice. All right.
The other thing to consider when it comes to stretch clusters is the active-passive and active scenario. So, let’s take a look at this active-active scenario.
So, you’ll see here in the diagram, we have two sites and you’ve got basically a primary site and you’ve got a secondary site. The primary site has two servers, Server1, Server2, and the secondary site has Server3 and Server4. The secondary site is completely passive, right? So as you can see, the red and purple machines, there is all being hosted on this primary site. In this case, there in this diagram, they’re kind of illustrating this as like a Hyper-V scenario. So, you’ve got virtual machines hosted probably by Hyper-V that are being stored in a shared storage, and that’s what those little cylinders are that you see there. And you have replication occurring doing utilizing what’s called a storage replica scenario.
Now, usually if you’re doing an active-passive scenario, then asynchronous replication is usually pretty adequate for that. But again, in a perfect world, you want to get synchronous. All right. So, the simple idea here, is that an active-passive scenario, that passive copy that’s in Site2 would not become active unless site one was to go offline. That’s the only time that’s going to become active. All right.
The other thing is sometimes people say, “Well, why would you ever use active-passive? Wouldn’t it be better to always go with active-active?” I kind of agree with that, if you can. In a perfect world, we want both locations. Actually, let’s jump over to active-active. We want both locations to accept connectivity. Right. In a perfect world, that would be great. So, you got traffic flowing to the site. Once traffic is flowing to site two and if you got the right equipment, know how you can actually have it set up so that users that are closer to site one, they always go to site one and users that are closer to site to always go to site two. It’s kind of like, you know, when you when you use places like Google and Amazon, you get routed to the nearest copies of those Google and Amazon servers. Well, that’s generally the way you want to. In a perfect world, that’s the way you want to do it. You have things like DNS that assist with that, but not getting into the details right now on that. And so that would be ideal. But you also have to consider cost and all of that.
Now, if you’re hosting this on-premises, you know, you’ve got to pay for the equipment yourself. You’ve got to buy all this hardware and you’ve got to have good, good low latency going on. And the other thing is it might be that Server1 and two are hosting these VMs and server three and four, just standby. But you know what, Server three and four could have other jobs that they could be performing. Server three and four could be Web servers or email servers or whatever. And they’re just there as a backup. They’re just there in case of a backup. I always use the analogy. It’s like when I was a teenager, I worked at a grocery store. It’s like one of the first jobs I ever had. I was like a cashier and bag boy, if you will, for a grocery store and things would, you know, when things got busy. If things get really busy, our manager would come out of his office and would go to a cash register and help ring people up. Right. But when things got slow, that manager would, would leave and go do whatever managers do, which to me was just sit in their office. But anyway, that would be an active-passive scenario, right? So you’ve got cashiers that are active, that are, that are taking customers. But if things get really busy, then that manager could come in and take load. And so you have to think about that type of scenario.
Another example would be like, let’s say that all of a sudden somebody that grocery store was ringing up groceries and started feeling sick. Oh, I don’t feel so well. I need to go sit down. The manager could come and take over. So, the manager is almost like a passive scenario that would kick in.
Now, in the case of failover clustering, servers are not going to just kick in when performance is sluggish. That gets more into a load balancing scenario, which I’m not talking about load balancing. I will say this you can combine load balancing with failover clustering, but not getting into that in this very video.
So, anyway, in a perfect world, yes, we want active-active and it’s very easy. If you do clustering in Azure, it’s very easy to achieve that. Very-very easy to set that up. It costs right. But technically, if you had to buy all the hardware and everything for your data center, it’s going to cost you there as well. All right. Okay.
So, hopefully now you understand the idea of active-active, active-passive and the concept of stretch clustering.