Microsoft Azure AZ-801 — Section 9: Manage failover clustering
59. Implement cluster-aware updating for installing updates on nodes
I’d like now to go over the concepts of cluster-aware updating.
Now, obviously when you are connecting machines together in a failover cluster, some of the things you’ve got to think about would be keeping those servers updated and the best way to do that, right? Obviously you can sit down at each individual server and just deactivate it from the cluster, disable it from the cluster, and then you can update it and you can reattach it. And that’s one possible way to do it. And that’s sort of the old-fashioned way to do it, I should say. And you would have to do that one at a time. But you’ve got to think about the fact that you do have to keep the services up and running when updates are going on, right? So, when Microsoft releases their updates, it feels like every day or at least on Patch Tuesday, right, you wouldn’t need to just do one at a time manually because Microsoft released a feature a while back that makes life a lot easier called cluster-aware updating, which is also sometimes referred to as the acronym CAU.
Now, believe it or not, cluster-aware updating is not actually going to be managed through the failover cluster Service Manager. So, the failover cluster manager is not where you manage this, which you’d kind of think it would be, but it’s not. It’s actually going to be managed through a tool in Server Manager. So, you come here to Server Manager, you go tool tools and you’ll see cluster-aware updating. It doesn’t really matter which server you do this on. I’m doing it in front of NYC-SVR1 which is part of my fellow cluster, but you could do it from the DC1 as well. Anyway, you pull that up. And the idea here is you can see what updates are available and then you can tell the updates to be deployed. And then what’s great about this is instead of you having to one at a time, you know, disconnect the server and install the updates and all that and then reconnect it and all that. Basically, it automates that whole process for you and it’ll do the updates one at a time for you on each server without having to lose the cluster services. And so you’re able to still, if clients were connected to the servers, you could still have those up and running. You, of course, should consider doing this in like a lab environment first.
So, in a perfect world and I know this is not a perfect world, I understand that not everybody gets to do this, but in a perfect world, set up a test environment with some servers similar to what you’ve got, maybe, virtual machines, similar to what you’ve got running, and then deploy updates to those first and just make sure everything is good. But I’m not really getting into the intricacies of updating. Hopefully, you understand updates by now and how they can break things. So it’s important to test them before you put them out in production. So, that’s the big thing to think about here.
The next thing is I have to connect to the failover cluster. So, from there, if you go right here, you can basically type in or I’m sorry, drop down, you can type the name in as well, but if you just drop it down, you’ll see it says NYC-CLUSTER, so it’s already detected the cluster service. So, then we just click Connect and as you can see, it has detected the cluster that is available. All right, from there we can even go and we can preview updates that are available. So, I can say Generate an update preview list and it’s going to go ahead and pull some updates that are available based on how up to date these servers are.
Now, if you’ve recently updated these servers, maybe, before you put the cluster on there, which I don’t think I told you to do, but if you did, then great. But as you can see, there are some updates here that are available. These are the updates that are available for Server1. These are for DC1. So, you can see that sometimes the servers, because the servers do update themselves, sometimes they’ll pull updates or, maybe, you updated before the servers became failover cluster members. They have different updates that are available, right? And so I can choose those. And then from there you’ve even got an analyzed cluster updating readiness that you can check out. All right. So, if I click that, that’s something that can be ran and it’ll run through an analysis feature where it’s going to analyze and it’s going to let you know if there’s any kind of issues, maybe, that would go through. And as you can see, everything is passing. So, that looks good. It’s looking at my firewall rules now. It’s going to let me know if there’s any issues there. But ultimately, this is definitely a good idea to run through that. So, right now it’s telling me that a firewall rule that allowed remote shutdowns should be enabled for each node. And it goes through here and it tells me, let’s see, tell me there’s a firewall rule that could potentially be causing problems for that. And it actually would tell you if there’s gives you instructions to on how to fix that if you want. It tells me the machine proxy on each failover cluster node should be set to local proxy. I don’t really have a proxy if I had a proxy it’s warning me about that and then it’s telling me the cluster role should be installed on the failover. So, the CAU cluster role should be installed on the failover cluster enabled self-updating mode.
Ultimately what’s happening there? My cluster itself has the failover services and I warn you that these failover services are sucking up a lot of memory. And I don’t have a lot of memory, unfortunately, on my virtual machines. So, when you don’t have a lot of memory, you will you will run into some issues there. All right. As far as being able to perform the cluster-aware updating, in fact, one thing you have to run before you can deploy updates is to configure the cluster self-updating options so you can’t apply any updates. Watch what happens if I try to apply the updates. So, it says the connected cluster does not have the enabled CAU clustered role. So, the reason it doesn’t is because we have not gone through this configure cluster self-updating process yet and so we have to go there. We have to add the cluster role, specify, okay, well when you want this to happen frequently of self-updating daily, weekly, monthly, how frequently do you want it? And then it says, all right, so it’s updating run options are going to be based on an XML. There are some settings here that you could tweak if you wanted to for things like plug ins and if there’s going to be any kind of warning messages. There’s actually an article you could read on that, which I’m not getting into right now, but you could check that out if you wanted to for tweaking some of this stuff. And then from there, if there’s any additional options and then you would click Next. And then at that point you can click Apply.
Now, I will warn you, memory is an issue here. This cluster-aware updating service does take up quite a bit of extra memory. And so, basically, what ends up happening is if you hit Apply, it’ll just run forever. It’ll never complete because there’s not enough memory on the machine. And it took a while for me to figure that out originally, because there’s not really a lot of documentation out there about this. But yeah, you don’t have a lot of memory on your virtual machines. It’s not going to work. All right. In my case, in these virtual machines, I’ve only got four gigs of memory on each virtual machine, and that’s not enough. So, it just runs forever. So, unfortunately I won’t be able to complete the cluster-aware updating. If you’ve got lots of memory to play around with here, you can do it and then you can just apply them. But it’s really easy. You’ve already seen the steps essentially on how to do it. Once you apply this, then you will be able to go and you’ll be able to apply the updates to your machines. And you can have you can basically have that done on a schedule.
60. Recover a failed cluster node and failover workloads between nodes
Let’s go over the concepts now of what happens if a node has failed. Now, there can be various reasons why this happens. Obviously, you could imagine that if you’re dealing with a physical server, there could be some kind of a hardware problem. It could be that you have too much going on one of the servers and there’s a lot more memory that’s being taken up. And you know what we’ve got in this environment here, because this is like a little test lab is not ideal really in the real world. What you want when it comes to clustering is the hardware to be the same and have the exact same services installed on it and all that. But we know that because we kind of have a limited setup here. Your DC is a domain controller, your NYC-DC1 is a domain controller. So it’s performing domain services in NYC-SVR1. Server1 is a server and doesn’t have all those other services running on it. And so they don’t match perfectly. You know, in a perfect world you’d want them to be the same thing. And ideally you don’t really want a domain controller to be a member in a failover cluster just because it’s got extra load on it. But again, I have limited resource here. I don’t have unlimited memory and I’m assuming you don’t either. And I try to build this course in a way that you could you could do the same setup in your own environment if you wanted to. But obviously in the real world, you know, in a perfect world, I would have had a third server and, maybe, NYC-SVR1 and NYC-SVR2, and that would be my failover cluster, right? That would be an ideal situation. But my point being that you’ve got too much going on a machine, there could be services that fail. You could have the normal thing; you could deploy an update that it’s not working right with the machine. There’s a conflict of some kind. So, there’s various things that could go on there. The heart from a hardware failure to a software failure. And so what do we do there?
Well, first off, if things aren’t performing right, the cluster may detect it itself. And at that point, it may quarantine whatever machine is having problems. And then what can happen if the status says it’s like in a quarantine state, usually what you can do is you can click on it, you can come over here to More actions and then you should be able to trigger the cluster service to restart. Another thing is if the status was to say “Stop!” You should be able to do the same thing.
The other thing you can do now, if let’s say you’ve got to take a server down for maintenance or something like that, that’s another possibility. I can go over here and I can pause and then I’ve got an option here that says Drain Roles. So, if I click that Drain Roles, you’ll notice it goes into a pause state. Now, currently what’s happened is any clients or anything like that are currently connected to that server, they would be drained off to the other server. So, everybody would be connected to this machine right here at this point. So, I could safely shut that machine down. I could work on it, do maintenance on it, and bring that back up if I need to. And at that point, you can restart the service and reconnect it. So, I can go right here to resume and it says Fail Roles Back. Right. So, I can click that and you can see that it’s back up and running. All right.
So, the other thing, of course, you’ve got is you’ve got a Show Critical Events. If there’s any issues going on, you can click on Show Critical Events. There’s been any issues with your cluster servers. That’s a great place to start. That’s going to give you an idea of failure, if there has been any types of failures or critical issues that’s going to show up for you right there. But ultimately, the thing to remember here is this More actions feature.
The other thing you can do, if you have a situation where a server is gone for good, or you just want to remove it out of the cluster, you can evict it. And so to do that, all I got to do is go here to More options and then I can click Evict. I’m not going to go through that right now, but you could try it out if you want to. You can evict a server, especially if it’s having problems., maybe, it’s down, the status is down. And you’re clicking More actions. And you’re trying to start the service back up, but it’s not starting or something like that then. One of the common things you can try is you can just try evicting it and then that will remove it out of the cluster entirely. And then all you got to do is just re-add it right. How do you re-add it? You go back over here to nodes, you go to Add node and you just run through that Add node wizard again. So, I’d go here and I just re-add it. But those are the traditional steps when it comes to recovering nodes in a failover cluster.