A brief introduction to Industrial Control Systems and Security

A brief introduction to Industrial Control Systems and Security hero

You may not know it, but much of your daily life depends on Industrial Control Systems(ICSs). From the power you're using right now to the water you drink, it all depends on Programmable Logic Controllers (PLCs) and other ICS tech to be delivered. In fact, nearly any time something in the physical world needs to be automated, there will be an ICS involved. As discussed in prior research by Pedro Umbelino (Bitsight identifies nearly 100,000 exposed industrial control systems), a large number of these systems are currently Internet-facing and at risk. Most folks with an IT/Infosec background don’t have a strong understanding of ICS devices and what makes them unique, so we’ll cover their unique characteristics and some of the challenges in applying IT security practices or technology to them.

ICS Components

Industrial Control Systems (ICS) devices have some fundamental differences from the systems we’re used to using every day, so it is important to get a basic understanding of how they work. It’s also important to understand the different types of Industrial Control Systems, so here is a brief glossary of a few of the most common ones and some other terms relevant to this post:

  • General Purpose Operating System (GPOS): an operating system like we’re all used to using, like Linux, MacOS or Windows. Provides a variety of services to the end user. Designed to prioritize responsiveness and limit process starvation.
  • Real Time Operating System (RTOS): an operating system designed to provide minimal services to the “user,” and instead prioritize meeting deadlines for critical tasks.
  • Programmable Logic Controller (PLC): a low level computing device running an RTOS. It is usually configured and programmed through software running on an engineering workstation (usually a Windows device). Its primary job is to read inputs and write outputs based on how those inputs are processed by its running program. Those inputs and outputs can be digital (representing true or false) or analog (representing a range of values), and can be attached to physical devices like valves, pressure sensors, motors, and other sensors and actuators.
  • Supervisory Control and Data Acquisition (SCADA): both an architectural model for control systems and a type of control software running those systems. As its name implies, a SCADA system acquires data, provides process visualization, and a control interface for an overall process consisting of many industrial systems. SCADA systems help run systems like utilities, facility control, manufacturing, and more. In this post we’ll mostly be referencing the software running on the Supervisory Computer within the overall SCADA architecture.
  • Human Machine Interface (HMI): A panel that displays data and provides control for a single machine.

A lot of the examples I’ll cover are about PLCs, but the general principles would carry over to other systems that run an RTOS. Many SCADA and HMI applications run on top of an underlying GPOS such as Microsoft Windows, so PLCs are really the components that operate in a radically different fashion.

Operating Systems: IT vs ICS

The first difference between PLCs and IT systems is their operating system— to be more specific, their operating system’s scheduling algorithm. Believe it or not, a single processor core in any operating system can only do one thing at a time. Processor scheduling algorithms determine how a particular operating system handles multiple competing tasks of varying run times, deadlines, and priorities. As an example for a GPOS, up until recently the Linux kernel used the Completely Fair Scheduler (CFS), which aims to provide an equal share of processor time to tasks in a somewhat round robin fashion. The idea here is to prevent any one process from being starved and to prioritize responsiveness to the user. As of version 6.6 the Linux kernel moved to a new scheduler, Earliest Eligible Virtual Deadline First (EEVDF) which functions similar to CFS, but adds an additional “lag” factor (the difference between the amount of CPU time a task received in comparison to what it should have gotten) and allows tasks with nearer deadlines to receive CPU time earlier (but still only their calculated fair share).

An RTOS, on the other hand, will generally use a Preemptive Priority Based Scheduling algorithm, which executes tasks in priority order and allows higher priority tasks to interrupt lower priority ones. To better elucidate differences, consider a scenario in which two tasks are ready to execute at the same time. In CFS (the Linux approach since kernel 2.6.23) they will get alternating shares of the CPU time based on their “nice” value (related to their priority). In EEVDF (the new Linux approach) the task with the nearest deadline will execute first but be constrained to its fair share of CPU time. In Preemptive Priority Based Scheduling (the RTOS approach), the task with the highest priority will execute first. For another comparison, if a task is running and a higher priority task enters the scope, with CFS and EEVDF the higher priority task will receive processor time once the running task has completed its share, then run for its fair share before yielding time to another task. This could occur several times before the high priority task has completed. With Preemptive Priority Based Scheduling the higher priority task will preempt the running task then run until completion unless interrupted by another higher priority task, after which the original running task would return to a running state.

So why does this matter and where does it come into play? In a GPOS you aren’t guaranteed when a high priority task that enters the scope will receive CPU time, but in an RTOS you know that it will begin receiving CPU time right away. Additionally, in a GPOS, depending on the number of ready tasks and their deadlines, you can’t guarantee when a high priority task will complete its execution, but in an RTOS you can, as long as it is not interrupted by another higher priority task. There are consequences though—if there is a continuous stream of high priority tasks in an RTOS, a low priority task can experience starvation. In a Siemens PLC, for example, the main program (which writes outputs based on the inputs) has a priority of 1, while communication has a priority of 15, meaning that too much communication could starve the main program. Siemens thought of this though and included a setting to allow you to determine what percentage of the CPU processing time is dedicated to communication tasks, with a maximum value of 50%. There is still some danger that too much communication may greatly elongate the PLC scan cycle, and possibly even cause it to exceed the configured maximum value. If this value is exceeded twice, the PLC will enter a Stop state and processing will cease (a task might lag with a GPOS, but that would rarely place the system into a failed state). With other devices we’ve noticed that too much communication will cause it to drop communication with our SCADA application, while it handles responding to other requests.

To sum it all up, an RTOS like those used by PLCs are a reliable solution for analyzing inputs and writing outputs within tight deadlines over a long service life. However, due to their singular focus, they’re limited in the additional services they can provide.

A GPOS such as Linux is excellent for balancing the many services it provides and its variety of capabilities. Unfortunately, each of these services creates a potential failure point for the system and keeping them in balance means the operating system cannot guarantee a specific task is executed in time to meet a deadline.

Patching & Maintenance Differences

PLCs also have additional constraints around patching when compared to standard IT systems, both in how you apply them and when you’d even be able to. In some cases, a device will have a web interface where firmware updates can be applied, other times there is a standalone tool, and in the worst case scenario, you’ll have to write the update to an SD card and physically insert it at the device. PLCs’ typical location and criticality also present additional challenges. If you’re in a 24/7 operation and you haven’t spent the extra cash to build in redundancy, you may have to wait years for a plant shutdown. It’s easy on the IT side to take for granted the many opportunities to apply updates to systems, or even just the fact that a system can be rebooted without all operations having to cease.

While we’re on the topic of patching and updates, every once in a while Microsoft will find it necessary to roll out a patch that breaks your PLC configuration software on the engineering workstation.For example, KB5037591 broke TIA Portal v17’s ability to compile the running program. Given the cost of a TIA Portal license, I’d suspect it is unlikely that many companies would have a system running it in their patch test group, and even then, the issue wouldn’t be caught unless someone tries to compile software changes and download it to a PLC. So even for the in-scope systems running a GPOS, patch management might not be very simple.

Security Considerations - Availability and Integrity Prioritized

There’s also a huge difference in the systems ICS devices control and how you interact with them when compared to IT devices. Outside of those in use by end users, IT systems will typically influence other digital assets and processes. Industrial Control Systems control things in the physical world: machinery, dams, water treatment, power transmission and generation, traffic lights, and even roller coasters. Therefore an issue with a PLC leads to problems in the physical world. Before we dive into security considerations, it is important to note that our research into ICS has been motivated by the disturbing number of control systems that are accessible from the public internet. We’ll talk about some ways to mitigate the exposure risk in future posts, but any ICS device directly exposed to the internet poses a severe risk to your organization.

With this in mind, let’s discuss the consequences of different types of failure with respect to the CIA triad. In the IT world, loss of confidentiality can lead to loss of sensitive customer, patient, or employee information which would lead to breach disclosures, reputational damage, fines, and overall financial loss. On the ICS side, given that there isn’t any customer data, the consequences aren’t quite as severe, with the exception of possibly leaking tag names that imply they control critical components (ex. Emergency_stop) or proprietary values in your manufacturing process. Ultimately, loss of confidentiality could be a major issue on the IT side, but has less severe consequences in ICS.

Loss of integrity in the IT world could result in log forgery, incorrect transaction data (and therefore monetary loss), defacement of company digital assets, and loss of consumer trust. Because of the extreme temporal sensitivity of PLCs, loss of integrity in the ICS world such as unauthorized or careless changes to the running program or important tag values could lead to disasters such as vehicle collisions or train derailments, poisoning due to contaminated drinking water, damage to critical equipment, or power outages.

Loss of availability for IT systems will almost certainly lead to financial losses, whether through lost employee productivity or loss of customer purchasing ability. Loss of availability on the ICS side could cause lack of clean drinking water, shipwrecks (and a collapsed bridge), power outages, or even nuclear accidents.

You’ll notice that for the loss of integrity and loss of availability incidents linked above that really the only difference is that the loss of availability incidents are reported as “system outages” while the loss of integrity incidents are reported as “attacks”. There is an interesting point that Joe Weiss has made in a few of his talks and posts that the difference between an attack and an error is just the intent of the attacker. In contrast to IT systems, most PLCs and other low level devices don’t have robust logging capabilities for system and security events, so in the end we may not really be able to tell the difference between a system failure, a cyberattack, or user error. Additionally, he points out that due to lack of expertise to investigate ICS incidents and reporting requirements, only a subset of ICS attacks are reported. Anecdotally, I find it very interesting that when searching to find examples of cyberattacks and outages, most incidents involving water facilities are reported as outages or system failures and most that involve energy/power are reported as cyberattacks.

What all this boils down to is there is a difficult balance that has to be struck when applying security best practices or functionality to control systems. As we discussed earlier, these devices typically run an RTOS designed to check inputs and write outputs in a timely fashion, not to provide robust services to the “user”. This greatly limits what can be implemented at the device level. Beyond technical limitations, the processes these devices support cannot tolerate even momentary loss of availability or integrity with respect to the control system. If any measure risks compromising either, it becomes a safety risk.

Conventional IT Security wisdom would say that every system should have an AV or EDR agent running at all times. In the control systems world, really the only systems that can host an AV or EDR agent would be the engineering workstation and SCADA server. There are many threads on reddit, plctalk, and other fora discussing AV/EDR conflicts that prevent their PLC programming software from functioning. This, of course, just means that the user cannot make changes to the configuration or running program for a PLC, which isn’t as serious as if bad software took down the PLC itself. In a later article, we’ll discuss some of the EDR/AV partnerships being formed by device vendors that may forestall these conflicts. However, for a thought exercise, imagine that a PLC could host a CrowdStrike Falcon agent and that the CrowdStrike outage (July 2024) instead took down a large percentage of the PLCs in operation. We wouldn’t be busy looking at memes about the event, since the power would be out and there’d be no clean water until sometime TBD. That is, of course, assuming a reactor meltdown or other catastrophic event didn’t cause even more damage. A more likely scenario would be SCADA servers blue screening due to the bad patch and losing central control.

Scanning conducted during penetration testing can cause a denial-of-service condition or at the very least could greatly elongate the PLC scan cycle. On one of our older PLCs, we’ve noticed that an nmap scan can elongate the scan cycle time by more than 150x. Tenable gives guidance to “Avoid Scanning Fragile Devices” and Rapid7 points out that blind scanning can lead to issues in ICS networks. And as mentioned earlier, even if a vulnerability is discovered, it may be tough to apply a firmware patch to a PLC (or the latest Windows patch could break your programming software).

Encryption in transit is another common requirement to encounter in the IT world. Many ICS protocols don’t support any form of encryption and for those that do, there is some risk that loss of availability could occur if a certificate expires. For example, with OPC-UA a connection cannot be established if a certificate has expired and depending on the dependent systems, centralized control may be lost. Certificate Management is a core practice within IT Security, but I can tell you from many years at Bitsight, many organizations are lacking in this area. Even more concerning in this case is that they often forget to keep certificates current on assets that aren’t customer facing. I am not advocating that you should forgo encryption in OPC-UA (especially since it is one of the aforementioned protocols that could potentially leak sensitive tag names), but a conversation around the risks needs to be had between stakeholders.

There are many ways in which industrial control systems differ from their IT cousins. They are made for different purposes, connected in different architectures, and have different constraints with serious implications for security. Ultimately, it boils down to this: in ICS, we must prioritize integrity and availability above all other considerations. Now that we’ve discussed the basics of how a PLC functions, what makes them unique, and important security considerations, look out for articles from our team about our unique partnership with Schneider Electric, the ICS lab we’ve built (and lessons we learned in the process), and industrial protocols!