DAIR – Cloud Security Best Practices

The goal of this tutorial is to give you a solid foundation for how to approach security in a modern cloud environment. By discussing cloud architecture, authentication, networking, logging, and how applications work with these components, we hope to empower you and your team to ask the right questions when designing the infrastructure of your cloud deployment project.

Jump in…

Risk Analysis

Modern applications have many moving parts. It is important for any large project to start with a risk analysis to identify your exposure. While the exact steps for this analysis are out of scope for this document (you can check out this article from Intel, or this article from isaca.org for further risk analysis reading), your analysis should cover three main categories:

  1. Sensitivity;
  2. Threat;
  3. and Economic Impact.

It is important to be aware of the sensitivity level of data you collect, along with the risks surrounding it. This awareness will allow you to assess the potential damage to your business if data is compromised.

Based on yearly security breach and incident reporting, the most common attack patterns include:

  • Privilege Misuse – Exposed or weak credentials exploited
  • Distributed Denial-Of-Service (DDOS) – Services taken offline by a swarm of bots
  • SQL injections – Malicious strings submitted into databases through forms or other public access points

Given the ubiquity of the internet in our daily lives, cyberattacks will only increase. The threat landscape changes quickly, as do best practices around security adoption. Taking time to plan and understand the risks related to a cloud environment will pay dividends, as the reputation of the company and remediation costs are on the line. Few businesses can afford hours or days of downtime to fix a security problem, so early and comprehensive planning is critical.

Where to Start

Regardless of your project’s complexity, there are several key starting points to secure your application and infrastructure. Many of these are covered in depth later in this tutorial.

Authentication and Authorization – How are you ensuring that the right people are accessing the right data?

  • Don’t use simple passwords, and don’t reuse passwords
  • Use multi-factor authentication wherever possible
  • Follow the Principle of Least Privilege
  • Regularly review permissions and authorized users, and revoke unnecessary privileges

Data Protection – How are you ensuring your data is protected?

  • Use encryption for data in transit or at rest
  • Separate sensitive data from less sensitive data

Network Access – How openly available is a service to the world? Should it be?

  • Use firewalls or security groups to limit access to services and prevent leaks or unintended access
  • Use a Web Application Firewall (WAF) to filter malicious traffic before it reaches your environment

Software Security – How are you ensuring that security updates are applied, and known vulnerabilities are patched?

  • Configure your cloud virtual machines (VMs) to automatically update and patch the operating system (OS)
  • Use cloud managed services where appropriate
  • Run regular penetration tests to expose issues
  • Have a procedure for following updates to third party components

Configuration Management – How are you ensuring your configuration is correct?

  • Harden your systems and services (only enable minimum required services)
  • Automate service configuration and deployment where possible

Logging and Visibility – How are you setting up your application to monitor the different components and identify if something is not operating as expected?

  • Use a centralized logging platform to collect and organize logs from all parts of your environment
  • Setup meaningful alerts and notifications

Shared Responsibility Model

In traditional co-located hosting, you are responsible for the entire stack: from the bare-metal hardware to the software running on it. When using a cloud service, ownership of some of that stack moves to the cloud operator. This dual ownership is known as the Shared Responsibility Model.

Cloud providers leverage internal tools and expertise to provide a uniform experience. This rigid setup allows the cloud provider to guarantee a secure environment, and act quickly to address new threats. The tenant benefits from an experienced partner assisting with key components of their infrastructure. The downside is you have less flexibility and freedom with how these components operate. Services such as Amazon RDS provide managed database clusters that behave similarly to self-managed models, but more specialized services such as Lambda functions may have unexpected limitations.

Diagram 1. Shared responsibility model (the red line represents the boundary between the responsibilities of a tenant and a provider in different service models)

The shared responsibility model can include the whole stack. Infrastructure/Platform/Function/Software as a Service (IaaS/PaaS/FaaS/SaaS) transfer more responsibility onto the cloud vendor. User data and secure use of the platform always remains a tenant responsibility. The cloud provider may provision secure services, but misconfigurations can still affect the overall security of the environment.

Security Aware Architecture

Many companies believe moving an existing infrastructure or application to a cloud ‘as-is’ is good enough, but this approach doesn’t take full advantage of the cloud. Your existing infrastructure and applications will likely require some redesign to best respond to the security challenges of the cloud.

Most cloud architectures focus on the interaction of services. Service-Oriented Architecture (SOA) is a style of software design where your application is built as separate services that work together across public and private networks to deliver your application. Sub-variants, such as microservices, break services down to the smallest possible task. Separating jobs and services this way has several advantages. For starters, you have greater control over resource usage, as allocated resources can automatically scale up and down with demand. Testing and monitoring services also improves security and reliability. In the event of an incident or failure, automated remediation procedures can take place and error notifications can be sent. The smaller the service, the smaller the impact of an incident, and the faster it can be recovered.

Lift and Shift

Diagram 2. An application deployed on one virtual server in the IaaS model.

The example above (Diagram 2) is referred to as “Lift and Shift”. This is when existing infrastructure or applications are moved into a cloud environment unmodified, often on a single compute instance. This is similar to raising an entire house by its foundation and shipping it to its new location. While Lift and Shift makes it easy for newcomers to migrate to the cloud, it usually does not take advantage of the real advantages of cloud architecture.

In a datacenter, your server is like a castle, with a strong outer perimeter and only one tenant. Cloud environments are more complicated, with virtual barriers between clients instead of physical barriers. Attacks such as Server Side Request Forgery (SSRF), where metadata services are exploited to force cloud resources to connect to the wrong tenant, have led to some famous company breaches, such as the one suffered by Capital One.

Oftentimes, infrastructure or applications that are lifted and moved into the cloud “as-is”, were designed in a monolithic fashion, i.e. the user interface, user data, and application code live on a single instance. With this design, if any part of the monolith is hacked or exploited, it could render the entire system compromised. OS and service ‘hardening’ (tuning default configurations for security and performance) reduce the surface area exposed to would-be attackers and can achieve relatively strong security. However, misconfiguration in one service can still lead to compromising other services, if not the whole instance.

While not perfect, a necessary step to secure monolithic systems is adding layers around the monolith. Diagram 2 shows an application with security modules in both the web server and the application framework. This is a good start, as it will block some attacks once they reach your server and you can expand upon these with cloud-based network filters that validate traffic before it reaches your web server. These application-aware proxies and web application firewalls (WAF) act as a first line of defense by filtering common web-based attacks before they reach your application.

Cloud-Native Design

Diagram 3. Cloud-native architecture that follows the microservices FaaS model.

Diagram 3 demonstrates a configuration that deploys a cloud provider’s web application firewall (WAF) and sends logs for analysis. Event driven automation then updates firewall rules on the fly, triggering alerts to increase visibility. Application services (behind the WAF) separated via security groups permit connectivity only where necessary. This architecture allows us to take advantage of cloud native tools, such as load balancers, to stay highly available and analyze traffic for security risks.

This approach, while more complicated, also enables a feedback loop to actively control security rules around the application’s entry point — in our case the WAF. With every request stored and analyzed, your WAF can amend the security rules, allowing the application to adapt to most attacks in real time.

Once inside the WAF security perimeter, each component is isolated within its own security group. This minimizes the overall impact if one service is compromised. This is part of a ‘“zero trust”’ framework that will be further outlined in the Authorization and Availability section.

Authentication and Data

In cloud environments, authenticating and authorizing access needs to be handled differently from a datacenter. This is partially because cloud native architecture has more credentials to secure, but also because the shared nature of public clouds limits your ability to isolate management networks. Even the most securely designed systems may be compromised if account passwords are too weak.

Data stored within your infrastructure must be configured and managed with extra care. Beyond the Server-Side Request Forgery (SSRF) attack mentioned earlier, cloud native approaches to data storage can easily be misconfigured in ways that you would not be able to accomplish in a traditional on-premise infrastructure. In the cloud, the difference between public and private accessibility can be a single checkbox.

Authentication

Authentication is the process of verifying the identity of a user or service. Once the identity of a user has been verified, they will be allowed or denied access to resources, depending on the permissions set.

A common problem with configuring authentication is the use of weak passwords and single-factor authentication methods. A publicly available server is a popular target for attacks, and the password may eventually be guessed or otherwise leaked. Additional login verification steps – known as multi-factor authentication (MFA) – is strongly recommended, such as a second password, biometric data, or a text message to your phone. MFA virtually prevents unauthorized access, as the chances of an attacker passing all authentication mechanisms is extremely low.

In addition to securing user access, you also need to look at measures for securing service connections. Public key encryption can be used to secure services like SSH and VPNs, while security tokens can be used to authenticate API-based communications. These keys are more secure than a password as they are always unique (they are designed for a single user or device), and they can be easily rotated.

Authorization is the function of defining the resources that an account can access. When an authenticated account attempts an action, the account privileges are checked to see if the action should be permitted or denied.

The “zero trust” framework has gained a lot of interest in recent years. With a classic “castle-and-moat” security approach, resources within a network security perimeter are considered trusted. With a “zero trust” policy, all entities inside or outside of the outer security perimeter require authorization. “Default-deny” is a more extreme version of “zero trust”, wherein all communication not expressly permitted is blocked. These policies allow sensitive resources to be hosted in the cloud (non-trusted infrastructure), and they can only communicate with explicitly permitted resources.

Attribute-based and role-based access controls (ABAC/RBAC) are another way of approaching a “zero trust” or “default deny” framework. These controls limit users’ access to only what is necessary for the job at hand. For example, a database is often only accessible from the application itself and doesn’t need full network access. Many cloud providers follow ‘“default deny”’ as a standard process. This forces users to manually permit/configure connection sources.

On top of network restrictions, limiting account actions is also prudent. If one service is inputting to a database and a different service is reading from it, access could be restricted to only what the specific service needs. This can prevent errors in code from having wider consequences and limits the overall impact of compromised service credentials.

Confidentiality and Data Protection

Storing sensitive data, such as Personal Identity Information (PII) and credit card numbers, comes with legal liability. Encryption is a key component to keeping this data safe ‘in transit’ (when the data is being transmitted to clients or other services) or ‘at rest’ (when the data is stored).

Data that is ‘at rest’ outside your live environment, such as backups or assets stored in services like AWS Simple Storage Service (S3) and Office 365/SharePoint, also need to be secured. Many cloud providers offer encrypted volume and object services. Automating this process makes securing sensitive data easier. Encrypted volumes are unrecoverable once deleted or if the encryption key is lost — also lowering the risk of accidental exposure.

Beyond encryption, abstracting PII in live databases helps separate sensitive data. “Tokenization” uses unique tokens instead of the PII itself. This facilitates splitting private data from non-private data and keeping it in a separate security enclave. The token is used to query the PII in the separate and secured service, when needed.

Network and Availability

In all cloud models, the physical network is managed by the provider. This yields many of the same benefits found when offloading responsibility for physical compute or storage hardware, i.e. hardware-level failures and exploits are fully managed. However, the virtual networks used by the instances may require your attention.

Network configuration is a critical part of security aware architecture. A properly configured network has a lower risk of outages and can isolate sensitive services. An outage is not as bad as a breach, but downtime can cost you money. Your network configuration should be resilient enough to meet your uptime needs and be secure against any unexpected traffic.

Network Isolation

Cloud security starts with a secure network. Users have full control over subnets, security groups, and overall architecture. This flexibility allows for traffic separation and flow control but can be confusing and lead to misconfigurations. All public networks should be treated as demilitarized zones (DMZ). A network DMZ offers public facing services not fully trusted by secure internal networks, but is also able to make specific, authorized connections. This adds a buffer between the public internet and internal services.

Private networks and subnets can be used as security enclaves where only specific traffic is allowed. For example, it can be used for a database cluster that needs to replicate data to each cluster node. This can be done on a dedicated private subnet that uses security groups to only allow replication traffic. By only having the necessary database servers on this network, replication traffic is hidden from other public or private networks.

If you need to communicate with a secure resource from a public network, encrypted communication standards like SSH or a VPN can secure the communication. These tools provide secure tunneling on top of less secure communication protocols, such as Remote Desktop Protocol (RDP) and Virtual Network Computing (VNC). By wrapping insecure communication in a secure tunnel, you prevent third parties from snooping on your traffic and intercepting sensitive data or authentication credentials.

Integrity of Communication

When network traffic needs to pass over a public network, encryption provides a basic layer of security. Valid SSL certificates not only encrypt communication, they also validate identity. When a client initiates a secure connection, the server replies with a certificate and a “key”. The client uses the certificate to validate the identity of the server they are talking to, and the key to encrypt communication in a way that only the original server can decrypt.

While the encryption process is straightforward, identity verification is not. To obtain a valid certificate, a service operator must verify the domain ownership. A third-party Certificate Authority (CA) performs this verification and issues a certificate, along with a private key for the service operator to install.

When you connect to a service secured by a certificate, the certificate provided by that service is validated in a number of ways. The issuing CA is queried to make sure it did, in fact, issue the certificate. The validity period of the certificate is checked to make sure it hasn’t expired. Finally, the certificate subject (domain) is checked to make sure it matches the domain/subject you were expecting. If any of these checks don’t pass, the certificate will be flagged as potentially insecure.

Services like Let’s Encrypt (a free, automated, and open certificate authority) make it easier to get and maintain valid certificates for websites and services. An alternative for internal use could be self-signed certificates or setting up a personal CA. Self-signed certificates or certificates provided by CA that are not in your browser/computer’s approved list will show as invalid. These invalid certificates can still encrypt traffic, but do not reliably verify the identity of the server. This is why a valid public signed certificate is necessary for production.

Availability

Availability describes your application’s ability to function as expected. Denial of Service (DoS) is a popular cyberattack designed to interrupt your application’s availability and prevent authorized parties from accessing the service. Even if your service isn’t fully consumed, this type of attack can make responses slow, or it can act as a smoke screen hiding other malicious activity.

The Internet of Things (IoT) and increased use of edge computing has made the distributed version of this attack (DDoS) more common and dangerous. Compromised systems are added to large botnets (loose collection of compromised systems under the control of a hacker), coordinated into sending large volumes of junk traffic to a target site. Depending on the size of the attack, this can cause major disruptions to service. These attacks are also harder to block, as their connections from diverse geographic regions can overwhelm your infrastructure’s bandwidth. One of the largest DDOS attacks recorded was 1.2 Tbps (TeraBits per second) and brought down much of the internet across the U.S. and Europe. If a service is required 24/7/365, make sure the architecture is resilient to attack and failure.

Many clouds provide solutions to mitigate this kind of attack. The most common is a web application firewall (WAF), combined with a load balancer and a content distribution network (CDN). This configuration allows your application to be fronted with robust and secure proxy servers (e.g. CloudFlare, CloudFront) that are capable of absorbing and filtering malicious traffic. Your server is likely limited to 1 to 10 Gbps (GigaBits per second), but the WAF can filter much of the attack traffic before it reaches your system, allowing mostly normal operations to continue during an attack. If you are required to comply with the Payment Card Industry Data Security Standard (PCI-DSS), a WAF is required.

When planning availability, note the differences between ‘high availability’ and ‘fault tolerant’ infrastructure. A high availability environment is designed to continue operations if a service goes offline but can experience performance or service degradation while recovering from events. A fault tolerant environment is designed to provide full functionality through multiple failures, with the target of zero interruption to the user’s experience. This often requires a completely separate environment fully capable of handling 100% of expected traffic should the primary environment go offline. While expensive, it can be more affordable than downtime for some applications.

Visibility and Software Security

Visibility and software maintenance play a large role in keeping your application healthy. Modern projects often include third-party software in the form of libraries, frameworks, and plugins. These components require updates much the same as the rest of your application, with the benefit of an external team testing and developing those updates. Depending on your location in the shared responsibility model, you may be required to install updates to your stack.

Furthermore, proper monitoring and logging can keep your production environment healthy and your team informed through relevant alerts that show changes and errors — hopefully before they impact availability.

Maintenance and Configuration

Keeping your application’s components up-to-date is a challenge, both in and out of the cloud. Depending on your level of responsibility (see Shared Responsibility Model), you may need to monitor updates for the OS, shared libraries, application frameworks, and of course, your application code itself. Out-of-date software is a major security risk and one of the leading avenues for remote code execution exploits. Once a vulnerability in a popular framework or library is public, bots begin to scan web properties and IP ranges immediately — looking for targets to exploit. This is why keeping software up to date is so important, especially in web applications.

Beyond updates, another important process is “hardening”. Default OS and service settings are not always suitable for a production environment. Hardening updates the default configuration for services and underlying components (such as the OS and network stack), with a focus on security. The hardening process should be part of your development workflows, so it is never overlooked.

A common way to ensure your settings are appropriate and secure is to perform a penetration test. Many open source tools exist (such as Kali or OpenVAS) to allow you to run these tests internally, but hiring an outside professional or agency is highly recommended. Many vulnerability scans that are available with these tools are automated but interpreting and acting on the results is not. With a little effort, you can gain meaningful insights from the reports, but it is hard to replace the expertise of a trained professional.

Visibility and Central Logging

Centralized logging is fundamental to securing modern infrastructure and understanding issues. With a central logging platform, it is easier to track events across systems and services. It is also important because logs on a compromised system can be modified. For this to work well, error codes, metrics, and log formats should be reasonably consistent and meaningful. Most clouds offer centralized logging services, but third-party options may have additional features that are useful to your team.

Beyond logging, active environment monitoring provides insights into the health and behaviour of your systems. This is particularly useful for alerting and resolving issues. Monitoring allows event-driven, high availability solutions where application health can trigger automated remediation, such as rebuilding a service or removing a node from an active cluster. You should also configure alerts to notify you of abnormal resource usage or authentication behaviour. A good rule of thumb is anything that impacts availability or may need manual intervention should be an alert, even if automated remedy procedures are in place.

Conclusion

All security models involve layers of connected and separated components. Diagram 4 shows the security layers you will typically find in a cloud environment.

Diagram 4. Cloud security layers

You can use this diagram to explore your responsibility and exposure points. As part of contingency planning, list the components of your application that fall within each layer. Then, depending on your responsibility, plan how to prevent a breach or recover from an outage.

Final Thoughts

Security tools and best practices are always changing, but fundamentals still hold true. Encryption, isolation, authentication, and monitoring are the cornerstones of secure architecture in the cloud or in a data centre. Modern security processes are often just extensions of these fundamentals, with automation supercharging their utility. We are able to automate tasks that previously required manual intervention (such as data encryption), and leverage cloud providers that manage large portions of the technology stack for us. Shared responsibility in the cloud makes infrastructure more approachable, and more secure.

Architecting a secure infrastructure is everyone’s responsibility. For it to work effectively, your strategy needs to be supported by policies, procedures, and awareness. Cloud providers offer solutions that share this burden, but you must be aware of your role and exposure points. It doesn’t matter if you build a small one-instance app or a large microservices-based architecture — security planning should be introduced early and reviewed regularly. It is hard to decouple tightly connected systems once they are live. Starting with strong security and service isolation is fundamental, as you may not be able to go back. Early investment in security can avoid downtime, protect your organization’s reputation, and avoid the costs associated with re-architecting to secure an existing application. It is better to build a strong foundation before the house is built on top.

Monitoring can facilitate event-driven architecture. Streamlining maintenance and enabling dynamic security rules can catch many issues before they become a problem. Tracking and auditing service permissions can alert you to misconfigurations and, if caught early, possibly prevent a breach. Additionally, testing and monitoring the health of your services can allow automation to resolve issues before they impact availability. Alerts sent to your team following an incident allow you to investigate the cause of the incident. With this visibility, you can take additional action, if necessary, to prevent future errors.

The risks that cyberattacks pose to your organization are always increasing, but so too are the tools available to combat them. In the war for your data, cloud providers and the tools they produce are on the front line of defense. Consider the best practices discussed here and remember, without careful planning and implementation, even the most modern security stack may let you down.

Additional Resources

We also highly recommend checking out the following resources: