Shahzad Bhatti Welcome to my ramblings and rants!

August 19, 2020

Review of “Building Secure and Reliable Systems”

Filed under: Computing,Technology — admin @ 5:33 pm

The “Building Secure and Reliable Systems” book shares best practices from Google’s security and SRE engineers. Here is a summary of these best practices:

The first chapter discusses tradeoff between security and reliability, e.g. reliability protects against non-malicious failure but may expand security surface via redundancy whereas security risk comes from adversarial attacks. Both reliability and security need confidentiality, integrity and availability but with different perspectives. Complex systems are difficult to reason so you must apply “Defense in depth”, “Principle of least privilege” and “Distinct failure domains” to limit the blast radius of failure. For example, Google uses geographic regions to limit the scope of credentials.

The second chapter focuses on “security adversaries” and “attack motives” who may come from hobbyists, hacktivist, researchers, criminals, cyber warfare, insiders and other background. You can apply CAPTCHA, automation/AI, zero trust, multi-party authorization, auditing/detection and recoverability to protect against these attacks.

The third chapter is part of second section of the book that focuses on designing secure and reliable systems. It introduces safe proxies in production environment that enforce authentication, multi-party authorization (MPA), auditing, rate limiting, zero touch, access control, etc. For example, Google uses CLI proxy to execute commands that are controlled via security policy, MPA and provides auditing/logs.

The chapter four examines security tradeoffs and reviews product features that may include functional and non-functional requirements (e.g. security, reliability, SLO dev velocity). Reliability and security are also considered emergent properties of system design and encompass entire product and services. The chapter also gives an example of design document template that includes sections for scalability, redundancy/reliability, dependencies, data-integrity, SLA, and security/privacy.

The chapter five discusses designing for least privilege that uses authentication and authorization. It also examines zero-trust networks that doesn’t grant any illegal access and zero-touch interfaces where all access is automated. It recommends writing small functions so that access control can be clearly defined, breaking glass in case of emergency to bypass certain authorization systems, auditing, testing for least privilege, multi-party authorization (MPA), three-factor authorization (3FA where access is approved from two platforms), business justifications, temporary access, proxies etc. This chapter also discusses tradeoffs of complex security with other factors such as company culture, data quality, user productivity, and development complexity.

The chapter six focuses on designing for understanding to reduce likelihood of security vulnerabilities and increase confidence in the system security. It defines system invariant, which is a property that is always true and can be used to assert security and reliability properties. It suggests using mental model to understand complex security system and explains identities, authentication, and access control concepts. When breaking a system into smaller components, the chapter recommends using trusted computing base (TCB) to create a security boundary that enforces security policies. In order to provide access from one TCB to another, you may issue end-user context ticket (EUC) that provides access temporarily. In order to standardize security policies, you may use a common framework for request dispatching, input sanitization, authentication, authorization, auditing, logging, monitoring, quota, load balancing, configuration, testing, dashboard, alerting, etc.

The chapter seven focuses on extensibility and new changes. For example, keeping dependencies up-to-date, automated testing, release frequently, using containers, micro services, etc.

The chapter eight focuses on resilience that describes the system’s ability to hold out against a major malfunction or disruption. It encourages designing the system with independent layers, modularization, redundancy, automation, security in defense, controlled degradation (partially failure), load shedding, throttling, automated response. You will need to consider tradeoffs between reliability and security, e.g. failing safe vs failing secure where reliability/safety may require ACL is “allow-all” but security may require ACL is “deny-all”. You can segment your network and Compartmentalize your system to reduce the blast radius. With micro-service architecture, you can assign distinct roles for each service and add geographic location or time as a scope of access. The chapter then defines failure domain, which is a type of blast radius control that creates isolation by partitioning a system into multiple equivalent but completely independent copies with its own data. Any of the individual partitions can take over for the entire system during an outage and help protect systems from global impact. You can validate the system continuously for failures using fuzzing and other types of testing.

The chapter nine discusses recoverability from random, accidental, software failures and errors. The chapter recommends designing emergency push system to simply be your regular push system turned up-to maximum for recovering it from failure. In order to prevent rollback to older-version, you can collect undesirable versions into a deny list or use white-list of allowed versions, which is used in the release system for verification. Also, you can maintain security version numbers (SVNs) and minimum acceptable security version numbers (MASVNs) and rotate signing keys, e.g.

ComponentState[DenyList] = ComponentState[DenyList].union(self[DenyList))
ComponentState[MASVN] = max(self[MASVN], ComponentState[MASVN])

def IsUpdateAllowed(self, Release, ComponentState, KeyDatabase):
  assert Release[Version] not in ComponentState[DenyList]
  assert Release[SVN] >= ComponentState[MASVN]
  assert VerifySignature(Release, KeyDatabase)

The chapter ten explains how to mitigate D.O.S. attacks where attacker may compromise vulnerable machines or launch amplification attacks. This chapter suggests using edge routers to throttle high-bandwidth attacks and eliminate attack traffic as early as possible. For example, You can use network and application load balancers to continually monitor incoming traffic. Other mitigating techniques include using caching proxies, minimize network requests (e.g. using spriting), minimize egress bandwidth, CAPTCHA, rate limit, monitoring/alerting (MTTD mean-time-to-detect, MTTR mean-time-to-repair), graceful degradation, exponential backoff, jitter, etc.

The chapter eleven is part of third section and focuses on maintaining trusted CA. For example, you can use secure and memory-safe languages to parse certificates or CSR requests. You may need to use third-party libraries but you can add testing for validation.

The chapter twelve focuses on writing code, e.g. using frameworks that enforce security and reliability. You can use RPC frameworks that may provide logging, authentication, authorization, rate-limiting. This chapter covers OWASP top vulnerabilities such as SQL injection that can be prevented by using parameterized SQLs; XSS that can be prevented by using sanitizing user input (safeHTML) and incremental rollout. Other coding techniques include simplicity, minimizing multi-level nesting/cyclometic complexity, eliminate yagni smells, pay tech-debt, refactoring. The chapter also suggests using memory-safe and strongly/static typed languages.

The chapter thirteen examines testing code using unit and integration tests. It also introduces other testing techniques such as fuzz testing, chaos engineering, static program analysis, code inspection tools (Error Prone for Java and Clang-Tidy), and formal methods.

The chapter fourteen describes deployment phase of software development that may include pushing the code, downloading a new binary, updating configuration, migrating database, etc. The chapter reviews threat model to prevent bad deployment such as accidental change, malicious change, bad configuration, stealing integrity keys, deploying older version, backdoor, etc. It suggests best practices such as code-reviews, automation, verifying artifacts, validating configuration, binary provenance, etc. The binary provenance verifies input to the artifact and validate transformation and entity that performed the build. The provenance fields include authenticity (signature), output, input (source and dependencies), command, environment, input metadata, debug-info, versioning. A build is considered verifiable if the binary provenance produced by the build is trustworthy. The verifiable build architectures include trusted build service, hermetic builds, reproducible builds, and verifiable builds, however you may need break-glass mechanism that bypasses the policy in case of outage. You can add post-deployment verification to validate the deployment.

The chapter fifteen shows how to investigate systems using debug flags, verifying data corruption, reviewing logs, and designing for safety.

The chapter sixteen is part of section four that focuses on disaster planning. This chapter introduces best practices to address short and long-term recovery such as performing analysis of potential disaster, establishing a response time, creating a response plans/playbooks, configuring systems, testing procedures/systems, and incorporating feedback from tests and evaluation. It shows how to setup incident response team that may include incident commander, SREs, customer support, legal, forensic, security/privacy engineers, etc. IR teams can use severity and priority models to categorize incidents based on severity of their impact on the organization and priority model to define response time. The response plan include incident reporting, triage, SLO, roles/responsibilities and communications. You also need to test systems and response plans and audit automated system. Red team testing can help simulate how the system reacts to an attack.

The chapter seventeen reviews crisis management that determines if the security incident is a crisis. This can be evaluated in triage that determines severity of the incident and whether the incident is a result of system bug or a compromise that is yet to be discovered. In the context of crisis management, operational security (OpSec) refers to the practice of keeping your response activity secret. For example, common OpSec mistakes include documenting incident in email, logging into the compromised systems, locking accounts/changing passwords, taking system offline. The chapter instead suggests meeting in person, use key-based access (without login), etc. You can apply forensics processes to investigate the security compromise. The chapter ends with summary of best practices that include triage, declaring an incident, communicate with executives and SecOps, creating IR team and forensics team, preparing communication and remediation and closure.

The chapter eighteen reviews recovery and aftermath from the security incident. You can establish recovery time based on if it affected mission critical system.

The goal of your recovery effort is to mitigate an attack and return your systems to their normal routine state, however complex security events may require parallelizing incident management/response execution. In order to return your systems to normal, you need to have a complete list of the systems, networks, and data affected by the attack. You also need sufficient information about the attacker’s tactics, techniques, and procedures (TTPs) to identify any related resources that may be impacted. There are several considerations before recovery such as:

  • how will your attacker respond to your recovery effort?
  • is your recovery infrastructure or tooling compromised?
  • what variants of the attack exist?
  • will your recovery reintroduce attack vectors?
  • what are your mitigation options?

The recovery checklists includes:

  • isolating Assets (quarantine)
  • system Rebuilds and software Upgrades
  • data sanitization
  • recovery data
  • credential rotation
  • postmortems

The chapter nineteen is part of section five of the book that offers making security a part of the organization culture. It suggests making security a team responsibility, providing security to users, designing for defense in depth and being transparent to the community.

The chapter twenty describes roles and responsibilities for security and reliability. For example, security experts implement security specific technologies, SREs develop centralized infrastructures, and security specialists can devise best practices. You can embed security experts with the development teams or review/audit security practices. Organizations can create red team that focus on offensive exercises for simulating attacks and blue team for assessing and hardening software and infrastructure.

The chapter twenty one shows how to build a culture of security and reliability. The chapter suggests organization culture of by-default security and reliability and encourage employees to discuss these topics early in project life-cycle. The chapter also suggests culture of review where peer reviews ensure that code implement least privilege and other security considerations. The culture should include awareness of security aspects, sustainability, transparency, and communication.

August 3, 2020

Summary of Data Consistency in Relational and NoSQL Databases

Filed under: Uncategorized — admin @ 8:34 pm

The relational databases generally guarantee transactions in terms of ACID properties that include:

  • A – Atomicity – transaction either succeeds or fails.
  • C – Consistency – all data will remain consistent.
  • I – Isolation – transaction will not be affected by other transactions.
  • D – Durability – changes from the transaction will be stored persistently.

Following is a list of transaction isolation levels:

  • Dirty Read – a transaction can read data that has not yet been committed by another transaction.
  • Non Repeatable Read – a transaction sees different data when reading same row again due to concurrency.
  • Phantom Read – a transaction sees different set of rows when running the same query again.

The SQL standard defines following isolation levels:

  • Read-Uncommitted – a transaction may see uncommitted changes by other transactions, thus allowing dirty reads.
  • Read Committed – a transaction only sees committed changes, thus preventing dirty reads.
  • Repeatable Read – prevents non-repeatable reads
  • Serializable – a highest isolation level where executing transactions appear to be executing serially.

NoSQL based Distributed systems define following consistency levels:

  • Strict consistency – a strongest consistency level that returns most recent updates when reading a value.
  • Sequential consistency – a weaker model as defined by Lamport(1979)
  • Linearizability (atomic) – guarantees sequential consistency with the real-time constraint
  • Causal consistency – a weaker model than Linearizability that guarantees write operations that are casually related must be seen in the same order

Most NoSQL databases lack ACID transaction guarantees and instead offer tradeoffs in terms of CAP theorem and PACELC, where CAP theorem states that a database can only guarantee two of three properties:

  • Consistency – Every node in the cluster responds with the most recent data that may require blocking the request until all replicas are updated.
  • Availability – Every node returns an immediate response even if the response isn’t the most recent data.
  • Partition Tolerance – The system continues to operate even if a node loses connectivity with other nodes.

Consistency in CAP is different than that of ACID where consistency in ACID means a transaction won’t corrupt the database and guarantees database correctness with transaction order but in CAP, it means maintaining Linearizability property that guarantees having the most up-to-date data. Serialization is highest form of isolation between transactions in ACID model with multi-operation, multi-object, arbitrary total order whereas linearizability is a single-operation, single-object, real-time order that applies to distributed systems.

In the event of a network failure (MTBF, MTTF, MTTR), you must choose Partition Tolerance from , so choice is between AP and CP (availability vs consistency). PACELC theorem extends CAP where you choose between availability (A) and consistency in presence of network partitioning (P) but choose between latency (L) and consistency) otherwise (E). Most NoSQL database choose availability and support Basically Available, Soft State, Eventual Consistency (Base) instead of strict serializability or linearizability. The eventual consistency only guarantees liveness where updates will be observed eventually. Some of modern NoSQL databases also support strong eventual consistency using conflict-free replicated data types.

Powered by WordPress