Shahzad Bhatti Welcome to my ramblings and rants!

October 6, 2019

Production Deployment Best Practices

Filed under: Computing,Project Management,Technology — admin @ 1:37 pm

Following are a few best practices I have learned over the years for deployment to the production environment especially when working with multiple versions of the software on multiple data centers:

Coding/Debugging Best Practices

Naming Threads

When using Java language, you can name background threads so that logs can show meaningful thread-names and set your threads as a daemon so that they are automatically shutdown when the java process is terminated.

Resource Cleanup

In order to avoid any resource leaking, always close the resources that require explicit cleanup such as I/O streams, database connections, etc.

Backward / Forward Compatibility

When working with multiple versions of the software, it’s critical that all changes to the domain model are both backward and forward compatible so that new schema can be read by the old service and old schema can be read by new service. If possible, always deploy the changes that can read the new schema before deploying the changes that write the new schema.

Character Encoding

Always use UTF-8 for character encoding in your code and databases.

Time Zones

Always use UTC timezone for application code and databases.

Internationalization/Localization/Accessibility

Apply i18n, l10n and accessibility standards for your user interfaces and services.

Data Validation

NullPointerException and wrong formats are common causes of many production issues so always validate your model, input parameters and results for proper format and ranges.

Failure Cases

Think about all failure and edge cases that your code may run.

Operational Best Practices

Use multiple stages for testing

Test your changes in multiple stages such as QA, UAT, alpha, beta, gamma where you can properly bake and test your changes. The size of data and variation will increase with each stage so that you can test as close to the production data and environment as possible. The testing will include load and performance testing to detect any impact to latency and availability. Further, you can use canary testing to release changes to a subset of users or data centers so that you can compare impact of changes before rolling out to all customers. If you are using multiple regions for deployment, you can start with a smaller or low-risk region for initial production release.

Check Calendar before the release

You can check calendar for any holidays or major changes to the infrastructure that might impact the release.

Automate Automate Automate

Avoid any manual changes to the build/deployment process.

Infrastructure as a code

Apply best practices and tools from infrastructure as a code so that you are applying all changes consistently.

Error Logging

Use appropriate level for production logs and log all exceptions including stack traces. You can log additional input parameters but ensure they don’t include any sensitive data.

Deploy domain objects and data access code together

In order to avoid any schema mismatch errors, always deploy both domain object and data access changes together. In addition, you can ignore unknown properties to the model to make these changes forward compatible. If domain and data access changes cannot be released together then release domain changes first and then release data access changes.

Plan for Rollback

Always test for rollback your changes and include multiple scenarios if changes were released in staggered mode where domain changes and data access changes were published separately. You can implement automated rollbacks when releasing changes incrementally so that new changes are immediately blocked before going to the next stage of testing.

SLA/SLO

Monitor SLA/SLO at each stage of the release so that you can fix any violation in these metrics.

Change Management

Apply best practices from change management to track any high-risk or major changes to the production environment.

Security Policies

Apply best practices from zero-trust security practices and principles of a lease privilege in all test and production environments.

Health Checks / Observability

Collect metrics for health checks, failure rates, usage, etc so that you can be immediately notified when you see a suspicious or a peculiar activity.

Chaos Testing

Apply chaos testing and game days to simulate various failure conditions to verify the behavior of your services and your reaction to those failures.

Integrate/Deploy often with small changes

Continuously integrate and deploy small changes to reduce the risk with major changes so that you can test your changes in production environment safely.

Feature Flags and Dark Launch

Use feature flags judiciously to launch features that are not yet available to all users so that you can test those changes for a subset of users.

July 12, 2019

Review of “Designing Data Intensive Applications”

Filed under: Computing,Technology — admin @ 4:38 pm

Designing Data-Intensive Applications is one of best resource for building scalable systems. The book is full of comprehensive material on data processing and following is summary of essential lessons from the book:

The first four chapter covers foundation of data systems and the first chapter starts with basic terminology of reliability, scalability, maintainability, Operability, simplicity and evolvability. It recommends measuring latencies using percentile and defining SLO/SLA.

The second chapter reviews data model such as relational model, document/ graph model, etc. It reviews differences in query model SQL, graph queries, and SPARQL and compares them in terms of schema evolution and performance.

The third chapter introduces basic concepts of storage for relational and NoSQL databases. It shows how you can use hash indexes for looking up key-value data, which can be enhanced with compaction. It further reviews SSTable and LSM-tree structures. The Sorted-String Table or SSTable keeps key-value in sorted fashion where each key only occurs once within each merged segment file. These segments are later merged using merge sort and written to disk. The Log-Structured Merge-Tree (LSM tree) is used to merge and compact sorted files. In order to reduce lookup time, bloom filter can be used to ensure key exists in the database. The relational databases use B-Tree data structure that break the database into fixed size pages and each page is identified by address that are stored in a tree structure. The B-tree use write-ahead log (WAL) to persist new data before updating B-tree data structure. Another form of index is multi-column index that combines several fields into one key. The  chapter then differentiates between OLTP and OLTP and using star (fact table surrounded by its dimension tables) and snowflakes (dimensions are broken down into sub dimensions) schema for analytics. Column-oriented storage can also be used for data warehouse that leads to better compression, CPU cache usage and vectorized processing. Some data warehouse use materialized aggregates or data cube that provides aggregates grouped by different dimensions.

The fourth chapter covers encoding and serialization schemes such as JSON, XML and binary encoding including MessagePack and Avro. It reviews binary protocols of Thrift and Protocol Buffers. Further, it provides support of schema evolution in these protocols for backward and forward compatibility. It describes data flow via network exchange using SOAP, REST/RPC message brokers and distributed actor frameworks.

The fifth chapter is part of second part of the book that focuses on distributed data with emphasis on scaling and shared-nothing architecture. The fifth chapter describes replication and defines leaders and followers. It starts with basic leader-based replication such as active/passive or master/slave where writes go through leader and reads can be served by any node. The replication can be synchronous, asynchronous or semi-synchronous. Though, synchronous replication has performance issues but research in chain replication provides good performance, which is used by Azure storage. The chapter reviews scenarios of leader or follower node failure, which may require election of new leader but can be subjected to loss of data that wasn’t replicated from the old leader. The replication can use statement-based replication, shipping WAL logs, logical (row-based) replication or trigger-based replication, where former can have side-effects due to triggers/non-deterministic behavior. The version mismatch may cause incompatibilities with WAL based replication. Asynchronous cause effects of eventual consistency where latest reads are not fully replicated so you need read-after-write consistency that can be addressed using read-your-writes consistency, e.g. reading from the leader or remembering timestamp of last write. The lag in replication can display load data after showing new data and monotonic read consistency addresses that behavior. Consistent prefix read guarantee sequence of writes order is preserved when reading those writes. When replicating with multi-leaders across data-centers can have higher lag time between replication and may result in write conflicts between leaders. You may associate users to a specific leader based on location or resolve these conflicts using timestamps or higher unique id. Other forms of replication uses leaderless or quorum based consistency. Some implementation use asynchronous read-repair to fix stale data and anti-entropy process to add any missing data. The quorum based consistency uses odd number of nodes with w = r = ceil(n+1)/2, though it can also be subjected to stale values (with sloppy quorum and hinted handoff) or loss of data (last writer wins). The quorum based nodes may use version number for every key to preserve order of write (version vectors).

The chapter six covers partitioning works with replication so that each node might have more than one partition. Some partitions may be skewed having more data than others, also referred as hot spot. You can partition based on key-range or hash of key, where hashing may use consistent hashing to distribute keys fairly. The partitioning data also requires partitioning secondary indexes or maintaining a global index that covers all partitions by term. The partitioning may also require re-balancing where you may define 100 times more partition than nodes so that data is not only partially relocated. When reading data requires routing to nodes where client may contact any node directly, which forwards to other node if needed; send all requests to a routing node; or client is aware of partitioning and contacts appropriate node.

The chapter seven describes transactions and defines meaning of ACID, however it cautions against failures due to asynchronous writes, caching, disk failures (despite fsync), etc. The relational databases use transaction scope to write multiple objects atomically and in isolation and other databases use compare-and-set operations to prevent lost updates. Due to high cost of serializable transactions, most databases provide weak isolation level when running multiple transactions concurrently. In order to prevent concurrency bugs, database provide transaction isolation including read-committed for preventing dirty reads/writes that may use row-level locks and keeping copies of old data; snapshot isolation/repeatable read that prevents read skew (non repeatable reads) using reader/write locks and MVCC. Relational databases provide explicit locking using SELECT…FOR UPDATE to prevent lost updates, other databases use compare-and-set to avoid lost update. These transaction isolation can still lead to write skew and phantoms that can be prevented using SELECT FOR UPDATE, e.g. meeting reservation, choosing username, preventing double spending. The serializable isolation provides strongest guarantees but it’s not provided in most databases and has limitations with partitioned data. Two-phase locking can be used for serializable isolation but it suffers performance issues and can lead to deadlocks. Another alternative is predicate locks that locks all objects matching criteria but they also suffer from performance issues. Other alternatives include index-range locks and serializable snapshot isolation (SSI) that uses optimistic concurrency controls.

The chapter eight discusses network faults and failures in distributed systems. The chapter covers cloud computing, supercomputer where nodes communicate through RDMA/shared memory. These failures are common in most systems and can be detected by load balancer or monitoring system. Partial failures are hard to detect and you may use timeout to detect failures. Distributed systems may use monotonic clocks (System.nanoTime) or time-of-date (NTP) clocks, however unreliable clocks can make measuring time in distributed systems error prone. The NTP synchronization is not always reliable and drift in time-of-clock may result in incorrect order of events and last-write-win strategy may overwrite data with old value. Google TrueTime API uses confidence interval with clock time, e.g. Spanner uses clock confidence interval for snapshot isolation to create global monotonic increasing transaction ID. When scheduling task periodically, check process delay due to GC, virtualization, disk I/O, etc. When using a lock or lease, fencing ensure there is only one leader. When the lock server grants a lock/lease, it returns fencing token (monotonic number) and client includes it with each request so old requests are rejected. However, fencing token cannot prevent against Byzantine faults (deliberate faults).

The chapter nine explains consistency and consensus. Most replicated database provides eventual consistency, which is weak consistency. Linearizability is strongest consistency that makes system appear as if there is a single copy of the data and reads cannot return old data if it previously returned new data. Linearizability may use compare-and-set operation to prevent data overwrite. The chapter also distinguishes Linearizability with Serializability that is isolation property of transaction that guarantees serial order of transactions, where Linearizability guarantees recent data after read/write and doesn’t prevent write skew. Leader election, distributed locks, unique constraints such as username may use Linearizability to come up with a single up-to-date value. Simplest way to have linearizable systems is to keep a single copy of the data but you need replication for fault tolerance system. You can use single-leader replication where all writes go to leader or consensus algorithm but multi-leader and leaderless replication (dynamo style) don’t provide linearizable guarantees. The chapter then describes CAP theorem, where consistency (C) relates to linearizability and you give up availability if some replicas are disconnected and wait until replicas are fixed. On the other hand, a replica can remain available even if it’s disconnected from other replicas to provide availability at the cost of linearizability. Also, CAP theorem only considers one kind of fault – network partition or nodes that are alive but disconnected from each other so most highly available systems don’t meet CAP definition. Though, causal order can define what happened before what but it’s not total order guaranteed by linearizability and sets can only be partially ordered. Linearizability is stronger than causality but most systems only need causality consistency that show what operation happened before other operation. Sequence number or timestamp (logical clock) can generate sequence number to identify order of operations, which can be incremented by the single leader. Other systems can preallocate blocks of sequence, attach timestamp or generate local sequence number but they are not consistent with causality. The chapter then reviews Lamport timestamps (logical) that enforce order (distinct from version vectors). You need a leader to sequence all operations on a single CPU to guarantee total order broadcast (atomic), but it’s not scalable. A leader per partition can maintain ordering per partition but it doesn’t guarantee total ordering. Total order broadcast requires reliable delivery and totally ordered order (same order to all nodes) and it can be used to implement serializable transaction. It can also be used for implementing lock service that provides fencing tokens. You can implement this by appending message to log, reading the log and waiting for the message delivered (same order for all nodes) back to verify the unique identifier such as username. But it only guarantees linearizable writes and you must sequence reads only after message is delivered back to you to guarantee linearizable reads.

Next part of chapter nine describes consensus, where you have to agree on leader after election (in presence of network faults while avoiding split brain) and atomic commits (as in ACID). Despite FLP that proved no algorithm can reach consensus if a node can crash, distributed systems can achieve consensus. The chapter reviews Two-phase commit (2PC) that involves multiple nodes as opposed to a single node that uses logs the data before committing it. The 2PC uses coordinator/transaction manager that creates globally unique tx-id and tracks two phases: prepare and commit. The coordinator must write transactions in logs before commit in case the coordinator crashes while nodes may have to wait for coordinator indefinitely. Three-phase commit assumes bounded delay/timeout to prevent blocking atomic commits. The chapter then distinguishes between internal database and distributed transactions, where distributed transactions guarantee exactly-once processing atomically such as XA transactions. On the downside, coordinator in distributed transactions would be a single point of failure and they limit scalability. The fault tolerant consensus requires uniform agreement, integrity, validity, and termination to agree on same value. The best algorithm that provides fault tolerant consensus include VSR, Paxos, Raft and Zab that uses total order broadcast algorithms that requires messages be delivered exactly once in the same order to all nodes. These algorithms use epoch numbers (monotonically increasing) for each election and quorum is used to agree on the value. There are a few limitations of consensus such as synchronous replication, majority voting (minimum 3 nodes), static membership, continuous re-election due to partial failures. Some of implementations include Zookeeper and etc that provides linearizable atomic operations using compare and set to implement locks; total ordering of operations using fencing token (monotonically increasing); failure detection; change modifications.

The chapter ten describes batch processing and is part of third part of the book that covers data derived from system of record for OLAP and reporting. The batch processing is offline processing as opposed to real-time services or near-real-time stream processing. The chapter starts with basic Unix tools that uses pipes and files for batch processing. It then reviews MapReduce such as Hadoop and distributed file systems such as HDFS, GlusterFS, QFS. The mapper extracts key/value from input records and generate a number of key-value pairs, whereas reducer takes key-value pairs and collects all values belonging to the same key. The scheduler uses principle of “putting the computation near the data” to run mapper on the machines with replica of input file. MapReduce often requires workflow systems to manage dependencies such as Oozie, Azkaban, Luigi, Airflow, Pinball. These frameworks use groups to merge or collate related data. However, you can use skewed join when group data is very large to fit on a single machine. where work can be parallelized on multiple reducers. Mappers can use broadcast hash joins and partitioned joins when working with large data. The chapter also discusses data flow engines like Spark, Tea, Flink that supports operators for providing more flexible way to create data pipeline. It also reviews graph processing systems such as Apache Graph, Spark GraphX that supports bulk synchronous parallel (BSP) model of computation, which sends messages from one vertex to all connecting vertices similar to actor model if you think of each vertex as actor.

The chapter eleven discusses stream processing for providing near real-time processing. The stream processing transmit event streams using messaging systems and pub/sub model. The messaging system may use UDP, brokerless libraries such as ZeroMQ or message brokers with queue support. The message brokers support load balancing when a single message is delivered to one of consumer and share the work (shared subscription). Alternatively, they use fan-out where message is delivered to all consumers. Message brokers use ack to remove the message when it’s processed by consumer. The broker delivers the message again in event of connection failure using atomic commit protocol. Some message brokers such as Apache Kafka use append-only logs to add incoming messages and these logs can be partitioned where each message uses monotonically increasing sequence number (without ordering gurantee). The consumers read files sequentially by specifying offset. There are some limitations of these log based brokers, e.g. number of nodes sharing the work can be at most the number of log partition in that topic and if a single message is slow to process, it holds up processing in that partition. Some implementations may use circular buffer to store messages on disk but if consumers cannot keep with producers, it may drop old messages. The chapter discusses change data capture for syncing data that capture the changes in the database and apply same changes to search index. You can also use compact logs for syncing the data from offset 0 and scan over all messages. The chapter reviews event sourcing that stores all changes to the application state as a log of change events. The event source distinguishes between event and commands and after command is validated, it becomes an event that is durable and immutable. The immutable data is related to command query responsibility serration principle.

The last chapter reviews future of data systems and limits of total order that may require a single leader but it can be difficult in distributed data-center and micro services. You can send all updates to the same partition but that is not sufficient to capture causal dependency. The chapter covers lambda architecture and unbundling of databases including storage technologies such as secondary indexes, replication logs, text search indexes. It reviews exactly-once execution of operation, duplicate suppression and operation identifier (using request-id to make operation idempotent).

June 18, 2019

Review of “Designing Distributed Systems”

Filed under: Computing,Technology — admin @ 11:34 am

The “Designing Distributed Systems” book provides design patterns for building distributed systems with support of container technologies such as Kubernetes. The book consists of three sections where first section focuses on single-nodes, second section focuses on long-running services, and third section focuses on batch computation.

Sidecar

The first pattern in the book introduces concept of sidecar pattern for modularity and reusability where a single application requires two containers: application container and sidebar container where sidebar container provides additional functionality such as adding SSL proxy for the service, collecting metrics for the application container. The side-bar container can be configured via dynamic configuration service.

Ambassadors

The ambassador pattern introduces an ambassador container that sits between the application and external services and all incoming/outgoing traffic goes through it. It also helps with modularity and reusability where the ambassador may abstract sharded service (or A/B testing) so that client or service itself doesn’t need to know all details . You may also use ambassador container for service brokering where it looks up an external service and connects to it.

Anti-corruption layer

The anti-corruption layer integrates two systems that don’t share the same semantic data model.

Adapters

The adapter pattern uses special container to modify the interface of application container, e.g. you can deploy monitoring adapter to automatically collect health metrics using Prometheus or other tools. Similarly, you may use adapter container to collect kubernetes logs (stdout/stderr) and reformat the logs before sending them to log aggregator (Fluentd).

Replicated Load-Balanced Services

This pattern is part of long-running services where a load balancer is added in front of the service for scalability. Each service is designed as a stateless so that requests can be sent to any replica of the service behind the load balancer. Each service needs to provide readiness probe so that load balancer knows if it can serve the requests. In some cases, you may need to support session-tracked services where user requests are routed to the same replica using sticky session or consistent hashing function. You may add a caching layer that is deployed along with your service container (as sidebar). Further, you may need to provide rate-limiting and protect against DOS attacks (X-RateLimit-Remaining headers). This pattern can also implement SSL Termination where external traffic is encrypted with different certificate compared with internal traffic (Varnish).

Sharded Services

This pattern partitions the traffic where each shard serves subset of all requests. As opposed to replicated services that are generally used for stateless services, sharded services are used for building stateful services. You may use sharded cache for each shard that sits between user and front-end to optimize end-user performance and latency. You may add replicas for each shard for further redundancy and scalability. Sharding requires selecting a key to route the traffic, e.g. you may use IP-address or consistent hash function to avoid remapping when new shards are added. If one of the shard becomes hot, you can add replicated sharded cache to handle the increased load.

Scatter/Gather

The scatter/gather pattern adds parallelism in servicing requests where work is broken and spawned to multiple services and then result is aggregated before returning to the user. For example, you can implement distributed document search by farming multiple leaf machines that returns matching document and root node aggregates the results. You can also add support for sharded data by searching each shard in parallel and root node generates union of all documents returned by each shard (leaf node). One downside of this pattern is that it may suffer straggler problem as total response time depends on the slowest response so you may need to replicate each shard to improve computational power.

Functions and Event-Driven Processing

This pattern is used to implement function-as-a-service (FaaS) products. FaaS simplifies development and deployment as the code is managed and scaled automatically. However, FaaS requires that you decouple your application into small parts that can be run independently. Faas uses event systems to communicate with each function or create a data pipeline. You can use external data services for storing states that is shared by these functions.

Ownership Election

This pattern helps in multi-node environment where a specific task must be owned by a single process. For example, when you have multiple replicas, you may need to elect master using consensus algorithm such as Paxos, Raft or frameworks such as etcd, ZooKeeper, and consul. You can use distributed locks to implement ownership (optionally with a lease or TTL). You may need to verify if you hold the lock before proceeding, e.g.

func (Lock l) isLocked() boolean {
return l.locked && l.lockTime + 0.75 * l.ttl > now()
}

Work Queue Systems

This pattern is part of batch computation section to handle work items within a certain amount of time. You may use a work-queue manager container along with an ambassador container to connect to external queue source where source might use storage API, network storage, pub/sub systems like Kafka or Redis. Once the queue manager receives a work item, it launches a worker container. Kubernetes contains a Job object that allows for the reliable execution of the work queue. In order to limit number of worker containers running concurrently, you can limit the number of Job objects that your work queue is willing to create. You can also use the multi-worker pattern when different worker containers are transformed into a single unified container that implements the worker interface.

Event-Driven Batch Processing

This pattern allows data pipelining where an output of one work queue becomes input to another work queue, referred as workflow systems. Here are patterns of event-driven processing:
Copier: This pattern just duplicates the work item into two or more identical streams.
Filter: This pattern reduce a stream of work items to a smaller stream of work items by filtering out items that don’t meet particular criteria.
Splitter: This works like filter, but instead of eliminating input, it sends different inputs to different queues based on criteria.
Sharder: This is more generic form of splitter and splits a work item into smaller work items based on sharding function.
Merger: This is opposite of copier and merges two different work queues into a single work queue.

You may use pub/sub API to communicate between different workers.

Coordinated Batch Processing

This is similar to Reduce part of MapReduce pattern where a work is broken up and distributed to multiple nodes in parallel. You may need Join or Barrier Synchronization to wait for intermediate results before proceeding to the next stage of the workflow. For example, reduce phase aggregates merges several outputs into a single output.

Claim-check

Instead of sending large messages to a messaging queue, you can store payload to an external storage service and send reference in the messaging event.

Index Table

If the data-store doesn’t support secondary indexes, you can define a separate index table where the primary key is the secondary field for query such as customer phone number. You can use same partition-key for both fact-table and index tables so that they are colocated.

Compensating Transaction

When invoking several external service with eventual consistency model that can fail, you can use a compensating transaction to undo the operation and revert back changes.

Saga distributed transactions

The saga pattern uses a sequence of transaction to invoke multiple services, however if a step fails then it executes compensating transactions.

Scheduler Agent Supervisor

This pattern coordinates distributed actions as a transaction and it will undo the work if any action fails. It uses a scheduler to orchestrate steps of transaction, an agent to invoke a remote service and a supervisor to monitor the task being performed.

Competing Consumers

This allows multiple concurrent consumers to receive messages from same messaging queue. It allows better scalability, reliability and availability.

Command and Query Responsibility Segregation (CQRS)

CQRS separates query and update operations for a data-store such that commands store task/transaction based data in the data-store. The query model is optimized for high performance read model.

Event Sourcing

The event-sourcing allows storing state of domain object in an append-only store as events that are applied to update state of the application. This allows better performance, scalability and audit-trail, compensating actions.

Asynchronous Request-Reply

The front-end applications often work with synchronous APIs and this pattern uses request-reply facade to decouple the backend asynchronous APIs.

Retry

The retry allows recovering from failures by retrying the failed operation after exponential delay and jitter.

Circuit Breaker

When a network request fails due to transient errors, the circuit breaker can be used to prevent an application from repeatedly retrying the failed operation.

Bulkhead

The bulkhead pattern components of an application are isolated so that failure in one component doesn’t cause cascading failure. For example, you may use different database connection pools, threads pool or rate-limit for different types of requests.

Orchestration vs Choreography

Micro-services can communicate with other services using orchestration model where a centralized service orchestrates requests to other services and acknowledges all operations. Alternatively, you can use message-driven or asynchronous messaging to implement choreography design.

Deployment Stamps

In order to scale deployment by tenants or groups, you can use deployment stamp or scale-unit to host multiple instances of the application. This provides scalability, sharding and separation of data per subset of customers.

External Configuration Store

The external configuration store allows storing application configuration to a centralized location that can be cached and shared across applications.

Edge Workload Configuration

The Edge-workload configuration is often used with IoT deployment to allow scaling of data-store and configuration at the edge. For example, you may define an edge-configuration controller and edge-configuration data-store to host device configurations that are updated asynchronously from cloud configuration controller and data-store.

Federated Identity

The federated identity allows delegating authentication to an external identity provider so that user management, authentication and authorization can be simplified.

Gatekeeper

The gatekeeper acts as a broker between client and services so that requests can be sanitized, throttled and authenticated.

Gateway Aggregation

A client may need to invoke multiple data services to aggregate application state, instead gateway aggregation allows using a gateway to invoke multiple services.

Gateway Offloading

When communicating with a service that requires unique requirements for connections or security, you can use a gateway to proxy all requests so that other services don’t need to implement all those measure.

Gateway Routing

This allows client to use a single endpoint to access all service that can authenticated, throttled and validated consistently.

Geode

This allows deploying applications to geographically node or regions in active/active style so that latency and availability can be improved.

Health Endpoint Monitoring

This allows checking application health from external tools for monitoring purpose.

Leader Election

In order to avoid conflicts, a single task instance can be elected as a leader so that it coordinates actions of other subordinate tasks.

Pipes and Filters

The pipes and filters allow decomposing tasks into a series of components that can be shared to improve performance, scalability, and reusability.

Rate Limiting

The rate-limiting allows controlling resources based on limits so that you can predict throughput accurately.

Sequential Convoy

This allows processing a group of messages in sequential order by using consumers that are partitioned by the group identifier.

Valet Key

The valet key acts as a token to provide access to restricted resource. For example, the client first requests a valet key, which is then used to access the resource within the lease-time.

March 9, 2018

Separating Control/Config Management from the Services

Filed under: REST,Technology — admin @ 9:59 pm

One of pattern I learned earlier in my career was to separate control flow from the data flow. For example, when I first looked at FTP protocol, I noticed it listened to two separate ports for communication between client and server. It used a data port to transfer files and used a control port to send/receive commands for managing transfer. This allows the server to respond quickly if your data port is busy transferring large files. In some ways, this is similar to Bulkhead pattern for partitioning components and limiting the blast radius. When a service reaches its capacity, it will slow down all requests including any requests to control or configure it. Thus, it helps to define a separate channel where you can manage the control-service. Also, you may need to define special access-control policies to manage the control-service. For example, an admin may need to be on a trusted network for administration. In some cases, you may build a control-service for management behind the firewall but the data-service is publicly accessible. Another use-case is to update service’s configuration at runtime where you might store the service configuration via the control-service that can update the configuration and then publishes it to the data-service.

March 6, 2018

Tips from the second edition of “Release It!”

Filed under: Design,Methodologies,Technology — admin @ 4:19 pm

The first edition of “Release It!” has been one of most influential books that I have read and it introduced a number of methods for writing fault tolerant systems such as circuit-breaker, bulkhead patterns so I was excited to read the second edition of the book when it came out. Here are a few tips from the book:

In the first chapter, the author defines stability in terms of robustness, i.e., “A robust system keeps processing transactions, even when transient impulses, persistent stresses, or component failures disrupt normal processing.” He recommends focusing on longevity bugs such as resource leaking, e.g. you can set timeout when invoking a network or database operation or detect dead connections before reading/writing. The author cautions against tightly coupled systems that can easily propagate failures to other parts of the system. Similarly, high number of integration points can increase probability of failure in one of those dependencies. The author suggests looking into the blocking calls that can lead to deadlocks when using multiple threads or scrutinizing resource pool that gets exhausted from blocking operations. Another cause of instability is chain reaction from failure of one of servers that increases load on the remaining servers, which can be remedied using bulkhead or circuit-breaker patterns. High memory usage on the server can also constrain resources and the author recommends using external caching systems such as Redis, memcache, etc. In order to monitor health of the system, the author suggests using a mock transaction to ensure it’s working as expected and keeping metrics on errors especially login errors, high latency warnings. One common self-inflicting failure can be caused by self-denial attacks by marketing campaign that can be mitigated by using shared-nothing architecture or reducing fan-in shared resources. The servers can also be subjected to dogpile effect, where resources after upgrade, cronjob or config change spike up that can be mitigated by using random clock skew, adding random jitter and using exponential backoff. Finally, the author recommends monitoring slow responses, removing unbounded result sets and failing fast.

Here is a summary of stability patterns that the author recommends:

  • Apply timeout with integration points and delayed retries when failure occurs.
  • Apply circuit-breaker to prevent cascading failures (along with timeout).
  • Apply bulk-head pattern to partition the system in the event of chain reaction failure.
  • Steady state
  • Data purging
  • Log files
  • Fail fast, restart fast and reintegrate
  • Let it crash with limited granularity, e.g. boundary of actor.
  • Supervision for monitoring, restarts, etc.
  • Shed Load
  • Back-pressure – queues must be finite for finite response time
  • Governor – create governor to slow rate of actions (from automated response) so that humans can review it.

Next few chapters focus on network, machines and processes for building and deploying the code. The author offers several mechanisms for scaling such as load balancing with DNS, using service registry for upgrading and fail-over, configuration, transparency, collecting logs and metrics, etc. The author recommends load shedding when under high load or use HTTP 503 to notify load balancer. For example, queue length can be calculated as: (max-wait-time / mean-processing-time + 1) * processing-threads * 1.5. You can use listen reject queue to return 503 error to prevent clients from reconnecting immediately. The control-panel chapter recommends tools for administration. It recommends postmortem template such as what happened, apologize, commit to improvement and emphasizes system failures (as opposed to human errors). The author recommends adding indicators such as traffic indicators, business transaction, users, resource pool health, database connection health, data consumption, integration point health, cache health, etc.

The security chapter offers standard best practices from OWASP such as using parameterized queries to protect against SQL injection, using high entropy random session-ids/storing cookies for exchanging session-ids to protect against session hijacking/fixation. In order to protect against XSS (when user’s input is rendered in HTML without escaping) by filtering input and escaping it when rendering. The author recommends using a random nonce and strict SameSite policy to protect against CSRF. Similarly, author recommends using the principle of least privilege, access control, etc. The admin tools can offer tools for resetting circuit breakers, adjust connection pool sizes, disabling specific outbound integrations, reloading configuration, stopping accepting load, toggling feature flags.

For ease of deployment, the author recommends automation, immutable infrastructure, continuous deployment, and rolling changes incrementally.

The author suggests several recommendations on process and organization such as OODA loop for fast learning, functional/autonomous teams, evolutionary architecture, asynchrony patterns, loose clustering and creating options for future.

Lastly, the author offers chaos engineering as a way to test resilience of your system using Simian army or writing your own chaos monkey. In the end, the new edition offers a few additional chapters on scaling, deployment, security, and chaos engineering and more war stories from author’s consulting work.

January 15, 2017

Review of “Whiplash: How to Survive Our Faster Future”

Filed under: Future,Technology — admin @ 4:38 pm

I read “Whiplash: How to Survive Our Faster Future” by Jai Ito and Jeff Howe over the holidays. Joi Ito is a director of the MIT Media Lab. The MIT Media Lab was created by Nicholas Negroponte in 1985 to build an environment where best ideas from schools of arts and science can be married to build next revolutionary discoveries.

This book narrates anecdotes of how technology revolutionized human development in past and how it continues to disrupt our lives today. As a consequence of Moor’s law and the Internet, technology is changing at an exponential speed. In such rapidly changing environments, this book provides key lessons that can be used to prepare us for uncertain future and paradigm shifts. In such exponential times, the invention of new technologies far outpaces the moral and ethical consequences of those breakthroughs. As technologies can be used for both good and bad, they offer both salvage and demise of humanity.

Here are primary principles that authors present to shape the new world:

Emergence over Authority:

The invention of Internet has facilitated communication and collaboration all over the world, where best ideas can be easily shared and exchanged. As a result, institutes that had central authorities are disintegrating. The authors present several examples of Emergence vs Authority such as Blogs vs Newspapers, Wikipedia vs Encyclopedia and central governments vs social networks based political revolutions.
In emergent systems, participants use simple rules and exchange information to build complex systems. Examples of emergent systems are ant colony, slime mold, brain, flocking birds, stock exchanges, biology, etc. The authoritarian systems enable incremental changes whereas emergent systems are more adoptive and foster non-linear progress.

Pull Over Push:

Push-based systems control their access whereas pull based use transparency and two-way communication and are able to cope with the crisis far better than push based systems. The authors recited an example of pull-based when meltdown of Fukushima nuclear plant occurred as a result of severe earthquake and fourteen feet tsunami. Joi and a team of volunteers across the world collaborate and built Geiger counters to take accurate readings of radiation. Other examples of pull based systems are crowdfunding and crowdsourcing.

Compasses over Maps:

A map has detailed knowledge and an optimal route whereas compass offers more autonomy and offer more flexibility in an unpredictable environment. The authors stated examples of the education system where standardized tests and curriculum deprive students of creativity and passion for learning. They used the culture of Media Lab as an illustration where the vision is based on compass heading. It provides a framework for individual progress leaving flexibility for interactions between groups.

Risk over Safety:

Traditional businesses are more risk averse where new ventures are thoroughly analyzed and they spend millions in studies. However, the cost of experimenting new ideas has drastically been reduced in today’s market and it offers much better return on investment.

Disobedience over Compliance:

Innovation requires creativity and breaking rules so a high-impact institutes require a culture of disobedience. It needs a culture where criticism and diverse ideas are embraced.

Practice over Theory:

Due to low-cost of launching new products, an innovating organization requires a culture where experiments are valued more than detailed planning.

Diversity over Ability:

The authors provided lessons from biochemistry companies that used gamers to design protein molecules. They used gamers with diverse background and those gamers had better pattern recognition than the biochemists with PhD. Most organizations believe in diversity, but most organizations lack diversity especially in high-tech companies such as Facebook, Yahoo and Google.

Resilience over Strength:

Resilient organizations are like the immune system that can successfully recover from failures. The authors gave examples of cyber-security where there are threats from various sources and successful defense requires treating security systems as biological systems and building strong immune systems against those security risks.

Systems over Objects:

Systems over objects emphasize understanding the connections between people, communities, and the environment. Instead of optimizing an individual or an organization, we need to optimize the impact of innovations on an entire natural system.

Conclusion:

In this final chapter, authors gave examples AI and machine learning where deep learning and reinforced learning has allowed machines to beat human experts in Chess and Go. The authors cite “The Singularity is Near” by Ray Kuzweil, who predicts that we will have intelligent explosion by 2045. In this world, we will have to think about how humans and machines will work together.

November 15, 2016

Tips from “Algorithms to Live By”

Filed under: Algorithms,Computing — admin @ 10:51 pm

The “Algorithms to Live By” by Brian Christian and Tom Griffiths reviews computer algorithms from several domains and illustrates practical examples for applying those algorithms in real-life problems. Here is a list of some of those algorithms that I found very useful:

1. Optimal Stopping

This class of problems determines the optimal time to stop further processing when searching or selecting an option. Here are a few examples:

Secretary Hiring Problem
This is a famous math problem, which was defined by a mathematician named Merril Flood based on “Look-Then-Leap-Rule” to find the best candidate by waiting until you review 37% of the candidates and then hiring the candidate who is better than all of the past candidates. There are several other applications of this algorithm such as finding a life partner or apartment hunting. This problem assumes that you cannot go back to the previous candidate once you reject but there are other variations of this algorithm that allow it in case the selected candidate rejects your offer.

Selling a House
When selling a house, you need to determine the range of expected offers and cost of waiting for the best offer.

Finding a Parking Spot
Given a percentage of parking spots available, you determine the number of vacant spots that can be passed before a certain distance until you take the first spot.

2. Explore/Exploit

In this chapter, authors describe several algorithms for exploring available paths and then using the optimal path. Here is a sampling of the approaches based on explore/exploit:
Multi-armed bandit
Given expected value of a slot machine (winnings/# of pulls), you need to maximize winnings. There are several approaches such as:

  • Win-Stay
    You keep using a slot machine as long as you are winning and then switch to a different machine when you lose.
  • Gittins Index
    It is named after Gittins, who was a professor at Oxford. It tries to maximize payoffs for future by calculating a Gittins index for all slot machines and then selecting slot machine with the highest Gittins index.
  • Regret and optimism
    Many problems in life can be defined in terms of regrets and optimism by imagining being at the deathbed and thinking of decisions that you could have made differently.
  • Upper Confidence Bound
    It is also referred as optimism in the face of uncertainty, where you choose your actions as if the environment is as nice as is plausibly possible. Given a range of plausible values and you pick the option with the highest confidence interval.
  • A/B Testing
    It is often used to test new features by offering the new features to a subset of the customers.

One of insight the authors present is that people often explore longer by favoring new over the best older option.

3. Sorting

In this chapter, authors describe several algorithms for sorting and their computing cost in terms of O-notation. The O-notation is generally used to indicate algorithm’s worst performance such as:

  • O(1): Constant cost
  • O(N): Linear cost
  • O(N^2): Quadratic cost
  • O(2^N): Exponential cost
  • O(N!): Factorial cost

Merge-Sort
This algorithm breaks data recursively into smaller sets until there is a single element. It then merges those subsets to create a new sorted list.

Bucket-Sort
A group of n items can be grouped into m buckets in O(nm) time and this insight is used by bucket sorting where items are grouped into a number of sorted buckets. For example, you can use this approach to load returned books into carts based on the shelf numbers.

Sorting is a pre-requisite for searching and there are a lot of practical applications for sorting such as creating matchups between teams. For example, teams can use round-robin based matchup where each team plays each other team but it would result in a lot of matches (O(N^2)). Instead, competitions such as March Madness uses Merge-Sort to move from 64 teams to 32, 16, 8, 4 and finals. However, it doesn’t use full sort as there are only 63 games in the season instead of 192.

4. Caching

In computer design, John Von Neumann designed memory hierarchy to improve lookup performance. It was first used in IBM 360 mainframes. Other computer researchers such as Belady designed algorithms for page faults to load data from disk to memory. There are several algorithms for cache eviction such as First-In, First-Out, Least-Recently-Used, etc.

5. Scheduling

Here are a few of the scheduling algorithms described in this chapter:
Earliest Due Date Strategy
It minimizes maximum lateness by choosing task with the earliest due date first.

Moore’s algorithm
It is similar to Earliest Due Date but it throws out biggest task if the new job can’t be completed by due date.

The authors give an example of Getting Things Done (GTD) technique for time management where small tasks are handled first. The tasks can also have a weight or priority and then the scheduler minimizes the sum of weighted completion time by dividing weight by length of the task and selecting the task with the highest density.

Here are a few issues that can arise with priority based tasks:

  • Priority Inversion – when a low priority task possesses a resource and scheduler executes a higher priority task, which cannot make any progress. One way to address this issue is by allowing the low-priority task to inherit the priority of higher priority task and let it complete.
  • Thrashing – it occurs when system grinds to halt because work cannot be completed due to lack of resources.
  • Context switching – Modern operating system uses context switching to work on multiple tasks but each slice of time needs to be big enough so that the task can make progress. One technique to minimize context switching is interrupt coalescing, which delays hardware interrupt. Similar techniques can be used by batching small tasks, e.g. Getting Things Done technique encourages creating a chunk of time to handle similar tasks such as checking emails, making phone calls, etc.

6. Bayes’s Rule

Reverand Thomas Bayes’s postulated Bayes’s rule by looking at winning and losing tickets to determine overall ticket pool. It was later proved by Pierre-Simon Laplace, which is commonly referred as Laplace’s law. Laplace worked out Bayes’s Rule to use prior knowledge in prediction problems.

Copernican Principle
Richard Gott hypothesized that the moment you observe something, it is likely to be in the middle of its lifetime.

Normal or Gaussian distribution
It has a bell curve and can be used to predict average life span.

Power-law distribution
It uses range over many scales such as the population of cities or income of people.

Multiplicative Rule
It multiplies quantity observed with some constant factor.

Average Rule
It uses the distribution natural average.

Additive Rule
It predicts that the things that will go on just a constant amount longer such as a five more minute rule.

7. Overfitting

In machine learning, overfitting occurs when training data fits tightly with key factors so that it doesn’t accurately predict the outcome for the data that it has not observed.

Cross Validation
Overfitting can be solved with cross-validation by assessing model not just against training data but also against unseen data.

Regularization
It uses contents to penalize complexity.

Lasso
It uses penalty of the total weight of different factors to minimize complexity.

8. Relaxation

In constraint optimization problems, you need to find the best arrangement of a set of variables given a set of rules and scoring mechanism such as traveling salesman problem (O(N!)). Using constraint relaxation, you remove some of the problem constraints, e.g. you can create a minimum spanning tree that connects all nodes in O(N^2) amount of time. Techniques such as Lagrangian Relaxation removes some of the constraints and add them to the scoring system.

9. Randomness

This chapter describes examples of algorithms that are based on random numbers such as:

Monte Carlo Method
It uses random samples to handle qualitatively unmanageable problems.

Hill Climbing
It takes a solution and tries to improve it by permuting some of the factors. It only accepts changes if it results in improvements. However, it may not find the globally optimal solution.

Jitter
It makes random small changes and accepts them even if they don’t improve in order to find the better solution.

Metropolis algorithm
It uses Monte Carlo Method and accepts bad and good tweaks in trying different solutions.

Simulated Annealing
It optimizes problems like annealing by heating up and slowly cooling off.

10. Networking

This chapter describes algorithms used in the computer network such as:

Packet switching
One of key idea of Internet was to use packet switching where TCP/IP sends data packets over a number of connections as opposed to dedicated lines or circuit switching which were used by phone companies.

Acknowledgement
It is used to let the sender know that packet is received. TCP/IP uses the triple handshake to establish a connection and sender resends packets if ACK is not received.

Exponential Backoff
It increases average delay after successive failure.

Flow Control
TCP/IP uses Additive Increase Multiplicative Decrease to increase the number of packets sent and cut the transmission rate in half and ACK is not received.

Bufferbloat
A buffer is a queue that stores outgoing packets, but when the queue length is large, it can add a delay in sending ACK, which would result in redelivery. Explicit Congestion Notification can be used to address those issues.

11. Game Theory

In this chapter, authors discuss several problems from game theory such as:

Halting problem
This problem was first posed by Alan Turing who asserted that a computer program can never tell whether another program that it uses would take forever to compute something.

Prisoner’s dilemma
It is based on two prisoners who are caught and have to either cooperate or work against each other. In general, defection is the dominant strategy.

Nash Equilibrium
It is one of strategy where neither player changes their own play based on the opponent’s strategy.

The Tragedy of the Commons
It involves a shared-resource system where an individual can act independently in a selfish manner that is contrary to the common good of all participants, e.g. voluntary environmental laws where companies are not required to obey emission levels.

Information cascade
Information cascade occurs where an individual abandons their own information in favor of other people’s action. One application of this class of problems is auction systems. Here are a few variations of the auction systems:

  • Sealed-bid – where bidders are unaware of other bid prices so they would have to predict price that other bidders would use.
  • Dutch or descending auction – where bids start at a high price and is slowly lowered until someone accepts it.
  • English or ascending auction – where bid starts at a low price and is then increased.
  • Vickrey auction – it is similar to sealed-bid but winners pay second-place bid. It results in better valuation as bidders are incentivized to bid based on the true value.

Summary

This book presents several domains of algorithms and encourages computational kindness by applying these algorithms in real-life. For example, we can add constraints or reduce the number of available options when making a decision, which would lower the mental labor.

September 30, 2016

Review of “Simple architecture for complex enterprises”

Filed under: Uncategorized — admin @ 11:24 am

“Simple architecture for complex enterprises” focuses on tackling complexity in IT systems. There are a number of methodologies such as Zachman, TOGAF and EA but they don’t address how to manage complexity. The author shares following concerns when implementing an enterprise architecture:

  • Unreliable Enterprise Information – when enterprise cannot access or trust its information
  • Untimely Enterprise Information – when reliable information is not available in a timely fashion.
  • New Projects Underway – when building a new complex IT project without understanding its relationship to the business processes.
  • New Companies Being Acquired
  • Enterprise Wants to Spin Off Unit
  • Need to Identify Outsourcing Opportunities
  • Regulatory Requirements
  • Need to Automate Relationships with Internal Partners
  • Need to Automate Relationships with Customers
  • Poor Relationship Between IT and Business Units
  • Poor Interoperability of IT Systems
  • IT Systems Unmanageable – when IT systems are built piecemeal and patched together.

The author defines enterprise architecture as:

“An enterprise architecture is a description of the goals of an organization, how these goals are realized by business processes, and how these business processes can be better served through technology.”

The author asserts need for planning when building enterprise IT systems and argues that complexity hinders success of these systems and cites several examples from Government and business industries. The author defines the Zachman framework for organizing architecture artifacts and design documents. John Zachman proposed six descriptive foci: data, function, network, people, time and motivation in the framework.

The author explains that Zachman framework does not address complexity of the systems. Next, author explains TOGAF (The Open Group Achitecture Framework) that has four categories:

  • Business architecture – business processes
  • Application architecture
  • Data architecture
  • Technical architecture – hardware / software infrastructure

TOGAF defines ADM (Architecture Development Method) as a recipe for creating architecture.  The author considers TOGAF as a process instead of framework that can complement Zachman. TOGAF defines following levels of enterprise continuum:

  • enterprise continuum
  • foundation architectures
  • common system architectures
  • industry architectures
  • organization architectures (ADM)

TOGAF defines knowledge bases such as TRM (Technical Reference Model) and SIB (Standards Information Base). The ADM defines following phases:

  • Phase: Prelim – framework and principles
  • Phase: A – architecture vision (statement of architecture work, architecture vision)
  • Phase: B – business architecture (input frame stakeholders to get baseline and business objectives)
  • Phase: C – information system architectures (baseline data architecture, review principles, models, data architecture)
  • Phase: D – technology architecture – infrastructure
  • Phase: E – opportunities and solutions
  • Phase: F – migration planning
  • Phase: G – implementation governance
  • Phase: H – architecture change management and then it goes back to Phase A.

TOGAF also lacks complexity management and the author then explains The Federal Enterprise Architecture (FEA) that includes reference models for business, service, components, technical and data. FEA organizes EA into segments of business functionality and enterprise services. FEA creates five reference models:

  • The Business Reference Model (BRM) – business view
  • The Component Reference Model (CRM)
  • The Technical Reference Model (TRM)
  • The Data Reference Model
  • The Performance Reference Model

In chapter two, the author explains how complexity affects system, e.g. a Rubik Cube of 2 x 2 x 2 dimensions has 8 interior cubes and 3.7 x 10^6 permutations but a Rubik Cube of 4 x 4 x 4 dimensions has 64 interior cubes and 7.4 x 10^45 permutations. The relative complexity of 4 x 4 x 4 dimensions Rubik Cube is much higher than Rubik Cube of 2 x 2 x 2 dimensions and the author argues that by partitioning 4 x 4 x 4 Rubik Cube into eight 2 x 2 x 2 Rubik Cube, you can lower its complexity. The author defines following five laws of partitions:

  • Partitions must be true partitions
  • Partition definitions must be appropriate (e.g. organizing clothing store by color may not be helpful to customers)
  • Partition subset numbers must be appropriate
  • Partition subset sizes must be roughly equal
  • Subset interactions must be minimal and well defined

Further, author suggests simplification to reduce complexity when partitioning by removing partition subsets along with their associated items and removal of other items from one or more partition subsets, leaving the subsets themselves in place. The process of partitioning can be done iteratively by choosing one of the partition subsets and simplifying it.  The author narrates story of Jon Boyd who came with the iterative process: observe, orient, plan, act (OOPA) when he was observing how pilots used aircrafts in dogfights at the Air Force. Also, he observed that faster you iterate on OOPA, better your chances of winning the dogfight.

In chapter three, the author shows how mathematics can be used for partitioning. He describes the number of system states as the best measure of complexity and relative complexity of two systems is ratio of the number of states in those systems. For example, a system with two variables, each taking six states can take 6^2 states. i.e,

C = S^v where C is the complexity, V is the number of variables and S is the number of significant states on average.

In business process, the number of paths and decision points within each path is the best measure of complexity, i.e.,

O = P^d where D is the number of decision points and P is the number of paths for each decision points and O is outcome.

The author introduces concept of homomorphism where observing one system make prediction on another system, e.g.   relationships between dice systems, software systems and business processes as homomorphic. A system with two six-sided dice has 36 possible states (P^d or 6^2). However, we can reduce number of states by dividing dices into multiple buckets, e.g. two dices with each bucket has 12 states instead of 36. The general formula for the number of states of B buckets with D dices and F faces per dice:

B * F^d

This chapter describes concept of equivalence relations with following properties:

  • E(a, a) – always true — reflexivity
  • E(a, b) implies E(b, a) — symmetry
  • E(a, b) and E(b, c) implies E(a, c) — transitivity

In chapter four, the author explains simple iterative partitions (SIP) to create a diagrammatic overview of the enterprise system that focus on what enterprise does (as opposed to how). The SIP starts with an autonomous business capability (ABC) that represents an equivalence class or one of the set that make up the partition. The ABC model includes process component and technology component including relationships for implementation and deployment. In addition to implementation and deployment, author adds ABC type that is used as a category of ABC such as human resources. These types can also be defined in hierarchical fashion and different implementations of same ABC types are considered siblings. The implementations can also be composed so that one ABC is part of another ABC. Another type of relationship is partner relationships at implementation or deployment levels where one ABC may create information request, information broadcast or work request.

In chapter five, author explains SIP process that has following goals:

  • Complexity control
  • Logic-based decisions
  • Value-driven deliverables
  • Reproducible results
  • Verifiable architectures
  • Flexible methodology

The SIP process consists of following six primary phases:

<------Preliminary------>   <--------------Preparatory------------------->   <------Iteration------>
Phase-0       Phase-1       Phase-2        Phase-3          Phase-4                 Phase-5
Evaluation    Preparation   Partitioning   Simplification   Prioritization          Iteration

The phase-0 (enterprise architecture evaluation) addresses following issue:

  • Unreliable enterprise information
  • Untimely enterprise information
  • New complex projects underway
  • New companies being acquired
  • Enterprise wants to spin off unit
  • Need to identify outsourcing opportunities
  • Regulatory requirements
  • Need to automate relationships with external partners
  • Need to automate relationships with customers
  • Poor relationships between IT and business units
  • Poor interoperability of IT systems
  • IT systems unmanageable

The phase-1 (SIP preparation) has following deliverables:

  • Audit of organizational readiness
  • Training
  • Governance model
  • SIP blend
  • Enterprise-specific tools

The phase-2 (partitioning) decomposes enterprise into ABC (discrete autonomous business capability) units. The phase-3 (partition simplification) defines five laws of partitions:

  • Partitions must be true partitions
  • Partition definitions must be appropriate
  • Partition numbers must be appropriate
  • Partition sizes must be roughly equal
  • Partition interactions must be minimal and well defined.

The phase-4 (ABC prioritization) uses value graph analysis to estimate potential payoff and risk. The value graph analysis addresses following factors:

  • Market drivers
  • Cost
  • Organizational risk
  • Technical risk
  • Financial value
  • Organizational preparedness
  • Team readiness
  • Status quo

The phase-5 (ABC iteration) uses iterative approach to simplify architecture.

The chapter six describes NPfit project as a case study in complexity. The NPfit promised integrated system connecting every patient, physician, laboratory, pharmacy and healthcare in the UK. Its infrastructure provided new national network, directory services, care records service (CRS). NPfit is split into five regional groups of patients and it allowed appointment to any facility, prescription fulfillment, and picture archiving. Despite huge budget of $9.8 billion dollars, there were several concerns such as failure to communicate, monolithic approach, stifling of innovation, lack of record confidentiality and quality of shared data. The SIP approach would have helped, e.g. phase-1 audits organizational readiness, training and partitioning. The phase-2 would have addressed complexity dropped multiple regional implementations. The phase-3 would have simplified partitions into subsets such as patient registration, appointment booking, prescriptions, patient records, and lab tests.

The chapter seven focuses on guarding boundaries in technical boundaries. For example two systems may communicate via RPC, shared databases or data access layer but it suggests service-oriented-architecture (SOA) for interoperability for better scalability. The author suggests use of guards or envoy entity for handling outgoing or incoming messages to the system. It defines following rules to encapsulate the software for a given ABC:

  • Autonomy
  • Explicit boundaries
  • Partitioning of functionality
  • Dependencies defined by policy
  • Asynchronicity
  • Partitioning of data
  • No cross-fortress transactions
  • Single-point security
  • Inside trust
  • Keep it simple

The chapter eight summarizes the book and it explains why complexity is the real enemy and how simplicity pays. It reiterates how SIP architecture can simplify architecture by partitioning system into ABC units.

February 6, 2016

Building a Generic Data Service

Filed under: Web Services — admin @ 10:44 pm

As REST based Micro-Services have become prevalent, I often find that web and mobile clients have to connect to different services for gathering data. You may have to call dozens of services to display data on a single screen or page. Also, you may only need subset of data from each service but you still have to pay for the bandwidth and parsing cost.

I created a new Java framework PlexDataProviders for aggregating and querying data from various underlying sources, which can be used to build a general-purpose data service. PlexDataProviders is a light-weight Java framework that abstract access to various data providers such as databases, files, web services, etc. It allows aggregation of data from various data providers.

The PlexDataProviders framework is divided into two components:

  • Data Provider – This component defines interfaces that are implemented to access data sources such as database or web services.
  • Query Engine – This component is used for querying and aggregating data.

The query engine can determine dependency between providers and it also allow you to use output of one of the data provider as input to another data provider. For example, let’s assume:

  • data-provider A requires input-a1, input-a2 and produces output-a1, output-a2
  • data-provider B requires input-b1 and output-a1 and produces output-b1, output-b2

Then you can pass input-a1, input-a2 to the query engine and request output-a1, output-a2, output-b1, output-b2 output data fields.

Benefits

PlexDataProviders provides offers following benefits:

  • It provides a unified way to search data and abstracts integration to underlying data sources.
  • It helps simplifying client side logic as they can use a single data service to query all data instead of using multiple data services.
  • This also help with managing end-points as you only a single end-point instead of connecting to multiple web services.
  • As clients can specify the data they need, this helps with payload size and network bandwidth.
  • The clients only need to create a single data parser so it keeps JSON parsing logic simple.
  • As PlexDataProviders supports multi-threading, it also helps with latency of the data fetch requests.
  • It partial failure so that a failure in a single data provider doesn’t effect other data providers and the data service can still return partial results. User
  • It supports timeout so that clients can receive available data that completes in given timeout interval

Data Structure

Following are primary data structures:

  • MetaField – This class defines meta information for each data field such as name, kind, type, etc.
  • MetaFieldType – This enum class supports primitive data types supported, i.e.
    • SCALAR_TEXT – simple text
    • SCALAR_INTEGER – integer numbers
    • SCALAR_DECIMAL – decimal numbers
    • SCALAR_DATE – dates
    • SCALAR_BOOLEAN – boolean
    • VECTOR_TEXT – array of text
    • VECTOR_INTEGER – array of integers
    • VECTOR_DECIMAL – array of decimals
    • VECTOR_DATE – array of dates
    • VECTOR_BOOLEAN – array of boolean
    • BINARY – binary data
    • ROWSET – nested data rowsets
  • Metadata – This class defines a set of MetaFields used in DataRow/DataRowSet
  • DataRow – This class abstracts a row of data fields
  • DataRowSet – This class abstracts a set of rows

PlexDataProviders also supports nested structures where a data field in DataRow can be instance of DataRowSet.

Adding a Data Provider

The data provider implements following two interfaces

[codesyntax lang="java"]
public interface DataProducer {
    void produce(DataRowSet requestFields, DataRowSet responseFields,
            QueryConfiguration config) throws DataProviderException;
}
[/codesyntax]

Note that QueryConfiguration defines additional parameters such as:

  • pagination parameters
  • ordering/grouping
  • filtering parameters
  • timeout parameters

The timeout parameter can be used to return all available data within defined time, e.g. query engine may invoke underlying data providers in multiple threads and if underlying query takes a long time then it would return available data.

[codesyntax lang="java"]
public interface DataProvider extends DataProducer, Comparable<DataProvider> {
    String getName();

    int getRank();

    Metadata getMandatoryRequestMetadata();

    Metadata getOptionalRequestMetadata();

    Metadata getResponseMetadata();

    TaskGranularity getTaskGranularity();
}
[/codesyntax]

Each provider defines name, rank (or priority when matching for best provider), set of mandatory/optional input and output data fields. The data provider can also define granularity as coarse grain or fine grain and the implementation may execute those providers on different threads.

PlexDataProviders also provides interfaces for converting data from domain objects to DataRowSet. Here is an example of provider implementation:

[codesyntax lang="java"]
public class SecuritiesBySymbolsProvider extends BaseProvider {
    private static Metadata parameterMeta = Metadata.from(SharedMeta.symbol);
    private static Metadata optionalMeta = Metadata.from();
    private static SecurityMarshaller marshaller = new SecurityMarshaller();

    public SecuritiesBySymbolsProvider() {
        super("SecuritiesBySymbolsProvider", parameterMeta, optionalMeta,
                marshaller.getMetadata());
    }

    @Override
    public void produce(DataRowSet parameter, DataRowSet response,
            QueryConfiguration config) throws DataProviderException {
        final String id = parameter.getValueAsText(SharedMeta.symbol, 0);
        Map<String, Object> criteria = new HashMap<>();
        criteria.put("symbol", id.toUpperCase());
        Collection<Security> securities = DaoLocator.securityDao.query(criteria);
        DataRowSet rowset = marshaller.marshal(securities);
        addRowSet(response, rowset, 0);
    }
}
[/codesyntax]

Typically, you will create data-provider for each different kind of query that you want to support. Each data provider specifies set of required and optional data fields that can be used to generate output data fields.

Here is an example of marshalling data from Securty domain objects to DataRowSet:

[codesyntax lang="java"]
public DataRowSet marshal(Security security) {
    DataRowSet rowset = new DataRowSet(responseMeta);
    marshal(rowset, security, 0);
    return rowset;
}

public DataRowSet marshal(Collection<Security> securities) {
    DataRowSet rowset = new DataRowSet(responseMeta);
    for (Security security : securities) {
        marshal(rowset, security, rowset.size());
    }
    return rowset;
}
...
[/codesyntax]

PlexDataProviders provides DataProviderLocator interface for registering and looking up provider, e.g.

[codesyntax lang="java"]
public interface DataProviderLocator {
    void register(DataProvider provider);

    Collection<DataProvider> locate(Metadata requestFields, Metadata responseFields);
...
}
[/codesyntax]

PlexDataProviders comes with a small application that provides data services by implementing various data providers. It uses PlexService framework for defining the service, e.g.

[codesyntax lang="java"]
@WebService
@Path("/data")
public class DataServiceImpl implements DataService {
    private DataProviderLocator dataProviderLocator = new DataProviderLocatorImpl();
    private QueryEngine queryEngine = new QueryEngineImpl(dataProviderLocator);

    public DataServiceImpl() {
        dataProviderLocator.register(new AccountsByIdsProvider());
        dataProviderLocator.register(new AccountsByUseridProvider());
        dataProviderLocator.register(new CompaniesBySymbolsProvider());
        dataProviderLocator.register(new OrdersByAccountIdsProvider());
        dataProviderLocator.register(new PositionGroupsBySymbolsProvider());
        dataProviderLocator.register(new PositionsBySymbolsProvider());
        dataProviderLocator.register(new QuotesBySymbolsProvider());
        dataProviderLocator.register(new SecuritiesBySymbolsProvider());
        dataProviderLocator.register(new UsersByIdsProvider());
        dataProviderLocator.register(new WatchlistByUserProvider());
        dataProviderLocator.register(new SymbolsProvider());
        dataProviderLocator.register(new UsersProvider());
        dataProviderLocator.register(new SymbolSearchProvider());
    }

    @Override
    @GET
    public DataResponse query(Request webRequest) {
        final DataRequest dataRequest = DataRequest.from(webRequest .getProperties());
        return queryEngine.query(dataRequest);
    }
}
[/codesyntax]

As you can see the data service simply builds DataRequest with input data fields and sends back response back to clients.

Here is an example client that passes a search query data field and requests quote data fields with company details

public void testGetQuoteBySearch() throws Throwable {
    String jsonResp = TestWebUtils.httpGet("http://localhost:" + DEFAULT_PORT
                    + "/data?responseFields=exchange,symbol,quote.bidPrice,quote.askPrice,quote.sales,company.name&symbolQuery=AAPL");
    ...

Note that above request will use three data providers, first it uses SymbolSearchProvider provider to search for matching symbols with given query. It then uses the symbol data field to request company and quote data fields from QuotesBySymbolsProvider and CompaniesBySymbolsProvider. The PlexDataProviders framework will take care of all dependency management for providers.

Here is an example JSON response from the data service:

[codesyntax lang="javascript"]
{
    "queryResponse": {
        "fields": [
            [{
                "symbol": "AAPL_X"
            }, {
                "quote.sales": [
                    [{
                        "symbol": "AAPL_X"
                    }, {
                        "timeOfSale.volume": 56
                    }, {
                        "timeOfSale.exchange": "DOW"
                    }, {
                        "timeOfSale.date": 1455426008762
                    }, {
                        "timeOfSale.price": 69.49132317180353
                    }],
                    [{
                        "symbol": "AAPL_X"
                    }, {
                        "timeOfSale.volume": 54
                    }, {
                        "timeOfSale.exchange": "NYSE"
                    }, {
                        "timeOfSale.date": 1455426008762
                    }, {
                        "timeOfSale.price": 16.677774132458076
                    }],
                    [{
                        "symbol": "AAPL_X"
                    }, {
                        "timeOfSale.volume": 99
                    }, {
                        "timeOfSale.exchange": "NASDAQ"
                    }, {
                        "timeOfSale.date": 1455426008762
                    }, {
                        "timeOfSale.price": 42.17891320885568
                    }],
                    [{
                        "symbol": "AAPL_X"
                    }, {
                        "timeOfSale.volume": 49
                    }, {
                        "timeOfSale.exchange": "DOW"
                    }, {
                        "timeOfSale.date": 1455426008762
                    }, {
                        "timeOfSale.price": 69.61680149649729
                    }],
                    [{
                        "symbol": "AAPL_X"
                    }, {
                        "timeOfSale.volume": 69
                    }, {
                        "timeOfSale.exchange": "NYSE"
                    }, {
                        "timeOfSale.date": 1455426008762
                    }, {
                        "timeOfSale.price": 25.353316897552833
                    }]
                ]
            }, {
                "quote.askPrice": 54.99300665695502
            }, {
                "quote.bidPrice": 26.935682182171643
            }, {
                "exchange": "DOW"
            }, {
                "company.name": "AAPL - name"
            }],
            [{
                "symbol": "AAPL"
            }, {
                "exchange": "NASDAQ"
            }]
        ],
        "errorsByProviderName": {},
        "providers": ["QuotesBySymbolsProvider", "SymbolSearchProvider", "CompaniesBySymbolsProvider"]
    }
}
[/codesyntax] 

PlexDataProviders is available from github and is licensed under liberal MIT license. It also comes with a small sample application for demo purpose. Feel free to send me your suggestions.

 

August 17, 2014

PlexService Overview – a Micro-service framework for defining HTTP/Websockets and JMS based Services

Filed under: Uncategorized — admin @ 9:19 pm

I recently created a new framework PlexService for serving micro-services. which can be accessed by HTTP, Websockets or JMS interfaces. You can choose these different access mechanism by needs of your services. For example, as JMS services are inherently asynchronous, they provide good foundation for building scalable and reactive services. You may choose http stack for implementing REST services or choose websockets for implementing interactive services.

PlexService framework provides provides basic support for encoding POJO objects into JSON for service consumption. The developers define service configuration via annoations to specify gateway types, encoding scheme, end-points, etc.

PlexService provides support of role-based security, where you can specify list of roles who can access each service. The service providers implement how to verify roles, which are then enforced by PlexService framework.

If you implement all services in JMS, you can easily expose them via HTTP or Websockets by configuring web-to-jms bridge. The bridge routes all requests from HTTP/Websockets to JMS and listen for incoming messages, which are then routed back to web clients.

PlexService provides basic metrics such as latency, invocations, errors, etc., which are exposed via JMX interface. PlexService uses jetty for serving web services. The developers provide JMS containers at runtime if required.

Building/Installing

Checkout code using

 git clone git@github.com:bhatti/PlexService.git

Compile and build jar file using

 ./gradlew jar

Copy and add jar file manually in your application.

Defining role-based security

PlexService allows developers to define role-based security, which is invoked when accessing services, e.g.

 public class BuggerRoleAuthorizer implements RoleAuthorizer {
     private final UserRepository userRepository;
 
     public BuggerRoleAuthorizer(UserRepository userRepository) {
       this.userRepository = userRepository;
     }
 
     @Override
       public void authorize(Request request, String[] roles) throws AuthException {
         String sessionId = request.getSessionId();
         User user = userRepository.getUserBySessionId(sessionId);
         if (user == null) {
           throw new AuthException(Constants.SC_UNAUTHORIZED,
               request.getSessionId(), request.getRemoteAddress(),
               "failed to validate session-id");
         }
         for (String role : roles) {
           if (!user.getRoles().contains(role)) {
             throw new AuthException(Constants.SC_UNAUTHORIZED,
                 request.getSessionId(), request.getRemoteAddress(),
                 "failed to match role");
           }
         }
       }
 }

Typically, login-service will store session-id, which is then passed to the implementation of RoleAuthorizer, e.g.

 @ServiceConfig(gateway = GatewayType.HTTP, requestClass = Void.class, endpoint = "/login", method = Method.POST, codec = CodecType.JSON)
 public class LoginService extends AbstractUserService implements RequestHandler {
   public LoginService(UserRepository userRepository) {
     super(userRepository);
   }
 
   @Override
   public void handle(Request request) {
     String username = request.getStringProperty("username");
     String password = request.getStringProperty("password");
 
     User user = userRepository.authenticate(username, password);
     AbstractResponseBuilder responseBuilder = request.getResponseBuilder();
     if (user == null) {
       throw new AuthException(Constants.SC_UNAUTHORIZED,
               request.getSessionId(), request.getRemoteAddress(),
               "failed to authenticate");
     } else {
       responseBuilder.addSessionId(userRepository.getSessionId(user));
       responseBuilder.send(user);
     }   
   }
 }

In above example the session-id is added to response upon successful login, which is then passed for future requests. For http services, you may use cookies to store session-ids, otherwise you would need to pass session-id as a parameter.

Here is how you can invoke login-service from curl:

 
 curl --cookie-jar cookies.txt -v -k -H "Content-Type: application/json" -X POST "http://127.0.0.1:8181/login?username=erica&password=pass"

which would return:

 
 Content-Type: application/json
 Set-Cookie: PlexSessionID=5 Expires: Thu, 01 Jan 1970 00:00:00 GMT
 {"id":5,"username":"erica","email":"erica@plexobject.com","roles":["Employee"]}

Defining Services

Defining a REST service for creating a user

Here is how you can a REST service:

@ServiceConfig(gateway = GatewayType.HTTP, requestClass = User.class, 
     rolesAllowed = "Administrator", endpoint = "/users", method = Method.POST, 
     codec = CodecType.JSON)
 public class CreateUserService extends AbstractUserService implements
 RequestHandler {
   public CreateUserService(UserRepository userRepository) {
     super(userRepository);
   }
 
   @Override
     public void handle(Request request) {
       User user = request.getPayload();
       user.validate();
       User saved = userRepository.save(user);
       request.getResponseBuilder().send(saved);
     }
 }

The ServiceConfig annotation defines that this service can be accessed via HTTP at “/users” URI. PlexService will provide encoding from JSON to User object and will ensure that service can be accessed by user who has Administrator role.

Here is how you can invoke this service from curl:

 
 curl --cookie cookies.txt -k -H "Content-Type: application/json" -X POST "http://127.0.0.1:8181/users" -d "{\"username\":\"david\",\"password\":\"pass\",\"email\":\"david@plexobject.com\",\"roles\":[\"Employee\"]}"

Defining a Web service over Websockets for creating a user

Here is how you can a Websocket based service:

 @ServiceConfig(gateway = GatewayType.WEBSOCKET, requestClass = User.class, 
     rolesAllowed = "Administrator", endpoint = "/users", method = Method.POST, 
     codec = CodecType.JSON)
 public class CreateUserService extends AbstractUserService implements
 RequestHandler {
   public CreateUserService(UserRepository userRepository) {
     super(userRepository);
   }
 
   @Override
     public void handle(Request request) {
       User user = request.getPayload();
       user.validate();
       User saved = userRepository.save(user);
       request.getResponseBuilder().send(saved);
     }
 }

The ServiceConfig annotation defines that this service can be accessed via Websocketat “/users” endpoint. However, as opposed to HTTP based service, this endpoint is not enforced in HTTP request and can be in any format as long it’s unique for a service.

Here is how you can access websocket service from javascript:

 var ws = new WebSocket("ws://127.0.0.1:8181/users");
 ws.onopen = function() {
   var req = {"payload":"", "endpoint":"/login", "method":"POST", "username":"scott", "password":"pass"};
   ws.send(JSON.stringify(req));
 };
 
 ws.onmessage = function (evt) {
   alert("Message: " + evt.data);
 };
 
 ws.onclose = function() {
 };
 
 ws.onerror = function(err) {
 };

Note that websockets are not supported by all browsers and above code will work only supported browsers such as IE 11+, FF 31+, Chrome 36+, etc.

Defining a JMS service for creating a user

Here is how you can create JMS service:

 @ServiceConfig(gateway = GatewayType.JMS, requestClass = User.class, 
       rolesAllowed = "Administrator", endpoint = "queue:{scope}-create-user-service-queue", 
       method = Method.MESSAGE, 
       codec = CodecType.JSON)
 public class CreateUserService extends AbstractUserService implements RequestHandler {
     public CreateUserService(UserRepository userRepository) {
     super(userRepository);
     }
 
     @Override
     public void handle(Request request) {
       User user = request.getPayload();
       user.validate();
       User saved = userRepository.save(user);
       request.getResponseBuilder().send(saved);
     }
 }

Note that the only difference is type of gateway. PlexService also support variables in end-points, which are populated from configurations. For example, you may create scope variable to create different queues/topics for different developers/environments. PlexService will serialize POJO classes into JSON when delivering messages over JMS.

Defining a REST service with parameterized URLs

PlexService allows developers to define URIs for services, that contains variables. These variables are then populated actual requests. These can be used for implementing REST services, e.g.

 @ServiceConfig(gateway = GatewayType.HTTP, requestClass = BugReport.class, 
       rolesAllowed = "Employee", endpoint = "/projects/{projectId}/bugreports", 
       method = Method.POST, 
       codec = CodecType.JSON)
 public class CreateBugReportService extends AbstractBugReportService implements RequestHandler {
     public CreateBugReportService(BugReportRepository bugReportRepository,
         UserRepository userRepository) {
       super(bugReportRepository, userRepository);
     }
 
     @Override
       public void handle(Request request) {
         BugReport report = request.getPayload();
         report.validate();
         BugReport saved = bugReportRepository.save(report);
         request.getResponseBuilder().send(saved);
       }
 }

Here is an example of invoking this service from curl:

 
 curl --cookie cookies.txt -k -H "Content-Type: application/json" -X POST "http://127.0.0.1:8181/projects/2/bugreports" -d "{\"title\":\"As an administrator, I would like to assign roles to users so that they can perform required actions.\",\"description\":\"As an administrator, I would like to assign roles to users so that they can perform required actions.\",\"bugNumber\":\"story-201\",\"assignedTo\":\"mike\",\"developedBy\":\"mike\"}"
 

Using variables with Websocket based service

You can also create variables for websocket’s endpoints similar to JMS, which are initialized from parameters.

 @ServiceConfig(gateway = GatewayType.WEBSOCKET, requestClass = BugReport.class, 
       rolesAllowed = "Employee", endpoint = "{variable}-create-bugreport-service-channel", 
       method = Method.MESSAGE, codec = CodecType.JSON)
 public class CreateBugReportService extends AbstractBugReportService implements
         RequestHandler {
     public CreateBugReportService(BugReportRepository bugReportRepository,
             UserRepository userRepository) {
         super(bugReportRepository, userRepository);
     }
 
     @Override
     public void handle(Request request) {
         BugReport report = request.getPayload();
         report.validate();
         BugReport saved = bugReportRepository.save(report);
         request.getResponseBuilder().send(saved);
     }
 
 }

Here is another example of consuming websocket based service from javascript:

 var ws = new WebSocket("ws://127.0.0.1:8181/users");
 ws.onopen = function() {
   var req = {"payload":{"title":"my title", "description":"my description","bugNumber":"story-201", "assignedTo":"mike", "developedBy":"mike"},"PlexSessionID":"4", "endpoint":"/projects/2/bugreports/2/assign", "method":"POST"};
   ws.send(JSON.stringify(req));
 };
 
 ws.onmessage = function (evt) {
   alert("Message: " + evt.data);
 };
 
 ws.onclose = function() {
 };
 
 ws.onerror = function(err) {
 };

Defining a REST service for querying users

Here is an example REST service, which uses GET request to query users:

   @ServiceConfig(gateway = GatewayType.HTTP, requestClass = User.class, 
       rolesAllowed = "Administrator", endpoint = "/users", method = Method.GET, 
       codec = CodecType.JSON)
   public class QueryUserService extends AbstractUserService implements
   RequestHandler {
     public QueryUserService(UserRepository userRepository) {
       super(userRepository);
     }
     @Override
       public void handle(Request request) {
         Collection<User> users = userRepository.getAll(new Predicate<User>() {
             @Override
             public boolean accept(User u) {
             return true;
             }
             });
         request.getResponseBuilder().send(users);
       }
   }

Here is how you can invoke this service from curl

 
 curl --cookie cookies.txt -k -H "Content-Type: application/json" "http://127.0.0.1:8181/users"   

which would return json array such as:

 
 [{"id":2,"username":"alex","email":"alex@plexobject.com","roles":["Employee"]},{"id":3,"username":"jeff","email":"jeff@plexobject.com","roles":["Employee","Manager"]},{"id":4,"username":"scott","email":"scott@plexobject.com","roles":["Employee","Administrator","Manager"]},{"id":5,"username":"erica","email":"erica@plexobject.com","roles":["Employee"]}]

Defining a JMS service for querying users

Here is an example of defining query users via JMS service:

 @ServiceConfig(gateway = GatewayType.JMS, requestClass = User.class, 
       rolesAllowed = "Administrator", endpoint = "queue:{scope}-query-user-service-queue", 
       method = Method.MESSAGE, 
       codec = CodecType.JSON)
 public class QueryUserService extends AbstractUserService implements RequestHandler {
     public QueryUserService(UserRepository userRepository) {
       super(userRepository);
     }
     @Override
       public void handle(Request request) {
         Collection<User> users = userRepository.getAll(new Predicate<User>() {
             @Override
             public boolean accept(User u) {
             return true;
             }
             });
         request.getResponseBuilder().send(users);
       }
 }

The end-point can contain variables such as scope that are initialized from configuration.

Registering services and starting service container

You will need to register services with ServiceRegistry at runtime, which would initialize and start those services, e.g.

 Collection<RequestHandler> services = new HashSet<>();
 services.add(new CreateUserService(userRepository));
 services.add(new UpdateUserService(userRepository));
 services.add(new QueryUserService(userRepository));
 services.add(new DeleteUserService(userRepository));
 services.add(new LoginService(userRepository));
 services.add(new CreateProjectService(projectRepository, userRepository));
 services.add(new UpdateProjectService(projectRepository, userRepository));
 services.add(new QueryProjectService(projectRepository, userRepository));
 services.add(new AddProjectMemberService(projectRepository, userRepository));
 services.add(new RemoveProjectMemberService(projectRepository, userRepository));
 services.add(new CreateBugReportService(bugreportRepository, userRepository));
 services.add(new UpdateBugReportService(bugreportRepository, userRepository));
 services.add(new QueryBugReportService(bugreportRepository, userRepository));
 services.add(new QueryProjectBugReportService(bugreportRepository, userRepository));
 
 services.add(new AssignBugReportService(bugreportRepository, userRepository));
 serviceRegistry = new ServiceRegistry(config, services, new BuggerRoleAuthorizer(userRepository));
 serviceRegistry.start();
 

Creating Http to JMS bridge

You may choose to write all services as JMS and then expose them via HTTP using bridge provided by PlexService, e.g.

   final String mappingJson = IOUtils.toString(new FileInputStream( args[1]));
 Collection<HttpToJmsEntry> entries = new JsonObjectCodec().decode(
     mappingJson, new TypeReference<List<HttpToJmsEntry>>() {
     });
 WebToJmsBridge bridge = new WebToJmsBridge(new Configuration(args[0]), entries, GatewayType.HTTP);
 bridge.startBridge();

Creating Websocket to JMS bridge

Similarly, you may expose JMS services via websockets based transport using the bridge:

 
   final String mappingJson = IOUtils.toString(new FileInputStream( args[1]));
 Collection<HttpToJmsEntry> entries = new JsonObjectCodec().decode(
     mappingJson, new TypeReference<List<HttpToJmsEntry>>() {
     });
 WebToJmsBridge bridge = new WebToJmsBridge(new Configuration(args[0]), entries, GatewayType.WEBSOCKET);
 bridge.startBridge();

Here is JSON configuration for bridge:

[
   {"codecType":"JSON","path":"/projects/{projectId}/bugreports/{id}/assign","method":"POST",
     "destination":"queue:{scope}-assign-bugreport-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/projects/{projectId}/bugreports","method":"GET",
     "destination":"queue:{scope}-query-project-bugreport-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/users","method":"GET",
     "destination":"queue:{scope}-query-user-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/projects","method":"GET",
     "destination":"queue:{scope}-query-projects-service","timeoutSecs":30},
   {"codecType":"JSON","path":"/bugreports","method":"GET",
     "destination":"queue:{scope}-bugreports-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/projects/{id}/membership/add","method":"POST",
     "destination":"queue:{scope}-add-project-member-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/projects/{id}/membership/remove","method":"POST",
     "destination":"queue:{scope}-remove-project-member-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/projects/{projectId}/bugreports","method":"POST",
     "destination":"queue:{scope}-create-bugreport-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/users","method":"POST",
     "destination":"queue:{scope}-create-user-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/projects","method":"POST",
     "destination":"queue:{scope}-create-projects-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/users/{id}","method":"POST",
     "destination":"queue:{scope}-update-user-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/users/{id}/delete","method":"POST",
     "destination":"queue:{scope}-delete-user-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/projects/{id}","method":"POST",
     "destination":"queue:{scope}-update-project-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/projects/{projectId}/bugreports/{id}","method":"POST",
     "destination":"queue:{scope}-update-bugreport-service-queue","timeoutSecs":30},
   {"codecType":"JSON","path":"/login","method":"POST",
     "destination":"queue:{scope}-login-service-queue","timeoutSecs":30}]

Defining a Streaming Quotes Service over Websockets

Suppose you are building a high performance streaming quote service for providing real-time stock quotes, you can easily build it using PlexService framework, e.g.

@ServiceConfig(gateway = GatewayType.WEBSOCKET, requestClass = Void.class, endpoint = "/quotes", method = Method.MESSAGE, codec = CodecType.JSON)
 public class QuoteServer implements RequestHandler {
     public enum Action {
         SUBSCRIBE, UNSUBSCRIBE
     }
 
     static final Logger log = LoggerFactory.getLogger(QuoteServer.class);
 
     private QuoteStreamer quoteStreamer = new QuoteStreamer();
 
     @Override
     public void handle(Request request) {
         String symbol = request.getProperty("symbol");
         String actionVal = request.getProperty("action");
         log.info("Received " + request);
         ValidationException
                 .builder()
                 .assertNonNull(symbol, "undefined_symbol", "symbol",
                         "symbol not specified")
                 .assertNonNull(actionVal, "undefined_action", "action",
                         "action not specified").end();
         Action action = Action.valueOf(actionVal.toUpperCase());
         if (action == Action.SUBSCRIBE) {
             quoteStreamer.add(symbol, request.getResponseBuilder());
         } else {
             quoteStreamer.remove(symbol, request.getResponseBuilder());
         }
     }
 
     public static void main(String[] args) throws Exception {
         Configuration config = new Configuration(args[0]);
         QuoteServer service = new QuoteServer();
         Collection<RequestHandler> services = new ArrayList<>();
         services.add(new QuoteServer());
         //
         ServiceRegistry serviceRegistry = new ServiceRegistry(config, services, null);
         serviceRegistry.start();
         Thread.currentThread().join();
     }
 }

Above example defines a service that listen to websockets and responds to subscribe or unsubscribe requests from web clients.

You can define mock QuoteStreamer as follows, which periodically sends quotes to all subscribers:

public class QuoteStreamer extends TimerTask {
     private int delay = 1000;
     private Map<String, Collection<ResponseDispatcher>> subscribers = new ConcurrentHashMap<>();
     private QuoteCache quoteCache = new QuoteCache();
     private final Timer timer = new Timer(true);
 
     public QuoteStreamer() {
         timer.schedule(this, delay, delay);
     }
 
     public void add(String symbol, ResponseDispatcher dispatcher) {
         symbol = symbol.toUpperCase();
         synchronized (symbol.intern()) {
             Collection<ResponseDispatcher> dispatchers = subscribers
                     .get(symbol);
             if (dispatchers == null) {
                 dispatchers = new HashSet<ResponseDispatcher>();
                 subscribers.put(symbol, dispatchers);
             }
             dispatchers.add(dispatcher);
         }
     }
 
     public void remove(String symbol, ResponseDispatcher dispatcher) {
         symbol = symbol.toUpperCase();
         synchronized (symbol.intern()) {
             Collection<ResponseDispatcher> dispatchers = subscribers
                     .get(symbol);
             if (dispatchers != null) {
                 dispatchers.remove(dispatcher);
             }
         }
     }
 
     @Override
     public void run() {
         for (Map.Entry<String, Collection<ResponseDispatcher>> e : subscribers
                 .entrySet()) {
             Quote q = quoteCache.getLatestQuote(e.getKey());
             Collection<ResponseDispatcher> dispatchers = new ArrayList<>(
                     e.getValue());
             for (ResponseDispatcher d : dispatchers) {
                 try {
                     d.send(q);
                 } catch (Exception ex) {
                     remove(e.getKey(), d);
                 }
             }
         }
     }
 }

Here is a sample javascript/html client, which allows users to subscribe to different stock symbols:

       var ws = new WebSocket("ws://127.0.0.1:8181/quotes");
       ws.onopen = function() {
       };
       var lasts = {};
       ws.onmessage = function (evt) {
         //console.log(evt.data);
         var quote = JSON.parse(evt.data).payload;
         var d = new Date(quote.timestamp);
         $('#time').text(d.toString());
         $('#company').text(quote.company);
         $('#last').text(quote.last.toFixed(2));
         var prev = lasts[quote.company];
         if (prev != undefined) {
           var change = quote.last - prev;
           if (change >= 0) {
             $('#change').css({'background-color':'green'});
           } else {
             $('#change').css({'background-color':'red'});
           }
           $('#change').text(change.toFixed(2));
         } else {
           $('#change').text('N/A');
         }
         lasts[quote.company] = quote.last;
       };
 
       ws.onclose = function() {
       };
 
       ws.onerror = function(err) {
       };
       function send(payload) {
         $('#input').text(payload);
         ws.send(payload);
       }
       $(document).ready(function() {
         $("#subscribe").click(function() {
           var symbol = $("#symbol").val();
           var req = {"endpoint":"/quotes", "symbol":symbol, "action":"subscribe"};
           send(JSON.stringify(req));
         });
       });
       $(document).ready(function() {
         $("#unsubscribe").click(function() {
           var symbol = $("#symbol").val();                                                                                            
           var req = {"endpoint":"/quotes", "symbol":symbol, "action":"unsubscribe"};
           send(JSON.stringify(req));
         });
       });
    <script>
 
   <body>
     <form>
       Symbol:<input type="text" id="symbol" value="AAPL" size="4" />
       <input type="button" id="subscribe" value="Subscribe"/>
       <input type="button" id="unsubscribe" value="Unsubscribe"/>
     </form>
 
     <br>
 
     <table id="quotes" class="quote" width="600" border="2" cellpadding="0" cellspacing="3">
       <thead>
         <tr>
           <th>Time</th>
           <th>Company</th>
           <th>Last</th>
           <th>Change</th>
         </tr>
       </thead>
       <tbody>
         <tr>
           <td id="time"></td>
           <td id="company"></td>
           <td id="last"></td>
           <td id="change"></td>
         </tr>
       </tbody>
     </table>
   </body>

PlexService includes this sample code, where you can start streaming quote server by running “quote.sh” command and then open quote.html file in your browser.

Using JMX

PlexService uses JMX to expose key metrics and lifecycle methods to start or stop services. You can use jconsole to access the JMX controls, e.g.

 
 jconsole localhost:9191

Summary

PlexService comes a full-fledged sample application under plexsvc-sample folder and you browse JavaDocs to view APIs.

 
« Newer PostsOlder Posts »

Powered by WordPress