Shahzad Bhatti Welcome to my ramblings and rants!

October 6, 2019

Production Deployment Best Practices

Filed under: Computing,Project Management,Technology — admin @ 1:37 pm

Following are a few best practices I have learned over the years for deployment to the production environment especially when working with multiple versions of the software on multiple data centers:

Coding/Debugging Best Practices

Naming Threads

When using Java language, you can name background threads so that logs can show meaningful thread-names and set your threads as a daemon so that they are automatically shutdown when the java process is terminated.

Resource Cleanup

In order to avoid any resource leaking, always close the resources that require explicit cleanup such as I/O streams, database connections, etc.

Backward / Forward Compatibility

When working with multiple versions of the software, it’s critical that all changes to the domain model are both backward and forward compatible so that new schema can be read by the old service and old schema can be read by new service. If possible, always deploy the changes that can read the new schema before deploying the changes that write the new schema.

Character Encoding

Always use UTF-8 for character encoding in your code and databases.

Time Zones

Always use UTC timezone for application code and databases.

Internationalization/Localization/Accessibility

Apply i18n, l10n and accessibility standards for your user interfaces and services.

Data Validation

NullPointerException and wrong formats are common causes of many production issues so always validate your model, input parameters and results for proper format and ranges.

Failure Cases

Think about all failure and edge cases that your code may run.

Operational Best Practices

Use multiple stages for testing

Test your changes in multiple stages such as QA, UAT, alpha, beta, gamma where you can properly bake and test your changes. The size of data and variation will increase with each stage so that you can test as close to the production data and environment as possible. The testing will include load and performance testing to detect any impact to latency and availability. Further, you can use canary testing to release changes to a subset of users or data centers so that you can compare impact of changes before rolling out to all customers. If you are using multiple regions for deployment, you can start with a smaller or low-risk region for initial production release.

Check Calendar before the release

You can check calendar for any holidays or major changes to the infrastructure that might impact the release.

Automate Automate Automate

Avoid any manual changes to the build/deployment process.

Infrastructure as a code

Apply best practices and tools from infrastructure as a code so that you are applying all changes consistently.

Error Logging

Use appropriate level for production logs and log all exceptions including stack traces. You can log additional input parameters but ensure they don’t include any sensitive data.

Deploy domain objects and data access code together

In order to avoid any schema mismatch errors, always deploy both domain object and data access changes together. In addition, you can ignore unknown properties to the model to make these changes forward compatible. If domain and data access changes cannot be released together then release domain changes first and then release data access changes.

Plan for Rollback

Always test for rollback your changes and include multiple scenarios if changes were released in staggered mode where domain changes and data access changes were published separately. You can implement automated rollbacks when releasing changes incrementally so that new changes are immediately blocked before going to the next stage of testing.

SLA/SLO

Monitor SLA/SLO at each stage of the release so that you can fix any violation in these metrics.

Change Management

Apply best practices from change management to track any high-risk or major changes to the production environment.

Security Policies

Apply best practices from zero-trust security practices and principles of a lease privilege in all test and production environments.

Health Checks / Observability

Collect metrics for health checks, failure rates, usage, etc so that you can be immediately notified when you see a suspicious or a peculiar activity.

Chaos Testing

Apply chaos testing and game days to simulate various failure conditions to verify the behavior of your services and your reaction to those failures.

Integrate/Deploy often with small changes

Continuously integrate and deploy small changes to reduce the risk with major changes so that you can test your changes in production environment safely.

Feature Flags and Dark Launch

Use feature flags judiciously to launch features that are not yet available to all users so that you can test those changes for a subset of users.

July 12, 2019

Review of “Designing Data Intensive Applications”

Filed under: Computing,Technology — admin @ 4:38 pm

Designing Data-Intensive Applications is one of best resource for building scalable systems. The book is full of comprehensive material on data processing and following is summary of essential lessons from the book:

The first four chapter covers foundation of data systems and the first chapter starts with basic terminology of reliability, scalability, maintainability, Operability, simplicity and evolvability. It recommends measuring latencies using percentile and defining SLO/SLA.

The second chapter reviews data model such as relational model, document/ graph model, etc. It reviews differences in query model SQL, graph queries, and SPARQL and compares them in terms of schema evolution and performance.

The third chapter introduces basic concepts of storage for relational and NoSQL databases. It shows how you can use hash indexes for looking up key-value data, which can be enhanced with compaction. It further reviews SSTable and LSM-tree structures. The Sorted-String Table or SSTable keeps key-value in sorted fashion where each key only occurs once within each merged segment file. These segments are later merged using merge sort and written to disk. The Log-Structured Merge-Tree (LSM tree) is used to merge and compact sorted files. In order to reduce lookup time, bloom filter can be used to ensure key exists in the database. The relational databases use B-Tree data structure that break the database into fixed size pages and each page is identified by address that are stored in a tree structure. The B-tree use write-ahead log (WAL) to persist new data before updating B-tree data structure. Another form of index is multi-column index that combines several fields into one key. The  chapter then differentiates between OLTP and OLTP and using star (fact table surrounded by its dimension tables) and snowflakes (dimensions are broken down into sub dimensions) schema for analytics. Column-oriented storage can also be used for data warehouse that leads to better compression, CPU cache usage and vectorized processing. Some data warehouse use materialized aggregates or data cube that provides aggregates grouped by different dimensions.

The fourth chapter covers encoding and serialization schemes such as JSON, XML and binary encoding including MessagePack and Avro. It reviews binary protocols of Thrift and Protocol Buffers. Further, it provides support of schema evolution in these protocols for backward and forward compatibility. It describes data flow via network exchange using SOAP, REST/RPC message brokers and distributed actor frameworks.

The fifth chapter is part of second part of the book that focuses on distributed data with emphasis on scaling and shared-nothing architecture. The fifth chapter describes replication and defines leaders and followers. It starts with basic leader-based replication such as active/passive or master/slave where writes go through leader and reads can be served by any node. The replication can be synchronous, asynchronous or semi-synchronous. Though, synchronous replication has performance issues but research in chain replication provides good performance, which is used by Azure storage. The chapter reviews scenarios of leader or follower node failure, which may require election of new leader but can be subjected to loss of data that wasn’t replicated from the old leader. The replication can use statement-based replication, shipping WAL logs, logical (row-based) replication or trigger-based replication, where former can have side-effects due to triggers/non-deterministic behavior. The version mismatch may cause incompatibilities with WAL based replication. Asynchronous cause effects of eventual consistency where latest reads are not fully replicated so you need read-after-write consistency that can be addressed using read-your-writes consistency, e.g. reading from the leader or remembering timestamp of last write. The lag in replication can display load data after showing new data and monotonic read consistency addresses that behavior. Consistent prefix read guarantee sequence of writes order is preserved when reading those writes. When replicating with multi-leaders across data-centers can have higher lag time between replication and may result in write conflicts between leaders. You may associate users to a specific leader based on location or resolve these conflicts using timestamps or higher unique id. Other forms of replication uses leaderless or quorum based consistency. Some implementation use asynchronous read-repair to fix stale data and anti-entropy process to add any missing data. The quorum based consistency uses odd number of nodes with w = r = ceil(n+1)/2, though it can also be subjected to stale values (with sloppy quorum and hinted handoff) or loss of data (last writer wins). The quorum based nodes may use version number for every key to preserve order of write (version vectors).

The chapter six covers partitioning works with replication so that each node might have more than one partition. Some partitions may be skewed having more data than others, also referred as hot spot. You can partition based on key-range or hash of key, where hashing may use consistent hashing to distribute keys fairly. The partitioning data also requires partitioning secondary indexes or maintaining a global index that covers all partitions by term. The partitioning may also require re-balancing where you may define 100 times more partition than nodes so that data is not only partially relocated. When reading data requires routing to nodes where client may contact any node directly, which forwards to other node if needed; send all requests to a routing node; or client is aware of partitioning and contacts appropriate node.

The chapter seven describes transactions and defines meaning of ACID, however it cautions against failures due to asynchronous writes, caching, disk failures (despite fsync), etc. The relational databases use transaction scope to write multiple objects atomically and in isolation and other databases use compare-and-set operations to prevent lost updates. Due to high cost of serializable transactions, most databases provide weak isolation level when running multiple transactions concurrently. In order to prevent concurrency bugs, database provide transaction isolation including read-committed for preventing dirty reads/writes that may use row-level locks and keeping copies of old data; snapshot isolation/repeatable read that prevents read skew (non repeatable reads) using reader/write locks and MVCC. Relational databases provide explicit locking using SELECT…FOR UPDATE to prevent lost updates, other databases use compare-and-set to avoid lost update. These transaction isolation can still lead to write skew and phantoms that can be prevented using SELECT FOR UPDATE, e.g. meeting reservation, choosing username, preventing double spending. The serializable isolation provides strongest guarantees but it’s not provided in most databases and has limitations with partitioned data. Two-phase locking can be used for serializable isolation but it suffers performance issues and can lead to deadlocks. Another alternative is predicate locks that locks all objects matching criteria but they also suffer from performance issues. Other alternatives include index-range locks and serializable snapshot isolation (SSI) that uses optimistic concurrency controls.

The chapter eight discusses network faults and failures in distributed systems. The chapter covers cloud computing, supercomputer where nodes communicate through RDMA/shared memory. These failures are common in most systems and can be detected by load balancer or monitoring system. Partial failures are hard to detect and you may use timeout to detect failures. Distributed systems may use monotonic clocks (System.nanoTime) or time-of-date (NTP) clocks, however unreliable clocks can make measuring time in distributed systems error prone. The NTP synchronization is not always reliable and drift in time-of-clock may result in incorrect order of events and last-write-win strategy may overwrite data with old value. Google TrueTime API uses confidence interval with clock time, e.g. Spanner uses clock confidence interval for snapshot isolation to create global monotonic increasing transaction ID. When scheduling task periodically, check process delay due to GC, virtualization, disk I/O, etc. When using a lock or lease, fencing ensure there is only one leader. When the lock server grants a lock/lease, it returns fencing token (monotonic number) and client includes it with each request so old requests are rejected. However, fencing token cannot prevent against Byzantine faults (deliberate faults).

The chapter nine explains consistency and consensus. Most replicated database provides eventual consistency, which is weak consistency. Linearizability is strongest consistency that makes system appear as if there is a single copy of the data and reads cannot return old data if it previously returned new data. Linearizability may use compare-and-set operation to prevent data overwrite. The chapter also distinguishes Linearizability with Serializability that is isolation property of transaction that guarantees serial order of transactions, where Linearizability guarantees recent data after read/write and doesn’t prevent write skew. Leader election, distributed locks, unique constraints such as username may use Linearizability to come up with a single up-to-date value. Simplest way to have linearizable systems is to keep a single copy of the data but you need replication for fault tolerance system. You can use single-leader replication where all writes go to leader or consensus algorithm but multi-leader and leaderless replication (dynamo style) don’t provide linearizable guarantees. The chapter then describes CAP theorem, where consistency (C) relates to linearizability and you give up availability if some replicas are disconnected and wait until replicas are fixed. On the other hand, a replica can remain available even if it’s disconnected from other replicas to provide availability at the cost of linearizability. Also, CAP theorem only considers one kind of fault – network partition or nodes that are alive but disconnected from each other so most highly available systems don’t meet CAP definition. Though, causal order can define what happened before what but it’s not total order guaranteed by linearizability and sets can only be partially ordered. Linearizability is stronger than causality but most systems only need causality consistency that show what operation happened before other operation. Sequence number or timestamp (logical clock) can generate sequence number to identify order of operations, which can be incremented by the single leader. Other systems can preallocate blocks of sequence, attach timestamp or generate local sequence number but they are not consistent with causality. The chapter then reviews Lamport timestamps (logical) that enforce order (distinct from version vectors). You need a leader to sequence all operations on a single CPU to guarantee total order broadcast (atomic), but it’s not scalable. A leader per partition can maintain ordering per partition but it doesn’t guarantee total ordering. Total order broadcast requires reliable delivery and totally ordered order (same order to all nodes) and it can be used to implement serializable transaction. It can also be used for implementing lock service that provides fencing tokens. You can implement this by appending message to log, reading the log and waiting for the message delivered (same order for all nodes) back to verify the unique identifier such as username. But it only guarantees linearizable writes and you must sequence reads only after message is delivered back to you to guarantee linearizable reads.

Next part of chapter nine describes consensus, where you have to agree on leader after election (in presence of network faults while avoiding split brain) and atomic commits (as in ACID). Despite FLP that proved no algorithm can reach consensus if a node can crash, distributed systems can achieve consensus. The chapter reviews Two-phase commit (2PC) that involves multiple nodes as opposed to a single node that uses logs the data before committing it. The 2PC uses coordinator/transaction manager that creates globally unique tx-id and tracks two phases: prepare and commit. The coordinator must write transactions in logs before commit in case the coordinator crashes while nodes may have to wait for coordinator indefinitely. Three-phase commit assumes bounded delay/timeout to prevent blocking atomic commits. The chapter then distinguishes between internal database and distributed transactions, where distributed transactions guarantee exactly-once processing atomically such as XA transactions. On the downside, coordinator in distributed transactions would be a single point of failure and they limit scalability. The fault tolerant consensus requires uniform agreement, integrity, validity, and termination to agree on same value. The best algorithm that provides fault tolerant consensus include VSR, Paxos, Raft and Zab that uses total order broadcast algorithms that requires messages be delivered exactly once in the same order to all nodes. These algorithms use epoch numbers (monotonically increasing) for each election and quorum is used to agree on the value. There are a few limitations of consensus such as synchronous replication, majority voting (minimum 3 nodes), static membership, continuous re-election due to partial failures. Some of implementations include Zookeeper and etc that provides linearizable atomic operations using compare and set to implement locks; total ordering of operations using fencing token (monotonically increasing); failure detection; change modifications.

The chapter ten describes batch processing and is part of third part of the book that covers data derived from system of record for OLAP and reporting. The batch processing is offline processing as opposed to real-time services or near-real-time stream processing. The chapter starts with basic Unix tools that uses pipes and files for batch processing. It then reviews MapReduce such as Hadoop and distributed file systems such as HDFS, GlusterFS, QFS. The mapper extracts key/value from input records and generate a number of key-value pairs, whereas reducer takes key-value pairs and collects all values belonging to the same key. The scheduler uses principle of “putting the computation near the data” to run mapper on the machines with replica of input file. MapReduce often requires workflow systems to manage dependencies such as Oozie, Azkaban, Luigi, Airflow, Pinball. These frameworks use groups to merge or collate related data. However, you can use skewed join when group data is very large to fit on a single machine. where work can be parallelized on multiple reducers. Mappers can use broadcast hash joins and partitioned joins when working with large data. The chapter also discusses data flow engines like Spark, Tea, Flink that supports operators for providing more flexible way to create data pipeline. It also reviews graph processing systems such as Apache Graph, Spark GraphX that supports bulk synchronous parallel (BSP) model of computation, which sends messages from one vertex to all connecting vertices similar to actor model if you think of each vertex as actor.

The chapter eleven discusses stream processing for providing near real-time processing. The stream processing transmit event streams using messaging systems and pub/sub model. The messaging system may use UDP, brokerless libraries such as ZeroMQ or message brokers with queue support. The message brokers support load balancing when a single message is delivered to one of consumer and share the work (shared subscription). Alternatively, they use fan-out where message is delivered to all consumers. Message brokers use ack to remove the message when it’s processed by consumer. The broker delivers the message again in event of connection failure using atomic commit protocol. Some message brokers such as Apache Kafka use append-only logs to add incoming messages and these logs can be partitioned where each message uses monotonically increasing sequence number (without ordering gurantee). The consumers read files sequentially by specifying offset. There are some limitations of these log based brokers, e.g. number of nodes sharing the work can be at most the number of log partition in that topic and if a single message is slow to process, it holds up processing in that partition. Some implementations may use circular buffer to store messages on disk but if consumers cannot keep with producers, it may drop old messages. The chapter discusses change data capture for syncing data that capture the changes in the database and apply same changes to search index. You can also use compact logs for syncing the data from offset 0 and scan over all messages. The chapter reviews event sourcing that stores all changes to the application state as a log of change events. The event source distinguishes between event and commands and after command is validated, it becomes an event that is durable and immutable. The immutable data is related to command query responsibility serration principle.

The last chapter reviews future of data systems and limits of total order that may require a single leader but it can be difficult in distributed data-center and micro services. You can send all updates to the same partition but that is not sufficient to capture causal dependency. The chapter covers lambda architecture and unbundling of databases including storage technologies such as secondary indexes, replication logs, text search indexes. It reviews exactly-once execution of operation, duplicate suppression and operation identifier (using request-id to make operation idempotent).

June 18, 2019

Review of “Designing Distributed Systems”

Filed under: Computing,Technology — admin @ 11:34 am

The “Designing Distributed Systems” book provides design patterns for building distributed systems with support of container technologies such as Kubernetes. The book consists of three sections where first section focuses on single-nodes, second section focuses on long-running services, and third section focuses on batch computation.

Sidecar

The first pattern in the book introduces concept of sidecar pattern for modularity and reusability where a single application requires two containers: application container and sidebar container where sidebar container provides additional functionality such as adding SSL proxy for the service, collecting metrics for the application container. The side-bar container can be configured via dynamic configuration service.

Ambassadors

The ambassador pattern introduces an ambassador container that sits between the application and external services and all incoming/outgoing traffic goes through it. It also helps with modularity and reusability where the ambassador may abstract sharded service (or A/B testing) so that client or service itself doesn’t need to know all details . You may also use ambassador container for service brokering where it looks up an external service and connects to it.

Anti-corruption layer

The anti-corruption layer integrates two systems that don’t share the same semantic data model.

Adapters

The adapter pattern uses special container to modify the interface of application container, e.g. you can deploy monitoring adapter to automatically collect health metrics using Prometheus or other tools. Similarly, you may use adapter container to collect kubernetes logs (stdout/stderr) and reformat the logs before sending them to log aggregator (Fluentd).

Replicated Load-Balanced Services

This pattern is part of long-running services where a load balancer is added in front of the service for scalability. Each service is designed as a stateless so that requests can be sent to any replica of the service behind the load balancer. Each service needs to provide readiness probe so that load balancer knows if it can serve the requests. In some cases, you may need to support session-tracked services where user requests are routed to the same replica using sticky session or consistent hashing function. You may add a caching layer that is deployed along with your service container (as sidebar). Further, you may need to provide rate-limiting and protect against DOS attacks (X-RateLimit-Remaining headers). This pattern can also implement SSL Termination where external traffic is encrypted with different certificate compared with internal traffic (Varnish).

Sharded Services

This pattern partitions the traffic where each shard serves subset of all requests. As opposed to replicated services that are generally used for stateless services, sharded services are used for building stateful services. You may use sharded cache for each shard that sits between user and front-end to optimize end-user performance and latency. You may add replicas for each shard for further redundancy and scalability. Sharding requires selecting a key to route the traffic, e.g. you may use IP-address or consistent hash function to avoid remapping when new shards are added. If one of the shard becomes hot, you can add replicated sharded cache to handle the increased load.

Scatter/Gather

The scatter/gather pattern adds parallelism in servicing requests where work is broken and spawned to multiple services and then result is aggregated before returning to the user. For example, you can implement distributed document search by farming multiple leaf machines that returns matching document and root node aggregates the results. You can also add support for sharded data by searching each shard in parallel and root node generates union of all documents returned by each shard (leaf node). One downside of this pattern is that it may suffer straggler problem as total response time depends on the slowest response so you may need to replicate each shard to improve computational power.

Functions and Event-Driven Processing

This pattern is used to implement function-as-a-service (FaaS) products. FaaS simplifies development and deployment as the code is managed and scaled automatically. However, FaaS requires that you decouple your application into small parts that can be run independently. Faas uses event systems to communicate with each function or create a data pipeline. You can use external data services for storing states that is shared by these functions.

Ownership Election

This pattern helps in multi-node environment where a specific task must be owned by a single process. For example, when you have multiple replicas, you may need to elect master using consensus algorithm such as Paxos, Raft or frameworks such as etcd, ZooKeeper, and consul. You can use distributed locks to implement ownership (optionally with a lease or TTL). You may need to verify if you hold the lock before proceeding, e.g.

func (Lock l) isLocked() boolean {
return l.locked && l.lockTime + 0.75 * l.ttl > now()
}

Work Queue Systems

This pattern is part of batch computation section to handle work items within a certain amount of time. You may use a work-queue manager container along with an ambassador container to connect to external queue source where source might use storage API, network storage, pub/sub systems like Kafka or Redis. Once the queue manager receives a work item, it launches a worker container. Kubernetes contains a Job object that allows for the reliable execution of the work queue. In order to limit number of worker containers running concurrently, you can limit the number of Job objects that your work queue is willing to create. You can also use the multi-worker pattern when different worker containers are transformed into a single unified container that implements the worker interface.

Event-Driven Batch Processing

This pattern allows data pipelining where an output of one work queue becomes input to another work queue, referred as workflow systems. Here are patterns of event-driven processing:
Copier: This pattern just duplicates the work item into two or more identical streams.
Filter: This pattern reduce a stream of work items to a smaller stream of work items by filtering out items that don’t meet particular criteria.
Splitter: This works like filter, but instead of eliminating input, it sends different inputs to different queues based on criteria.
Sharder: This is more generic form of splitter and splits a work item into smaller work items based on sharding function.
Merger: This is opposite of copier and merges two different work queues into a single work queue.

You may use pub/sub API to communicate between different workers.

Coordinated Batch Processing

This is similar to Reduce part of MapReduce pattern where a work is broken up and distributed to multiple nodes in parallel. You may need Join or Barrier Synchronization to wait for intermediate results before proceeding to the next stage of the workflow. For example, reduce phase aggregates merges several outputs into a single output.

Claim-check

Instead of sending large messages to a messaging queue, you can store payload to an external storage service and send reference in the messaging event.

Index Table

If the data-store doesn’t support secondary indexes, you can define a separate index table where the primary key is the secondary field for query such as customer phone number. You can use same partition-key for both fact-table and index tables so that they are colocated.

Compensating Transaction

When invoking several external service with eventual consistency model that can fail, you can use a compensating transaction to undo the operation and revert back changes.

Saga distributed transactions

The saga pattern uses a sequence of transaction to invoke multiple services, however if a step fails then it executes compensating transactions.

Scheduler Agent Supervisor

This pattern coordinates distributed actions as a transaction and it will undo the work if any action fails. It uses a scheduler to orchestrate steps of transaction, an agent to invoke a remote service and a supervisor to monitor the task being performed.

Competing Consumers

This allows multiple concurrent consumers to receive messages from same messaging queue. It allows better scalability, reliability and availability.

Command and Query Responsibility Segregation (CQRS)

CQRS separates query and update operations for a data-store such that commands store task/transaction based data in the data-store. The query model is optimized for high performance read model.

Event Sourcing

The event-sourcing allows storing state of domain object in an append-only store as events that are applied to update state of the application. This allows better performance, scalability and audit-trail, compensating actions.

Asynchronous Request-Reply

The front-end applications often work with synchronous APIs and this pattern uses request-reply facade to decouple the backend asynchronous APIs.

Retry

The retry allows recovering from failures by retrying the failed operation after exponential delay and jitter.

Circuit Breaker

When a network request fails due to transient errors, the circuit breaker can be used to prevent an application from repeatedly retrying the failed operation.

Bulkhead

The bulkhead pattern components of an application are isolated so that failure in one component doesn’t cause cascading failure. For example, you may use different database connection pools, threads pool or rate-limit for different types of requests.

Orchestration vs Choreography

Micro-services can communicate with other services using orchestration model where a centralized service orchestrates requests to other services and acknowledges all operations. Alternatively, you can use message-driven or asynchronous messaging to implement choreography design.

Deployment Stamps

In order to scale deployment by tenants or groups, you can use deployment stamp or scale-unit to host multiple instances of the application. This provides scalability, sharding and separation of data per subset of customers.

External Configuration Store

The external configuration store allows storing application configuration to a centralized location that can be cached and shared across applications.

Edge Workload Configuration

The Edge-workload configuration is often used with IoT deployment to allow scaling of data-store and configuration at the edge. For example, you may define an edge-configuration controller and edge-configuration data-store to host device configurations that are updated asynchronously from cloud configuration controller and data-store.

Federated Identity

The federated identity allows delegating authentication to an external identity provider so that user management, authentication and authorization can be simplified.

Gatekeeper

The gatekeeper acts as a broker between client and services so that requests can be sanitized, throttled and authenticated.

Gateway Aggregation

A client may need to invoke multiple data services to aggregate application state, instead gateway aggregation allows using a gateway to invoke multiple services.

Gateway Offloading

When communicating with a service that requires unique requirements for connections or security, you can use a gateway to proxy all requests so that other services don’t need to implement all those measure.

Gateway Routing

This allows client to use a single endpoint to access all service that can authenticated, throttled and validated consistently.

Geode

This allows deploying applications to geographically node or regions in active/active style so that latency and availability can be improved.

Health Endpoint Monitoring

This allows checking application health from external tools for monitoring purpose.

Leader Election

In order to avoid conflicts, a single task instance can be elected as a leader so that it coordinates actions of other subordinate tasks.

Pipes and Filters

The pipes and filters allow decomposing tasks into a series of components that can be shared to improve performance, scalability, and reusability.

Rate Limiting

The rate-limiting allows controlling resources based on limits so that you can predict throughput accurately.

Sequential Convoy

This allows processing a group of messages in sequential order by using consumers that are partitioned by the group identifier.

Valet Key

The valet key acts as a token to provide access to restricted resource. For example, the client first requests a valet key, which is then used to access the resource within the lease-time.

March 6, 2018

Tips from the second edition of “Release It!”

Filed under: Design,Methodologies,Technology — admin @ 4:19 pm

The first edition of “Release It!” has been one of most influential books that I have read and it introduced a number of methods for writing fault tolerant systems such as circuit-breaker, bulkhead patterns so I was excited to read the second edition of the book when it came out. Here are a few tips from the book:

In the first chapter, the author defines stability in terms of robustness, i.e., “A robust system keeps processing transactions, even when transient impulses, persistent stresses, or component failures disrupt normal processing.” He recommends focusing on longevity bugs such as resource leaking, e.g. you can set timeout when invoking a network or database operation or detect dead connections before reading/writing. The author cautions against tightly coupled systems that can easily propagate failures to other parts of the system. Similarly, high number of integration points can increase probability of failure in one of those dependencies. The author suggests looking into the blocking calls that can lead to deadlocks when using multiple threads or scrutinizing resource pool that gets exhausted from blocking operations. Another cause of instability is chain reaction from failure of one of servers that increases load on the remaining servers, which can be remedied using bulkhead or circuit-breaker patterns. High memory usage on the server can also constrain resources and the author recommends using external caching systems such as Redis, memcache, etc. In order to monitor health of the system, the author suggests using a mock transaction to ensure it’s working as expected and keeping metrics on errors especially login errors, high latency warnings. One common self-inflicting failure can be caused by self-denial attacks by marketing campaign that can be mitigated by using shared-nothing architecture or reducing fan-in shared resources. The servers can also be subjected to dogpile effect, where resources after upgrade, cronjob or config change spike up that can be mitigated by using random clock skew, adding random jitter and using exponential backoff. Finally, the author recommends monitoring slow responses, removing unbounded result sets and failing fast.

Here is a summary of stability patterns that the author recommends:

  • Apply timeout with integration points and delayed retries when failure occurs.
  • Apply circuit-breaker to prevent cascading failures (along with timeout).
  • Apply bulk-head pattern to partition the system in the event of chain reaction failure.
  • Steady state
  • Data purging
  • Log files
  • Fail fast, restart fast and reintegrate
  • Let it crash with limited granularity, e.g. boundary of actor.
  • Supervision for monitoring, restarts, etc.
  • Shed Load
  • Back-pressure – queues must be finite for finite response time
  • Governor – create governor to slow rate of actions (from automated response) so that humans can review it.

Next few chapters focus on network, machines and processes for building and deploying the code. The author offers several mechanisms for scaling such as load balancing with DNS, using service registry for upgrading and fail-over, configuration, transparency, collecting logs and metrics, etc. The author recommends load shedding when under high load or use HTTP 503 to notify load balancer. For example, queue length can be calculated as: (max-wait-time / mean-processing-time + 1) * processing-threads * 1.5. You can use listen reject queue to return 503 error to prevent clients from reconnecting immediately. The control-panel chapter recommends tools for administration. It recommends postmortem template such as what happened, apologize, commit to improvement and emphasizes system failures (as opposed to human errors). The author recommends adding indicators such as traffic indicators, business transaction, users, resource pool health, database connection health, data consumption, integration point health, cache health, etc.

The security chapter offers standard best practices from OWASP such as using parameterized queries to protect against SQL injection, using high entropy random session-ids/storing cookies for exchanging session-ids to protect against session hijacking/fixation. In order to protect against XSS (when user’s input is rendered in HTML without escaping) by filtering input and escaping it when rendering. The author recommends using a random nonce and strict SameSite policy to protect against CSRF. Similarly, author recommends using the principle of least privilege, access control, etc. The admin tools can offer tools for resetting circuit breakers, adjust connection pool sizes, disabling specific outbound integrations, reloading configuration, stopping accepting load, toggling feature flags.

For ease of deployment, the author recommends automation, immutable infrastructure, continuous deployment, and rolling changes incrementally.

The author suggests several recommendations on process and organization such as OODA loop for fast learning, functional/autonomous teams, evolutionary architecture, asynchrony patterns, loose clustering and creating options for future.

Lastly, the author offers chaos engineering as a way to test resilience of your system using Simian army or writing your own chaos monkey. In the end, the new edition offers a few additional chapters on scaling, deployment, security, and chaos engineering and more war stories from author’s consulting work.

November 15, 2016

Tips from “Algorithms to Live By”

Filed under: Algorithms,Computing — admin @ 10:51 pm

The “Algorithms to Live By” by Brian Christian and Tom Griffiths reviews computer algorithms from several domains and illustrates practical examples for applying those algorithms in real-life problems. Here is a list of some of those algorithms that I found very useful:

1. Optimal Stopping

This class of problems determines the optimal time to stop further processing when searching or selecting an option. Here are a few examples:

Secretary Hiring Problem
This is a famous math problem, which was defined by a mathematician named Merril Flood based on “Look-Then-Leap-Rule” to find the best candidate by waiting until you review 37% of the candidates and then hiring the candidate who is better than all of the past candidates. There are several other applications of this algorithm such as finding a life partner or apartment hunting. This problem assumes that you cannot go back to the previous candidate once you reject but there are other variations of this algorithm that allow it in case the selected candidate rejects your offer.

Selling a House
When selling a house, you need to determine the range of expected offers and cost of waiting for the best offer.

Finding a Parking Spot
Given a percentage of parking spots available, you determine the number of vacant spots that can be passed before a certain distance until you take the first spot.

2. Explore/Exploit

In this chapter, authors describe several algorithms for exploring available paths and then using the optimal path. Here is a sampling of the approaches based on explore/exploit:
Multi-armed bandit
Given expected value of a slot machine (winnings/# of pulls), you need to maximize winnings. There are several approaches such as:

  • Win-Stay
    You keep using a slot machine as long as you are winning and then switch to a different machine when you lose.
  • Gittins Index
    It is named after Gittins, who was a professor at Oxford. It tries to maximize payoffs for future by calculating a Gittins index for all slot machines and then selecting slot machine with the highest Gittins index.
  • Regret and optimism
    Many problems in life can be defined in terms of regrets and optimism by imagining being at the deathbed and thinking of decisions that you could have made differently.
  • Upper Confidence Bound
    It is also referred as optimism in the face of uncertainty, where you choose your actions as if the environment is as nice as is plausibly possible. Given a range of plausible values and you pick the option with the highest confidence interval.
  • A/B Testing
    It is often used to test new features by offering the new features to a subset of the customers.

One of insight the authors present is that people often explore longer by favoring new over the best older option.

3. Sorting

In this chapter, authors describe several algorithms for sorting and their computing cost in terms of O-notation. The O-notation is generally used to indicate algorithm’s worst performance such as:

  • O(1): Constant cost
  • O(N): Linear cost
  • O(N^2): Quadratic cost
  • O(2^N): Exponential cost
  • O(N!): Factorial cost

Merge-Sort
This algorithm breaks data recursively into smaller sets until there is a single element. It then merges those subsets to create a new sorted list.

Bucket-Sort
A group of n items can be grouped into m buckets in O(nm) time and this insight is used by bucket sorting where items are grouped into a number of sorted buckets. For example, you can use this approach to load returned books into carts based on the shelf numbers.

Sorting is a pre-requisite for searching and there are a lot of practical applications for sorting such as creating matchups between teams. For example, teams can use round-robin based matchup where each team plays each other team but it would result in a lot of matches (O(N^2)). Instead, competitions such as March Madness uses Merge-Sort to move from 64 teams to 32, 16, 8, 4 and finals. However, it doesn’t use full sort as there are only 63 games in the season instead of 192.

4. Caching

In computer design, John Von Neumann designed memory hierarchy to improve lookup performance. It was first used in IBM 360 mainframes. Other computer researchers such as Belady designed algorithms for page faults to load data from disk to memory. There are several algorithms for cache eviction such as First-In, First-Out, Least-Recently-Used, etc.

5. Scheduling

Here are a few of the scheduling algorithms described in this chapter:
Earliest Due Date Strategy
It minimizes maximum lateness by choosing task with the earliest due date first.

Moore’s algorithm
It is similar to Earliest Due Date but it throws out biggest task if the new job can’t be completed by due date.

The authors give an example of Getting Things Done (GTD) technique for time management where small tasks are handled first. The tasks can also have a weight or priority and then the scheduler minimizes the sum of weighted completion time by dividing weight by length of the task and selecting the task with the highest density.

Here are a few issues that can arise with priority based tasks:

  • Priority Inversion – when a low priority task possesses a resource and scheduler executes a higher priority task, which cannot make any progress. One way to address this issue is by allowing the low-priority task to inherit the priority of higher priority task and let it complete.
  • Thrashing – it occurs when system grinds to halt because work cannot be completed due to lack of resources.
  • Context switching – Modern operating system uses context switching to work on multiple tasks but each slice of time needs to be big enough so that the task can make progress. One technique to minimize context switching is interrupt coalescing, which delays hardware interrupt. Similar techniques can be used by batching small tasks, e.g. Getting Things Done technique encourages creating a chunk of time to handle similar tasks such as checking emails, making phone calls, etc.

6. Bayes’s Rule

Reverand Thomas Bayes’s postulated Bayes’s rule by looking at winning and losing tickets to determine overall ticket pool. It was later proved by Pierre-Simon Laplace, which is commonly referred as Laplace’s law. Laplace worked out Bayes’s Rule to use prior knowledge in prediction problems.

Copernican Principle
Richard Gott hypothesized that the moment you observe something, it is likely to be in the middle of its lifetime.

Normal or Gaussian distribution
It has a bell curve and can be used to predict average life span.

Power-law distribution
It uses range over many scales such as the population of cities or income of people.

Multiplicative Rule
It multiplies quantity observed with some constant factor.

Average Rule
It uses the distribution natural average.

Additive Rule
It predicts that the things that will go on just a constant amount longer such as a five more minute rule.

7. Overfitting

In machine learning, overfitting occurs when training data fits tightly with key factors so that it doesn’t accurately predict the outcome for the data that it has not observed.

Cross Validation
Overfitting can be solved with cross-validation by assessing model not just against training data but also against unseen data.

Regularization
It uses contents to penalize complexity.

Lasso
It uses penalty of the total weight of different factors to minimize complexity.

8. Relaxation

In constraint optimization problems, you need to find the best arrangement of a set of variables given a set of rules and scoring mechanism such as traveling salesman problem (O(N!)). Using constraint relaxation, you remove some of the problem constraints, e.g. you can create a minimum spanning tree that connects all nodes in O(N^2) amount of time. Techniques such as Lagrangian Relaxation removes some of the constraints and add them to the scoring system.

9. Randomness

This chapter describes examples of algorithms that are based on random numbers such as:

Monte Carlo Method
It uses random samples to handle qualitatively unmanageable problems.

Hill Climbing
It takes a solution and tries to improve it by permuting some of the factors. It only accepts changes if it results in improvements. However, it may not find the globally optimal solution.

Jitter
It makes random small changes and accepts them even if they don’t improve in order to find the better solution.

Metropolis algorithm
It uses Monte Carlo Method and accepts bad and good tweaks in trying different solutions.

Simulated Annealing
It optimizes problems like annealing by heating up and slowly cooling off.

10. Networking

This chapter describes algorithms used in the computer network such as:

Packet switching
One of key idea of Internet was to use packet switching where TCP/IP sends data packets over a number of connections as opposed to dedicated lines or circuit switching which were used by phone companies.

Acknowledgement
It is used to let the sender know that packet is received. TCP/IP uses the triple handshake to establish a connection and sender resends packets if ACK is not received.

Exponential Backoff
It increases average delay after successive failure.

Flow Control
TCP/IP uses Additive Increase Multiplicative Decrease to increase the number of packets sent and cut the transmission rate in half and ACK is not received.

Bufferbloat
A buffer is a queue that stores outgoing packets, but when the queue length is large, it can add a delay in sending ACK, which would result in redelivery. Explicit Congestion Notification can be used to address those issues.

11. Game Theory

In this chapter, authors discuss several problems from game theory such as:

Halting problem
This problem was first posed by Alan Turing who asserted that a computer program can never tell whether another program that it uses would take forever to compute something.

Prisoner’s dilemma
It is based on two prisoners who are caught and have to either cooperate or work against each other. In general, defection is the dominant strategy.

Nash Equilibrium
It is one of strategy where neither player changes their own play based on the opponent’s strategy.

The Tragedy of the Commons
It involves a shared-resource system where an individual can act independently in a selfish manner that is contrary to the common good of all participants, e.g. voluntary environmental laws where companies are not required to obey emission levels.

Information cascade
Information cascade occurs where an individual abandons their own information in favor of other people’s action. One application of this class of problems is auction systems. Here are a few variations of the auction systems:

  • Sealed-bid – where bidders are unaware of other bid prices so they would have to predict price that other bidders would use.
  • Dutch or descending auction – where bids start at a high price and is slowly lowered until someone accepts it.
  • English or ascending auction – where bid starts at a low price and is then increased.
  • Vickrey auction – it is similar to sealed-bid but winners pay second-place bid. It results in better valuation as bidders are incentivized to bid based on the true value.

Summary

This book presents several domains of algorithms and encourages computational kindness by applying these algorithms in real-life. For example, we can add constraints or reduce the number of available options when making a decision, which would lower the mental labor.

February 6, 2016

Building a Generic Data Service

Filed under: Web Services — admin @ 10:44 pm

As REST based Micro-Services have become prevalent, I often find that web and mobile clients have to connect to different services for gathering data. You may have to call dozens of services to display data on a single screen or page. Also, you may only need subset of data from each service but you still have to pay for the bandwidth and parsing cost.

I created a new Java framework PlexDataProviders for aggregating and querying data from various underlying sources, which can be used to build a general-purpose data service. PlexDataProviders is a light-weight Java framework that abstract access to various data providers such as databases, files, web services, etc. It allows aggregation of data from various data providers.

The PlexDataProviders framework is divided into two components:

  • Data Provider – This component defines interfaces that are implemented to access data sources such as database or web services.
  • Query Engine – This component is used for querying and aggregating data.

The query engine can determine dependency between providers and it also allow you to use output of one of the data provider as input to another data provider. For example, let’s assume:

  • data-provider A requires input-a1, input-a2 and produces output-a1, output-a2
  • data-provider B requires input-b1 and output-a1 and produces output-b1, output-b2

Then you can pass input-a1, input-a2 to the query engine and request output-a1, output-a2, output-b1, output-b2 output data fields.

Benefits

PlexDataProviders provides offers following benefits:

  • It provides a unified way to search data and abstracts integration to underlying data sources.
  • It helps simplifying client side logic as they can use a single data service to query all data instead of using multiple data services.
  • This also help with managing end-points as you only a single end-point instead of connecting to multiple web services.
  • As clients can specify the data they need, this helps with payload size and network bandwidth.
  • The clients only need to create a single data parser so it keeps JSON parsing logic simple.
  • As PlexDataProviders supports multi-threading, it also helps with latency of the data fetch requests.
  • It partial failure so that a failure in a single data provider doesn’t effect other data providers and the data service can still return partial results. User
  • It supports timeout so that clients can receive available data that completes in given timeout interval

Data Structure

Following are primary data structures:

  • MetaField – This class defines meta information for each data field such as name, kind, type, etc.
  • MetaFieldType – This enum class supports primitive data types supported, i.e.
    • SCALAR_TEXT – simple text
    • SCALAR_INTEGER – integer numbers
    • SCALAR_DECIMAL – decimal numbers
    • SCALAR_DATE – dates
    • SCALAR_BOOLEAN – boolean
    • VECTOR_TEXT – array of text
    • VECTOR_INTEGER – array of integers
    • VECTOR_DECIMAL – array of decimals
    • VECTOR_DATE – array of dates
    • VECTOR_BOOLEAN – array of boolean
    • BINARY – binary data
    • ROWSET – nested data rowsets
  • Metadata – This class defines a set of MetaFields used in DataRow/DataRowSet
  • DataRow – This class abstracts a row of data fields
  • DataRowSet – This class abstracts a set of rows

PlexDataProviders also supports nested structures where a data field in DataRow can be instance of DataRowSet.

Adding a Data Provider

The data provider implements following two interfaces

[codesyntax lang="java"]
public interface DataProducer {
    void produce(DataRowSet requestFields, DataRowSet responseFields,
            QueryConfiguration config) throws DataProviderException;
}
[/codesyntax]

Note that QueryConfiguration defines additional parameters such as:

  • pagination parameters
  • ordering/grouping
  • filtering parameters
  • timeout parameters

The timeout parameter can be used to return all available data within defined time, e.g. query engine may invoke underlying data providers in multiple threads and if underlying query takes a long time then it would return available data.

[codesyntax lang="java"]
public interface DataProvider extends DataProducer, Comparable<DataProvider> {
    String getName();

    int getRank();

    Metadata getMandatoryRequestMetadata();

    Metadata getOptionalRequestMetadata();

    Metadata getResponseMetadata();

    TaskGranularity getTaskGranularity();
}
[/codesyntax]

Each provider defines name, rank (or priority when matching for best provider), set of mandatory/optional input and output data fields. The data provider can also define granularity as coarse grain or fine grain and the implementation may execute those providers on different threads.

PlexDataProviders also provides interfaces for converting data from domain objects to DataRowSet. Here is an example of provider implementation:

[codesyntax lang="java"]
public class SecuritiesBySymbolsProvider extends BaseProvider {
    private static Metadata parameterMeta = Metadata.from(SharedMeta.symbol);
    private static Metadata optionalMeta = Metadata.from();
    private static SecurityMarshaller marshaller = new SecurityMarshaller();

    public SecuritiesBySymbolsProvider() {
        super("SecuritiesBySymbolsProvider", parameterMeta, optionalMeta,
                marshaller.getMetadata());
    }

    @Override
    public void produce(DataRowSet parameter, DataRowSet response,
            QueryConfiguration config) throws DataProviderException {
        final String id = parameter.getValueAsText(SharedMeta.symbol, 0);
        Map<String, Object> criteria = new HashMap<>();
        criteria.put("symbol", id.toUpperCase());
        Collection<Security> securities = DaoLocator.securityDao.query(criteria);
        DataRowSet rowset = marshaller.marshal(securities);
        addRowSet(response, rowset, 0);
    }
}
[/codesyntax]

Typically, you will create data-provider for each different kind of query that you want to support. Each data provider specifies set of required and optional data fields that can be used to generate output data fields.

Here is an example of marshalling data from Securty domain objects to DataRowSet:

[codesyntax lang="java"]
public DataRowSet marshal(Security security) {
    DataRowSet rowset = new DataRowSet(responseMeta);
    marshal(rowset, security, 0);
    return rowset;
}

public DataRowSet marshal(Collection<Security> securities) {
    DataRowSet rowset = new DataRowSet(responseMeta);
    for (Security security : securities) {
        marshal(rowset, security, rowset.size());
    }
    return rowset;
}
...
[/codesyntax]

PlexDataProviders provides DataProviderLocator interface for registering and looking up provider, e.g.

[codesyntax lang="java"]
public interface DataProviderLocator {
    void register(DataProvider provider);

    Collection<DataProvider> locate(Metadata requestFields, Metadata responseFields);
...
}
[/codesyntax]

PlexDataProviders comes with a small application that provides data services by implementing various data providers. It uses PlexService framework for defining the service, e.g.

[codesyntax lang="java"]
@WebService
@Path("/data")
public class DataServiceImpl implements DataService {
    private DataProviderLocator dataProviderLocator = new DataProviderLocatorImpl();
    private QueryEngine queryEngine = new QueryEngineImpl(dataProviderLocator);

    public DataServiceImpl() {
        dataProviderLocator.register(new AccountsByIdsProvider());
        dataProviderLocator.register(new AccountsByUseridProvider());
        dataProviderLocator.register(new CompaniesBySymbolsProvider());
        dataProviderLocator.register(new OrdersByAccountIdsProvider());
        dataProviderLocator.register(new PositionGroupsBySymbolsProvider());
        dataProviderLocator.register(new PositionsBySymbolsProvider());
        dataProviderLocator.register(new QuotesBySymbolsProvider());
        dataProviderLocator.register(new SecuritiesBySymbolsProvider());
        dataProviderLocator.register(new UsersByIdsProvider());
        dataProviderLocator.register(new WatchlistByUserProvider());
        dataProviderLocator.register(new SymbolsProvider());
        dataProviderLocator.register(new UsersProvider());
        dataProviderLocator.register(new SymbolSearchProvider());
    }

    @Override
    @GET
    public DataResponse query(Request webRequest) {
        final DataRequest dataRequest = DataRequest.from(webRequest .getProperties());
        return queryEngine.query(dataRequest);
    }
}
[/codesyntax]

As you can see the data service simply builds DataRequest with input data fields and sends back response back to clients.

Here is an example client that passes a search query data field and requests quote data fields with company details

public void testGetQuoteBySearch() throws Throwable {
    String jsonResp = TestWebUtils.httpGet("http://localhost:" + DEFAULT_PORT
                    + "/data?responseFields=exchange,symbol,quote.bidPrice,quote.askPrice,quote.sales,company.name&symbolQuery=AAPL");
    ...

Note that above request will use three data providers, first it uses SymbolSearchProvider provider to search for matching symbols with given query. It then uses the symbol data field to request company and quote data fields from QuotesBySymbolsProvider and CompaniesBySymbolsProvider. The PlexDataProviders framework will take care of all dependency management for providers.

Here is an example JSON response from the data service:

[codesyntax lang="javascript"]
{
    "queryResponse": {
        "fields": [
            [{
                "symbol": "AAPL_X"
            }, {
                "quote.sales": [
                    [{
                        "symbol": "AAPL_X"
                    }, {
                        "timeOfSale.volume": 56
                    }, {
                        "timeOfSale.exchange": "DOW"
                    }, {
                        "timeOfSale.date": 1455426008762
                    }, {
                        "timeOfSale.price": 69.49132317180353
                    }],
                    [{
                        "symbol": "AAPL_X"
                    }, {
                        "timeOfSale.volume": 54
                    }, {
                        "timeOfSale.exchange": "NYSE"
                    }, {
                        "timeOfSale.date": 1455426008762
                    }, {
                        "timeOfSale.price": 16.677774132458076
                    }],
                    [{
                        "symbol": "AAPL_X"
                    }, {
                        "timeOfSale.volume": 99
                    }, {
                        "timeOfSale.exchange": "NASDAQ"
                    }, {
                        "timeOfSale.date": 1455426008762
                    }, {
                        "timeOfSale.price": 42.17891320885568
                    }],
                    [{
                        "symbol": "AAPL_X"
                    }, {
                        "timeOfSale.volume": 49
                    }, {
                        "timeOfSale.exchange": "DOW"
                    }, {
                        "timeOfSale.date": 1455426008762
                    }, {
                        "timeOfSale.price": 69.61680149649729
                    }],
                    [{
                        "symbol": "AAPL_X"
                    }, {
                        "timeOfSale.volume": 69
                    }, {
                        "timeOfSale.exchange": "NYSE"
                    }, {
                        "timeOfSale.date": 1455426008762
                    }, {
                        "timeOfSale.price": 25.353316897552833
                    }]
                ]
            }, {
                "quote.askPrice": 54.99300665695502
            }, {
                "quote.bidPrice": 26.935682182171643
            }, {
                "exchange": "DOW"
            }, {
                "company.name": "AAPL - name"
            }],
            [{
                "symbol": "AAPL"
            }, {
                "exchange": "NASDAQ"
            }]
        ],
        "errorsByProviderName": {},
        "providers": ["QuotesBySymbolsProvider", "SymbolSearchProvider", "CompaniesBySymbolsProvider"]
    }
}
[/codesyntax] 

PlexDataProviders is available from github and is licensed under liberal MIT license. It also comes with a small sample application for demo purpose. Feel free to send me your suggestions.

 

April 23, 2014

Implementing Reactive Extensions (RX) using Java 8

Filed under: Java — admin @ 11:52 pm

In my last blog, I described new lambda support in Java 8. In order to try new Java features in more depth, I implemented Reactive extensions in Java 8. In short, reactive extensions allows processing synchronous and asynchronous in data uniform manner. It provides unified interfaces that can be used as an iterator or callback method for asynchronous processing. Though, Microsoft RX library is huge but I only implemented core features and focused on Observable API. Here is brief overview of implementation:

Creating Observable from Collection

Here is how you can create Observable from a collection:

    List<String> names = Arrays.asList("One", "Two", "Three", "Four", "Five"); 
    Observable.from(names).subscribe(System.out::println, 
       Throwable::printStackTrace, () -> System.out.println("done"));
 

In Microsoft’s version of RX, Observable takes an Observer for subscription, which defines three methods: onNext, onError and onCompleted. onNext is invoked to push next element of data, onError is used to notify errors and onCompleted is called when data is all processed. In my implementation, the Observable interface defines two overloaded subscribe method, first takes callback functions for onNext and onError and second method takes three callback functions including onCompleted. I chose to use separate function parameters instead of a single interface so that caller can pass inline lambda functions instead of passing implementation of Observer interface.

Creating Observable from Array of objects

Here is how you can create Observable from stream:

    Observable.from("Erica", "Matt", "John", "Mike").subscribe(System.out::println, 
         Throwable::printStackTrace, () -> System.out.println("done"));
 

Creating Observable from Stream

Here is how you can create Observable from stream:

    Stream<String> names = Stream.of("One", "Two", "Three", "Four", "Five"); 
    // note third argument for onComplete is optional
    Observable.from(names).subscribe(name -> System.out.println(name), 
       error -> error.printStackTrace());
 

Creating Observable from Iterator

Here is how you can create Observable from iterator:

    Stream<String> names = Stream.of("One", "Two", "Three", "Four", "Five"); 
    Observable.from(names.iterator()).subscribe(name -> System.out.println(name), 
       error -> error.printStackTrace());
 

Creating Observable from Spliterator

Here is how you can create Observable from spliterator:

    List<String> names = Arrays.asList("One", "Two", "Three", "Four", "Five"); 
    Observable.from(names.spliterator()).subscribe(System.out::println, 
       Throwable::printStackTrace);
 

Creating Observable from a single object

Here is how you can create Observable from a single object:

    Observable.just("value").subscribe(v -> System.out.println(v), 
       error -> error.printStackTrace());
    // if a single object is collection, it would be treated as a single entity, e.g.
    Observable.just(Arrays.asList(1, 2, 3)).subscribe( num -> System.out.println(num), 
       error -> error.printStackTrace());
 

Creating Observable for an error

Here is how you can create Observable that would return an error:

    Observable.throwing(new Error("test error")).subscribe(System.out::println, 
       error -> System.err.println(error));
    // this will print error 
 

Creating Observable from a consumer function

Here is how you can create Observable that takes user function for invoking onNext, onError and onCompleted function:

    Observable.create(observer -> {
       for (String name : names) {
          observer.onNext(name);
       }
       observer.onCompleted();
    }).subscribe(System.out::println, Throwable::printStackTrace);
 

Creating Observable from range

Here is how you can create Observable from stream that would create numbers from start to end range exclusively.

    // Creates range of numbers starting at from until it reaches to exclusively
    Observable.range(4, 8).subscribe(num -> System.out.println(num), 
       error -> error.printStackTrace());
    // will print 4, 5, 6, 7
 

Creating empty Observable

It would call onCompleted right away:

    Observable.empty().subscribe(System.out::println, 
       Throwable::printStackTrace, () -> System.out.println("Completed"));
 

Creating never Observable

It would not call any of call back methods:

    Observable.never().subscribe(System.out::println, Throwable::printStackTrace);
 
 

Changing Scheduler

By default Observable notifies observer asynchronously using thread-pool scheduler but you can change default scheduler as follows:

Using thread-pool scheduler

    Stream<String> names = Stream.of("One", "Two", "Three", "Four", "Five"); 
    Observable.from(names).subscribeOn(Scheduler.getThreadPoolScheduler()).
       subscribe(System.out::println, Throwable::printStackTrace);
 

Using new-thread scheduler

It will create new thread

    Stream<String> names = Stream.of("One", "Two", "Three", "Four", "Five"); 
    Observable.from(names).subscribeOn(Scheduler.getNewThreadScheduler()).
       subscribe(System.out::println, Throwable::printStackTrace);
 

Using timer thread with interval

It will notify at each interval

    Stream<String> names = Stream.of("One", "Two", "Three", "Four", "Five"); 
    Observable.from(names).subscribeOn(Scheduler.getTimerSchedulerWithMilliInterval(1000)).
       subscribe(System.out::println, Throwable::printStackTrace);
    // this will print each name every second
 

Using immediate scheduler

This scheduler calls callback functions right away on the same thread. You can use this if you synchronous data and don’t want to create another thread. On the downside, you cannot unsubscribe with this scheduler.

    Stream<String> names = Stream.of("One", "Two", "Three", "Four", "Five"); 
    Observable.from(names).subscribeOn(Scheduler.getImmediateScheduler()).
       subscribe(System.out::println, Throwable::printStackTrace);
 

Transforming

Observables keep sequence of items as streams and they support map/flatMap operation as supported by standard Stream class, e.g.

Map

    Stream<String> names = Stream.of("One", "Two", "Three", "Four", "Five"); 
    Observable.from(names).map(name -> name.hashCode()).
       subscribe(System.out::println, Throwable::printStackTrace);
 

FlatMap

    Stream integerListStream = Stream.of( Arrays.asList(1, 2), 
       Arrays.asList(3, 4), Arrays.asList(5));
    Observable.from(integerListStream).flatMap(integerList -> integerList.stream()).
       subscribe(System.out::println, Throwable::printStackTrace);
 

Filtering

Observables supports basic filtering support as provided by Java Streams, e.g.

Filter

    Stream<String> names = Stream.of("One", "Two", "Three", "Four", 
       "Five"); 
    Observable.from(names).filter(name -> name.startsWith("T")).
       subscribe(System.out::println, Throwable::printStackTrace);
    // This will only print Two and Three
 

Skip

skips given number of elements

    Stream<String> names = Stream.of("One", "Two", "Three", "Four", "Five"); 
    Observable.from(names).skip(2).subscribe(System.out::println, 
       Throwable::printStackTrace);
    // This will skip One and Two
 

Limit

    Stream<String> names = Stream.of("One", "Two", "Three", "Four", "Five"); 
    Observable.from(names).limit(2).subscribe(System.out::println, 
       Throwable::printStackTrace);
    // This will only print first two strings
 

Distinct

    Stream<String> names = Stream.of("One", "Two", "Three", "One");
    Observable.from(names).distinct.subscribe(System.out::println, 
       Throwable::printStackTrace);
    // This will print One only once
 

Merge

This concates two observable data:

    Observable<Integer> observable2 = Observable.from(Stream.of(4, 5, 6));
    observable1.merge(observable2).subscribe(System.out::println, 
       Throwable::printStackTrace);
    // This will print 1, 2, 3, 4, 5, 6
 

Summary

In summary, as Java 8 already supported a lot of functional primitives, adding support for reactive extensions was quite straight forward. For example, Nextflix’s implementation of reactive extensions in Java consists of over 80K lines of code but it took few hundred lines to implement core features with Java 8. You can download or fork the code from https://github.com/bhatti/RxJava8.


April 17, 2014

Introduction to Java 8 Lambda and Stream Syntax

Filed under: Java — admin @ 11:10 pm

Introduction

Java 8 was released in March 2014 with most language-level enhancements since Java 5 back in 2004. The biggest new feature is introduction to Lambda. Lambda or Closure is a block of code that you can pass to other methods or return from methods. Previously, Java supported a form of closure via anonymous class syntax, e.g.

 import java.awt.event.ActionEvent;
 import java.awt.event.ActionListener;
 import java.awt.Dimension;
 import java.awt.FlowLayout;
 import javax.swing.JButton;
 import javax.swing.JFrame;
 
 public class SwingExample extends JFrame {
   public SwingExample() {
     this.getContentPane().setLayout(new FlowLayout());
     final JButton btn = new JButton("Click Me");
     btn.setPreferredSize(new Dimension(400,200));
     add(btn);
     btn.addActionListener(new ActionListener() {
       public void actionPerformed(ActionEvent e) {
         btn.setText("Clicked");
         btn.setEnabled(false);
       }
     });
   }
 
   private static void createAndShowGUI() {
     JFrame frame = new SwingExample();
     frame.pack();
     frame.setVisible(true);
     frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
 
   }
 
   public static void main(String[] args) {
     javax.swing.SwingUtilities.invokeLater(new Runnable() {
       public void run() {
         createAndShowGUI(); 
       }
     });
   }
 }

In above example, using anonymous class could use locally defined data in method that declares it as long as it is defined with final. Here is how the example looks like with Java 8 syntax:

 import java.awt.event.ActionEvent;
 import java.awt.event.ActionListener;
 import java.awt.Dimension;
 import java.awt.FlowLayout;
 import javax.swing.JButton;
 import javax.swing.JFrame;
 
 public class SwingExample8 extends JFrame {
   public SwingExample8() {
     this.getContentPane().setLayout(new FlowLayout());
     JButton btn = new JButton("Click Me");
     btn.setPreferredSize(new Dimension(400,200));
     add(btn);
     btn.addActionListener(e -> {
         btn.setText("Clicked");
         btn.setEnabled(false);
       }
     );
   }
 
   private static void createAndShowGUI() {
     JFrame frame = new SwingExample8();
     frame.pack();
     frame.setVisible(true);
     frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
 
   }
 
   public static void main(String[] args) {
     javax.swing.SwingUtilities.invokeLater(new Runnable() {
       public void run() {
         createAndShowGUI(); 
       }
     });
   }
 }

As you can see, lambda syntax is very minimal. In addition, lambda syntax doesn’t require that you declare externally accessible data as final, though it cannot be changed. Java lambda also adds type inferencing so that you don’t have to define types of arguments. The lambda features are implemented using “invokedynamic” instruction to dispatch method calls, which was added in Java 7 to support dynamic languages. For example, let’s take a simple example:

public class Run8 {
   public static void main(String[] args) {
     Runnable r = () -> System.out.println("hello there");
     r.run();
   }
 }

You can de-compile it using:

javap -p Run8

You will see, it generated lambda$main$0 method, e.g.

public class Run8 {
   public Run8();
   public static void main(java.lang.String[]);
   private static void lambda$main$0();
 }

You can see real byte code using:

javap -p -c Run8
public class Run8 {
public Run8();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object
4: return

public static void main(java.lang.String[]);
Code:
0: invokedynamic #2, 0 // InvokeDynamic #0:run:()Ljava/lang/Runnable;
5: astore_1
6: aload_1
7: invokeinterface #3, 1 // InterfaceMethod java/lang/Runnable.run:()V
12: return

private static void lambda$main$0();
Code:
0: getstatic #4 // Field java/lang/System.out:Ljava/io/PrintStream;
3: ldc #5 // String hello there
5: invokevirtual #6 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
8: return
}

This means lambdas don’t have to keep reference of enclosing class and “this” inside lambda does not create new scope.

Types of Functions

Java 8 provides predefined functions (See http://docs.oracle.com/javase/8/docs/api/java/util/function/package-summary.html), but there are four major types:

                 Supplier: () -> T
                 Consumer: T -> ()
                 Predicate: T -> boolean
                 Function: T -> R

The supplier method takes not arguments and produces an object, the consumer takes an argument for consumption, predicate evaluates given argument by returning true/false and function maps an argument of type T and returns an object of type R.

Supplier example

 import java.util.function.*;
 
 public class Supply8 {
   public static void main(String[] args) {
     Supplier<Double>> random1 = Math::random;
     System.out.println(random1.get());
     //
     DoubleSupplier random2 = Math::random;
     System.out.println(random2.getAsDouble());
   }
 }

Note: Java 8 provides special functions for primitive types that you can use instead of using wrapper classes for primitive types. Here is another example that shows how you can write efficient log messages:

 import java.util.function.*;
 public class SupplyLog {
   private static boolean debugEnabled;
   public static void debug(Supplier<String> msg) {
     if (debugEnabled) {
       System.out.println(msg.get());
     }
   }
   public static void main(String[] args) {
     debug(() -> "this will not be printed");
     debugEnabled = true;
     debug(() -> "this will be printed");
   }
 }

Consumer example

 import java.util.function.*;
 public class Consume8 {
   public static void main(String[] args) {
     Consumer<String> consumer = s -> System.out.println(s);
     consumer.accept("hello there");
     consumer.andThen(consumer).accept("this will be printed twice");
   }
 }

Predicate example

 import java.util.function.*;
 
 public class Predicate8 {
   public static void main(String[] args) {
     Predicate<Integer> gradeA = score -> score >= 90;
     System.out.println(gradeA.test(80));
     System.out.println(gradeA.test(90));
   }
 }

In addition to test method, you can also use and, negate, or method to combine other predicates.

Function example

 import java.util.function.*;
 
 public class Function8 {
   public static void main(String[] args) {
     BinaryOperator<Integer> adder = (n1, n2) -> n1 + n2;
     System.out.println("sum " + adder.apply(4, 5));
     Function<Double>,Double> square = x -> x * x;
     System.out.println("square " + square.apply(5.0));
   }
 }

Custom Functions

In addition to predefined functions, you can define your own interface for functions as long as there is a single method is declared. You can optionally declare interface with @FunctionalInterface annotation so that compiler can verify it, e.g.

public class CustomFunction8 {
   @FunctionalInterface
   interface Command<T> {
     void execute(T obj);
   }
 
   private static <T> void invoke(Command<T> cmd, T arg) {
     cmd.execute(arg);
   }
 
 
   public static void main(String[] args) {
     Command<Integer> cmd = arg -> System.out.println(arg);
     invoke(cmd, 5);
   }
 }

Method Reference

In addition to passing lambda, you can also pass instance or static methods as closures using method reference. There are four kinds of method references:

  • Reference to a static method ContainingClass::staticMethodName
  • Reference to an instance method of a particular object ContainingObject::instanceMethodName
  • Reference to an instance method of an arbitrary object of a particular type ContainingType::methodName
  • Reference to a constructor ClassName::new

Streams

In addition to lambda support, Java 8 has updated collection classes to support streams. Streams don’t really store anything but they behave as pipes for computation lazily. Though, collections have limited size, but streams can be unlimited and they can be only consumed once. Streams can be accessed from collections using stream() and parallelStream() methods or from an array via Arrays.stream(Object[]). There are also static factory methods on the stream classes, such as Stream.of(Object[]), IntStream.range(int, int), etc. Common intermediate methods using as pipes that you can invoke on streams:

  • filter()
  • distinct()
  • limit()
  • map()
  • peek()
  • sorted()
  • unsorted()

In above examples sorted, distinct, unsorted are stateful, whereas filter, map, limit are stateless. And here are terminal operations that trigger evaluation on streams:

  • findFirst()
  • min()
  • max()
  • reduce()
  • sum()

You can implement your own streams by using helper methods in StreamSupport class.

Iterating

Java 8 streams support forEach method for iterating, which can optionally take a consumer function, e.g.

 import java.util.stream.Stream;
 
 public class StreamForEach {
   public static void main(String[] args) {
     Stream<String> symbols = Stream.of("AAPL", "MSFT", "ORCL", "NFLX", "TSLA");
     symbols.forEach(System.out::println);
   }
 }

Parallel iteration:

 import java.util.Arrays;
 import java.util.List;
 import java.util.stream.Stream;
 
 public class ParStreamForEach {
   public static void main(String[] args) {
     List<String> symbols = Arrays.asList("AAPL", "MSFT", "ORCL", "NFLX", "TSLA");
     System.out.println("unordered");
     symbols.parallelStream().forEach(System.out::println);
     System.out.println("ordered");
     symbols.parallelStream().forEachOrdered(System.out::println);
   }
 }

Note that by default iterating parallel stream would be unordered but you can force ordered iteration using forEachOrdered method instead of forEach.

Filtering

We already saw predicate functions and filtering support in streams allow extract elements of collections that evaluates true to given predicate. Let’s create a couple of classes that we will use later:

 import java.util.ArrayList;
 import java.util.Collection;
 import java.util.function.IntPredicate;
 public class Game implements IntPredicate {
   enum Type {
     AGE_ALL,
     AGE_13_OR_ABOVE,
     AGE_18_OR_ABOVE
   }
   public final String name;
   public final Type type;
   public final Collection<Player> players = new ArrayList<>();
   public Game(String name, Type type) {
     this.name = name;
     this.type = type;
   }
   public boolean suitableForAll() {
     return type == Type.AGE_ALL;
   }
   public void add(Player player) {
     if (test(player.age)) {
       this.players.add(player);
     }
   }
   @Override
   public boolean test(int age) {
     switch (type) {
       case AGE_18_OR_ABOVE:
         return age >= 18;
       case AGE_13_OR_ABOVE:
         return age >= 13;
       default:
         return true;
     }
   }
   @Override
   public String toString() {
     return name;
   }
 }
 public class Player {
   public final String name;
   public final int age;
   public Player(String name, int age) {
     this.name = name;
     this.age = age;
   }
   @Override 
   public String toString() {
     return name;
   }
 }

Now let’s create a class that will filter games by types:

 import java.util.Arrays;
 import java.util.Collection;
 import static java.util.stream.Collectors.*;
 import java.util.stream.Stream;
 
 
 public class GameFilter {
   public static void main(String[] args) {
     Collection<Game>> games = Arrays.asList(new Game("Birdie", Game.Type.AGE_ALL), new Game("Draw", Game.Type.AGE_ALL), new Game("Poker", Game.Type.AGE_18_OR_ABOVE), new Game("Torpedo", Game.Type.AGE_13_OR_ABOVE));
     Collection<Game>> suitableForAll = games.stream().filter(Game::suitableForAll).collect(toList());
     System.out.println("suitable for all");
     suitableForAll.stream().forEach(System.out::println);
     Collection<Game>> adultOnly = games.stream().filter(game -> game.type == Game.Type.AGE_18_OR_ABOVE).limit(10).collect(toList());
     System.out.println("suitable for adults only");
     adultOnly.stream().forEach(System.out::println);
   }
 }

As you can see, filter can accept lambda or method reference.

Map

Map operation on streams applies a given function to transform each element in stream and produces another stream with transformed elements.

 import java.util.Arrays;
 import java.util.Collection;
 import static java.util.stream.Collectors.*;
 import java.util.stream.Stream;
 
 
 public class GameMap {
   public static void main(String[] args) {
     Collection<Game>> games = Arrays.asList(new Game("Birdie", Game.Type.AGE_ALL), new Game("Draw", Game.Type.AGE_ALL), new Game("Poker", Game.Type.AGE_18_OR_ABOVE), new Game("Torpedo", Game.Type.AGE_13_OR_ABOVE));
     Collection<Player> players = Arrays.asList(new Player("John", 10), new Player("David", 15), new Player("Matt", 20), new Player("Dan", 30), new Player("Erica", 5));
     for (Game game : games) {
       for (Player player : players) {
         game.add(player);
       }
     }
     //
     Collection<Game>.Type> types = games.stream().map(game -> game.type).collect(toList());
     System.out.println("types:");
     types.stream().forEach(System.out::println);
     Collection<Player> allPlayers = games.stream().flatMap(game -> game.players.stream()).collect(toList());
     System.out.println("\nplayers:");
     players.stream().forEach(System.out::println);
   }
 }

Note that flatMap takes collection of objects for each input and flattens it and produces a single collection. Java streams also produces map methods for primitive types such as mapToLong, mapToDouble, etc.

Sorting

Previously, you had to implement Comparable interface or provide Comparator for sorting but you can now pass lambda for comparison, e.g.

 import java.util.Arrays;
 import java.util.List;
 import static java.util.stream.Collectors.*;
 import java.util.stream.Stream;
 import java.util.Comparator;
 
 
 public class GameSort {
   public static void main(String[] args) {
     List<Player> players = Arrays.asList(new Player("John", 10), new Player("David", 15), new Player("Matt", 20), new Player("Dan", 30), new Player("Erica", 5));
     players.sort(Comparator.comparing(player -> player.age));
     System.out.println(players);
   }
 }

Min/Max

Java streams provide helper methods for calculating min/max, e.g.

 import java.util.Arrays;
 import java.util.Collection;
 import static java.util.stream.Collectors.*;
 import java.util.stream.Stream;
 import java.util.Comparator;
 public class GameMinMax {
   public static void main(String[] args) {
     Collection<Player> players = Arrays.asList(new Player("John", 10), new Player("David", 15), new Player("Matt", 20), new Player("Dan", 30), new Player("Erica", 5));
     Player min = players.stream().min(Comparator.comparing(player -> player.age)).get();
     Player max = players.stream().max(Comparator.comparing(player -> player.age)).get();
     System.out.println("min " + min + ", max " + max);
   }
 }

Reduce/Fold

Reduce or fold generalizes the problem where we compuate a single value from collection, e.g.

 import java.util.Arrays;
 import java.util.Collection;
 import static java.util.stream.Collectors.*;
 import java.util.stream.Stream;
 
 
 public class GameReduce {
   public static void main(String[] args) {
     Collection<Player> players = Arrays.asList(new Player("John", 10), new Player("David", 15), new Player("Matt", 20), new Player("Dan", 30), new Player("Erica", 5));
     double averageAge1 = players.stream().mapToInt(player -> player.age).average().getAsDouble();
     double averageAge2 = players.stream().mapToInt(player -> player.age).reduce(0, Integer::sum) / players.size();
     double averageAge3 = players.stream().mapToInt(player -> player.age).reduce(0, (sum, age) -> sum + age) / players.size();
     System.out.println("average age " + averageAge1 + ", " + averageAge2 + ", or " + averageAge3);
   }
 }

Grouping/Partitioning

groupingBy method of Collectors allows grouping collection, e.g.

 import java.util.Arrays;
 import java.util.Collection;
 import java.util.List;
 import java.util.Map;
 import static java.util.stream.Collectors.*;
 import java.util.stream.Stream;
 
 
 public class GameGrouping {
   public static void main(String[] args) {
     Collection<Player> players = Arrays.asList(new Player("John", 10), new Player("David", 15), new Player("Matt", 20), new Player("Dan", 30), new Player("Erica", 5));
     Map<Integer, List<Player>> playersByAge = players.stream().collect(groupingBy(player -> player.age));
     System.out.println(playersByAge);
   }
 }

partitioningBy groups collection into two collection, e.g.

 import java.util.Arrays;
 import java.util.Collection;
 import java.util.List;
 import java.util.Map;
 import static java.util.stream.Collectors.*;
 import java.util.stream.Stream;
 
 
 public class GamePartition {
   public static void main(String[] args) {
     Collection<Player> players = Arrays.asList(new Player("John", 10), new Player("David", 15), new Player("Matt", 20), new Player("Dan", 30), new Player("Erica", 5));
     Map<Boolean, List<Player>> playersByAge = players.stream().collect(groupingBy(player -> player.age >= 18));
     System.out.println(playersByAge);
   }
 }

String joining

Here is an example of creating a string from collection:

 import static java.util.stream.Collectors.*;
 import java.util.stream.Stream;
 
 
 public class GameJoin {
   public static void main(String[] args) {
     Stream<Player> players = Stream.of(new Player("John", 10), new Player("David", 15), new Player("Matt", 20), new Player("Dan", 30), new Player("Erica", 5));
     System.out.println(players.map(Player::toString).collect(joining(",", "[", "]")));
   }
 }

Lazy Evaluation

Java Map interface now supports lazy evaluation by adding object to Map if the key is not present:

 import java.util.HashMap;
 import java.util.Map;
 
 public class Fib {
   private final static Map<Integer,Long> cache = new HashMap<Integer, Long>() {{
     put(0,0L);
     put(1,1L);
   }};
   public static long fib(int x) {
     return cache.computeIfAbsent(x, n -> fib(n-1) + fib(n-2));
   }
 
   public static void main(String[] args) {
     System.out.println(fib(10));
   }
 }

Parallel Streams

Though, streams processes elements serially but you can change stream() method of collection to parallelStream to take advantage of parallel processing of stream. Parallel stream use ForkJoinPool by default and use as many threads as you have processors (Runtime.getRuntime().availableProcessors()), e.g.

 import java.util.Arrays;
 import java.util.List;
 import java.util.stream.IntStream;
 import java.util.stream.Stream;
 
 public class ParStreamPrime {
   private static boolean isPrime(int n) {
     if (n%2==0) return false;
     for(int i=3;i*i<=n;i+=2) {
       if(n%i==0) return false;
     }
     return true;
   }
   private static long serialTest(int max) {
     long started = System.currentTimeMillis();
     IntStream.rangeClosed(1, max).forEach(num -> isPrime(num));
     return System.currentTimeMillis() - started;
   }
   private static long parallelTest(int max) {
     long started = System.currentTimeMillis();
     IntStream.rangeClosed(1, max).parallel().forEach(num -> isPrime(num));
     return System.currentTimeMillis() - started;
   }
   //
   public static void main(String[] args) {
     int max = 1000000;
     System.out.println("Serial " + serialTest(max));
     System.out.println("Parallel " + parallelTest(max));
   }
 }

If you need to customize thread pool size, you can create parallel stream inside fork-join-pool, e.g.

  private static void parallelTest(final int max) {
     ForkJoinPool forkJoinPool = new ForkJoinPool(2);
     forkJoinPool.submit(() ->
        IntStream.rangeClosed(1, max).parallel().forEach(num -> isPrime(num));
     ).get();
   }

Default methods and Mixins

For anyone who has to support interfaces for multiple clients knows the frustration of adding new methods because it requires updating all clients. You can now add default methods on interfaces and add static methods, e.g.

 interface Vehicle {
   float getMaxSpeedMPH();
   public static String getType(Vehicle v) {
     return v.getClass().getSimpleName();
   }
 }
 
 interface Car extends Vehicle {
   void drive();
   public default float getMaxSpeedMPH() {
     return 200;
   }
 }
 interface Boat extends Vehicle {
   void row();
   public default float getMaxSpeedMPH() {
     return 100;
   }
 }
 
 interface Plane extends Vehicle {
   void fly();
   public default float getMaxSpeedMPH() {
     return 500;
   }
 }
 
 public class AmphiFlyCar implements Car, Boat, Plane {
   @Override
   public void drive() {
     System.out.println("drive");
   }
   @Override
   public void row() {
     System.out.println("row");
   }
   @Override
   public void fly() {
     System.out.println("fly");
   }
   public float getMaxSpeedMPH() {
     return Plane.super.getMaxSpeedMPH();
   }
   public static void main(String[] args) {
     AmphiFlyCar v = new AmphiFlyCar();
     System.out.println(Vehicle.getType(v) + ": " + v.getMaxSpeedMPH());
   }
 }

Optional

Tony Hoare called nulls a billion dollar mistake and Java has an infamous NullPointerException error when you dereference a null object. Now you can get rid of those nasty errors using Optional, which acts as Maybe monads in other languages, e.g.

 import java.util.HashMap;
 import java.util.Map;
 import static java.util.stream.Collectors.*;
 import java.util.stream.Stream;
 import java.util.Optional;
 
 
 public class OptionalExample {
   private static Map<String, Player> players = new HashMap<String, Player>() {{
     put("John", new Player("John", 10));
     put("David", new Player("David", 15));
     put("Matt", new Player("Matt", 20));
     put("Erica", new Player("Erica", 25));
   }};
 
   private static Optional<Player> findPlayerByName(String name) {
     Player player = players.get(name);
     return player == null ? Optional.empty() : Optional.of(player);
   }
 
   private static Integer getAge(Player player) {
     return player.age;
   }
 
 
   public static void main(String[] args) {
     findPlayerByName("John").ifPresent(System.out::println);
     Player player = findPlayerByName("Jeff").orElse(new Player("Jeff", 40));
     System.out.println("orElse " + player);
     Integer age = findPlayerByName("Jeff").map(OptionalExample::getAge).orElse(-1);
     System.out.println("Jeff age " + age);
   }
 }

CompletableFuture

Java has long supported future, but previously you had to call blocking get() to retrieve the result. With Java 8, you can use CompletableFuture to define the behavior when asynchronous processing is completed, e.g.

 private static final ExecutorService executor = Executors.newFixedThreadPool(2);
 public static CompletableFuture getPlayer(String name) {
    return CompletableFuture.supplyAsync(() -> new Player(), executor);
 }
 getQuote("name").thenAccept(player -> System.out.println(player));
 

Summary

Over the past few years, functional programming languages have become mainstream and Java 8 brings many of those capabilities despite being late. These new features help write better and more concise code. You will have to change existing code to make more use of immutable objects and use streams instead of objects when possible. As far as stability, I found java 8 compiler on Linux environment a bit buggy that crashed often so it may take a little while before Java 8 can be used in production.

September 12, 2013

NFJS Seattle 2013 Review

Filed under: Computing — admin @ 7:38 pm

I attended NFJS conference over the weekend. It was a short conference from Friday, Sep 6 to Sunday, Sep 8. It had a number of sessions on Java 8, Javascript, Mobile, and functional areas. Here are some of the sections that I enjoyed:

Java 8 Language Capabilities

This session gave a brief overview of new features of Java 8 mainly new closure/lambda syntax. Venkat is a great speaker and he was coding live throughout the session.

Concurrency without Pain in Pure Java

This was related to first talk by Venkat Subramaniam and covered a number of patterns such as Actors and STM for concurrency.

Functional SOLID

In this talk Matt Stine gave overview of SOLID principles and how functional programming make it easier to apply these patterns. This was more abstract talk and didn’t go into examples of those patterns.

Programming with Immutability

This was more practical talk by Matt Stine and he gave examples of immutability and functional programming in Java and Groovy with live coding. He mentioned a number of tools that can make it easier to build mutable and immutable pair of classes and how immutable classes can be used in other frameworks such as Hibernate.

Rich Web Apps with Angular

This was a short introduction to Angular by Raju Gandhi.

Vagrant: Virtualized Development Environments Made Simple

This was a great introduction to Vagrant and how to setup a complete development, testing and production environments on your local desktop.

Simulation Testing with Simulant

This was a short introduction to Simulant testing library that Stuart Halloway has been using for testing Datomic database. This testing library can be used for functional testing and load testing. It saves all data in datomic database and can be populated from existing data.

Generative Testing

This was another session of testing framework by Stuart Halloway. This framework provides great support for generating test data and is somewhat similar to QuickCheck, though it doesn’t offer reduction.

Summary

I didn’t go to OSCON this year and enjoyed smaller pavilion that NFJS provided. There were a couple of dud sessions, but overall I enjoyed it.


August 8, 2012

Back from OSCON 2012

Filed under: Computing — admin @ 5:58 pm

I went back to OSCON last month (July 2012), which was held in Portland again. I saw the biggest crowd this year and over 3000 folks attended the conference. Here are some of the tutorials and sessions I attended:

R

I attended R tutorial on the first day, which I found quite informative. There has been a lot of interest in Data Science and R is a great tool for statistical analysis and graphs.

Scala

On the second half of Monday, I attended Scala Koans, which was disappointing. First, they could not get us started with the session as Wifi died on us and they didn’t have much backup plan. We finally started the session after waiting for an hour and then were mostly left on our own to finish the Koans. Yeah, I could have done that myself by downloading Koans.

Android-Fu

On Tuesday, I attended Android-Fu, which was somewhat useful. I have had quite a bit Android development at work, but I got a few pointers.

Android Testing

The second half of Tuesday, I attended Android Testing, which was useful but the presenter was incredible boring and had hard time keeping awake. [Slides]

Go

The real conference started on Wednesday and I attended Go session, which was mostly about governance behind Go language.

Storm

I then attended session on Storm, which was informative described differences between Hadoop and Storm and showed some actual code.

MongoDB

I then attended session on Running MongoDB for High Availability, which showed some of weak areas and lessons learned while scaling MongoDB. [Slides]

Apache Zookeeper

This session was also very informative and it showed various uses of Zookeeper.

Disruptor

I then attended session on Disruptor, which is incredibly fast concurrency framework. [Slides]

Building Functional Hybrid Apps For The iPhone And Android

I then attended Building Functional Hybrid Apps For The iPhone And Android, which was mostly marketing talk for Websphere IDE and speakers ignored value of PhoneGap (Apache Cordova), which was behind their demo.

Node.js

On thursday, I attended session on Node.js, which was nice introduction with some code samples. [Slides]

I attended another session on Node.js about Node.js in Production: Postmortem Debugging and Performance Analysis, which showed a number of ways to debug your Node.js applications. [Slides]

Advanced MySQL Replication Architectures

This was also very informative session and you can take a look at slides.

The Art of Organizational Manipulation

This was entertaining talk about how to build influence in workplace.

High Performance Network Programming on the JVM

This was another informative talk about building network applications in Java and you would slides very helpful.

Summary

I found very few sessions on mobile this year and there were a couple of sessions on Android and I wanted to see more. There were a lot of vendor sponsored sessions on private clouds and a lot of vendors in exhibition hall were promoting frameworks such as OpenStack, CloudStack, Eucalyptus and others. There were also quite a few sessions on distributed frameworks such as Hadoop, HBase, MongoDB, Zookeeper, Storm, Disruptor, etc, which I enjoyed. I didn’t see a lot of sessions on functional languages as past and wanted to see some sessions on Clojure (which was cancelled).
NoSQL Databases – A few sessions were on HBase, MongoDB and Cassandra. There was a lot of enthusiasm for HTML5 again and a lot of sessions were sold out. Having attended similar sessions in past, I skipped most of them. Overall, I had good time but not sure if I would be back next year. You can read presentation slides from http://www.oscon.com/oscon2012/public/schedule/proceedings.

« Newer PostsOlder Posts »

Powered by WordPress