About year and half ago, I was involved in a system rewrite for a project that communicated with dozens of data sources. The web application allowed users to generate ad-hoc reports and perform various transactions on the data. The original project was in Perl/Mason, which proved to be difficult to scale because demands of adding more data sources. Also, the performance of the system became problematic because the old system waited for all data before displaying them to the user. I was assigned to redesign the system, and I rebuilt the new system using lightweight J2EE based stack including Spring, Hibernate, Spring-MVC, Sitemesh and used DWR, Prototype, Scripaculous for AJAX based web interface. In this blog, I am going to focus on high level design especially integration with oher services.
High level Design
The system consisted of following layers:
- Domain layer – This layer defined domain classes. Most of the data structures were just aggregates of heterogenous data with a little structure. Also, users wanted to add new data sources with minimal time and effort, so a number of generic classes were defined to represent them.
- Data provider layer – This layer provided services to search and agregate data from different sources. Basically, each provider published the query data that it required and output data that it supported.
- Data aggregation layer – This layer collected data from multiple data sources and allowed UI to pull the data as it became available.
- Service layer – This layer provided high level operations for quering, reporting and transactions.
- Presentation – This layer provided web based interface. This layer used significant use of AJAX to show the data incremently.
Domain layer
Most of the data services simply returned rows of data with little structure and commonality. So, I designed a general purpose classes to represent rowsets and columns:
MetaField
represents a meta information for each atomic data element used for reporting purpose. It stored information such as name and type of the field.
DataField
represents both MetaField and its value. The value could be one of following:
- UnInitailized – This is a marker interface that signifies that data is not yet populated. It was used to indicate visual clues to the users for the data elements that are waiting for response from the data providers.
- DataError – This class stores an error while accessing the data item. This class also had subclasses like
- UnAvailable – which means data is not available from the service
- TimeoutError – service timedout
- ServerError – any server side unexpected error.
- Value from the data provider.
DataSink
The data provider allowed clients to specify the size of data that is needed, however many of the data providers had internal limits of size of the data that they could return. So, it required multiple invocations of underlying services to the data providers. The DataSink interface allowed higher order services to consume the data from each data provider in stream fashioned, which enhanced UI interaction and minimized the memory required to buffer the data from the service providers. Here is the interface for DataSink callback :
/** * This method allows clients to consume a set of tuples. The client returns true when it wants to stop processing more data and * no further calls would be made to the providers * @param set - set of new tuples received from the data providers * @return - true if client wants to stop */ public boolean consume(TupleSet set); /** * This method notifies client that data provider is done with fetching all required data */ public void dataEnded();
DataProvider
interface is used for integration to each of the data service
public interface DataProvider { public int getPriority(); public MetaField[] getRequestMetaData(); public MetaField[] getResponseMetaData(); public DataSink invoke(Mapcontext, DataField[] inputParameters) throws DataProviderException;
DataLocator
This class used a configuration file to map all data locators needed for the query.
public interface DataProviderLocator { public DataProvider[]> getDataProviders(MetaField[] input, MetaField[] output); }
DataExecutor
This class used Java’s Executors to send off queries to different data providers in parallel.
public interface DataExecutor { public void execute(); }
The implementation of this class manages the dependency of the data providers and runs the in separate thread.
DataAggregator
This class stored results of all data providers in a rowset format where each row was array of data fields. It was
consumed by the AJAX clients which polled for new data.
public interface DataAggregator { public void add(DataField[] keyFields, DataField[] valueFields); public DataField[] keyFields(); public DataField[] dequeue(DataField[] keyFields) throws NoMoreDataException; }
The first method is used by the DataExecutor to add data to the aggregator. In our application, each of the report had some kind of a key field such as SKU#. In some cases that key was passed by the user and in other cases it was queried before the actual search. The second method returned those key fields. The third method was used by the AJAX clients to query new data.
Service Layer
This layer abstraction for communcating with underlying data locators, providers and aggregators.
public interface DataProviderService{ public DataAggregator search(DataField[] inputFields, DataField[] outputFields); }
End to End Example
+--------------------------+ | Client selects | | input/output fields | | and sends search request | | View renders initial | | table. | +--------------------------+ | ^ V | +-----------------------+ | Web Controller | | creates DataFields | | and calls service. | | It then stores | | aggregator in session.| +-----------------------+ | ^ V | +------------------------+ | Service calls locators,| | and executor. | | | | It returns aggregator | | | +------------------------+ | ^ V | +------------------------+ | Executor calls | | providers and adds | | responses to aggregator| | | +------------------------+ | ^ V | +---------------------+ | Providers call | | underlying services | | or database queries | +---------------------+ +------------------------+ | Client sends AJAX | | request for new data | | fields. View uses | | $('cellid').value to | | update table. | +------------------------+ | V +-----------------------+ | Web Controller | | calls aggregator | | to get new fields | | It cleans up aggreg. | | when done. | +-----------------------+ | V +----------------+ | Aggregator | +----------------+
- Client selects types of reports, where each report has slightly different input data fields.
- Client opens the application and selects the data fields he/she is interested in.
- Client hits search button
- Web Controller intercepts the request and converts form into an array of input and output data field objects.
- Web Controller calls search method of DataProviderService and stores the DataAggregator in the session. Though, our application used
multiple servers, we used sticky sessions and didn’t need to provide replication of the search results. The controller then sent back the
keyfields to the view. - The view used the key data to populate the table for report. The view then starts polling the server for the incoming data.
- Each poll request finds new data and returns to the view, which then populates the table cells. When all data is polled, the aggregator
throws NoMoreDataException and view stops polling. - Also, view stops polling after two minutes in case service stalls. In that case, aggregator from the session is cleared by another background
thread.
Lessons Learned
This design has served well as far as performance and extensibility, but we had some scalability issues because we allowed output of one provider to be used as input to another provider. Thus, some of the threads were idle, so we added some smarts into Executors to spawn threads only when there is input data available. Also, though some of the data sources provided asynchronous services, most didn’t and for others we had to use the database. If services were purely asynchronous, we could have used reactive style of concurrency and used only two threads per search instead of almost one thread for each provider, where one thread would send requests to all services and another thread would poll response from all unfinished providers and add it to the aggregator if it’s finished. I think this kind of application is much better suited for language like Erlang, which provides extremely lightweight processes and you can easily launch hundreds of thousand processes. Also, Erlang has builtin support for tuples used in our application.