Sunday, April 15, 2018

CLC Teardrop Camper Electrical System

As the weather is starting to warm up I'm getting back to my teardrop camper build. The few remaining steps are to finish up the electrical system, mushroom vents, and headliner. This post gives an overview of my electrical system which differs from the CLC recommendations in a number of places. There are a bunch of different ways to go about your camper electrical system depending on your needs, skill level, and budget. I was primarily looking for lights, phone/tablet charging, and the occasional laptop charging. The resulting system is relatively simple with a bit of increased complexity due to the number of switches and independent controls I wanted but it is by no means a difficult system to install.

Before you start, please note that I'm not an electrician and learned most of what I know by reading web pages, watching YouTube videos, forum posts, and bugging friends who actually are electrical engineers. Use any of my advice at your own risk.

Battery Box

I didn't like trying to fit the battery and related components in the galley behind the galley module and after reading a few of the posts about the weight of the battery causing the galley flat to crack or delaminate, I decided to put the battery in the tongue box. I built a simple platform from some scrap plywood and mounted a plastic storage box to it. It isn't the prettiest thing but I wanted to get something up and running for the spring season. Hopefully I'll get time to either build a tongue box or buy a diamond plated one that matches the trailer fenders.

Inside the box I mounted an all in one lithium ion battery from Chafon that provides 12v DC and 120v AC. There are a number of these all in one units available now, including some that wrap a lead-acid battery, but I liked this unit because:

it is really light
it has reasonable storage for the price (as far as Li-ion goes)
it has all the charging and inverter components built in (and can be easily daisy chained or connected to solar panels)
I can take it out and use it around the house when not on the road

But there's no doubt that lithium ion is much more expensive than an equivalent lead-acid so you'll have to make your own decision. We'll see how well it holds up over time with multiple discharge cycles.

On the side of the box I have an AC port plug to allow for charging the battery when power is available.

Inside the box you can see the AC supply from the port plug on the right, the battery unit, AC out on the right and DC out on the left. The AC and DC lines then leave the box via wire glands on the back.

The supply lines then run under the length of the camper to just behind the galley bulkhead. I added a set of plugs just behind the tongue box so I can easily disconnect and remove the entire box to replace it or just to have an easy battery disconnect.

The lines enter the camper from the bottom by going through an exterior conduit box with two wire glands mounted on it. I originally wanted to put the wire glands directly on the floor of the camper but I couldn't find a reasonably priced solution that would span the 3/4" (or is it 7/8") thick floor because most wire glands are designed to mount to a sheet metal box. I did think about removing a small square in the floor and mounting some 1/4" plywood but then I worried about water proofing it and strength, etc. In the end, I used some weather stripping on the top of the conduit box and drilled a 1" hole through the box and the floor to bring the wires up.

The wires then come up through the floor just behind the galley bulkhead and run up and into the galley. The light strip shown there is an LED nightlight in a right angle channel that runs down either side of the bulkhead pointing toward the back of the camper. They are set to a nice blue color that provides a low level of light in the evening when the cabin lights are too much.

Galley Wiring

With the battery in the tongue box, I had a lot more room for wiring behind the galley module. I had an idea of the items I wanted/needed in the system and I didn't like the idea of mounting them all to the galley bulkhead. It's fairly thin material so wood screws need to be really small and I didn't want a bunch more bolts through the bulkhead. To provide a surface for mounting all my electrical components, I built a simple 4 sided box out of some scrap plywood and painted it to match the bottom of the camper (not that you'll ever see it).

The box fits into the center compartment of the galley module and gives me 4 sacrificial plywood surfaces to mount to. The supply lines are coming in from the galley floor on the right side and running into the box. The lines run out to switch panels on either side and then to the final lights, fans, out outlets. I had to increase the notches on the galley module but everything tucks neatly out of sight.

Inside this box is where all the magic happens. Ok, not magic but a lot of wire connections. What's in the box:

In the center is a 12v fuse box with 6 circuits.
There are a number of screw terminals just to make splicing a little easier.
On the left and right are LED RGB controllers that allow me to set a brightness and color for the LED cabin lights.
In the center on the bottom are two latching relays to allow a 3-way switch like behavior for the cabin lights and night lights.
On the bottom left is a junction box for the 120v line.

You can see that I nicely labeled everything with my label maker only to realize the labels peel apart in the heat. That'll make fixing a future issue real fun.

The lines run out of the central box and over to the back of switch plates on either side of the cabin, down through the galley floor for the night lights, through the bulkhead for the cabin lights, and, on one side, up to the galley lights. This is a little different from CLC's approach were they went up the bulkhead and directly through to the lights mounted in the cabin. I decided that I rather do the wiring run on the cabin side where I could hide it behind the headliner. The only wire really visible in the galley is the one line for lights.

A note of caution, I did put holes in the galley floor for the supply lines and the night light lines. In the event that I spill something in the galley, water will leak through these holes into the cabin. I use a rubber gasket and some caulk so fill the holes but anymore than an 1/8" of water and it will most likely leak. I'm willing to take that chance.

Inside the Cabin

Moving inside the cabin, the light and fan wires come in just above the shelf in the corner. I used some mesh wire sleeve to hide them a bit before they turn the corner to go up the wall behind the headliner.

The front of the switch panel that was wired up from the galley side shows the switches and outlets I selected. I couldn't find a panel I liked so made one by cutting it out of 1/4" plywood and staining it a contrasting color. There is a separate panel on each side of the cabin. The panel includes:

120v outlet
2 x 12v barrel outlet, one with a USB charging insert
2 x on/off toggles for a personal fan and reading light
2 x momentary toggles for the cabin lights and night lights
An LED indicator light for 120v power

The momentary toggles are connected to the latching relays to allow the lights to be toggled on or off from either side of the cabin.

The wiring continues up the wall, then splits off in two directions. The first two lines go to the main cabin lights on the bulkhead and the personal fan. The second two lines go to the secondary cabin light and a personal reading light. I used 16 AWG wire for most of this which was probably overkill but I was worried about voltage drop over these distances. I'd probably go with 18 or 20 if I did it again because the LED lights don't seem to be affected by a slight voltage drop over 10 feet. The 16 gauge did fit behind the headliner but it shows a little bulge in places.

As to the lighting, I wasn't happy with a lot of the options I could find. They were either ugly, not bright enough, too blindingly bright, or the wrong color temperature. After ordering a few products and returning them, I finally decided to make them myself. Again, I cut a mounting panel out of 1/4" plywood, stained it, and mounted 4 strips of LEDs to it. I used a piece of 32% translucent acrylic to diffuse the light. The panel also serves to mount the personal fan next to the light.

For the reading light, I again cut a custom 1/4" plywood panel and mounted both a small under cabinet light and a gooseneck reading light. The under cabinet light is on the same switch as the main cabin light so you get both front and side lighting. I used a small circle of the same acrylic to diffuse these lights as well. The reading light is on a separate switch per side. The reading lights have their own built in dimmer and switch but they remember their state when toggled via the panel switch which is an issue I ran into with some cheaper lights.

I ran the wires down the center of the panel rather than against the door so the headliner could grab on either side of the wires to keep everything in place. Again, this might not be an issue with 20 gauge wire.

Once the headliner was in place and things were buttoned up, the wiring wasn't really noticeable at all. The wiring, panels, and lights are a mirror image on both sides of the cabin with the exception of an additional wire running all the way up to the top vent fan on one side. With the control panels on either side of the cabin 1) you can easily reach them while standing outside the door and 2) there is a large empty area in the center of the bulkhead for a tablet or other entertainment.

Finishing up the Galley

With all the wiring in place, I installed the galley module which has its own switch panel. Again, a custom panel but left natural to contrast the module face frame.

This panel includes:

2 x 120v outlets
2 x 12v outlets
1 on/off switch for the galley lights

The galley lights are the same under cabinet puck lights that I used for the cabin lights near the doors. I ran a single line up the side, across the back of the hatch stiffener, then up to the lights. Again, I wrapped the wire in a mesh sleeve to make it look a little nicer. I included a barrel plug in the light power supply line so I can disconnect it and remove the entire hatch if needed. The lights are held on the hatch with exterior double sided mounting tape.

I can still access the wiring box in the module hatch for a quick repair without having to remove the entire galley module but anything more than tightening a screw will probably be a little challenging.

Final Thoughts

Overall I'm happy with how it all turned out. I like having the battery in the tongue box for easy access at the cost of some storage space. I saw that some other builders mounted the battery underneath the cabin which is also a good idea. I also like how there is almost no wiring visible other than the few inches from the bulkhead to the headliner.

I hope this gives you some ideas and I'll post updates after I get some more time using the configuration in the great outdoors!

Friday, March 3, 2017

My Thoughts from the Future of Radio Symposium and Metadata Summit

I recently attended the 2017 NABA Future of Radio Symposium and the NPR Metadata Summit and listened to a number of really good presentations on a wide range of topics. In my role as a software architect with the Public Radio Satellite System (PRSS) distributing content to public radio stations, I tend to focus on distributor-to-station-to-listener data flows and less on the content production flows but these two events highlighted the complexity and opportunities that exist on both ends of the content lifecycle and how they are intricately linked.

I’ve been reflecting on a number of common themes that I heard throughout the various presentations and how they can apply to what I do in content distribution and what our member stations can do as the conduits to listeners. Our listeners, their habits, and their technology aren’t standing still, and neither should we.

Radio (or more generally broadcast audio) still has a lot of life in it, but it needs to continue to evolve

This isn’t a surprising conclusion from a meeting of a bunch of radio people but it was good to see some hard numbers and listener feedback. Radio is still incredibly popular even with a huge range of other media options. We’re now seeing an interesting blend of podcasts, streaming audio, and radio combining to give listeners a customized experience on various platforms. While data prices remain high (at least in the US) and network coverage isn’t great (at least in the US), broadcast audio is still an extremely reliable and cheap distribution method. When cell towers get overloaded or power goes out, broadcast radio is still a go-to critical emergency service.

However, with more streaming options, connected devices, and autonomous vehicles making progress, radio needs to continue to evolve to provide a richer experience that, with quality audio content at the core, leverages connectivity to provide metrics, two-way communication, interactivity, and accompanying visuals.

Listeners want an easy to use and enjoyable user experience when accessing content

As anyone with a car knows, the in-dash device experience is anything but great. The auto industry has been plagued by low resolution screens, complex interfaces, poor responsiveness, and slowly evolving feature sets. There are plenty of reasons for this including long development cycles, lack of prioritization, lack of competition, cost, product lifespan, etc. but it seems like the industry is finally starting to turn things around.

With big tech companies getting into the mix with Andriod Auto and CarPlay and competition from mobile devices, auto makers seem to be finally putting some effort and money behind more advanced in-dash systems. However, this means traditional radio needs to start providing content for these systems. Traditionally a 200x200 pixel cover art image might have worked just fine in a car but now it just looks silly in a world with 1280x1024 dashboards. Similarly, an 8 character RDS string is embarrassingly non-informative to a listener who was just using a streaming service on their computer with artist, title, biographies, and related content available at the click of a button.

The combination of better interfaces, integrated controls, and content is what is going to win over (or at least maintain) listeners and if that content can arrive at the device for free (or near free), all the better.

There are many ways that listeners are accessing content on many different devices with many different interfaces

While cars may be one of the largest listening groups for pure audio content, there are a lot of different devices out there from mobile phones to computers to home assistants (think Amazon Echo and Google Home). This means the content produced and distributed has to work on many different platforms, some of which might not even exist right now. Trying to build and maintain a custom solution for each platform is going to be challenging even for the biggest producer or station.

At the same time, listeners want a consistent experience across all these devices regardless of the interface. For example, a listener wants to be able to say “play WAMU” to their device, be it a car radio, phone, home assistant or whatever and have it just work. They want to know that they can continue to listen to a story that started in the car on their mobile device as they walk into the house. They want to find related and recommended content when possible.

If content is distributed in such a way that it is available on all these platforms and described in a consistent way, these concepts are possible. A consistent interface and experience doesn’t mean “the same app on all platforms” because the user interfaces vary so wildly that this approach may not be possible or cost effective. Instead, we have to think about how we can package the content so the listener can be given an expected and enjoyable experience regardless of the interface or application driving the interaction.

Metrics rule the day

Content producers love metrics (duh!). Not only do the metrics help sell advertising/underwriting, but it helps inform the content production process. NPR had some interesting demonstrations of how metrics from their NPR One application showed where listeners dropped out allowing the content production teams to make decisions about show structure, transitions, and the placement of “spoiler alerts”.

Traditional radio and even modern podcasting have a poor track record around metrics. Listener diaries and listener sampling are still primary methods for metric gathering. Streaming content can do much better because of the one-to-one connection of a listener and the backend service with the obvious trade-off of data usage and scalability.

There are some solutions coming to the broadcast space for better metrics including mobile applications that behave like streaming services but use prerecorded or broadcast content as well as open standards that dictate how players and receivers can report metrics in a common format.

As with all metrics, the key is balancing user privacy concerns while collecting the data producers and advertisers want. Questions like who owns the data, how is the data anonymized, who is the data sold to, do users opt out or opt in are going to be very important going forward. As we recently saw with Vizio TVs, collecting too much data without informing users can lead to a lot of bad publicity, lawsuits, and financial damages. It is critical to understand the responsibilities around metric collection in order to gain the long-term benefits.

Stations need to see the benefit

While it is great to talk about how the user experience can be and needs to be improved, stations are also asking how they can benefit from advancements in listening technology. The most obvious answer is better metrics (see above), but richer interfaces with more listener friendly experiences means more consumption of the media which means more opportunities for underwriting, linking, and community involvement.

People don’t want TV for the commercials (with the possible exception of the “big game”) but rather for the content. However, once they are there for the content, you can intermix underwriting and local announcements. The same holds true for the non-core content experience. If listeners get accustomed to looking at the screen for the program and story title or the related visuals, they will be more likely to look at the screen when an underwriter’s logo is displayed or the station is advertising a local event or a membership campaign.

As it is now, listeners may ignore their in-dash display because it is too complicated, only displays an FM frequency, displays 8 sad characters, or has outdated static content. The first step is getting compelling content in-front of the listeners so the display becomes a valuable part of the experience which then opens up another medium that can be leveraged to benefit the station, producer, and ideally the listener.

But that’s not all! Stations can also benefit by reducing their workload by automating the flow of information through all these various systems. Many public radio stations have limited staffing resources so trying to develop or even populate applications on so many different devices and interfaces is near impossible. Centralized streaming and cataloging services (think TuneIn, iHeartRadio, Stitcher, iTunes Radio, etc) have popped up to try to fill the gap but just keeping these services up to date can be challenging. Ideally a station publishing this additional information to end listener devices could use the same information to keep all of these centralized streaming services up-to-date. Stations get a reduced workload through a common production workflow while listeners get a more consistent and enjoyable experience even though they are using multiple devices, interfaces, and applications.

Metadata glues it all together

So how does this all come together? Through metadata of course! There has always been a strong focus on the quality production and distribution of the core content but as broadcast radio evolves, metadata is becoming a key component in supporting all of the scenarios discussed so far. For example, a single application isn’t needed to provide a consistent user experience if various devices from cars to phones to home assistants can access the same metadata about the content.

The listener can expect that “play WAMU” will work the same everywhere if all the devices know about the same “WAMU” via the metadata. Similarly, the display on the head-unit should show the same program and story title as might be read by my home assistant when I say “what’s playing now”. Not only does good quality metadata and open, standard distribution enable this functionality, but it also enables functionality that hasn’t been broadly seen before.

How great would it be if my car stereo could tell me the next story that is coming up without having to wait for the announcer to read it to me? What if I could get a station list with genres and logos without having to use the “scan” button and listen to each station for 10 seconds? What if I could skip a broadcast radio piece by seamlessly switching from the radio to a stream and back again? What if I could share a radio story right from my home assistant? What if my “driveway moment” could seamlessly hand off to my phone so I can continue listening in the house? What if I could quickly determine the next time this story is going to play or where to get it online or do a deeper dive?

These topics are huge (as the two days of events discussion them barely scratched the service) and there are a number of associated topics like archive management, rights issues, privacy issues, funding, autonomous vehicles, etc. which makes this a very interesting space to be in right now. As the caretakers of public radio, we want to continue to provide a great product to listeners which not only means great content, but great distribution, great presentation, great user experience, and so on. While AM/FM broadcasting may not be the hot new technology on the block, there is a lot of innovation happening and a lot more to come.

Note that these statements are my personal opinion based on my attendance at the mentioned meetings and do not represent the opinions or strategy of NPR or PRSS.

Wednesday, April 13, 2016

Distributed Task Coordination with Hazelcast

The Problem

Every small to large scale application has some number of tasks that run in the background to perform various functions such as batch processing, data export, reconciliation, billing, notifications, etc. These tasks need to be managed appropriately to not only ensure that they run when expected, but to ensure that they run in the correct environment on the correct nodes. In modern applications, microservice architectures are becoming more popular as an alternative to monolithic applications deployed in heavyweight application servers. At the hardware level, virtual machine deployments, containerization, and multiple region/zone support are desired to support failover, scaling, and disaster recovery. Coordinating tasks in these multi-service, multi-machine, multi-zone environments can be challenging even for small scale projects. While there are existing solutions to coordination, scheduling, and master election problems, many require additional hardware or database architectures that may not be reasonable for all applications. This document presents a method for using Hazelcast, an open source in-memory data grid, to coordinate tasks across a small deployment with minimal additional hardware or software required while supporting flexible task allocation and management.

Use Cases

In many applications, there will be multiple production nodes to support load balancing and fail-over as well as one or more off-site standby nodes to support disaster recovery. Depending on the application, background tasks may be needed to perform scheduled operations such as exporting data, sending notifications, cleaning up old data, generating billing invoices, data reconciliation, etc. To avoid issues such as double billing or emailing a customer twice, jobs should be executed once and only once across the cluster. It may be desirable to run some jobs, such as report generation, at a warm/read-only standby site to reduce the load on the production system. During maintenance periods, it may be desirable to pause specific tasks to avoid errors or to reduce load.

Managing these tasks across the cluster requires some kind of distributed or shared coordinator that can allocate the tasks appropriately. While the description sounds rather simple, successfully implementing and handling master elections, availability, network partitioning, etc. is extremely challenging and shouldn't be approached lightly. Reusing an existing, tested, proven solution is very valuable.

Existing Solutions

Coordinating an application across multiple services, nodes, and zones is not a new problem and there are existing solutions; however these solutions have a number of drawbacks that may cause them to not be usable in the existing application architecture or require additional hardware or software beyond what the project is already using. For small scale projects, some of these solutions may double the amount of hardware required to simply coordinate which node should run a nightly cleanup job. That being said, it is important to not reinvent the wheel if one of the existing solutions would fit into the application's existing hardware and software architecture so it is wise to review the options. A few of the more common options are presented here but this is by no means an exhaustive list.

Single Master Election

In the simplest scenario, a cluster can elect a master node using a master election algorithm such as the Rift Consensus Algorithm. Once a master is elected, that node is responsible for executing all tasks to ensure that the task only runs on a single node. This approach is relatively simple, especially if an existing implementation of a master election algorithm can be used; however it offers very little in the way of flexibility. There is no way to run tasks on different nodes or to pre-allocate tasks to specific zones such as running reporting tasks on a warm-standby site. There can also be an issue if the master is elected as a site that is in a warm-standby mode with a read-only database or other such limiting configuration. If the application requirements are simple enough that a single master running all tasks will work, it is a reasonable and well understood approach to take.

Shared Database

One of the simplest ways to coordinate across services may be to use the application's existing database to enumerate configuration information, lock names, or task schedules. As long as this database is accessible to all the nodes in the system, this method works well. For example, a single "locks" table could be used with each node performing a "select for update" operation to acquire the lock. If the locks are timed, they can be automatically released in the event that a node crashes.

While this approach is very common and can work in many scenarios, there are some drawbacks. First, all the nodes must access a common database or a synchronous replica. If asynchronous replication is in use, there is a chance that two nodes accessing two different databases are allowed to acquire the same lock. Depending on the task being executed, this could be very problematic. Second, depending on the database replication support, some nodes may be read-only (warm standby) replicas which do not allow updating and therefore do not allow the nodes using those databases to acquire locks. When using the shared DB solution, a dependency between the DB replication scheme and task allocation develops which may not be desirable. Third, and finally, not all applications are using relational style databases that make the "select for update" operation easy or even possible. Some append only databases or in-memory DB solutions may not work well for mutual exclusion/row locking type of functionality by preferring performance and eventual consistency over synchronous locking.

Zookeeper and the Like

ZooKeeper is an Apache project that enables distributed coordination by maintaining configuration information, naming, providing distributed synchronization, and providing group services. It is a well proven solution that should be considered when looking for a task coordination solution. There are also many similar alternatives such as Eureka, etcd, and consul that each offer their own pros and cons. While ZooKeeper and the like are powerful solutions, they can be complex to configure properly and may require additional hardware. For a large devops team with a large deployment footprint, these additional requirements may not be an issue. However take an example of a two node application production cluster and an additional one node disaster recover site. ZooKeeper's three node minimum could double the hardware required for this small deployment depending on the configuration.

Some of the related solutions offer distributed key/value pairs which allow for ultimate flexibility but also require additional logic to be implemented in the application by providing the bare minimum master election and data replication logic. Depending on the application language, some of these solutions may be difficult to integrate or maintain by requiring additional runtimes or configurations. Again, for larger projects this may be well worth the investment but for smaller projects it could add to the overall project deployment and maintenance costs.

Quartz and Scheduling

Quartz, a popular Java task scheduling library, supports cluster scheduling to coordinate the execution of a task once and only once in a cluster. However this solution, and many solutions like it, simply fall back to using one of the other solutions such as a shared DB, ZooKeeper, etc. to perform the heavy lifting. Therefore a "clustered" scheduler is not a solution in itself, but simply builds on an existing distributed coordination solution.

Enter Hazelcast

Hazelcast is an open-source in-memory data grid that offers a number of compelling features including:

Distributed data structures
Distributed compute
Distributed query
Clustering
Caching
Multiple language bindings
Easily embeddable in a Java application

This feature set makes Hazelcast a multi-use tool in an application. It can be used for simple messaging, caching, a key/value store, and, as described in the remainder of this document, a task coordination service. Unlike some of the other services previously described, Hazelcast can be leveraged to solve multiple problems in an application besides just distributed configuration and coordination. Being designed as a distributed memory-grid from the beginning, Hazelcast solves many of the hard underlying problems such as master election, network resiliency, and eventual consistency. A full description of Hazelcast, its features, and configuration are beyond the scope of this document so it is recommended to read the reference manual for more details. A basic understanding of Hazelcast or in-memory grids is assumed for the remainder of this document.

The general concept of the Hazelcast task coordination solution is to store semaphore/lock definitions in-memory and to coordinate the allocation of semaphore permits to nodes based on criteria such as the node name, the zone name, and the number of permits available. To handle a node dropping out of the cluster, the semaphore permits support time based expiration in additional to explicit release. All of the clustering, consistency, and availability requirements will be delegated to Hazelcast to make the solution as simple as possible.

Semaphore Definitions and Permits

Hazelcast exposes the in-memory data through the well known Java data structure interfaces such as List, Map, Queue, and Set. While Hazelcast does expose a distributed implementation of a Semaphore, this solution introduces the concept of a SemaphoreDefinition and SemaphorePermit to allow for more complex permit allocations and expiration.

SemaphoreDefinition

To support functionality such as multiple permits per semaphore, node (acquirer) name pattern filtering, and zone (group) name pattern filtering, a SemaphoreDefinition class is used. The class is a basic structure that tracks the definition configuration. There is one such definition for each semaphore or task that is to be controlled. For example, there may be a definition for a "billing reconciliation" task or a "nightly export" task. The semaphore definition is composed of the following fields:

Semaphore Name: The name clients will use when requesting a semaphore permit.
Group Pattern: A regular expression that must match the group name reported by the client. If the group doesn't match, the permit will be denied. Groups can be used to control zone/environment/site access to permits.
Acquirer Pattern: A regular expression that must match the client name reported by the client. If the acquirer doesn't match, the permit will be denied. Acquirers can be used to control individual node access to permits.
Permits: the number permits that can be allocated for a given semaphore. In some cases it is desirable to issue multiple permits while in others a single permit can be used to enforce exclusivity.
Duration: the amount of time the client can hold the permit before it will automatically release. The client must re-acquire before this time or risk losing the permit.

In the current implementation, the definitions are loaded into Hazelcast after a fixed delay to allow other nodes to come on-line and synchronize. Therefore if the definitions are already loaded, they are visible to the new client or member without reloading them from configuration. It would also be possible to use Hazelcast's support for data persistence to permanently store and recover the definitions once loaded.

The Gatekeeper

With the semaphore definitions loaded into a Hazelcast data structure, clients can begin to request permits for specific semaphore names. The "gatekeeper" is responsible for enforcing the rules of the semaphore definition when a permit is required. There are a number of ways to implement the gatekeeper depending on the Hazelcast configuration selected for the application. For example, if each microservice in the application is a full Hazelcast data member or a client, the gatekeeper could be implemented as a shared library to be included in each component. However if the Hazelcast data is exposed as an independent service in the application using the existing application communication architecture (e.g. JMS, REST, etc.), the gatekeeper may be a top-level service in the application that receives requests for permits and issues replies.

The key is that there is some part of the application responsible for implementing the permit allocation logic. When a client makes a request for a permit, either through a library or application service, the gatekeeper is responsible for the following logic:

Check if there is a definition matching the requested name.
1. If no, return denied.
Check if the acquirer's group name matches the definition by applying the regex.
1. If no, return denied.
Check if the acquirer's node name matches the definition by applying the regex.
1. If no, return denied.
Check if the acquirer already holds a valid permit.
1. If yes, simply refresh the expiration and return success.
Check if there is free permit available.
1. If yes, allocate it to the acquirer and return success.
Return denied.

The gatekeeper is also responsible for periodically releasing permits once the expiration date expires. This ensures that nodes that may be removed from the cluster due to crashing or being shutdown cannot hold resources indefinitely.

Pessimistic Permit Acquisition

To coordinate the allocation of a permit across the cluster, a Hazelcast Lock can be used to control access to a Map of semaphore names to granted permits. When a request comes in for a permit, the gatekeeper can acquire the lock, examine the existing permits, and make a decision on allocating the requested permit. Once the decision is made, the Map of semaphore permits can be updated and the lock released. This pessimistic approach makes use of Hazelcast's distributed lock support to ensure a single reader/writer to the existing permits.

This approach is simple to implement and with a limited number of semaphores and requesting clients it should be plenty fast enough. One issue that was found during testing with this approach is that occasionally if a Hazelcast member held the lock and dropped out of the cluster, the lock would not be released on the remaining members. This issue may be related to the specific Hazelcast version (now out-dated) but it is a scenario to test thoroughly.

Optimistic Permit Acquisition

To coordinate the allocation of a permit across the cluster, the replace(...) method can be used on a Hazelcast ConcurrentMap implementation of semaphore names to granted permits. When a request comes in for a permit, the gatekeeper can retrieve the existing permits and make a decision on allocating the requested permit. Once the decision is made, the Map of semaphore permits can be updated using the replace operation to ensure that the permits for the semaphore in question were not changed since the request began. This optimistic approach does not require a distributed lock but rather relies on Hazelcast's implementation of the replace(...) operation to detect concurrent modification of the existing permits.

This approach is again simple to implement and depending on the permit request/allocation pattern of the application, it does not require any external lock management. Because there is no locking required, when a permit is denied (which will be the more common case because only a single node will usually get the permit), there is little to no coordination cost across the cluster. If the replace operation fails, the permit map can be reloaded from Hazelcast and the operation can be performed again.

The Gatekeeper Semaphore

With the gatekeeper controlling access to the permits and using Hazelcast to coordinate and track permit acquisition across the cluster, the remaining piece of the solution is the client side request to acquire permits. A semaphore is implemented by passing requests to the gatekeeper while exposing an API consistent with existing Java Semaphores. Unfortunately there is no standard Semaphore interface in Java, but the basic acquire(...), tryAcquire(...), and release(...) method names can be used to remain consistent. Additional methods such as getExpirationPeriod(...) and refresh(...) can be added to expose the concept of an expiring semaphore if desired.

The gatekeeper semaphore implementation uses the gatekeeper either via a shared library or a service request to acquire a permit. The permit request is composed of the following fields:

Semaphore Name: The name of the semaphore for which to get a permit.
Group Pattern: The name of the group that the client is in. Normally this is used for different environments or zones such as "production", "dr-site", "east-coast", or "west-coast"
Acquirer Pattern: The name of the acquiring node. Normally this is the hostname and it is used to select specific machines to run a task such as "app1", "app2", "db1", or "db2".

Services that need to execute tasks simply acquire a permit from the semaphore just like in any multi-threaded context. The permit request is handed off to the gatekeeper and a decision is made at the Hazelcast cluster level across all nodes and zones. The semaphore can be hidden within the scheduling framework itself so the end user doesn't need to use it directly. For example, the Spring Framework TaskScheduler could be implemented to wrap Runnable tasks in a class that attempts to acquire a permit before executing the target task.

Controlling and Moving Tasks

The gatekeeper on top of Hazelcast combined with the gatekeeper semaphore implementation effectively coordinates and controls tasks execution across the cluster; but a static configuration isn't entirely useful as the tasks may need to migrate around the system to support maintenance activities, fail-over, or disaster recovery. With the semaphore definitions stored in Hazelcast, they can be edited in-memory to cause tasks to migrate by adjusting the group pattern, acquirer pattern, permits, and duration fields. The editing functionality can be built into the gatekeeper service or it can be done by an external tool by directly modifying the semaphore definition data in the cluster. The screenshot below demonstrates how the definitions could be edited live with an example user interface. The user can edit any of the semaphore definitions to restrict it to a specific acquirer or group in the cluster. As the permits expire or are released, the new acquirers will be restricted and the tasks will migrate to the appropriate group and/or acquirer automatically. By using a group name that doesn't exist, such as "no-run", tasks can be effectively paused indefinitely.

Another powerful feature is mass migration from one group/zone to another. For example, in the event of an emergency or large scale maintenance, a simple script could update the semaphore definitions in a batch to change the group pattern effectively moving all the tasks to a different zone. Because Hazelcast handles data replication and availability, the change to the definitions can be made on any member in the cluster and it immediately becomes visible to all members.

In addition to the semaphore definition UI, the gatekeeper can periodically dump state information to a log file to help with debugging and monitoring. The example output below shows the type of information that can be easily displayed and searched by any log management tools.

Limitations

Hazelcast is designed to prefer availability over consistency; therefore in a split-brain, network partition scenario, Hazelcast will remain available on each side of the network partition and could potentially allow more than one node to acquire a semaphore permit. Depending on your network structure, acquirer and group patterns, and sensitivity to duplicate executions in these scenarios, a quorum solution may be more appropriate. One approach may be to use Hazelcast's Cluster information to implement a quorum check as part of the gatekeeper logic.

It is important to note that there may be a delay from when a semaphore definition is updated to when the task can start on the new node/group depending on if the existing permit owner releases the permit or simply lets it expire. For tasks that may run every few minutes or hours this probably isn't an issue. However if the application has strict task failover or migration time requirements, allowing permits to expire at a fixed duration may not be acceptable.

Wrapping Up

The solution presented is not a turnkey library or framework but it shows how Hazelcast can be used to perform task coordination and control with just a little custom code. For more complex applications or larger deployments, an existing tool such as ZooKeeper or etcd may be more appropriate but a simpler approach may make sense for a number of use cases including:

The application is already using Hazelcast for caching, distributed computing, or messaging.
More flexibility/customization for task coordination is needed than an existing solution offers.
There is no shared database or the task coordination scheme is incompatible with the database replication scheme.
The solution should be embedded in the existing application/services rather than requiring additional hardware or processes.

If you're looking for a powerful and flexible communication, coordination, and control mechanism for your application, checkout the Hazelcast documentation and see if it can work for you. As a distributed in-memory grid exposing basic data structures and distributed computing functionality, it can take over or be the foundation for many of the more complex requirements of microservices architectures.

Sunday, June 22, 2014

A Camel and a Yeti Walk into a GitHub

The release of HazelcastMQ v1.0.0 includes a new Java STOMP client and server implementation as well as a HazelcastMQ Apache Camel component. With these additions, it is now easier than ever to setup a clustered, scalable messaging architecture with no centralized broker.

Yeti is a STOMP server and client framework built on Netty to make it simple to build STOMP implementations for existing brokers. Yeti borrows the best ideas from Stampy and
Stilts to provide fast, reusable STOMP frame codecs and channel handlers while abstracting away the underlying network IO with a Stomplet API similar to the familiar Servlet API. To build a custom server, simply implement a Stomplet or extend one of the existing Stomplets to provide custom handling for STOMP frames.

HazelcastMQ Stomp implements a Stomplet to link the STOMP send, subscribe, and message commands to HazelcastMQ. This means you can now have non-Java clients send and receive messages via a STOMP API. With Hazelcast's underlying clustering, you could run a local STOMP server on each node and let Hazelcast handle all the network communication.

HazelcastMQ Camel is an Apache Camel component implementation for HazelcastMQ supporting Camel's integration framework and Enterprise Integration Patterns (EIP). The component supports configurable consumers and producers including request/reply messaging and concurrent consumers. Unlike Camel's existing JMS component, HazelcastMQ Camel has no dependency on the Spring
Framework by building directly on HazelcastMQ Core.

If your application is already using Camel, switching to HazelcastMQ simply requires changing the endpoint URIs. For example, hazelcastmq:queue:blog.posted creates an endpoint that can consumer from or produce to the blog.posted queue in HazelcastMQ.

Combining HazelcastMQ Stomp and Camel means a Java application using Camel gateway proxies could use a simple Java interface to exchange messages with a C application using STOMP while getting all the benefits of a reliable and clustered data grid.

Thursday, November 21, 2013

Authentication using Active Directory in Java with Spring LDAP

Most of my team's applications authenticate off of our application specific user data stored in a good old relational database. However we have a single, internal operations application that uses the company wide Active Directory (AD) server for user authentication. When I first developed the application there were thoughts of many more applications performing authentication against AD and LDAP directory access via Java has always been a little awkward so we decided to deploy an Atlassian Crowd instance. Crowd works well and exposes a simple REST interface for user authentication but it was one extra server and application to maintain and monitor. Given that we only had one application using it and we are hitting one AD instance on the backend, it became a bit of unnecessary overhead.

I've never been much of an AD or even LDAP expert, but I decided to look into authenticating off of AD directly from my Java application. The best LDAP library I found is Spring LDAP and we already use Spring for all of dependency injection so it was a natural fit. I expected to spend a day or two wading through distinguished names (DNs), properties, AD hierarchies, etc. but to my surprise I was able to get it up and running in a few lines of code as shown here:

To get all the various configuration values I needed, I simply looked at my Jenkins LDAP setup as well as the original Crowd configuration. Between the two and through a little poking around using JXplorer, I was able to find all the information I needed.

The active directory at my company requires an authenticated user in order to browse the directory so I specify this information as the user DN and password. I can then issue authentication requests against any user in AD. Spring LDAP hides a lot of the gory details of crawling the directory, finding the user, and performing the authentication check. You'll obviously have to tweak the base DN for your own configuration and in a heavily used application, you'll probably also want to look into Spring LDAP's pooling support.

In the long run I'd like to get Crowd more fully configured and supported so I can point a bunch of my internal tools to it (like Git, SVN, Jenkins, etc) but for now I can shut it down and let my one application hit AD directly.

Monday, October 28, 2013

Helmsman 1.0 Available

I'm happy to announce the first release of Helmsman, a simple, easy to configure, easy to deploy, service control tool. My current project is a micro-services architecture which requires starting, stopping, and checking the status of 12 to 15 daemons on a couple of machines during deployments and maintenance sessions. Helmsman makes this process simple, reliable, and quick. You tell Helmsman where to steer the ship and it politely abides.

History

Helmsman's legacy is a collection of SystemV style init scripts that were copied to different machines and manually maintained or symlinked all over the place. Needless to say that didn't scale very well and the scripts started to drift between the machines. We also ran into issues with permissions because we didn't want the entire development team or QA team having root access but the scripts needed to be maintained, linked to the appropriate run level, and services controlled.

This led to a rewrite in Python which was chosen because it was quick to put together and has pretty good subprocess execution and monitoring capabilities. Unfortunately the implementation tried to be a little too fancy and do some nice ASCII art to show service status which would cause the interpreter to crash in various term configurations. We also ran into issues keeping the Python install (and supporting modules) consistent across the various Solaris Sparc, Solaris x86, RedHat x86, OSX, and Windows machines in use through-out different environments (a discussion for another time).

I then spent some time looking for alternatives. Unfortunately I didn't find anything that was simple to install, cross platform, and would be maintainable by our (mostly Java) development team. I looked at monit, Chef, Puppet, RunDeck, Jenkins, Upstartd, etc. but they felt way too heavy weight or got us back into the issue of needing another runtime across all of our machines. We're not a huge shop so having to build out Puppet scripts to consistently install a runtime to start and stop services just doesn't seem like time well spent.

Given that our main applications are written in Java and we already maintain JVM installs on all machines and our developers know Java well, it seemed like an obvious choice. I spent a few hours playing around with commons-exec and how to format the output and debugging information to be readable and support all terminals, I was able to rewrite the Python scripts in a day. Helmsman was born.

My Process

I deploy Helmsman with our application. So our deployment scripts (automated via Jenkins) push a copy of Helmsman out with our deployment, stop all the services using the existing version, move everything out of the way, install the new deployment, and then use the new Helmsman to start everything back up. This makes it super simple to make sure that the same version is on every machine and that all changes are getting pushed out reliably just like the rest of our build.

In test/stage environments, I have versions setup for QA to use to start and stop key services during testing to test failover, redundancy, etc. We also use the groups feature to define services that need to stay up even when in maintenance mode or services that should run at the warm standby site.

Features

Some of the features include:

Simple Java properties configuration format
One jar deployment
Base configuration shared across all environments/machines
Per machine configuration overrides
Simple service start/stop ordering (no dependency model)
Parallel or serial service execution

Configuration is done via Java properties files which list the names of the services and then a few basic properties for each service. The "services" are simply a command to be executed which follows the SystemV init script style of taking an argument of start, stop or status. These scripts can be custom written but in most cases they will be provided by frameworks like Java Service Wrapper (JSW) or Yet Another Java Service Wrapper (YAJSW) or by your container.

Get It

You can grab the source from my Github repo or grab a precompiled version from my Github hosted MVN repo. Checkout the Github page for more details on how to use the tool.

Let me know if you find a use for Helmsman in your process. Hopefully it makes your life a bit easier.

Wednesday, July 31, 2013

Vaadin, Shiro, and Push

I've been using Vaadin for the past few months on a large project and I've been really impressed with it. I've also been using Apache Shiro for all of the projects authentication, authorization, and crypto needs. Again, very impressed.

Up until Vaadin 7.1, I've just been relying on my old ShiroFilter based configuration of Shiro using the DefaultWebSecurityManager. While this configuration wasn't an exact fit for a Vaadin rich internet application (RIA), it worked well enough that I never changed it. The filter would initialize the security manager and the Subject and it was available via the SecurityUtils as expected.

Then Vaadin 7.1 came along with push support via Atmosphere. Depending on the transport used, Shiro's SecurityUtils can no longer be used because it depends on the filter to bind the Subject to the current thread but, for example, a Websocket transport won't use the normal servlet thread mechanism and a long standing connection may be suspended and resumed on different threads.

There is a helpful tip for using Shiro with Atmosphere where the basic idea is to not use SecurityUtils and to simply bind the subject to the Atmosphere request. Vaadin does a good job of abstracting away the underlying transport which means there is little direct access to the Atmosphere request; however Vaadin does implement a VaadinSession which is the obvious place to stash the Shiro Subject.

First things first, I switched from using the DefaultWebSecurityManager to just using the DefaultSecurityManager. I also removed the ShiroFilter from my web.xml. With the modular design of Shiro I was still able to use my existing Realm implementation and just rely on implementing authc/authz in the application logic itself. The Vaadin wiki has some good, general examples of how to do this. Essentially this changes the security model from web security where you apply authc/authz on each incoming HTTP request to native/application security where you implement authc/authz in the application and assume a persistent connection to the client.

Next up, I needed a way to locate the Subject without relying on SecurityUtils due to the thread limitations mentioned above. Following the general idea of using Shiro with Atmosphere, I wrote a simple VaadinSecurityContext class that provides similar functionality but binds the Subject to the VaadinSession rather than to a thread. Now that I don't have the SecurityUtils singleton anymore, I rely on Spring to inject the context into my views (and view-models) as need using the elegant spring-vaadin plugin.

At this point everything was working and I have full authc/authz with Shiro and Vaadin push support. But, the Shiro DefaultSecurityManager uses a DefaultSessionManager internally to manage the security Session for the Subject. While you could leave it like this, I didn't like the fact that my security sessions were being managed separately from my Vaadin/UI sessions. This was going to be a problem when it came to session expiration because Vaadin already has UI expiration times and VaadinSession expiration times and I was now introducing security Session expiration times. The odds of getting them all to work together nicely was slim and I can imagine users getting randomly logged out while still having valid UIs or VaadinSessions.

My solution was to write a custom Shiro SessionManager and inject it into the DefaultSecurityManager. My implementation is very simple with the assumption that whenever a Shiro Session is needed, a user specific VaadinSession is available. The VaadinSessionManager creates a new session (using Shiro's SimpleSessionFactory) and stashes it in the user specific VaadinSession. Expiration of the Shiro Session (and Subject) are now tied to the expiration of the VaadinSession. While I could have used the DefaultSessionMananger and implemented a custom Shiro SessionDAO, I didn't see that the DefaultSessionManager offered me much given that I did not want Session expiration/validation support.

So that's it. I wire it all up with Spring and I now have Shiro working happily with Vaadin. The best part is that none of my existing authc/authz code changed because it all simply works with the Shiro Subject obtained via the VaadinSecurityContext. In the future if I need to change up this configuration, I expect that my authc/authz code will remain exactly the same and all the changes will be under the hood with some Spring context updates.

I'm interested to hear if anyone else found a good way to link up these two great frameworks or if you see any holes in my approach. I'm no expert on Atmosphere and Vaadin does a good bit of magic to dynamically kickoff server push, but so far things have been working well. Best of luck!