A tiny monitoring library in Go

I wrote a tiny monitoring library in Go over the weekend. The goal was to create something that was extensible to any type of source that I wanted to monitor while remaining simple and easy to test.

I messed with a few different approaches and ultimately landed here. The library is one single file but I think it’s pretty robust. I initially tried a more interface-focused approach but I actually like the typed function implementation shown below. It might change as things grow, but for now it was pretty slick to use.

package alerts

import (
	"context"
	"fmt"
	"log"
	"sync"
	"time"
)

// Alerts is a package for creating a monitoring a set of Monitors.
// A Monitor has an Alert and a Check on it.
// A Check is run at an interval decided by the Monitor.
// A Check func runs and returns a bool and an error.
// A fail on the bool will trigger a call of the Alert function.

// Monitor combines an Alert, a Check, and an Interval.
// It has a single method Run that calls Check at every Interval.
type Monitor struct {
	Name     string
	Alert    Alert
	Check    Check
	Interval time.Duration
}

// Alert is called when a Check returns false.
// * Alerts can be called multiple times.
// * Alerts can't error by design, but they could log if needed.
type Alert func(ctx context.Context)

// Check runs and returns a bool and an error.
// * A check can pass and still return an error, e.g. degraded service
type Check func(ctx context.Context) (bool, error)

// Siren is responsible for starting, restarting, and stopping
// a set of Monitors. It interacts with the HTTP wrapper to become the
// Monitors resource.
type Siren struct {
	sync.Mutex
	Monitors []*Monitor
}

// Add adds a Monitor to the Siren and starts the Monitor.
func (s *Siren) Add(ctx context.Context, mon *Monitor) error {
	s.Lock()
	s.Monitors = append(s.Monitors, mon)
	s.Unlock()

	go mon.Run(ctx)

	return nil
}

// List returns a list of all Monitors currently in the Siren.
func (s *Siren) List(ctx context.Context) (map[string]*Monitor, error) {
	monitors := map[string]*Monitor{}
	for _, m := range s.Monitors {
		monitors[m.Name] = m
	}
	return monitors, nil
}

// Run starts a monitor and listens for context cancellations
func (m *Monitor) Run(ctx context.Context) error {
	// initialize the monitor
	if err := m.Init(); err != nil {
		return fmt.Errorf("failed to initialize: %+v", err)
	}

	// start running the monitor
	for {
		// run check once at the beginning and then every mon.Interval
		ok, err := m.Check(ctx)
		if !ok {
			// alert when check fails
			m.Alert(ctx)
			// break out of the loop and return our reason for failure
			return err
		}
		time.Sleep(m.Interval)
	}
}

// Init validates the monitor's configuration and prepares it for execution.
func (m *Monitor) Init() error {
	// for right now, this just returns nil
	return nil
}

The code is simple but powerful. We run a Check and then wait for a given Interval of time until we run one again. If the Check returns false, we fire our Alert function and then return the error that the Check returned. This allows us to setup an interval monitor that we can kick off whenever we want. One drawback we’ll need to handle eventually is that if Check returns an error, it stops the Monitor. This is intentional but could be changed easily or we could write retry logic to handle it better.

I considered making Check handle errors like Linux processes do where 0 is a success and any number above that corresponds to some error code or class, but that felt like unnecessary complexity. I like this approach for being an unambiguous pass-fail mark. The error can return some information, but it won’t matter for the Alert function which fires on a failure.

The Init function is currently empty and unused, and I considered removing it, but I feel like there will inevitably be cases where some preparation for a Monitor will be required, so I’m leaving it for now.

InfluxDB Adapter⌗

This is only half the puzzle, though. Of course, we must implement a Monitor. So let’s look at that code! This is the influxdb.go file. It implements a simple InfluxDB monitor that runs a query and checks its results fall within a threshold. We can make this configurable later. For now, it’s perfectly acceptable to hard-code in our thresholds and query parameters. All we need to know is that we can make this programmatically defined whenever needed.

package alerts

import (
	"context"
	"fmt"
	"log"
	"os"
	"time"

	influxdb2 "github.com/influxdata/influxdb-client-go/v2"
)

///////////////////////////
// INFLUXDB IMPLEMENTATION
///////////////////////////

// Influx DB documentation is at
// * https://docs.influxdata.com/influxdb/cloud/api-guide/client-libraries/go/

// InfluxClient holds methods for querying it and creating Alerts.
type InfluxClient struct {
	client influxdb2.Client
}

// NewInfluxClient creates a new InfluxDB client or returns an error.
func NewInfluxClient(ctx context.Context) (*InfluxClient, error) {
	influxURL := os.Getenv("INFLUX_URL")
	influxToken := os.Getenv("INFLUX_TOKEN")
	client := influxdb2.NewClient(influxURL, influxToken)

	go func(c context.Context) {
		<-c.Done()
		log.Printf("context cancellation detected")
		// TODO: Handle client closure correctly
		// defer client.Close()
	}(ctx)

	return &InfluxClient{
		client: client,
	}, nil
}

// create makes a new Monitor from the InfluxClient.
func (i *InfluxClient) create(ctx context.Context, query string) (*Monitor, error) {
	// pass the client the organizationID must be
	orgID := os.Getenv("INFLUX_ORGID")
	if orgID == "" {
		return nil, fmt.Errorf("ErrInvalidOrgID")
	}
	api := i.client.QueryAPI(orgID) // TODO: get from env?
	m := &Monitor{
		Alert: func(ctx context.Context) {
			log.Printf("ERROR: monitor alerted")
		},
		Interval: time.Minute * 15,
		Check: func(ctx context.Context) (bool, error) {
			result, err := api.Query(ctx, query)
			if err != nil {
				log.Printf("ERROR QUERYING INFLUXDB %+v", err)
			}
			defer result.Close()

			// loop over until we prove our monitor correct.
			// TODO: check some configurable upper and lower bounds.
			ok := false
			for result.Next() {
				r := result.Record()
				then := time.Now().Add(-time.Minute * 15)
				if r.Start().After(then) {
					ok = true
				}
			}
			return ok, err
		},
	}
	return m, nil
}

As you can see, our implementation is pretty flexible here. We can define any number of external services or APIs that we need to connect to, we can call out to any other internal libraries we want, and we can perform advanced querying and process the results as neat functions.

The best part of this design is that it completely separates the problem of managing and running a set of Monitors from the implementation concerns of a given Monitor.

Next Steps⌗

I’ll be improving this library in the coming weeks as it’s the heart of a project I’m working on with a friend that includes some cool hardware and control loops, but that will have to come next time! For now, if you want to see this code in full (but likely different from it looks in this post) you can browse the full repository here.