How I got New Relic working with Thin Async

I recently had to build a small service where performance and reliability were high priorities. Our main app is built on Rails, but for these requirements it really doesn’t make a very good fit. Arguably, perhaps Ruby isn’t a good fit it self, but since the service would be exposing a library already built with Ruby, that was a given. So I decided to try to write the service directly on top on Thin.

Building directly on Thin means very few dependencies, which means fewer things that can go wrong and which means an app that is easier to reason about. This helps with the reliability part. In case you are not aware, Thin is based on Event Machine and uses a reactor pattern for concurrency, rather than threads. Essentially, this means that everything happens within a single thread, with the benefit of being very fast and having a very low resource overhead. This is the same reason why Nginx is able to handle far more concurrent connections than – say – Apache.

The downside of writing code directly on Thin, is that it has to be written in an evented fashion. For example, if you have to query a database, you can’t just wait for the call to return - you have to pass a callback.

class MyApp
  def call(env)
    Thin::AsyncResponse.perform(env) do |response|
      response.headers["Content-Type"] = "text/plain"
      db = Mysql2::EM::Client.new($mysql_configuration)
      defer = db.query "select now()"
      defer.callback do |result|
        now = result.first
        response.status = 200
        response << "The time is #{now}"
        response.done
      end
      defer.errback do |err|
        response.status = 500
        response << "Oops - Something went wrong: #{err}"
        response.done
      end
    end
  end
end

run MyApp.new

For smaller applications that’s ok, but for something more complex it quickly becomes unweildy. For our use case, it did fine though. And the end result was pretty satisfactory - on a stock EC2 m3.medium the app was able to handle close to 2000 req/sec without bulging.

All good, except for one thing.

I’m a big fan of New Relic. There are alternatives out there, but nothing that quite gives the same level of detail and insight – But the New Relic client side library (agent) doesn’t work with Thin’s async style. At least not out of the box. Looking at the source code of the agent makes it clear why that is:

module NewRelic
  module Agent
    module Instrumentation
      module MiddlewareTracing
      ...
        def call(env)
          first_middleware = note_transaction_started(env)

          state = NewRelic::Agent::TransactionState.tl_get

          begin
            Transaction.start(state, category, build_transaction_options(env, first_middleware))
            events.notify(:before_call, env) if first_middleware

            result = (target == self) ? traced_call(env) : target.call(env)

            capture_http_response_code(state, result)
            events.notify(:after_call, env, result) if first_middleware

            result
          rescue Exception => e
            NewRelic::Agent.notice_error(e)
            raise e
          ensure
            Transaction.stop(state)
          end
        end
      ...

This is a piece of Rack middleware, which is how New Relic monitors your app. Essentially, it starts a transaction, delegates to the next handler in the chain and then stops the transaction again. In the case of an async request however, the request returns immediately, but it only completes much later. The result is that NR falsely reports all requests as completing in 0 time.

Note that NR works fine with a traditional threaded concurrency model - the global state is kept thread-local through the single global accessor in NewRelic::Agent::TransactionState.tl_get. Everything else that NR does, uses this to store state.

Hacking the agent

Since I really wanted to use NR, I figured I would see if I could get it working with async.

The main problem seems to be that all state is stored globally. The usual solution to that would be to pass the state around as an object instance instead, but in this case that would be a massive change and would probably mean that the client code would need quite deep awareness of NR – which is not a good thing.

As a compromise I tried keeping the global state design, but instead of a true global, I would make a switchable registry. Then all we have to do is switch this based on an identifier that is unique to the request. In the scope of rack/Thin, the ENV variable contains everything specific to a request/response, so that seems like a good fit:

module NewRelic
  module Agent
    # Patch TransactionState to be local to the currently scoped request
    class TransactionState

      def self.tl_get
        id = scope_id
        if id
          tl_state_for_scope(id)
        else
          tl_state_for(Thread.current)
        end
      end

      def self.tl_state_for_scope(scope_id)
        @transaction_states ||= {}
        state = @transaction_states[scope_id]
        if state.nil?
          state = TransactionState.new
          @transaction_states[scope_id] = state
        end
        state
      end

      def self.current_env=(env)
        @current_env = env
        unless @current_env['new_relic.scope_id']
          @new_relic_scope_ids ||= 0
          @new_relic_scope_ids += 1
          @current_env['new_relic.scope_id'] = @new_relic_scope_ids.to_s
        end
      end

      def self.scope_id
        @current_env ? @current_env['new_relic.scope_id'] : nil
      end

    end
  end
end

First we override the normal implementation of tl_get. If we don’t have a scope_id, we let it fall back to normal behaviour. Otherwise we call tl_state_for_scope, which essentially is a local registry of TransactionState instances. The unique identifier (scope_id) is stored on the env. To use this, we somehow have to set the current_env before we enter into a handler and clear it again afterwards. That seems like a good case for using a block:

...
  def self.scope_env(env_or_request, &block)
    previous = @current_env
    @current_env = env_or_request.respond_to?(:env) ? env_or_request.env : env_or_request
    result = yield
    @current_env = previous
    result
  end
...

Notice that we maintain the previous env, so this should be safe to nest, even if that won’t normally happen in an evented app.

Armed with this, let’s patch the middleware to keep the transaction open:

module NewRelic
  module Agent
    module Instrumentation
      module MiddlewareTracing
      ...
        def call(env)
          NewRelic::Agent::TransactionState.scope_env(env) do
            first_middleware = note_transaction_started(env)

            state = NewRelic::Agent::TransactionState.tl_get
            is_async = nil
            env['new_relic.middleware_depth'] ||= 0
            env['new_relic.middleware_depth'] += 1

            begin
              Transaction.start(state, category, build_transaction_options(env, first_middleware))
              events.notify(:before_call, env) if first_middleware

              result = (target == self) ? traced_call(env) : target.call(env)

              is_async = result[0] == -1

              capture_http_response_code(state, result)
              events.notify(:after_call, env, result) if first_middleware

              result
            rescue Exception => e
              NewRelic::Agent.notice_error(e)
              raise e
            ensure
              # If the request is async, we expect the handler to end these transactions
              Transaction.stop(state) unless is_async
            end
          end
        end
      ...

The key change here is that Transaction.stop(state) is not called for an async response. And we scope the contents of the code to the current env, using the code we wrote above. Of course, since the transaction is no longer stopped by the middleware, it now becomes the responsibility of the client code to shut it down when the request completes.

Let’s first create a helper method for this - Transaction seems to be a good place for that:

module NewRelic
  module Agent
  class Transaction
    # End async request cycle
    def self.stop_async(env, response=nil)
      state = NewRelic::Agent::TransactionState.tl_get
      state.current_transaction.http_response_code = response.status if response && state.current_transaction
      env['new_relic.middleware_depth'].times do
        NewRelic::Agent::Transaction.stop(state)
      end
    end
  end # class Transaction
...

The middleware_depth counter is there to make sure any nested calls are unwound correctly.

Finally, we need the client code to wrap all handlers in scope_env. So our example from earlier now becomes:

class MyApp
  def call(env)
    NewRelic::Agent::TransactionState.scope_env(env) do
      Thin::AsyncResponse.perform(env) do |response|
        response.headers["Content-Type"] = "text/plain"
        db = Mysql2::EM::Client.new($mysql_configuration)
        defer = db.query "select now()"
        defer.callback do |result|
          NewRelic::Agent::TransactionState.scope_env(env) do
            now = result.first
            response.status = 200
            response << "The time is #{now}"
            response.done
          end
        end
        defer.errback do |err|
          NewRelic::Agent::TransactionState.scope_env(env) do
            response.status = 500
            response << "Oops - Something went wrong: #{err}"
            response.done
          end
        end
      end
    end
    response.callback do
      NewRelic::Agent::Transaction.stop_async(env, response)
    end
  end
end

run MyApp.new

Clearly, wrapping code in scope_env gets a bit tedious, so that would be another good candidate for abstracting away. I’ll leave that as an excersise for the reader.

Method tracing

There’s still a few lose ends though. First there is the name of the transaction as it appears within NR. Currently it will all just show up as MyApp#call, which is a bit of a problem if your app has more than one pathway/routes. You can explicitly set this with:

class MyApp
  def call(env)
    NewRelic::Agent::TransactionState.scope_env(env) do
      txn = NewRelic::Agent::Transaction.tl_current
      txn.set_overriding_transaction_name txn.make_transaction_name("My Custom Transaction"), txn.category
      ...

We also have a bit of patching to do to get method tracers to work correctly. Again, instead of this async code:

trace_execution_scoped "Scope/Module/method" do
  ...
end

We need to start a timer manually and pass it when the trace commences:

t0 = Time.now.to_f
ConnectionPool.instance.acquire do |db| # Expensive, async operation
  NewRelic::Agent::TransactionState.scope_env(env) do
    trace_execution_scoped "Database/ConnectionPool/acquire", :start_time => t0
    ...

This is fairly straight-forward:

module NewRelic
  module Agent
    module MethodTracer
      def trace_execution_scoped(metric_names, options={}) #THREAD_LOCAL_ACCESS
        if block_given?
          NewRelic::Agent::MethodTracerHelpers.trace_execution_scoped(metric_names, options) do
            # Using an implicit block avoids object allocation for a &block param
            yield
          end
        else
          NewRelic::Agent::MethodTracerHelpers.trace_execution_scoped(metric_names, options)
        end
      end
    end # module MethodTracer

    # We override these in order to allow passing in start_time instead of relying on a block to time the call
    module MethodTracerHelpers
      def trace_execution_scoped(metric_names, options={}) #THREAD_LOCAL_ACCESS
        state = NewRelic::Agent::TransactionState.tl_get
        return (block_given? ? yield : nil) unless state.is_execution_traced?

        metric_names = Array(metric_names)
        first_name   = metric_names.shift
        return (block_given? ? yield : nil) unless first_name

        additional_metrics_callback = options[:additional_metrics_callback]
        start_time = options[:start_time] || Time.now.to_f
        expected_scope = trace_execution_scoped_header(state, start_time)

        begin
          result = block_given? ? yield : nil
          metric_names += Array(additional_metrics_callback.call) if additional_metrics_callback
          result
        ensure
          trace_execution_scoped_footer(state, start_time, first_name, metric_names, expected_scope, options)
        end
      end
    end # class MethodTracerHelpers
  ...

So, here we allow trace_execution_scoped to be called without a block, and we allow start_time to be passed in through options.

That’s all

I wrapped the patches up into a gem and put it on Github, in case anyone else should find this useful. Do mind that this monkey patches directly on the main NR agent code. While transparently compatible, if the underlying code changes, it may break. The version of newrelic_rpm that I have used is (3.13.2.302)

You can find the code at: github.com/troelskn/newrelic-async

Discuss this post

If you liked this, you should follow me on twitter for more posts.