Scaling Detection and Response Operations at Coinbase Pt2

Tl;dr: At Coinbase, we aim to be the most trusted and secure place to interact with the cryptoeconomy in support of our mission to increase economic freedom. To help achieve this aim we have a dedicated team - the Computer Security Incident Response Team (CSIRT) - who hunt, detect and respond to security threats targeting Coinbase systems and data. In this three-part blog series, we’ll cover some of the strategies and systems that the CSIRT has implemented at Coinbase to investigate and respond to threats more effectively, scale our detection and response operations, and ultimately to keep Coinbase the most trusted and secure cryptocurrency exchange.

By James Dorgan

Engineering

, September 15, 2023

, 8min read time

In part one of this blog series we looked at how building an in-house platform that ties together security tooling and log sources has improved our ability to scale detection and response operations at Coinbase. Specifically we focused on how we’ve been automating investigation workflows and increasing our analysts ability to identify key contextual information while investigating an alert. All of the improvements that we’ve discussed so far take place after an alert has been generated, where an analyst has already started building their understanding of what’s happened by enriching the alert with information from additional data sources.

Following on from part one, we’ll be covering how we’ve been building more context directly into our detection logic and into the resulting alerts. This has allowed us to scale our detection and response operations by ensuring our alerts contain relevant contextual information, and by expanding the quality and variety of data that is available to our analysts when they’re writing and tuning detection rules.

Providing more contextual information at the point where an alert is generated helps our analysts to have as much relevant information as possible when they first begin triaging an alert; allowing them to more rapidly identify false positives or spot alerts that should be prioritized. Building more data sources that can be leveraged within the logic of our detection rules helps our analysts to not only write more targeted alerts, but it also allows us to more precisely tune out false positives to reduce alert fatigue.

Let’s jump into why we want to build more context around our alerts and detections:

Building Context Directly Into Alerts and Detection Logic

In an ideal world, detections would be delivered to analysts along with all of the relevant contextual information required to effectively perform a complete investigation. In reality, most detections only provide a small amount of contextual information and the investigating analyst is required to do a lot of manual digging into logs and systems to develop their understanding of the alert.

While it’s certainly challenging to populate an alert with all of the information required for an analyst to perform a complete investigation, a lot of vendor provided detections - and even detections built in house - often lack a lot of relevant contextual information:

The alert shown above is a fairly typical example of a detection system which only has limited context of the environment that it is operating in. Even with a properly integrated XDR system it’s not uncommon to see alerts that provide detailed information about the threat event itself, but limited contextual information about the relevance of the threat to your environment. An analyst investigating this type of alert would typically need to spend time trying to answer basic questions to develop their understanding of the alert: Who owns this machine? What team does this user work in? What is the EDR GUID of this machine? etc.

To drive more context directly into our alerts we built a context engine that integrates with various log sources and systems that we use at Coinbase. The context engine is responsible for querying information out of these systems on regular intervals to produce two tables:

User Profiles - Information relating to user entities. e.g. email address, team, work location, assigned devices, last login IPs, hire date, group memberships etc.

Machine Profiles - Information relating to device entities. e.g, serial number, platform type, owner name, last seen date, containment status, attached USB devices, associated public IP addresses etc.

The information stored in these tables ranges from ephemeral data such as a device’s current public IP address, to static data such as a machine’s serial number and platform type. This data is refreshed on intervals depending on how frequently the data is likely to change, and how fresh the data needs to be in order to be useful.

At a high-level these databases are essentially lookup tables that aggregate information from data sources across Coinbase to provide pivots for common data points (such as serial numbers, emails, hostnames etc) found in alerts and logs:

Either within the detection logic, or during post-alert processing, the data in these tables is used to enrich our detections with relevant contextual information. By having this information immediately available to the investigating analyst when they begin triaging an alert, we’re able to reduce the amount of time and number of investigative steps needed to determine whether an alert is a true positive or a false positive.

In addition to reducing the time required to triage an alert, leveraging the data provided by the context engine also allows our analysts to have additional control when both building and tuning detections. For example, we commonly find detections that are accurately identifying the target behavior that we consider to be suspicious, but we have to filter out certain teams or users who legitimately perform the target behavior as part of their day to day duties. Instead of having the overhead of continually tuning this rule to account for employees joining or leaving these teams, or a user changing devices, we can leverage the context tables to add an exclusion or reduce the severity of an alert depending on which team a user works in.

Example Threat Detection Rule using User Profiles

In the simplified threat detection query shown above, we are searching AWS cloudtrail logs to look for user agents that contain strings related to security testing tools. As there are some teams at Coinbase who are authorized to utilize these types of tools, we need to filter out requests originating from these teams to avoid false positives. By using the context tables discussed above we can quickly tune this rule to exclude any user who is part of the “Security Testing” team. As users join and leave that team over time, the context tables will automatically update meaning that our downstream detection rules also automatically update.

Example Threat Detection Rule using Machine Profiles

In the threat detection query shown above we’re searching for successful SSO logins where the public IP address of the login is not associated with any machine assigned to the source SSO user. In other words, we’re looking for successful SSO logins for employees originating from non-corporate devices. This is a slightly contrived example as non-corporate devices would likely be blocked from authenticating to begin with via an SSO policy, but it serves as a good example of a detection for this behavior in case the technical control fails.

This threat detection is made possible because the machine profiles context table is regularly updated with data pulled from the host based tooling deployed to our corporate devices. This allows us to log and aggregate historical public IP addresses that were assigned to any of our devices. We can then simply bring this dataset into our detection rule to only show results for logins that do not occur from one of these historical IP addresses.

Similarly to the previous detection example, as the context table continues to be updated this threat detection rule automatically benefits from these updates to tune out potential false positives, without an analyst having to manually change the rule logic.

Automatically Enriching Alerts with Context Tables

In addition to using the context tables for improving alert fidelity we also use these tables for alert enrichment. Typically the log data that triggered the alert will contain enough basic information (i.e. a user email, a serial number, an IP address etc.) that we can pivot into the user and machine profiles to decorate the alert with relevant contextual information.

For example, if we remember the alert that we discussed at the start of this blog post, we could use the serial number from the alert title to pivot into the context tables:

By pivoting from the serial number to the machine profile context table, we could simply enrich this alert with relevant information to save an analyst having to manually query and extract the information while triaging the alert:

The example enrichment above shows how an analyst immediately benefits from being able to see relevant information relating to the device which has triggered the alert. While the threat still needs to be triaged and investigated, all of the initial tasks (i.e. converting the serial to an owner, or converting the serial to an EDR GUID) have been automated, reducing the mean time to response for the analyst, allowing them to immediately focus on the threat itself.

Closing Comments

Similar to the detection and response platform that we covered in part one of this blog series, building context tables that aggregate information from a variety of data sources across Coinbase allows us to abstract complexity and vendor changes from our security analysts. As we onboard and offboard data sources we can simply update the context tables allowing the changes to propagate to the downstream detections and enrichment processes. Analysts are not required to have in-depth knowledge of the structure and format of the underlying logs that make up the context tables, but are still able to benefit from the aggregated information for detection development and alert enrichment use-cases.

Aggregating relevant user and machine information into central dedicated tables also helps to improve the maintainability and quality of our threat detections. Analysts are able to easily self discover and integrate useful information from one central location into their detection rules in a consistent manner. This helps to reduce the number of different data sources being referenced in our detection rules, improves the readability of detection code, and helps to facilitate junior analysts being able to extend existing threat detections, or replicate tuning logic into their own detections.

In the final part of this blog series we’ll be discussing how we’ve automated the process of bringing our employees and teams into the alert triaging process, building additional contextual information into our alerts and reducing the number of alerts our security team deals with on a daily basis without losing detection coverage. We’ll also be covering the approach we’ve implemented to reduce our mean time to response by automating key response tasks for some of our threat detections.

Engineering