Mining data from Cobalt Strike beacons

25 March 2022

Since we published about identifying Cobalt Strike Team Servers in the wild just over three years ago, we’ve collected over 128,000 beacons from over 24,000 active Team Servers. Today, RIFT is making this extensive beacon dataset publicly available in combination with the open-source release of dissect.cobaltstrike, our Python library for studying and parsing Cobalt Strike related data.

The published dataset contains historical beacon metadata ranging from 2018 to 2022. This blog will highlight some interesting findings you can extract and query from this extensive dataset. We encourage other researchers also to explore the dataset and share exciting results with the community.

Cobalt Strike Beacon dataset

The dataset beacons.jsonl.gz is a GZIP compressed file containing 128,340 rows of beacon metadata as JSON-lines. You can download it from the following repository and make sure to also check out the accompanying Jupyter notebook:

https://github.com/fox-it/cobaltstrike-beacon-data

The dataset spans almost four years of historical Cobalt Strike beacon metadata from July 2018 until February 2022. Unfortunately, we lost five months’ worth of data in 2019 due to archiving issues. In addition, the dataset mainly focuses on x86 beacons collected from active Team Servers on HTTP port 80, 443 and DNS; therefore, it does not contain any beacons from other sources, such as VirusTotal.

The beacon payloads themselves are not in the dataset due to the size. Instead, the different beacon configuration settings are stored, including other metadata such as:

Date the beacon was collected and from which IP address and port
GeoIP + ASN metadata
TLS certificate in DER format
PE information (timestamps, magic_mz, magic_pe, stage_prepend, stage_append)
If the payload was XOR encoded, and which XOR key was used for config obfuscation
The raw beacon configuration bytes; handy if you want to parse the beacon config manually. (e.g. using dissect.cobaltstrike or another parser of choice)

While there are some trivial methods to identify cracked/pirated Cobalt Strike Team Servers from the beacon payload, it’s difficult to tell for the non-trivial ones. Therefore the dataset is unfiltered, full disclosure and contains all beacons we have collected.

Cobalt Strike Team Servers that are properly hidden or have payload staging disabled are, of course, not included. That means Red Teams (and sadly threat actors) with good OPSEC have nothing to worry about being present in this dataset. 🙂

Beacons, and where to find them

The Cobalt Strike beacons were acquired by first identifying Team Servers on the Internet and then downloading the beacon using a checksum8 HTTP request. This method is similar to how the company behind Cobalt Strike did their own Cobalt Strike Team Server Population Study back in 2019.

Although the anomalous space fingerprint we used for identification was since fixed, we found other reliable methods for identifying Team Servers. In addition, we are sure that the original author of Cobalt Strike intentionally left some indicators in there to help blue teams. For that, we are grateful and hope this doesn’t change in the future.

Via our honeypots, we can tell that RIFT is not alone in mining beacons. Everyone is doing it now. The increased blog posts and example scripts on how to find Cobalt Strike probably attribute to this, next to the increased popularity of Cobalt Strike itself, of course.

The development of increased scanning for Cobalt Strike is fascinating to witness, including the different techniques and shotgun approaches for identifying Team Servers and retrieving beacons. Some even skip the identification part and go directly for the beacon request! As you can imagine, this can be noisy, which surely doesn’t go unnoticed for some threat actors.

If you run a public-facing web server, you can easily verify this increased scanning by checking the HTTP access logs for common checksum8 like requests, for example, by using the following grep command:

$ zgrep -hE "GET /[a-zA-Z0-9]{4} HTTP" /var/log/nginx/*.gz
172.x.x.x - - [23/Feb/2021:18:xx:08 +0100] "GET /0bef HTTP/1.0” 404 162 "-" "-"
172.x.x.x - - [24/Feb/2021:09:xx:40 +0100] "GET /0bef HTTP/1.0” 404 162 "-" "-"
139.x.x.x - - [25/Feb/2021:05:xx:39 +0100] "GET /bag2 HTTP/1.1” 404 193 "-" "-"
134.x.x.x - - [25/Feb/2021:15:xx:12 +0100] "GET /ab2g HTTP/1.1” 400 166 "-" "-"
134.x.x.x - - [25/Feb/2021:15:xx:22 +0100] "GET /ab2h HTTP/1.1” 400 166 "-" "-"

The requests shown above are checksum8 requests (for x86 and x64 beacons), hitting a normal webserver hosting a real website in February 2021.

You can also use our checksum8-accesslogs.py script, which does all these things in one script and more accurately by verifying the checksum8 value. It can also output statistics. Here is an example of outputting the x86 and x64 beacon HTTP requests hitting one of our honeypots and generating the statistics:

In the output, you can also see the different beacon scanning techniques being used, which we will leave as an exercise for the reader to figure out.

We can see an apparent increase in beacon scanning on one of our honeypots by plotting the statistics:

https://datawrapper.dwcdn.net/pfQZY/5/

So if you ever wondered why people are requesting these weird four-character URLs (or other strange-looking URLs) on your web server, check the checksum8 value of the request, and you might have your answer.

We try to be a bit stealthier and won’t disclose our fingerprinting techniques, as we also know threat actors are vigilant and, in the long run, will make it harder for everyone dealing with Threat Intelligence.

Cobalt Strike version usage over time

Because we have beacon metadata over multiple years, we can paint a pretty good picture of active Cobalt Strike servers on the Internet and which versions they were using at that time.

To extract the Cobalt Strike version data, we used the following two methods:

Using the Beacon Setting constants
- When a new Cobalt Strike beacon configuration setting is introduced, the Setting constant is increased and then assigned. It’s possible to deduce the version based on the highest available constant in the extracted beacon configuration.
Using the PE export timestamp
- This is also documented by BlackBerry in their Finding Beacons In the Dark: A Guide to Cyber Threat Intelligence book and is a more accurate way of determining the exact version.

Our Python library dissect.cobaltstrike supports both methods for deducing version numbers and favours the PE export timestamp when available.

The dataset already contains the beacon_version field for your convenience and is based on the PE export timestamp. Using this field, we can generate the following graph showing the different Cobalt Strike versions used on the Internet over time:

https://datawrapper.dwcdn.net/UW9lM/2/

We can see that in April 2021, there was quite a prominent peak of online Cobalt Strike servers and unique beacons, but we are not sure what caused this except that there was a 3% increase of modified (likely malicious) beacons that month.

The following percentage-wise graph shows a clearer picture of the adoption and popularity between the different versions over time:

https://datawrapper.dwcdn.net/t3y64/2/

We can see that Cobalt Strike 4.0 (released in December 2019) remained quite popular from January 2020 to January 2021.

Beacon watermark statistics

Since Cobalt Strike 3.10 (released December 2017), the beacons contain a setting called SETTING_WATERMARK. This watermark value should be unique per Cobalt Strike installation, as the license server issues this.

However, cracked/pirated versions usually patch this to a fixed value, making it easy to identify which beacons are more likely to be malicious (i.e. not a penetration tester). This likelihood aligns with our incident response engagements so far, where beacons related to the compromise used known-bad watermarks.

Note that requesting a trial or buying a legitimate copy of Cobalt Strike is difficult for malicious actors as every user is vetted and screened. Because of these measures, there is a high asking price for a Cobalt Strike copy on the dark web. For example, Conti invested $60.000 to acquire a valid copy of Cobalt Strike.

https://datawrapper.dwcdn.net/wW4df/6/

If you find a beacon with a watermark in this top 50, then it’s most likely malicious!

Customized Beacons

While parsing collected beacons, we found that some were modified, for example, with a custom shellcode stub, non-default XOR keys or reassigned Beacon settings.

Therefore, the beacons with heavy customizations could not be dumped properly and are not included in the dataset.

The configuration block in the beacon payload is usually obfuscated using a single byte XOR key. Depending on the Cobalt Strike version, the default keys are 0x2e or 0x69.

The use of non-default XOR keys requires the user to modify the beacon and or Team Server, as it’s not configurable by default. Here is an overview of seen XOR keys over the unique beacon dataset:

https://datawrapper.dwcdn.net/4eeir/6/

Using a custom XOR key makes you an outlier though, but it does protect you against some existing Cobalt Strike config dumpers. Our Python library dissect.cobaltstrike supports trying all XOR keys when the default XOR keys don’t work. For example, you can pass the command line flag --all-xor-keys to the beacon-dump command.

Portable Executable artifacts

While most existing Cobalt Strike dumpers focus on the beacon settings, some settings from the Malleable C2 profiles will not end up in the embedded beacon config of the payload. For example, some Portable Executable (PE) settings in the Malleable C2 profile are applied directly to the beacon payload. Our Python library dissect.cobaltstrike supports extracting this information, and our dataset includes the following extracted PE header metadata:

magic_mz — MZ header
magic_pe — PE header
pe_compile_stamp — PE compilation stamp
pe_export_stamp — timestamp of the export section
stage_prepend – (shellcode) bytes prepended to the start of the beacon payload
stage_append — bytes appended to the end of the beacon payload

We created an overview of the most common stage_prepend bytes that are all ASCII bytes. These bytes are prepended in front of the MZ header, and has to be valid assembly code but resulting in a no-operation as it’s executed as shellcode. Some are quite creative:

https://datawrapper.dwcdn.net/BeLYP/5/

If we disassemble the example stage_prepend shellcode JFIFJFIF we can see that it increases the ESI and decreases the EDX registers and leaves it modified as a result; so it’s not fully a no-operation shellcode but it most likely doesn’t affect the staging process either.

$ echo -n JFIFJFIF | ndisasm -b 32 /dev/stdin
00000000  4A                dec edx
00000001  46                inc esi
00000002  49                dec ecx
00000003  46                inc esi
00000004  4A                dec edx
00000005  46                inc esi
00000006  49                dec ecx
00000007  46                inc esi

You can check our Jupyter notebook for an overview on the rest of the PE artifacts, such as magic_mz and magic_pe.

Watermarked releasenotes.txt using whitespace

The author of Cobalt Strike must really like spaces, after the erroneous space in the HTTP server header, there is now also a (repurposed) beacon setting called SETTING_SPAWNTO that is now populated with the MD5 hash of the file releasenotes.txt (or accidentally another file in the same directory if that matches the same checksum8 value of 152 and filename length!).

The releasenotes.txt is automatically downloaded from the license server when you activate or update your Cobalt Strike server. To our surprise, we discovered that this file is most likely watermarked using whitespace characters thus making this file and MD5 hash unique per installation. The license server probably keeps track of all these uniquely generated files to help combat piracy and leaks of Cobalt Strike.

While this is pretty clever, we found that in some pirated beacons this field is all zeroes, or not available. Meaning they knew about this file and decided not to ship it in the pirated version or the field value was patched out. Nevertheless, this field is still useful for hunting or correlating beacons when it is available.

Analyze Beacon payloads with dissect.cobaltstrike

We are also proud to open-source our Python library for dissecting Cobalt Strike, aptly named dissect.cobaltstrike. The library is available on PyPI and requires Python 3.6 or higher. You can use pip to install it:

$ pip install dissect.cobaltstrike

The project’s GitHub repository: https://github.com/fox-it/dissect.cobaltstrike

It currently installs three command line tools for your convenience:

beacon-dump – used for dumping configuration from beacon payloads (also works on memory dumps)
beacon-xordecode – a standalone tool for decoding xorencoded payloads
c2profile-dump – use this to read and parse Malleable C2 profiles.

A neat feature of beacon-dump is to dump the beacon configuration back as it’s Malleable C2 profile compatible equivalent:

While these command line tools provide most of the boilerplate for working with Beacon payloads, you can also import the library in a script or notebook for more advanced use cases. See our notebook and documentation for some examples.

Closing thoughts

The beacon dataset has proved very useful to us, especially the historical aspect of the dataset is insightful during incident response engagements. We use the dataset daily, ranging from C2 infrastructure mapping, actor tracking, threat hunting, high-quality indicators, detection engineering and many more.

We hope this dataset and Python library will be helpful to the community as it is for us and are eager to see what kind of exciting things people will come up with or find using the data and tooling. What we have shown in this blog is only the tip of the iceberg of what you can uncover from beacon data.

Some ideas for the readers:

Cluster beacon and C2 profile features using a clustering algorithm such as DBSCAN.
Improve the classification of malicious beacons. You can find the current classification method in our notebook.
Use the GeoIP ASN data to determine where the most malicious beacons are hosted.
Analysis on the x509 certificate data, such as self-signed or not.
Determine if a beacon uses domain fronting and which CDN.

All the statistics shown in this blog post can also be found in our accompanying Jupyter notebook including some more statistics and overviews not shown in this blog.

We also want to thank Rapid7 for the Open Data sets, without this data the beacon dataset would be far less complete!

Final links for convenience:

Beacon dataset and notebook – https://github.com/fox-it/cobaltstrike-beacon-data
dissect.cobaltstrike Python library – https://github.com/fox-it/dissect.cobaltstrike
dissect.cobaltstrike Documentation – https://dissect-cobaltstrike.readthedocs.io

Yun Zheng Hu