Selective YARA Scanning: What’s Your Type?

March 30, 2022

Selective YARA Scanning: What’s Your Type?

With countless YARA rules publicly available, it can be cumbersome to scan many files with numerous rules. Scanning whole disk images with large rulesets, for example, increases the opportunity for errors and unwanted detections. Most YARA rules written to identify malicious files should only alert on certain file types. However, typical YARA scanners will often run every rule against every file, even when it wouldn’t return useful results. This post will discuss using selective YARA scanners to limit the scope of detection rules as well as how to sort rules by targeted file type to make the best use of selective YARA scanners.

File type YARA scanning isn’t a new idea; one of the sandboxes we use here at Optiv, Trend Micro’s Deep Discovery Analyzer, contains this capability. The following screenshot shows it adding a rule for DLL and EXE files.

Figure 1: Image of Deep Discovery Analyzer file type YARA scanning configuration

Another example is an extension for X-Ways Forensics written by Chris Mayhew. The extension enables YARA scanning of an evidence volume and supports targeted scans. X-Ways provides filters for various file properties that can combine to select files of interest for review. Using the extension, you can YARA scan files with appropriate rulesets for the targeted files specified by the filters.

A final selective YARA scanner example is the open-source forensic software Judge Jury and Executable (JJE), for which I have provided some quality assurance testing. I like JJE because it does forensic scanning of active files on a mounted NTFS disk image and outputs information threat hunters want to leverage. File paths, names, date-time stamps, hash values and file properties are all reported. In addition, things like digital signer/publisher, compile date, entropy, imphash and YARA results are also available. The YARA scanning in JJE provides the ability to scan with specific rules based on MIME type, extension, no file type/extension match and always run regardless of other filter matches.

Figure 2: Image of JJE YARA file type scanning configuration

An example of wanting to be selective with YARA rules can been seen in VMware Carbon Black Hosted EDR, and the on-prem equivalent, VMware Carbon Black EDR, formally known as Cb Response. The EDR solutions allow YARA rules to be uploaded to scan the files available in the product's binary store using the YARA Connector. The binary store contains portable executable (PE) and executable and linking format (ELF) files. The YARA Connector provides scanning against the files captured from endpoints within the binary store.

A PE is an executable binary file type for the Windows platform, more commonly referenced based on their common file extensions as executables (.exe), dynamic link libraries (.dll) or drivers (.sys). In Linux, the binary files would be in the ELF format. Scanning the VMware Carbon Black EDR binary store with rules for any other file type is a waste of resources and time, potentially including efforts to review YARA detections meant for different file types.

To selectively apply YARA rules to a particular file type, you will need a scanner that supports using different rule sets for different file types. You also need to know what file types any particular YARA rule applies to. Yes, you can do this manually (and sometimes this is unavoidable), but automation is key when faced with the overwhelming task of sorting thousands of YARA rules. After thinking about it, I decided the process of sorting YARA rules according to file type is somewhat amenable to automation. It would at least be able to sort the vast majority of YARA rules, and what remained could be either sorted manually or just run against all file types. Doing so was no easy task, but I had decent success, and the remainder of this post will detail that process.

Let’s take a minute to explain the YARA rule format at a high level before getting into the YARA rule sorting. We can break down YARA rules into three elements: metadata, strings and conditions. There are also imports and comments, which can exist anywhere in the rules file. Below is an example rule created from combining examples from the YARA documentation that contain all those elements. First are the imports, followed by a multi-line comment, and then the rule.

import "pe"
import "cuckoo"

/*
This is a multi-line comment ...
*/

rule WildcardExample // ... and this is a single-line comment
{
meta:
my_identifier_1 = "Some string data":
my_identifier_2 = 24
my_identifier_3 = true

strings:
$hex_string = { E2 34 ?? C8 A? FB }

condition:
$hex_string
}

With thousands of rules to reorganize, I wanted to take a programmatic approach to sort them. Usually, when I write code, it starts by manually performing the task and getting close to the data. Automation often encounters problems if I don't fully understand the data. The following sections detail my findings on identifying file types to be sorted with automation.

YARA PE Module

I immediately recognized that the import “pe” feature might indicate that the YARA rules were targeting PE files. However, it turns out that, while many YARA rule files import pe, most do not use the pe module. In other words, importing the pe module does not indicate that the rule(s) targets a PE file. You must review each rule for the use of the pe module to confirm it targets a PE file.

import "pe"

rulerule single_section
{
condition:
pe.number_of_sections == 1
}

rule control_panel_applet
{
condition:
pe.exports("CPlApplet")
}

rule is_dll
{
condition:
pe.characteristics & pe.DLL
}

Figure 3: PE module example from https://yara.readthedocs.io/en/v3.4.0/modules/pe.html

File Type Magic Numbers

Another way to identify the file type target of a YARA rule is to match file headers against file type magic numbers. Magic number matching is a common practice to ensure that the YARA rule only detects the intended file type. The header is typically at the beginning of a file and will identify the format for the file. The following matches the MZ header of PE files:

uint16(0) == 0x5a4d

YARA allows for matching in the strings section, where the match variable for that string is referenced in the condition section. YARA also allows for matching without a string variable, where the matching only occurs in the condition section of the rule. Additionally, other rules can be referenced in the condition section for matching conditions.

rule Rule1
{
strings:
$a = "dummy1"

condition:
$a
}

rule Rule2
{
strings:
$a = "dummy2"

condition:
$a and Rule1
}

Figure 4: https://yara.readthedocs.io/en/stable/writingrules.html#referencing-other-rules

YARA supports complex conditional statements, including grouping and logic operators such as NOT and OR. During my review, I did not encounter any rules that excluded a match due to having a matching file header. However, check out this conditional statement from us-cert that matches multiple file types:

(uint16(0) == 0x5A4D or uint16(0) == 0xCFD0 or uint16(0) == 0xC3D4 or uint32(0) == 0x46445025 or uint32(1) == 0x6674725C)

Figure 5: Example of conditional matching multiple file types

Following the logic operators complicates the parsing and was not automated in this first round of code. However, the file type magic numbers were commonly targeting one file type, and so it made sense to save time by checking for the usage and dealing with the minimal false-positive categorizations. Below are some examples pulled from YARA rules:

uint16(0) == 0xcfd0 //Word/Office Document
$doc = {D0 CF 11 E0} //DOCFILE0 Word Document
uint16(0) == 0x457f //ELF
$magic = { 50 4b 03 04 ( 14 | 0a ) 00 } //pkzip
$PK = "PK" //pkzip
int16(0) == 0x4B50 //pkzip
uint16(0) == 0x004c //Windows link
$h1 = "Rar!" //rar
uint32(0) == 0x74725c7b //RTF
$string7 = "%PDF-1.6" //PDF
uint16be(0) == 0x4D5A //MZ header

Figure 6: File type magic number examples

Given the above examples; it seemed logical to write a regular expression to identify file type magic numbers used in YARA rules:

(int\d(\d|)(be|)$(\d+)$|\$[\w\d]+) *=(=|) *({ ?|0x|")

Figure 7: Partial regex for YARA file type magic number matching

Combining this regex with a list of magic numbers mapped to file type allows us to identify which file type a rule targets. I observed magic numbers in four representations: String, big-endian, little-endian and hex string. The regex above will match the various representations as highlighted below. Appending the file type string in the represented forms to the regex will then return matches for the file type targeted by the YARA rule.

$string7 = "%PDF-1.6" //PDF
uint16be(0) == 0x4D5A //MZ header
uint32(0) == 0x74725c7b //RTF
$magic = { 50 4b 03 04 ( 14 | 0a ) 00 } //pkzip

Figure 8: Regex matches for YARA file type magic number representations

The most challenging is the little-endian, as the byte order is reversed with the most significant byte on the right. Note below that the hex string is in reverse order to the uint32 value:

{7f 45 4c 46 4} //hex string representation
(uint32(0) == 0x4464c457f) //little-endian

Figure 9: Example of little-endian vs. hex string of ELF file header

Also, rules might reference other rules that contain the file type magic numbers. One example of being referenced by different YARA rules was the “is__elf” rule in 000_common-rules.yar. My YARA sorting script looks for the “is__elf” text in the condition section of YARA rules and will set the file type to ELF for the rule.

The sort script modifies each magic number to match the four representation types and appends each individually to the regex. Per the YARA documentation, both 16 and 32-bit integers are little-endian. Little-endian is the default for intXX functions. Adding “be” to the end, such as initXXbe, will call the function using big-endian.

While magic numbers are accurate for YARA rule logic, keying off them to sort rules isn’t perfect. For example, FE_LEGALSTRIKE_RTF is for an RTF file with an embedded link file. While the rule looks for a malicious link within the file, the actual file type is RTF. Or Wanna_Cry_Ransomware_Generic, where it has a string match that starts the same as ZIP header with the “PK” characters:

$s4 = "PKS"

Figure 10: PKS string

Rule Name and Metadata

The sorting code looks at the rule name and within the YARA rule's metadata section to check for mentions of a known file extension. An exclusion function exists for terms such as the word "warn" as file extension .war matches the word. Another problem was rules that had other extensions mentioned in them, such as CGI and other technologies:

rule WebShell_cgitelnet {
meta:
description = "PHP Webshells Github Archive - file cgitelnet.php"

Figure 11: Rule name and description for PHP CGI

Ordering the file type checks was essential (such as putting CGI first) so later mentioned languages would overwrite it. The same goes for asp and aspx extensions since one exists within the other. In other words, the asp extension must be listed before aspx to ensure aspx becomes the file type.

Hash Values in Metadata

Many YARA rules provide hash values in the metadata for samples that it should detect. Doing so is an excellent practice so that others can update the rule and test it against the same samples. YARA rules can also calculate several hashing values of a file or parts of it.

hash.md5(0, filesize) == “feba6c919e3797e7778e8f2e85fa033d”

Hash values are how the industry references unique files and can be used to get additional context via Internet searches or on services such as VirusTotal. Some lookup services may not have the file associated with the hash but are helpful when they do. VirusTotal, for example, provides valuable information about a file such as antimalware detection names, date stamps, file type, sandbox analysis, etc., that can allow for categorization and sorting.

I leveraged Vendor Threat Triage Lookup (VTTL) for automated hash lookups and added an integration to utilize the CSV output. Hash values were extracted from the YARA rules and fed into VTTL. VTTL will output a CSV that contains the hash value, file name and the file type provided by the VirusTotal API. This CSV spreadsheet is imported and referenced for any hash matches within the rules to identify the file hash’s associated file type.

Using VirusTotal File Type for Sorting

One option to sort the rules by file type is to leverage intelligence provided by VirusTotal. Using VirusTotal is not required but can help sort rules that reference hashes to files that VirusTotal has scanned.

To sort on file hash references in the YARA rules, you must extract the hash values and perform lookups against them using VTTL. An approach could be to loop through all the files and scan the contents with regular expressions to detect hash values:

https://github.com/RandomRhythm/RegEx_Hash_Scanner.py

See the below example regex that looks for upper and lowercase hex values between a certain length. The length restriction below will get the hash values of MD5 up to SHA256:

[A-Fa-f0-9]{32,128}

Figure 12: Regex to detect hex character strings that are long enough to be hashes

You could also check for valid hash length, but that is not required as invalid hashes will not have any results on VirusTotal.

Next, we use VTTL to perform hash lookups against VirusTotal, which assesses the associated file’s type. VTTL outputs the VirusTotal information to a CSV file, which we will import into the sorting script. The example VTTL CSV output for the yara-rules/rules GitHub repo is displayed in the following figure.

Figure 13: Microsoft Excel displaying VTTL CSV output

Using the Sort Script

Sorting the rules requires a listing of all the rules from YARA_Rules_Util. Passing the -v argument will create a listing of all rules within the YARA files. Doing so will generate the all_rules.csv file:

YARA_Util.py -d E:\test\rules-master -s -v

The Sort_Rules.py script can take several arguments. The only required one is providing the all_rules.csv rule to file mapping as the input path.

-h, --help show this help message and exit
-i INPUTPATH, --input=INPUTPATH
Path to input file containing rule to file mapping
(required). Use YARA_Rules_Util to create the
rule mapping file
-a, --autosort Sort rules by file type using rule content and
metadata (optional)
-m MOVEPATH, --move=MOVEPATH
Path to file containing mapping of where rules
should be moved (optional)
-l LOOKUPPATH, --lookups=LOOKUPPATH
Path to file containing VTTL hash lookup results
(optional)
-o OUTPUTPATH, --output=OUTPUTPATH
Output log file path

Figure 14: Sort_Rules.py arguments

If you want to override the file path where a rule gets put, you can provide a file path to the -m argument of a CSV in the same format that YARA_Rules_Util produces for all_rules.csv. Doing so will force the script to use the file paths in the provided CSV to place the rules.

Below is an example of using all the arguments:

Sort_Rules.py -i "E:\\YARA_Rules_Util\\all_rules.csv" -o "E:\\YARA_Sort_Rules\\log.txt" -m "E:\\YARA_Rules_Util\\rule_remapping.csv" -l "E:\\YARA_Hash_Values\\yara_Hash_lookups.csv" -a

Running the command will move all rules specified in all_rules.csv into subfolders specifying the file type.

Figure 15: Sorted YARA folder structure

The script also renames the rule file to include the file type at the end of the name.

Figure 16: Sorted YARA file name examples

Validation is an important step to ensure you are prepared to use the rules how you intend. Creating an index of YARA rules to leverage will allow for easy testing. Indexes reference multiple rules files to be used in a YARA scan. Running YARA using the index will tell you if any missing dependencies or other errors exist. An index can be created using YARA Rules Util:

YARA_Util.py -d C:\YARA\rules-master -i C:\YARA\rules-master\index_new.yar -b rules-master -s

Don’t want to use the script to sort YARA rules by file type? I’ve gone ahead and published a copy of the popular Yara-Rules/rules repo with the rules already reorganized:

https://github.com/RandomRhythm/YARA_Rules_Project_Sorted_Ruleset

Conclusion

The methodology described here can be improved further, such as identification for additional file types. One thought was to categorize malware family names to file types to help limit false positives and provide potential file types for the rules. I utilized malware family lookups when I manually sorted the rules, but it wasn't possible to automate without a mapping. Ideas, suggestions and bug fixes are welcome, and I look forward to receiving feedback via the GitHub repo.

While there is room for improvement with automated rule sorting, the fact remains that limiting the YARA rules during a scan will decrease the chances a YARA scan error is encountered. For example, when scanning a disk, the exception that likely occurs is "ERROR_TOO_MANY_MATCHES," when a rule has more than 10,000 string matches against a single file. Avoiding such errors is advisable as the YARA scan of the file is aborted, with no results returned. In one test using JJE against a disk image, there was a 2400% decrease in exceptions encountered when using selective YARA scanning vs. running all rules against everything.

Traditional YARA is going to be the fastest way to get results back. Consider utilizing a selective scanner when you want to be more targeted and verbose with rule matching. An example could be a YARA rule that looks for the string "PowerShell" in a file. The string "PowerShell" existing in a Microsoft Office document could be considered suspicious. However, the string "PowerShell" is a common occurrence in benign executables. Selective YARA scanning can lower the number of results that need reviewed and increase the fidelity of the detection.

Tools referenced in this blog

https://github.com/AdamWhiteHat/Judge-Jury-and-Executable
https://github.com/RandomRhythm/YARA_Rules_Project_Sorted_Ruleset
https://github.com/RandomRhythm/YARA_Rules_Util
https://github.com/RandomRhythm/RegEx_Hash_Scanner.py
https://github.com/RandomRhythm/Vendor-Threat-Triage-Lookup

By:

Ryan Boyle

Consultant with Threat Management Team | Optiv

Ryan Boyle is a consultant in Optiv’s Threat Management Team (incident management specialization).

Optiv Security: Secure greatness.^®

Optiv is the cyber advisory and solutions leader, delivering strategic and technical expertise to nearly 6,000 companies across every major industry. We partner with organizations to advise, deploy and operate complete cybersecurity programs from strategy and managed security services to risk, integration and technology solutions. With clients at the center of our unmatched ecosystem of people, products, partners and programs, we accelerate business progress like no other company can. At Optiv, we manage cyber risk so you can secure your full potential. For more information, visit www.optiv.com.

Related Insights

Mastering the Hunt: Threat Hunting with Optiv and Carbon Black

Even though corporate endpoints made up the top five assets involved in breaches last year, many enterprises still focus only on securing their network.

Cyber Incident Readiness Services

Cybersecurity incident readiness services measure your incident response capability against the current threat landscape and industry best practices. Learn more!

Threat Intelligence Services

Cybersecurity success hinges on transforming data into actionable intel. Gain resources to repel advanced cyber threats with Optiv’s Threat Intelligence services.

How Can We Help?

Let us know what you need, and we will have an Optiv professional contact you shortly.

Selective YARA Scanning: What’s Your Type?

YARA PE Module

File Type Magic Numbers

Rule Name and Metadata

Hash Values in Metadata

Using VirusTotal File Type for Sorting

Using the Sort Script

Conclusion

Source Zero®

Cyber Threats

Blue Team

Malware

Threat Hunting

Related Insights

How Can We Help?