US20250005154A1 - Techniques for utilizing embeddings to monitor process trees - Google Patents
Techniques for utilizing embeddings to monitor process trees Download PDFInfo
- Publication number
- US20250005154A1 US20250005154A1 US18/216,833 US202318216833A US2025005154A1 US 20250005154 A1 US20250005154 A1 US 20250005154A1 US 202318216833 A US202318216833 A US 202318216833A US 2025005154 A1 US2025005154 A1 US 2025005154A1
- Authority
- US
- United States
- Prior art keywords
- tree
- embedding
- process tree
- generate
- malware
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 785
- 230000008569 process Effects 0.000 title claims abstract description 722
- 238000012545 processing Methods 0.000 claims abstract description 67
- 238000010801 machine learning Methods 0.000 claims abstract description 34
- 238000003860 storage Methods 0.000 claims description 37
- 230000015654 memory Effects 0.000 claims description 26
- 230000009471 action Effects 0.000 claims description 9
- 230000026676 system process Effects 0.000 claims description 9
- 230000004931 aggregating effect Effects 0.000 claims description 4
- 238000005067 remediation Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 description 60
- 238000001514 detection method Methods 0.000 description 43
- 238000010586 diagram Methods 0.000 description 26
- 238000012544 monitoring process Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 12
- 239000013598 vector Substances 0.000 description 11
- 238000003062 neural network model Methods 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 8
- 238000012360 testing method Methods 0.000 description 6
- 230000002776 aggregation Effects 0.000 description 5
- 238000004220 aggregation Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 4
- 238000004519 manufacturing process Methods 0.000 description 4
- 230000002085 persistent effect Effects 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000009931 harmful effect Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011842 forensic investigation Methods 0.000 description 1
- ZXQYGBMAQZUVMI-GCMPRSNUSA-N gamma-cyhalothrin Chemical compound CC1(C)[C@@H](\C=C(/Cl)C(F)(F)F)[C@H]1C(=O)O[C@H](C#N)C1=CC=CC(OC=2C=CC=CC=2)=C1 ZXQYGBMAQZUVMI-GCMPRSNUSA-N 0.000 description 1
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000011022 operating instruction Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012913 prioritisation Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000013341 scale-up Methods 0.000 description 1
- 238000013515 script Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000003245 working effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
Definitions
- aspects of the present disclosure relate to detecting cybersecurity events, and more particularly, to detecting cybersecurity events through analysis of process trees.
- Malware is a term that refers to malicious software.
- Malware includes software that is designed with malicious intent to cause intentional harm and/or bypass security measures.
- Malware is used, for example, by cyber attackers to disrupt computer operations, to access and to steal sensitive information stored on the computer or provided to the computer by a user, or to perform other actions that are harmful to the computer and/or to the user of the computer.
- Malware may be formatted as executable files (e.g., COM or EXE files), dynamic link libraries (DLLs), scripts, steganographic encodings within media files such as images, and/or other types of computer programs, or combinations thereof.
- Malware authors or distributors frequently disguise or obfuscate malware in attempts to evade detection by malware-detection or-removal tools.
- FIG. 1 is a block diagram that illustrates an example system, according to some embodiments of the present disclosure.
- FIG. 2 A is a schematic block diagram illustrating an example of a process tree, in accordance with some embodiments of the present disclosure.
- FIG. 2 B illustrates a schematic representation of the process tree of FIG. 2 A .
- FIG. 3 is a flow diagram of a method of generating a malware classification and/or malware explanation, according to some embodiments of the present disclosure.
- FIG. 4 is schematic block diagram illustrating an operation of the embedding engine to generate a process embedding, in accordance with some embodiments of the present disclosure.
- FIG. 5 is schematic block diagram illustrating an operation of generating a process tree embedding, in accordance with some embodiments of the present disclosure.
- FIG. 6 is schematic block diagram illustrating an operation of generating a malware classification and/or malware explanation, in accordance with some embodiments of the present disclosure.
- FIG. 7 A is a block diagram illustrating an example training system for performing a machine learning operation based on process tree embeddings from process metadata, according to some embodiments of the present disclosure.
- FIG. 7 B is a block diagram of a system incorporating a neural network model for generating a classification and/or explanation of a process tree embedding based on process metadata, according to some embodiments of the present disclosure.
- FIG. 8 is a flow diagram of another method of generating a malware classification and/or malware explanation, according to some embodiments of the present disclosure.
- FIG. 9 is schematic block diagram illustrating an operation of generating a process tree embedding from process metadata, in accordance with some embodiments of the present disclosure.
- FIG. 10 is a flow diagram of a method of operating a malware detection system, according to some embodiments of the present disclosure.
- FIG. 11 is a c component diagram of an example of a device architecture for malware detection, in accordance with embodiments of the disclosure
- FIG. 12 is a block diagram of an example computing device that may perform one or more of the operations described herein, in accordance with embodiments of the disclosure.
- malware Modern computer systems are subject to a large number of potential malware attacks.
- Examples of malware include computer viruses, worms, Trojan horses, ransomware, rootkits, keyloggers, spyware, adware, rogue security software, potentially unwanted programs (PUPs), potentially unwanted applications (PUAs), and other malicious programs.
- PUPs programs
- PUAs potentially unwanted applications
- users may install scanning programs which attempt to detect the presence of malware. These scanning programs may review programs and/or executables that exist on the computer's storage medium (e.g., a hard disk drive (HDD)) prior to execution of the file.
- HDD hard disk drive
- authors and distributors of malware have taken countermeasures to avoid these scanning programs.
- the malware is obfuscated to conceal the contents of the file.
- Obfuscation may include varying the contents of the file to misdirect, confuse, or otherwise conceal the true purpose and/or function of the code.
- obfuscation may include inserting inoperable code within the executable instructions, compressing/encrypting the operating instructions, rearranging the code instructions, and/or other techniques. These techniques can make it difficult to identify malware in at-rest files.
- XDR extended detection and response
- XDR enables advanced forensic investigation and threat hunting capabilities across multiple domains from a single administrative interface.
- XDR techniques may include ingesting and normalizing volumes of data from endpoints (e.g., client computing devices), cloud workloads, identity, email, network traffic, virtual containers and more. XDR techniques may then parse and correlate the data to automatically detect threats, and respond to the threats, potentially prioritized by severity, so that threat hunters can quickly analyze and triage new events In some cases, XDR techniques may attempt to automate investigation and response activities
- processes running on a client device may be analyzed.
- the processes may be organized within an operating system of the client device as a process tree.
- a process tree can be viewed as a graph that shows meaningful connections between processes.
- a whole call stack may be manually parsed to understand and explain certain behaviors.
- manually parsing might prove unfeasible (e.g., for thousands of nodes), and unproductive, as usually an overview may be sufficient rather than dissecting each step in the execution.
- Embodiments of the present disclosure address the above-noted and other deficiencies by providing an automated solution that can explain a process tree in natural language, to help analysts scale up operations and provide more informed answers to potential security events in a timely manner.
- Embodiments of the present disclosure provide an automated tool that can generate, in natural language, a high level explanation of the inner workings of a process tree.
- Embodiments of the present disclosure may reduce the amount of time and processing resources needed to identify a potential threat on a computing device.
- a self-supervised embedding model that leverages the power of said embeddings is used to train an identification model that can explain, in natural language, the details associated with a process tree.
- Some embodiments of the present disclosure may create embeddings at the process level, and then create embeddings at the process tree level from the process level embeddings. After the process tree representation has been constructed by training the self-supervised embedding, a mapping between process trees and explanations may be learned by leveraging various metadata (notes, patterns, tags) associated with the process trees as a supervision signal.
- the embodiments described herein provide improvements over some security mechanisms which rely on the detection of particular patterns in stored files.
- the embedding model described herein may be capable of determining features of a process tree (e.g., metadata and/or structure associated with processes of the process tree) that are indicative of an executing process that contains malware. These features may be identified, in some cases, regardless of attempts by an author of the malware to change its data signature. In this way, embodiments according to the present disclosure may provide an improved capability of detecting malware, and may increase the security of a computer system.
- FIG. 1 is a block diagram that illustrates an example system 100 , according to some embodiments of the present disclosure.
- FIG. 1 and the other figures may use like reference numerals to identify like elements.
- the system 100 includes a first computing device 110 (also referred to herein as a detection computing device 110 ) and a second computing device 120 (also referred to herein as a client computing device 120 ).
- the detection computing device 110 and the client computing device 120 may each include hardware such as processing device 122 (e.g., processors, central processing units (CPUs)), memory 124 (e.g., random access memory (RAM), storage devices 126 (e.g., hard-disk drive (HDD)), and solid-state drives (SSD), etc.), and other hardware devices (e.g., sound card, video card, etc.).
- processing device 122 e.g., processors, central processing units (CPUs)
- memory 124 e.g., random access memory (RAM)
- storage devices 126 e.g., hard-disk drive (HDD)
- SSD solid-state drives
- memory 124 may be volatile memory that loses contents when the power to the computing device is removed or non-volatile memory that retains its contents when power is removed. In some embodiments, memory 124 may be non-uniform access (NUMA), such that memory access time depends on the memory location relative to processing device 122 .
- NUMA non-uniform access
- Processing device 122 may include a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets.
- Processing device 122 may also include one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like.
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- DSP digital signal processor
- the storage device 126 may comprise a persistent storage that is capable of storing data.
- a persistent storage may be a local storage unit or a remote storage unit.
- Persistent storage may be a magnetic storage unit, optical storage unit, solid state storage unit, electronic storage units (main memory), or similar storage unit. Persistent storage may also be a monolithic/single device or a distributed set of devices.
- the detection computing device 110 and/or the client computing device 120 may comprise any suitable type of computing device or machine that has a programmable processor including, for example, server computers, desktop computers, laptop computers, tablet computers, smartphones, set-top boxes, etc.
- the detection computing device 110 may comprise a single machine or may include multiple interconnected machines (e.g., multiple servers configured in a cluster).
- the detection computing device 110 and/or the client computing device 120 may be implemented by a common entity/organization or may be implemented by different entities/organizations.
- the detection computing device 110 and/or the client computing device 120 may be coupled to each other (e.g., may be operatively coupled, communicatively coupled, may communicate data/messages with each other) via network 102 .
- Network 102 may be a public network (e.g., the internet), a private network (e.g., a local area network (LAN) or wide area network (WAN)), or a combination thereof.
- network 102 may include a wired or a wireless infrastructure, which may be provided by one or more wireless communications systems, such as a WIFITM hotspot connected with the network 102 and/or a wireless carrier system that can be implemented using various data processing equipment, communication towers (e.g., cell towers), etc.
- the network 102 may carry communications (e.g., data, message, packets, frames, etc.) between the detection computing device 110 and/or the client computing device 120 .
- the client computing device 120 may execute an operating system 115 .
- the operating system 115 of the client computing device 120 may manage the execution of other components (e.g., software, applications, etc.) and/or may manage access to the hardware (e.g., processors, memory, storage devices etc.) of the client computing device 120 .
- Operating system 115 may be software to provide an interface between the computing hardware (e.g., processing device 122 and/or storage device 126 ) and applications running on the operating system 115 .
- Operating system 115 may include an OS kernel and a user space supporting the execution of one or more processes 210 .
- the number of processes 210 illustrated in FIG. 1 are merely for purposes of explanation, and are not intended to limit the embodiments of the present disclosure.
- Operating system 115 may include several operating system functionalities, including but not limited to process management, hardware interfaces, access control and the like. Examples of operating systems 115 include WINDOWSTM, LINUXTM, ANDROIDTM, IOSTM, and MACOSTM.
- the detection computing device 110 may also include an operating system, which may, in some embodiments, be different than that of the operating system 115 of the client computing device 120 .
- the client computing device 120 may execute (e.g., using processing device 122 ) the one or more processes 210 .
- Process 210 may be a desktop application, a network application, a database application, or any other application that may be executed by the operating system 115 .
- the process 210 may be loaded from a process executable (e.g., in storage device 126 ) into memory 124 .
- the process executable may be a file, for example, on the storage device 126 that contains executable instructions.
- the operating system 115 may allocate execution resources (e.g., processing device 122 and/or memory 124 ) to the process 210 (e.g., by a multi-tasking scheduler).
- the processing device 122 may execute the executable instructions of the process 210 .
- the processes 210 may execute within a tree hierarchy. As will be described further herein, a first process 210 may spawn a second process 210 , which may further spawn other processes 210 .
- the hierarchical relationship of the processes 210 may be represented by a process tree 220 .
- the process tree 220 may illustrate the parent-child relationships within the processes 210 .
- a first process 210 that spawns a second process 210 may be referenced as the parent of the second process 210
- the second process 210 may be referenced as the child of the first process 210 .
- the first process 210 may be referenced as the grandparent of the third process 210 , and so on.
- the operating system 115 of the client computing device 110 may also execute a monitoring engine 215 .
- the monitoring engine 215 may monitor the processes 210 executing on the client computing device 110 .
- the monitoring engine 215 may execute at an elevated authority.
- the monitoring engine 215 may have administrative access that allows it to collect process metadata 250 for each of the processes 210 of a particular process tree 220 .
- the monitoring engine 215 may execute as part of, or an extension of, the operating system 115 .
- the process metadata 250 may be provided to the detection computing device 110 to facilitate embodiments of the present disclosure.
- FIG. 2 A is a schematic block diagram illustrating an example of a process tree 220 , in accordance with some embodiments of the present disclosure. A description of elements of FIG. 2 A that have been previously described will be omitted for brevity. FIG. 2 A provides additional details on the structure of the process tree 220 .
- the process tree 220 represents the execution of a plurality of processes 210 (e.g., within operating system 115 ) on a computing device, such as client computing device 120 .
- a computing device such as client computing device 120 .
- FIG. 2 A an example is illustrated in which eight processes (e.g., 210 A, 210 B, 210 C, 210 D, 210 E, 210 F, 210 G, 210 H) are shown in an example configuration of a process tree 220 .
- the specific configuration of the processes 210 of the process tree 220 in the example of FIG. 2 A is not intended to limit the embodiments of the present disclosure.
- a first process 210 A may execute (e.g., within operating system 115 ). During executing, the first process 210 A may spawn a second process 210 B. As used herein, spawn may refer to an operation in which one process 210 begins execution of another process 210 . Though illustrated as a single operation in FIG. 2 A , this is only for example. In some embodiments of an operating system 115 , spawning a child process 210 may be accomplished by a plurality of operations.
- spawning a new process 210 may include first performing an operation typically called a fork, which creates a child process 210 as a copy of the parent process 210 , including the instruction codes and memory space.
- a fork an operation typically called a fork
- the child process 210 may return from the system call in the same manner as the parent process 210 (e.g., within the copy of the parent instruction codes), and may continue executing from that point.
- an exec operation refers to a function in an operating system, and/or provided by a system call that interfaces with the operating system, that operates to replace the instruction space of a process 210 with a new set of instruction codes.
- An example of an exec operation in the LINUX operating system is an operation performed by the kernel in response to an execv( ) system call.
- a parent process 210 that wishes to spawn a different program/application will first perform a fork, and then the child process 210 may perform an exec of the different program/application.
- the first process 210 A may spawn the second process 210 B. While the first process 210 A is executing, the monitoring engine 215 may collect first process metadata 250 A from the first process 210 A.
- the first process metadata 250 A may include a number of particular data values that correspond to information associated with the first process 210 A.
- the second process 210 B may continue to execute after being spawned by the first process 210 A.
- the monitoring engine 215 may collect second process metadata 250 B from the second process 210 B.
- the second process 210 B may spawn a third process 210 C and a fourth process 210 D.
- the third process 210 C may spawn a fifth process 210 E and a sixth process 210 F.
- the fourth process 210 D may execute a seventh process 210 G, which may subsequently spawn an eighth process 210 H.
- FIG. 2 B illustrates a schematic representation of the process tree 220 of FIG. 2 A .
- the process tree 220 illustrates the hierarchical nature of the process tree 220 .
- the first process 210 A is the parent of the second process 210 B.
- the second process 210 B is the child of the first process 210 A and the parent of the third process 210 C and the fourth process 210 D.
- the first process 210 A is the grandparent of the third process 210 C and the fourth process 210 D.
- the fifth process 210 E and the sixth process 210 F are children of the third process 210 C and grandchildren of the second process 210 B.
- the seventh process 210 G is the child of the fourth process 210 D and the parent of the eighth process 210 H.
- processes 210 that are hierarchically above a target process 210 in the process tree 220 may be referred to as an ancestor of the target process 210 .
- processes 210 G, 210 D, 210 B, and 210 A may be referred to as ancestors of process 210 H.
- the process tree 220 may be viewed as a graph that provides information about the executions of the processes 210 within it and the connections between the processes 210 .
- analysis for malware may utilize the process tree 220 to detect potential fingerprints of harmful activity. Such analysis may examine a target process 210 , its parent process 210 , and its grandparent process 210 . As an example utilizing FIG. 2 B , analysis in a Windows operating system environment may detect the process 210 H (see FIGS. 2 A and 2 B ) that was executed having a command line of regsvr32.exe. Based on this, the parent process 210 G of the target process 210 H may be analyzed, which may be, for example, powershell.exe.
- the parent process 210 D of powershell.exe may be analyzed, which may be, for example, wscript.exe, which may be determined to be malware that was executed by an infected application.
- relevant process 210 D in this case, wscript.exe
- relevant process 210 D may be detected which may be determined to be a relevant cause of the eventual execution of the target process 210 H.
- similar types of malware may have similar process trees 220 , even if the contents of the processes 210 change.
- similar types of malware may be spawned in a similar series of operations.
- Embodiments of the present disclosure may collect information about the processes 210 and the structure of the process tree 220 . This information may be analyzed to determine an operational state of the process tree 220 and, in some embodiments, determine whether a particular configuration of processes 210 of a process tree 220 may be associated with malware.
- process metadata 250 may be collected for each of the processes 210 .
- the process metadata 250 may be collected by the monitoring engine 215 .
- first process metadata 250 A may be collected for the first process 210 A
- second process metadata 250 B may be collected for the second process 210 B
- third process metadata 250 C may be collected for the third process 210 C
- fourth process metadata 250 D may be collected for the fourth process 210 D
- fifth process metadata 250 E may be collected for the fifth process 210 E
- sixth process metadata 250 F may be collected for the sixth process 210 F
- seventh process metadata 250 G may be collected for the seventh process 210 G
- eighth process metadata 250 H may be collected for the eighth process 210 H.
- the process metadata 250 may be collected (e.g., by the monitoring engine 215 ) when the associated process 210 is spawned. In some embodiments, the process metadata 250 may be collected at particular operation points during the execution of the process 210 . For example, process metadata 250 may be collected when the process 210 is first spawned. The process metadata 250 may be collected again and/or updated when other operations are performed by the process 210 , such as a filesystem access, a screen capture, spawning another process, accessing a network, or other operation. For operating systems 115 that support fork and exec system calls, process metadata 250 may be collected when the process 210 is first forked and again when the process 210 performs an exec system call.
- the process metadata 250 may include information related to the associated process 210 .
- information of the process metadata 250 includes, but is not limited to an operating system process identifier (ID) of the process 210 , a unique generated process ID (UPID) of the process 210 (UPIDs are described, for example, in U.S. patent application Ser. No. 18/081,144, filed on Dec. 14, 2022, and U.S. patent application Ser. No. 18/081,149, filed on Dec.
- an operating system process ID of the parent of the process 210 a UPID of the parent of the process 210 , an operating system process ID of the grandparent of the process 210 , a UPID of the grandparent of the process 210 , a filename (e.g., a full path) of an image (e.g., an executable process image stored in storage device 126 ) of the process 210 , the command line used to create the process 210 , the image filename of the parent process 210 , the command line of the parent process 210 , the image filename of the grandparent process 210 , the command line of the grandparent process 210 , an identification of the action that caused the generation of the process metadata 250 (e.g., the identification of the operation, such as an exec, a screenshot, access of the storage device 126 , etc.), the name of the action that caused the generation of the process metadata 250 , a uniform resource locator (URL) associated with the process 210 , and the like
- the monitoring engine 215 may transmit the process metadata 250 to the detection computing device 110 .
- the monitoring engine 215 may transmit the process metadata 250 over the network 102 connecting the client computing device 120 to the detection computing device 110 .
- the detection computing device 110 may utilize the process metadata 250 to analyze the process trees 220 of the client computing device 120 .
- a malware identification engine 255 of the detection computing device 110 may analyze the process metadata 250 to determine a classification 254 and/or an explanation 252 associated with the process metadata 250 .
- the classification 254 may be a determination as to whether malware is present on the client computing device 120 , where the malware is associated with the process metadata 250 .
- the explanation 252 may be an identification of which process 210 or processes 210 of a given process tree 220 are associated with malware. In some embodiments, the explanation 252 may include an identification of which process 210 of the process tree 220 is most relevant in the determination of malware. For example, the explanation 252 may identify that a particular process 210 is performing harmful activities on the client computing device 120 , but that another ancestor process 210 (e.g., a grandparent process 210 or great-grandparent process 210 ) is also infected and is likely the root cause of the malware. In some embodiments, the explanation 252 may be generated as natural language text suitable for display to a user, administrator, and/or analyst.
- FIG. 3 is a flow diagram of a method 300 of generating a malware classification 254 and/or malware explanation 252 , according to some embodiments of the present disclosure.
- Method 300 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a central processing unit (CPU), a system-on-chip (SoC), etc.), software (e.g., instructions running/executing on a processing device), firmware (e.g., microcode), or a combination thereof.
- the method 300 may be performed by a computing device (e.g., detection computing device 110 ).
- method 300 illustrates example functions used by various embodiments. Although specific function blocks (“blocks”) are disclosed in method 300 , such blocks are examples. That is, embodiments are well suited to performing various other blocks or variations of the blocks recited in method 300 . It is appreciated that the blocks in method 300 may be performed in an order different than presented, and that not all of the blocks in method 300 may be performed.
- the method 300 begins at block 310 , in which process metadata 250 is received from processes 210 of a process tree 220 .
- the process metadata 250 may be collected (e.g., from a client computing device 120 ) by a monitoring engine 215 and transmitted to the detection computing device 110 over the network 102 .
- the detection computing device 110 may store the process metadata 250 in a process metadata store 285 (e.g., in storage device 126 ).
- one or more process metadata 250 may be utilized to generate one or more process embeddings 262 from the process metadata 250 .
- the one or more process embeddings 262 may be generated by an embedding engine 260 (see FIG. 1 ) of the detection computing device 110 .
- the embedding engine 260 may be utilized generate embeddings for data, words, sentences, or documents.
- Embedding may refer to the process of taking a data element, such as a text-string and/or other data, and producing a vector of numbers for it. In other words, the original data element is “embedded” into the new multi-dimensional (embedding) space.
- the generated vectors also referred to herein as embeddings
- the points associated with the embeddings represented in the multi-dimensional space are close if the entities are similar and/or related.
- the embedding engine 260 may utilize an embedding model 265 .
- the embedding model 265 may be, or include a large language model (LLM), though the embodiments of the present disclosure are not limited to such a configuration.
- the embedding model 265 may be a transformer-based neural network that is capable of managing tabular data.
- the embedding model 265 may be trained on vast amounts of text data using unsupervised learning techniques. During the training process, the embedding model 265 may learn, for example, to predict the next word in a sentence based on the context provided by the preceding words. Similarly, the embedding model 265 may learn during training of relationships between data in datasets, such as tabular data. This process enables the embedding model 265 to develop a rich understanding of the relationships between words and the contextual nuances of language and associated data.
- text utilized to train the embedding model 265 may include text available online, such as text on web pages, postings, and the like, but the embodiments of the present disclosure are not limited to such a configuration.
- the embedding model 265 may be trained on process-specific contents, such as those included in operating systems 115 .
- the embedding model 265 may maintain its training state, for example, in storage 126 , which can be utilized when the embedding model 265 is operated.
- FIG. 4 is schematic block diagram illustrating an operation of the embedding engine 260 to generate the process embedding 262 , in accordance with some embodiments of the present disclosure. A description of elements of FIG. 4 that have been previously described will be omitted for brevity.
- the embedding engine 260 may be configured to take as input the process metadata 250 and generate a corresponding process embedding 262 .
- the embedding engine 260 may process a first process metadata 250 A and generate (e.g., utilizing the embedding model 265 ) a first process embedding 262 A.
- second, third, up to Nth process metadata 250 B, 250 C, . . . 250 N may be processed by the embedding engine 260 to generate second, third, up to Nth process embeddings 262 B, 262 C, . . . 262 N.
- the number of process metadata 250 and process embeddings 262 illustrated in FIG. 4 is merely an example, and is not intended to limit the embodiments of the present disclosure.
- the generated process embeddings 262 for a given input of process metadata 250 are numerical representations (e.g., vectors) that encode semantic and syntactic properties of the language represented by the input.
- the process embeddings 262 may be high-dimensional vectors, where the dimensions capture different aspects of the language and/or data of the process metadata 250 .
- the process embeddings 262 produced by the embedding engine 260 may have several desirable properties. First, the process embeddings 262 may capture semantic similarity, meaning that similar words or phrases are represented by vectors that are close to each other in the embedding space.
- the embeddings of “dog” and “cat” would be closer together than the embeddings of “dog” and “car.” This property allows for tasks like word similarity measurement or finding related words based on the vectors of the process embeddings 262 .
- the process embeddings 262 may capture contextual information. Since the embedding model 265 is trained on vast amounts of data and/or text, it may programmatically learn to understand the meaning of data and/or words based on their surrounding context. This enables the process embeddings 262 to reflect the meaning of data within the process metadata 250 . Furthermore, the embedding engine 260 may generate process embeddings 262 by aggregating the embeddings of individual portions of the process metadata 250 . This allows for understanding the overall meaning and semantic compositionality of longer metadata portions.
- the embedding engine 260 of the detection computing device 110 may generate a process embedding 262 for each of the process metadata 250 and store the results in the storage device 126 .
- the process embeddings 262 may be utilized to classify potential malware executing on the client computing device 120 .
- the operations of the method 300 may continue with block 330 , in which a process tree embedding 264 is generated based on the process embeddings 262 .
- the embedding engine 260 may aggregate and/or combine the process embeddings 262 associated with a given process tree 220 .
- FIG. 5 is schematic block diagram illustrating an operation of generating a process tree embedding 264 , in accordance with some embodiments of the present disclosure. A description of elements of FIG. 5 that have been previously described will be omitted for brevity.
- the process embeddings 262 that are associated with the processes 210 of the process tree 220 may be aggregated and/or combined.
- FIG. 5 an example is illustrated in which four process embeddings 262 ( 262 A, 262 B, 262 C, 262 D) are aggregated for a particular process tree 220 .
- the process embeddings 262 A, 262 B, 262 C, 262 D may be associated with the processes 210 of the process tree 220 (e.g., as executed on the operating system 115 of the client computing device 120 , as illustrated in FIGS. 1 , 2 A, and 2 B ).
- the process embeddings 262 A, 262 B, 262 C, 262 D may be processed by an aggregation operation 510 .
- the aggregation operation 510 may perform neighborhood aggregation on the various vectors of the process embeddings 262 associated with the process tree 220 .
- the aggregation operation 510 may be performed in a variety of different ways.
- the process tree embedding 264 may be constructed by performing an averaging of the coordinates of the vectors of the process embeddings 262 (e.g., the process embeddings 262 A, 262 B, 262 C, 262 D) associated with the process tree 220 .
- the coordinates of the vectors of the process embeddings 262 may be summed to generate the process tree embedding 264 .
- the process tree embedding 264 may be formed by taking the maximum value for each coordinate of the respective vectors of the process embeddings 262 .
- Other forms of combining the process embeddings 262 to generate the process tree embedding 264 are contemplated.
- the process embeddings 262 may be combined using a machine learning model, such as a neural network.
- the machine learning model may be trained based on known combinations of processes 210 and/or process trees 220 to aggregate the various coordinates of the vectors comprising the process embeddings 262 to generate the process tree embedding 264 .
- the operations of the method 300 may continue with operation 340 , in which a classification 254 and/or explanation 252 for the process tree 220 are generated based on the process tree embedding 264 .
- the classification 254 and/or the explanation 252 may be generated by a malware identification engine 255 of the detection computing device 110 .
- FIG. 6 is schematic block diagram illustrating an operation of generating a malware classification 254 and/or malware explanation 252 , in accordance with some embodiments of the present disclosure. A description of elements of FIG. 6 that have been previously described will be omitted for brevity.
- the classification 254 may provide a determination as to whether the process tree 220 is associated with malware, or other negative operating environment. Referring to both FIG. 6 and FIG. 1 , to generate the classification 254 , the process tree embedding 264 may be processed by the malware identification engine 255 . In some embodiments, the malware identification engine 255 may process the process tree embedding 264 utilizing an identification model 275 . In some embodiments, the identification model 275 may be a neural network model based on machine learning.
- the malware identification engine 255 may also generate an explanation 252 .
- the explanation 252 may be an identification of which processes 210 of the process tree 220 are relevant in the determination of the malware classification 254 .
- the malware identification engine 255 may determine that a process tree embedding 264 is associated with malware.
- the classification 254 may identify that a particular process tree 220 is associated with malware, and the explanation 252 may identify why the process tree 220 is associated with malware, and the relevant portions of the process tree 220 (e.g., which processes 210 ) that contributed to the classification 254 .
- the explanation 252 may also be generated based on the identification model 275 .
- the malware identification engine 255 may take the process tree embedding 264 as input.
- the malware identification engine 255 may analyze the process tree embedding 264 utilizing the identification model 275 .
- the identification model 275 may be a machine learning model trained on a plurality of process tree embeddings 264 .
- Each of the process tree embeddings 264 may have a known classification 254 and a known explanation 252 .
- relationships may be established between different aspects of the process tree embeddings 264 and the classifications 254 and/or explanations 252 to generate the identification model 275 .
- FIG. 7 A is a block diagram illustrating an example training system for performing a machine learning operation based on process tree embeddings 264 from process metadata 250 , according to some embodiments of the present disclosure. A description of elements of FIG. 7 A that have been previously described will be omitted for brevity.
- a system 700 A for performing a machine learning operation may include learning operations 730 which perform a feedback controlled comparison between a training dataset 720 and a testing dataset 725 based on the process tree embeddings 264 .
- the system 700 A may be implemented by a classification training engine 270 of the detection computing device 110 , as illustrated in FIG. 1 .
- an identification model 275 may be pre-trained and provided to the detection computing device 110 .
- the process tree embeddings 264 may be combined with training classification value 754 and/or training explanation value 752 to generate process tree-specific input data 707 . More specifically, the process tree embeddings 264 from a particular process tree 220 may be combined with training classification value 754 and/or training explanation value 752 for the same process tree 220 .
- the training classification value 754 for the process tree 220 may identify whether the process tree 220 contains or is associated with malware, and the training explanation value 752 for the process tree 220 may identify the underlying basis for the malware classification and which of the processes 210 of the process tree 220 are most relevant to the malware classification.
- process trees 220 with known classifications may be collected and process tree embeddings 264 may be formed from process metadata 250 associated with processes 210 of the process tree 220 with known malware classifications and/or known explanations for the malware.
- the known classification and the known explanation of a given process tree 220 may be used as the training classification value 754 and/or the training explanation value 752 , and combined with the process tree embedding 264 to form the process tree-specific input data 707 for that process tree 220 .
- process metadata 250 may be collected from a process 210 that is part of a process tree 220 that is known to contain or be associated with malware.
- a training classification value 754 of the known-bad process tree 220 may be generated indicating that the process tree 220 is associated with malware and a training explanation value 752 of the known-bad process tree 220 may be generated identifying which portion of the process tree 220 is contributing to the training classification value 754 .
- a set of process tree embeddings 264 may be generated from the process metadata 250 (as described herein with respect to FIGS. 4 and 5 ). The set of process tree embeddings 264 may be combined with the training classification value 754 (e.g., malware) and/or the training explanation value 752 to generate the process tree-specific input data 707 for that process tree 220 .
- process metadata 250 may be collected from processes 210 of a process tree 220 that is known to be free of malware.
- a training classification value 754 and/or a training explanation value 752 of the known-good process tree 220 may be generated indicating that the process tree 220 is free of malware.
- a process tree embedding 264 may be generated from the processes 210 of the process tree 220 as described herein. The process tree embedding 264 may be combined with a training classification value 754 (e.g., malware-free) and/or a training explanation value 752 to generate the process tree-specific input data 707 for that process tree 220 .
- a training classification value 754 e.g., malware-free
- process tree-specific input data 707 may be generated for each process tree 220 used for training the identification model 275 .
- the process tree-specific input data 707 may be separated into two groups: a training dataset 720 and a testing dataset 725 .
- Each group of the training dataset 720 and the testing dataset 725 may include process tree-specific input data 707 (e.g., process tree embeddings 264 and their associated training classification value 754 and/or associated training explanation value 752 ) for a plurality of process trees 220 .
- Learning operation 730 may be performed on the training dataset 720 .
- the learning operations 730 may examine the process tree embeddings 264 to establish a relationship between the elements of the process tree embeddings 264 that accurately predict the classification value 754 (e.g., malware or not malware) and/or the explanation value 752 for a given process tree 220 .
- the learning operations 730 may generate a ML training model 765 that represents the determined relationship.
- the ML training model 765 may take a process tree embedding 264 as input, and output a classification value (e.g., malware or non-malware) and/or the explanation value for the process tree 220 associated with the process tree embedding 264 .
- a classification value e.g., malware or non-malware
- the learning operations 730 may attempt to adjust parameters 735 of the ML training model 765 to generate a best-fit algorithm that describes a relationship between the process tree embedding 264 and the training classification value 754 and/or the training explanation value 752 for all of the process trees 220 of the training dataset 720 .
- a set of parameters 735 may be selected based on the training dataset 720 and preliminarily established as the ML training model 765 .
- the results of the learning operations 730 may be provided to an evaluation operation 740 .
- the evaluation operation 740 may utilize the ML training model 765 generated by the learning operations 730 (based on the training dataset 720 ) to see if the ML training model 765 correctly predicts the training classification value 754 and/or the training explanation value 752 for the process tree embeddings 264 of the testing dataset 725 . If the ML training model 765 accurately predicts the training classification values 754 and/or the training explanation values 752 of the testing dataset 725 , it may be promoted to the identification model 275 .
- the ML training model 765 does not accurately predict the training classification values 754 and/or the training explanation values 752 of the testing dataset 725 , feedback 712 may be provided to the learning operations 730 , and the learning operations 730 may be repeated, with additional adjustment of the parameters 735 . This process of learning operations 730 and evaluation operation 740 may be repeated until an acceptable identification model 275 is generated that is capable of accurately predicting the training classification values 754 , the training explanation values 752 , and/or combinations of the training classification values 754 and the training explanation values 752 .
- the identification model 275 may be used to predict classifications 254 and/or explanations 252 for production process tree embeddings 264 .
- process metadata 250 may be generated, as described herein.
- a process tree embedding 264 may be generated in a manner similar to that discussed herein (e.g., with respect to FIGS. 4 and 5 ).
- a process embedding 262 may be generated for each process 210 of a process tree 220 , and the process tree embedding 264 may be generated from the process embeddings 262 .
- the process tree embedding 264 may be provided to the identification model 275 .
- the operations of the identification model 275 may generate the classification 254 (e.g., whether or not the process tree 220 associated with the production process tree embedding 264 contains and/or is associated with malware) and/or the explanation 252 (e.g., which processes 210 of the process tree 220 are relevant to the classification of the process tree 220 ).
- FIG. 7 B is a block diagram of a system 700 B incorporating a neural network model 790 for generating a classification 254 and/or an explanation 252 of a process tree embedding 264 based on process metadata 250 , according to some embodiments of the present disclosure.
- the system 700 B may be implemented by the classification training engine 270 of the detection computing device 110 , as illustrated in FIG. 1 .
- the embodiments of the present disclosure are not limited to such a configuration.
- the neural network model 790 may be pre-trained and provided as the identification model 275 to the detection computing device 110 .
- the neural network model 790 includes an input layer having a plurality of input nodes I 1 to I N , a sequence of neural network layers (layers 1 to Z are illustrated in FIG. 7 B ) each including a plurality (e.g., 1 to X in FIG. 7 B ) of weight nodes, and an output layer including at least one output node.
- the input layer includes input nodes I 1 to I N (where N is any plural integer).
- a first one of the sequence of neural network layers includes weight nodes N 1L1 (where “1L1” refers to a first weight node on layer one) to N XL1 (where X is any plural integer).
- a last one (“Z”) of the sequence of neural network layers includes weight nodes N 1LZ (where Z is any plural integer) to N YLZ (where Y is any plural integer).
- the output layer includes a plurality of output nodes O 1 to O M (where M is any plural integer).
- the neural network model 790 can be operated to process elements of the process tree embedding 264 through different inputs (e.g., input nodes I 1 to I N ) to generate one or more outputs (e.g., output nodes O 1 to O M ).
- the elements of the process tree embedding 264 that can be simultaneously processed through different input nodes I 1 to I N may include, for example, statistical values (e.g., minimum, maximum, average, and/or standard deviation) of axes of an embedding space based the process tree embedding 264 .
- the classification 254 and/or the explanation 252 that can be output may include an indication of whether the process tree 220 associated with the process tree embedding 264 is associated with malware and/or the processes 210 of the process tree 220 that may contribute to that classification.
- the various weights of the neural network layers may be adjusted based on a comparison of predicted process classification 254 and/or predicted explanation 252 to data of an actual classification and/or explanation (such as training classification value 754 and/or training explanation value 752 ).
- the comparison may be performed, for example, through the use of a loss function.
- the loss function may provide a mechanism to calculate how poorly the training model is performing by comparing what the model is predicting with the actual value it is supposed to output.
- the interconnected structure between the input nodes, the weight nodes of the neural network layers, and the output nodes may cause a given element of the process tree embedding 264 to influence the classification prediction 254 and/or the explanation prediction 252 generated for all of the other elements of the process tree embedding 264 that are simultaneously processed.
- the classification prediction 254 and/or the explanation prediction 252 generated by the neural network model 790 may thereby identify a comparative prioritization of which of the elements of the process tree embedding 264 provide a higher/lower impact on the classification 254 as to whether the associated process tree 220 is, or is not, associated with malware and/or the explanation prediction 252 as to which processes 210 of the process tree 220 contribute to that classification 254 .
- the neural network model 790 of FIG. 7 B is an example that has been provided for ease of illustration and explanation of one embodiment.
- Other embodiments may include any non-zero number of input layers having any non-zero number of input nodes, any non-zero number of neural network layers having a plural number of weight nodes, and any non-zero number of output layers having any non-zero number of output nodes.
- the number of input nodes can be selected based on the number of input values that are to be simultaneously processed, and the number of output nodes can be similarly selected based on the number of output characteristics that are to be simultaneously generated therefrom.
- an identification of malware may be made based on the classification 254 and/or explanation 252 .
- the explanation 252 may be utilized to identify a process 210 of the process tree 220 that is relevant to the classification of malware.
- the identification of the process 210 relevant to the classification of malware for the process tree 220 may allow for remediation actions to be taken with respect to the relevant process 210 .
- the method 300 illustrates that process embeddings 262 may be generated from process metadata 250 , and the process tree embeddings 264 may be generated from the process embeddings 262 .
- the embodiments of the present disclosure are not limited to such a configuration.
- FIG. 8 is a flow diagram of another method 800 of generating a malware classification 254 and/or malware explanation 252 , according to some embodiments of the present disclosure.
- Method 800 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a central processing unit (CPU), a system-on-chip (SoC), etc.), software (e.g., instructions running/executing on a processing device), firmware (e.g., microcode), or a combination thereof.
- the method 800 may be performed by a computing device (e.g., detection computing device 110 ).
- method 800 illustrates example functions used by various embodiments. Although specific function blocks (“blocks”) are disclosed in method 800 , such blocks are examples. That is, embodiments are well suited to performing various other blocks or variations of the blocks recited in method 800 . It is appreciated that the blocks in method 800 may be performed in an order different than presented, and that not all of the blocks in method 800 may be performed.
- the method 800 begins at block 810 , in which process metadata 250 is received from processes 210 of a process tree 220 .
- the receipt of the process metadata 250 may be similar to the block 310 described herein with respect to FIG. 3 and, as such, a duplicate description thereof will be omitted.
- a process tree embedding 264 is generated based on the process metadata 250 .
- the embedding engine 260 may take as input the process metadata 250 .
- Operation 820 may generate the process tree embedding 264 directly from the process metadata 250 , rather than generating the intermediate process embeddings 262 , as described herein with respect to FIG. 3 .
- FIG. 9 is schematic block diagram illustrating an operation of generating a process tree embedding 264 from process metadata 250 , in accordance with some embodiments of the present disclosure. A description of elements of FIG. 9 that have been previously described will be omitted for brevity.
- the process embeddings 262 that are associated with the processes 210 of the process tree 220 may be provided to the embedding engine 260 .
- the process metadata 250 A, 250 B, 250 C, 250 D may be processed by the embedding engine 260 to generate the process tree embedding 264 .
- the generated process tree embedding 264 for a given set of process metadata 250 is a numerical representation that encodes semantic and syntactic properties of the language represented by the input (e.g., the process metadata 250 ).
- the process tree embedding 264 may be directly generated from the process metadata 250 .
- the process metadata 250 may be concatenated or otherwise combined when provided to the embedding engine 260 .
- additional process tree metadata 950 may be provided to the embedding engine 260 along with the process metadata 250 .
- the process tree metadata 950 may include information related to the process tree 220 , such as the number of processes 210 , details on the execution history of the process tree 220 , tags associated with the process tree 220 , statistics associated with the process tree 220 , and the like.
- the process tree metadata 950 may be omitted.
- the operations of the method 800 may continue with operation 839 , in which a classification 254 and/or explanation 252 for the process tree 220 are generated based on the process tree embedding 264 .
- the classification 254 and/or the explanation 252 may be generated by a malware identification engine 255 of the detection computing device 110 .
- the generation of the classification 254 and/or explanation 252 for the process tree 220 may be similar to the block 340 described herein with respect to FIG. 3 and, as such, a duplicate description thereof will be omitted.
- the method 800 of FIG. 8 may allow for the generation of the process tree embedding 264 directly from the process metadata 250 , as compared to the method 300 of FIG. 3 .
- the classification 254 and/or explanation 252 may be generated more quickly and with the use of fewer resources.
- the embedding engine 260 may extract additional context from the process tree 220 , which may lead to a more accurate classification 254 and/or explanation 252 .
- the classification 254 and/or the explanation 252 may be utilized to detect and/or remediate malware. For example, in some embodiments, responsive to the classification 254 and/or the explanation 252 indicating that process tree 220 is associated with malware, steps may be taken to isolate the processes 210 of the process tree 220 . In some embodiments, for example, the classification 254 (and/or explanation 252 ) may be transmitted to the client computing device 120 . The client computing device 120 may take remediation to address any malware that may be indicated by the explanation 252 .
- the one or more processes 210 may be terminated and/or quarantined.
- the operating system 115 and/or the monitoring engine 215 ) may terminate and/or unload the process 210 (or all of the processes 210 of the process tree 220 ), deny the program executable associated with the process 210 permission to execute from memory 124 , and/or deny the process 210 access to resources of the client computing device 120 .
- an alert may be transmitted from the detection computing device 110 (e.g., to the client computing device 120 and/or another administrative computing device).
- the system 100 provides an improvement in the technology associated with computer security.
- the system 100 provides an improved malware detection platform that is able to holistically analyze a process tree 220 to determine if the process tree 220 is associated with malware, as well as determining which processes 210 of the process tree 220 are contributing to this determination.
- the system 100 is a technological improvement over some techniques for malware detection in that it does not exclusively utilize static image comparisons, which may be quickly varied by malware developers. Instead, embodiments according to the present disclosure may identify malware based on characteristics of portions of the running processes associated with the malware, and may be able to bypass obfuscation techniques that might otherwise make the malware detection difficult.
- the embedding model 265 and the identification model 275 are illustrated as separate models. However, the embodiments of the present disclosure are not limited to such a configuration. In some embodiments, the embedding model 265 and the identification model 275 may be different portions of a single machine learning-based model.
- FIG. 10 is a flow diagram of a method 1000 of operating a malware detection system, in accordance with some embodiments of the present disclosure.
- Method 1000 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a central processing unit (CPU), a system-on-chip (SoC), etc.), software (e.g., instructions running/executing on a processing device), firmware (e.g., microcode), or a combination thereof.
- the method 1000 may be performed by a computing device (e.g., detection computing device 110 ).
- method 1000 illustrates example functions used by various embodiments. Although specific function blocks (“blocks”) are disclosed in method 1000 , such blocks are examples. That is, embodiments are well suited to performing various other blocks or variations of the blocks recited in method 1000 . It is appreciated that the blocks in method 1000 may be performed in an order different than presented, and that not all of the blocks in method 1000 may be performed.
- the method 1000 begins at block 1010 , which includes generating a process tree embedding corresponding to a process tree, the process tree comprising a plurality of processes.
- the process tree embedding and process tree may be similar to the process tree embedding 264 and/or the process tree 220 described herein with respect to FIGS. 1 to 9 .
- the plurality of processes may be similar to the processes 210 described herein with respect to FIGS. 1 to 9 .
- generating the process tree embedding corresponding to the process tree includes generating a process embedding corresponding to a first process of the process tree, and generating the process tree embedding based on the process embedding.
- the process embedding may be similar to the process embedding 262 described herein with respect to FIGS. 1 to 9 .
- generating the process embedding comprises submitting metadata associated with the first process to an LLM.
- the metadata may be similar to the process metadata 250 described herein with respect to FIGS. 1 to 9 .
- the metadata includes at least one of an operating system process identifier (ID) of the first process, a unique generated process ID (UPID) of first process, an operating system process ID of an ancestor in the process tree of first process, a UPID of the ancestor in the process tree of the first process, a filename of an executable image of the first process, a command line used to create the first process, a filename of an executable image of the ancestor in the process tree of the first process, a command line of the ancestor in the process tree of the first process, or an identification of an action that caused a generation of the metadata for the first process.
- ID operating system process identifier
- UPID unique generated process ID
- generating the process tree embedding corresponding to the process tree includes generating process embeddings for each of the plurality of processes of the process tree, and aggregating the process embeddings to generate the process tree embedding.
- generating the process tree embedding corresponding to the process tree includes generating the process tree embedding based on respective metadata of each of the plurality of processes of the process tree.
- operations of the method 1000 may include processing the process tree embedding with a machine learning model to generate an identification of malware associated with the process tree.
- the machine learning model may be similar to the identification model 275 described herein with respect to FIGS. 1 to 9 .
- the identification may be similar to and/or based on the classification 254 and/or the explanation 252 described herein with respect to FIGS. 1 to 9 .
- processing the process tree embedding with the machine learning model to generate the identification of malware associated with the process tree may include processing the process tree embedding with the machine learning model to generate a classification of the process tree as being associated with malware; and, responsive to the classification indicating that the process tree is associated with malware, generating the identification of a first process of the plurality of processes that is relevant to the classification of the process tree as being associated with malware, generating, by a processing device, an identification of a first process of the plurality of processes that is relevant to the classification of the process tree as being associated with malware.
- the classification may be similar to the classification 254 described herein with respect to FIGS. 1 to 9 .
- the identification of the first process may be based on the explanation 252 described herein with respect to FIGS. 1 to 9 .
- the method 1000 may further include, responsive to the identification of malware associated with the process tree, initiating remediation on one or more of the plurality of processes of the process tree.
- FIG. 11 is a component diagram of an example of a device architecture 1100 for malware detection, in accordance with embodiments of the disclosure.
- the device architecture 1100 includes computing device 110 having processing device 122 and memory 124 , as described herein with respect to FIGS. 1 to 10 .
- a process tree embedding 1164 may be generated that to a process tree 220 .
- the process tree 220 may include a plurality of processes 210 .
- the process tree embedding 1164 and process tree may be similar to the process tree embedding 264 and/or the process tree 220 described herein with respect to FIGS. 1 to 10 .
- the plurality of processes 210 may be similar to the processes 210 described herein with respect to FIGS. 1 to 10 .
- the process tree embedding 1164 may be processed with a machine learning (ML) model 1175 to generate an identification 1152 of the process tree 220 as being associated with malware.
- ML model 1175 may be similar to the identification model 275 described herein with respect to FIGS. 1 to 10 .
- the identification 1152 may be based on the classification 254 and/or explanation 252 described herein with respect to FIGS. 1 to 10 .
- the device architecture 1100 of FIG. 11 provides an improved capability for malware detection.
- the device architecture 1100 allows for analysis and/or detection of malware based on process metadata and process tree comparison. By detecting a similarity of a particular process tree to other process trees that are associated with malware, a potentially problematic process may be detected without having to wait for damaging operations to occur.
- the device architecture 1100 may be able to detect malware based on the behavior and/or configuration of the processes and process tree, rather than relying on a scan or comparison of static executable images.
- the device architecture 1100 may be configured to provide a natural language explanation for a relevant process of the process tree that is responsible for the determination that the process tree is associated with malware.
- the device architecture 1100 may be able to identify other processes that are the root cause of the malware.
- the device architecture 1100 provides a technological improvement to the operation of typical computing devices in that it is able to identify malware more efficiently and does not require constant updates to maintain malware signatures, saving on processing resources and downtime.
- FIG. 12 is a block diagram of an example computing device 1200 that may perform one or more of the operations described herein, in accordance with some embodiments of the disclosure.
- Computing device 1200 may be connected to other computing devices in a LAN, an intranet, an extranet, and/or the Internet.
- the computing device may operate in the capacity of a server machine in client-server network environment or in the capacity of a client in a peer-to-peer network environment.
- the computing device may be provided by a personal computer (PC), a set-top box (STB), a server, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
- PC personal computer
- STB set-top box
- server a server
- network router switch or bridge
- the example computing device 1200 may include a processing device (e.g., a general purpose processor, a PLD, etc.) 1202 , a main memory 1204 (e.g., synchronous dynamic random access memory (DRAM), read-only memory (ROM)), a static memory 1206 (e.g., flash memory) and a data storage device 1218 , which may communicate with each other via a bus 1230 .
- a processing device e.g., a general purpose processor, a PLD, etc.
- main memory 1204 e.g., synchronous dynamic random access memory (DRAM), read-only memory (ROM)
- static memory 1206 e.g., flash memory
- data storage device 1218 may communicate with each other via a bus 1230 .
- Processing device 1202 may be provided by one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like.
- processing device 1202 may include a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets.
- processing device 1202 may also include one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like.
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- DSP digital signal processor
- the processing device 1202 may execute the operations described herein, in accordance with one or more aspects of the present disclosure, for performing the operations and steps discussed herein.
- Computing device 1200 may further include a network interface device 1208 which may communicate with a network 1220 .
- the computing device 1200 also may include a video display unit 1210 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device 1212 (e.g., a keyboard), a cursor control device 1214 (e.g., a mouse) and an acoustic signal generation device 1216 (e.g., a speaker).
- video display unit 1210 , alphanumeric input device 1212 , and cursor control device 1214 may be combined into a single component or device (e.g., an LCD touch screen).
- Data storage device 1218 may include a computer-readable storage medium 1228 on which may be stored one or more sets of instructions 1225 that may include instructions for an embedding engine 260 and/or a malware identification engine 255 for carrying out the operations described herein, in accordance with one or more aspects of the present disclosure.
- Instructions 1225 may also reside, completely or at least partially, within main memory 1204 and/or within processing device 1202 during execution thereof by computing device 1200 , main memory 1204 and processing device 1202 also constituting computer-readable media.
- the instructions 1225 may further be transmitted or received over a network 1220 via network interface device 1208 .
- While computer-readable storage medium 1228 is shown in an illustrative example to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database and/or associated caches and servers) that store the one or more sets of instructions.
- the term “computer-readable storage medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform the methods described herein.
- the term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media and magnetic media.
- terms such as “generating,” “processing,” “aggregating,” “submitting,” “initiating,” or the like refer to actions and processes performed or implemented by computing devices that manipulates and transforms data represented as physical (electronic) quantities within the computing device's registers and memories into other data similarly represented as physical quantities within the computing device memories or registers or other such information storage, transmission or display devices.
- the terms “first,” “second,” “third,” “fourth,” etc., as used herein are meant as labels to distinguish among different elements and may not necessarily have an ordinal meaning according to their numerical designation.
- Examples described herein also relate to an apparatus for performing the operations described herein.
- This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computing device selectively programmed by a computer program stored in the computing device.
- a computer program may be stored in a computer-readable non-transitory storage medium.
- Various units, circuits, or other components may be described or claimed as “configured to” or “configurable to” perform a task or tasks.
- the phrase “configured to” or “configurable to” is used to connote structure by indicating that the units/circuits/components include structure (e.g., circuitry) that performs the task or tasks during operation.
- the unit/circuit/component can be said to be configured to perform the task, or configurable to perform the task, even when the specified unit/circuit/component is not currently operational (e.g., is not on).
- the units/circuits/components used with the “configured to” or “configurable to” language include hardware—for example, circuits, memory storing program instructions executable to implement the operation, etc. Reciting that a unit/circuit/component is “configured to” perform one or more tasks, or is “configurable to” perform one or more tasks, is expressly intended not to invoke 35 U.S.C. ⁇ 112 (f) for that unit/circuit/component.
- “configured to” or “configurable to” can include generic structure (e.g., generic circuitry) that is manipulated by software and/or firmware (e.g., an FPGA or a general-purpose processor executing software) to operate in manner that is capable of performing the task(s) at issue. “Configured to” may also include adapting a manufacturing process (e.g., a semiconductor fabrication facility) to fabricate devices (e.g., integrated circuits) that are adapted to implement or perform one or more tasks.
- a manufacturing process e.g., a semiconductor fabrication facility
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A process tree embedding is generated corresponding to a process tree. The process tree comprises a plurality of processes. The process tree embedding is processed with a machine learning model to generate an identification of malware associated with the process tree. In some embodiments, processing the process tree embedding with the machine learning model to generate the identification of malware associated with the process tree includes: processing the process tree embedding with the machine learning model to generate a classification of the process tree as being associated with malware; and, responsive to the classification indicating that the process tree is associated with malware, generating the identification of a first process of the plurality of processes that is relevant to the classification of the process tree as being associated with malware.
Description
- Aspects of the present disclosure relate to detecting cybersecurity events, and more particularly, to detecting cybersecurity events through analysis of process trees.
- Malware is a term that refers to malicious software. Malware includes software that is designed with malicious intent to cause intentional harm and/or bypass security measures. Malware is used, for example, by cyber attackers to disrupt computer operations, to access and to steal sensitive information stored on the computer or provided to the computer by a user, or to perform other actions that are harmful to the computer and/or to the user of the computer. Malware may be formatted as executable files (e.g., COM or EXE files), dynamic link libraries (DLLs), scripts, steganographic encodings within media files such as images, and/or other types of computer programs, or combinations thereof. Malware authors or distributors frequently disguise or obfuscate malware in attempts to evade detection by malware-detection or-removal tools.
- The described embodiments and the advantages thereof may best be understood by reference to the following description taken in conjunction with the accompanying drawings. These drawings in no way limit any changes in form and detail that may be made to the described embodiments by one skilled in the art without departing from the scope of the described embodiments.
-
FIG. 1 is a block diagram that illustrates an example system, according to some embodiments of the present disclosure. -
FIG. 2A is a schematic block diagram illustrating an example of a process tree, in accordance with some embodiments of the present disclosure. -
FIG. 2B illustrates a schematic representation of the process tree ofFIG. 2A . -
FIG. 3 is a flow diagram of a method of generating a malware classification and/or malware explanation, according to some embodiments of the present disclosure. -
FIG. 4 is schematic block diagram illustrating an operation of the embedding engine to generate a process embedding, in accordance with some embodiments of the present disclosure. -
FIG. 5 is schematic block diagram illustrating an operation of generating a process tree embedding, in accordance with some embodiments of the present disclosure. -
FIG. 6 is schematic block diagram illustrating an operation of generating a malware classification and/or malware explanation, in accordance with some embodiments of the present disclosure. -
FIG. 7A is a block diagram illustrating an example training system for performing a machine learning operation based on process tree embeddings from process metadata, according to some embodiments of the present disclosure. -
FIG. 7B is a block diagram of a system incorporating a neural network model for generating a classification and/or explanation of a process tree embedding based on process metadata, according to some embodiments of the present disclosure. -
FIG. 8 is a flow diagram of another method of generating a malware classification and/or malware explanation, according to some embodiments of the present disclosure. -
FIG. 9 is schematic block diagram illustrating an operation of generating a process tree embedding from process metadata, in accordance with some embodiments of the present disclosure. -
FIG. 10 is a flow diagram of a method of operating a malware detection system, according to some embodiments of the present disclosure. -
FIG. 11 is a c component diagram of an example of a device architecture for malware detection, in accordance with embodiments of the disclosure -
FIG. 12 is a block diagram of an example computing device that may perform one or more of the operations described herein, in accordance with embodiments of the disclosure. - Modern computer systems are subject to a large number of potential malware attacks. Examples of malware include computer viruses, worms, Trojan horses, ransomware, rootkits, keyloggers, spyware, adware, rogue security software, potentially unwanted programs (PUPs), potentially unwanted applications (PUAs), and other malicious programs. To protect from such malware, users may install scanning programs which attempt to detect the presence of malware. These scanning programs may review programs and/or executables that exist on the computer's storage medium (e.g., a hard disk drive (HDD)) prior to execution of the file. However, authors and distributors of malware have taken countermeasures to avoid these scanning programs. In some cases, the malware is obfuscated to conceal the contents of the file. Obfuscation may include varying the contents of the file to misdirect, confuse, or otherwise conceal the true purpose and/or function of the code. For example, obfuscation may include inserting inoperable code within the executable instructions, compressing/encrypting the operating instructions, rearranging the code instructions, and/or other techniques. These techniques can make it difficult to identify malware in at-rest files.
- Given the many ways that malware can be hidden, mechanisms that simply scan files to determine whether malware is present face difficult and ever-changing challenges. Recognizing the many variations that a particular version of malware can take may cause constant updates to a scanning program. To address some of these challenges, extended detection and response (XDR) techniques and capabilities have been developed to address malware XDR techniques attempt to combine the telemetry data from multiple tools into a cohesive whole, correlating the telemetry data to identify potential threats. In an XDR environment, threats may be identified by more than just a scan, but also by how the underlying executable behaves or the type of traffic it generates. XDR connects data from otherwise-isolated security solutions to improve threat visibility and reduce the length of time required to identify and respond to an attack. XDR enables advanced forensic investigation and threat hunting capabilities across multiple domains from a single administrative interface. XDR techniques may include ingesting and normalizing volumes of data from endpoints (e.g., client computing devices), cloud workloads, identity, email, network traffic, virtual containers and more. XDR techniques may then parse and correlate the data to automatically detect threats, and respond to the threats, potentially prioritized by severity, so that threat hunters can quickly analyze and triage new events In some cases, XDR techniques may attempt to automate investigation and response activities
- As part of the operations of investigating potential malware, processes running on a client device may be analyzed. The processes may be organized within an operating system of the client device as a process tree. A process tree can be viewed as a graph that shows meaningful connections between processes. In analyzing a process tree, a whole call stack may be manually parsed to understand and explain certain behaviors. Depending on the size of the underlying process tree, manually parsing might prove unfeasible (e.g., for thousands of nodes), and unproductive, as usually an overview may be sufficient rather than dissecting each step in the execution.
- In addition, for behavioral models of threat detection, analysts may have to tag data coming from events from the underlying devices. This often implies that the analysts have to parse through a large sample of events in order to get an accurate overview of an underlying computing environment. This task can prove to be very time consuming and difficult to scale.
- The present disclosure addresses the above-noted and other deficiencies by providing an automated solution that can explain a process tree in natural language, to help analysts scale up operations and provide more informed answers to potential security events in a timely manner. Embodiments of the present disclosure provide an automated tool that can generate, in natural language, a high level explanation of the inner workings of a process tree. Embodiments of the present disclosure may reduce the amount of time and processing resources needed to identify a potential threat on a computing device.
- In some embodiments, a self-supervised embedding model that leverages the power of said embeddings is used to train an identification model that can explain, in natural language, the details associated with a process tree. Some embodiments of the present disclosure may create embeddings at the process level, and then create embeddings at the process tree level from the process level embeddings. After the process tree representation has been constructed by training the self-supervised embedding, a mapping between process trees and explanations may be learned by leveraging various metadata (notes, patterns, tags) associated with the process trees as a supervision signal.
- The embodiments described herein provide improvements over some security mechanisms which rely on the detection of particular patterns in stored files. In sharp contrast, the embedding model described herein may be capable of determining features of a process tree (e.g., metadata and/or structure associated with processes of the process tree) that are indicative of an executing process that contains malware. These features may be identified, in some cases, regardless of attempts by an author of the malware to change its data signature. In this way, embodiments according to the present disclosure may provide an improved capability of detecting malware, and may increase the security of a computer system.
-
FIG. 1 is a block diagram that illustrates anexample system 100, according to some embodiments of the present disclosure.FIG. 1 and the other figures may use like reference numerals to identify like elements. A letter after a reference numeral, such as “250A,” indicates that the text refers specifically to the element having that particular reference numeral. A reference numeral in the text without a following letter, such as “250,” refers to all of the elements in the figures bearing that reference numeral. - As illustrated in
FIG. 1 , thesystem 100 includes a first computing device 110 (also referred to herein as a detection computing device 110) and a second computing device 120 (also referred to herein as a client computing device 120). Thedetection computing device 110 and theclient computing device 120 may each include hardware such as processing device 122 (e.g., processors, central processing units (CPUs)), memory 124 (e.g., random access memory (RAM), storage devices 126 (e.g., hard-disk drive (HDD)), and solid-state drives (SSD), etc.), and other hardware devices (e.g., sound card, video card, etc.). - In some embodiments,
memory 124 may be volatile memory that loses contents when the power to the computing device is removed or non-volatile memory that retains its contents when power is removed. In some embodiments,memory 124 may be non-uniform access (NUMA), such that memory access time depends on the memory location relative toprocessing device 122. -
Processing device 122 may include a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets.Processing device 122 may also include one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. It should be noted that although, for simplicity, asingle processing device 122 is depicted in theclient computing device 120 and thedetection computing device 110 depicted inFIG. 1 , other embodiments of theclient computing device 120 and/or thedetection computing device 110 may include multiple processing devices, storage devices, or other devices. - The
storage device 126 may comprise a persistent storage that is capable of storing data. A persistent storage may be a local storage unit or a remote storage unit. Persistent storage may be a magnetic storage unit, optical storage unit, solid state storage unit, electronic storage units (main memory), or similar storage unit. Persistent storage may also be a monolithic/single device or a distributed set of devices. - The
detection computing device 110 and/or theclient computing device 120 may comprise any suitable type of computing device or machine that has a programmable processor including, for example, server computers, desktop computers, laptop computers, tablet computers, smartphones, set-top boxes, etc. In some examples, thedetection computing device 110 may comprise a single machine or may include multiple interconnected machines (e.g., multiple servers configured in a cluster). Thedetection computing device 110 and/or theclient computing device 120 may be implemented by a common entity/organization or may be implemented by different entities/organizations. - The
detection computing device 110 and/or theclient computing device 120 may be coupled to each other (e.g., may be operatively coupled, communicatively coupled, may communicate data/messages with each other) vianetwork 102.Network 102 may be a public network (e.g., the internet), a private network (e.g., a local area network (LAN) or wide area network (WAN)), or a combination thereof. In one embodiment,network 102 may include a wired or a wireless infrastructure, which may be provided by one or more wireless communications systems, such as a WIFI™ hotspot connected with thenetwork 102 and/or a wireless carrier system that can be implemented using various data processing equipment, communication towers (e.g., cell towers), etc. Thenetwork 102 may carry communications (e.g., data, message, packets, frames, etc.) between thedetection computing device 110 and/or theclient computing device 120. - The
client computing device 120 may execute anoperating system 115. Theoperating system 115 of theclient computing device 120 may manage the execution of other components (e.g., software, applications, etc.) and/or may manage access to the hardware (e.g., processors, memory, storage devices etc.) of theclient computing device 120.Operating system 115 may be software to provide an interface between the computing hardware (e.g.,processing device 122 and/or storage device 126) and applications running on theoperating system 115. -
Operating system 115 may include an OS kernel and a user space supporting the execution of one ormore processes 210. The number ofprocesses 210 illustrated inFIG. 1 are merely for purposes of explanation, and are not intended to limit the embodiments of the present disclosure.Operating system 115 may include several operating system functionalities, including but not limited to process management, hardware interfaces, access control and the like. Examples ofoperating systems 115 include WINDOWS™, LINUX™, ANDROID™, IOS™, and MACOS™. Though not expressly illustrated inFIG. 1 , thedetection computing device 110 may also include an operating system, which may, in some embodiments, be different than that of theoperating system 115 of theclient computing device 120. - As illustrated in
FIG. 1 , theclient computing device 120 may execute (e.g., using processing device 122) the one ormore processes 210.Process 210 may be a desktop application, a network application, a database application, or any other application that may be executed by theoperating system 115. To be executed, theprocess 210 may be loaded from a process executable (e.g., in storage device 126) intomemory 124. The process executable may be a file, for example, on thestorage device 126 that contains executable instructions. Theoperating system 115 may allocate execution resources (e.g.,processing device 122 and/or memory 124) to the process 210 (e.g., by a multi-tasking scheduler). Theprocessing device 122 may execute the executable instructions of theprocess 210. - In some embodiments, the
processes 210 may execute within a tree hierarchy. As will be described further herein, afirst process 210 may spawn asecond process 210, which may further spawnother processes 210. The hierarchical relationship of theprocesses 210 may be represented by aprocess tree 220. Theprocess tree 220 may illustrate the parent-child relationships within theprocesses 210. Afirst process 210 that spawns asecond process 210 may be referenced as the parent of thesecond process 210, and thesecond process 210 may be referenced as the child of thefirst process 210. Similarly, if thesecond process 210 spawns athird process 210, thefirst process 210 may be referenced as the grandparent of thethird process 210, and so on. - The
operating system 115 of theclient computing device 110 may also execute amonitoring engine 215. Themonitoring engine 215 may monitor theprocesses 210 executing on theclient computing device 110. In some embodiments, themonitoring engine 215 may execute at an elevated authority. For example, themonitoring engine 215 may have administrative access that allows it to collectprocess metadata 250 for each of theprocesses 210 of aparticular process tree 220. In some embodiments, themonitoring engine 215 may execute as part of, or an extension of, theoperating system 115. As will be described further therein, theprocess metadata 250 may be provided to thedetection computing device 110 to facilitate embodiments of the present disclosure. -
FIG. 2A is a schematic block diagram illustrating an example of aprocess tree 220, in accordance with some embodiments of the present disclosure. A description of elements ofFIG. 2A that have been previously described will be omitted for brevity.FIG. 2A provides additional details on the structure of theprocess tree 220. - As illustrated in
FIG. 2A , theprocess tree 220 represents the execution of a plurality of processes 210 (e.g., within operating system 115) on a computing device, such asclient computing device 120. InFIG. 2A , an example is illustrated in which eight processes (e.g., 210A, 210B, 210C, 210D, 210E, 210F, 210G, 210H) are shown in an example configuration of aprocess tree 220. The specific configuration of theprocesses 210 of theprocess tree 220 in the example ofFIG. 2A is not intended to limit the embodiments of the present disclosure. - Referring to
FIG. 2A , afirst process 210A may execute (e.g., within operating system 115). During executing, thefirst process 210A may spawn asecond process 210B. As used herein, spawn may refer to an operation in which oneprocess 210 begins execution of anotherprocess 210. Though illustrated as a single operation inFIG. 2A , this is only for example. In some embodiments of anoperating system 115, spawning achild process 210 may be accomplished by a plurality of operations. - For example, in operating systems supporting aspects of the Portable Operating System Interface (POSIX) standard, spawning a
new process 210 may include first performing an operation typically called a fork, which creates achild process 210 as a copy of theparent process 210, including the instruction codes and memory space. In an operation implementing the fork process as a fork system call, thechild process 210 may return from the system call in the same manner as the parent process 210 (e.g., within the copy of the parent instruction codes), and may continue executing from that point. - While the
child process 210 may continue to execute in this manner, thechild process 210 may replace the instruction codes of theparent process 210 with a new set of instruction codes. As an example from a POSIX-compliant operating system, these operations may include a system call often known as an exec. As used herein, an exec operation refers to a function in an operating system, and/or provided by a system call that interfaces with the operating system, that operates to replace the instruction space of aprocess 210 with a new set of instruction codes. An example of an exec operation in the LINUX operating system is an operation performed by the kernel in response to an execv( ) system call. Thus, in a POSIX-compliant operating system (though not limited to POSIX-compliant operating system), aparent process 210 that wishes to spawn a different program/application, will first perform a fork, and then thechild process 210 may perform an exec of the different program/application. - Referring to
FIG. 2A , thefirst process 210A may spawn thesecond process 210B. While thefirst process 210A is executing, themonitoring engine 215 may collectfirst process metadata 250A from thefirst process 210A. Thefirst process metadata 250A may include a number of particular data values that correspond to information associated with thefirst process 210A. - The
second process 210B may continue to execute after being spawned by thefirst process 210A. As with thefirst process 210A, themonitoring engine 215 may collectsecond process metadata 250B from thesecond process 210B. During execution, thesecond process 210B may spawn athird process 210C and afourth process 210D. Thethird process 210C may spawn afifth process 210E and asixth process 210F. Thefourth process 210D may execute aseventh process 210G, which may subsequently spawn aneighth process 210H. -
FIG. 2B illustrates a schematic representation of theprocess tree 220 ofFIG. 2A . Theprocess tree 220 illustrates the hierarchical nature of theprocess tree 220. For example, based on the example scenario ofFIG. 2A , thefirst process 210A is the parent of thesecond process 210B. Thesecond process 210B is the child of thefirst process 210A and the parent of thethird process 210C and thefourth process 210D. Thefirst process 210A is the grandparent of thethird process 210C and thefourth process 210D. Thefifth process 210E and thesixth process 210F are children of thethird process 210C and grandchildren of thesecond process 210B. Theseventh process 210G is the child of thefourth process 210D and the parent of theeighth process 210H. As used herein, processes 210 that are hierarchically above atarget process 210 in theprocess tree 220 may be referred to as an ancestor of thetarget process 210. For example, inFIG. 2B , processes 210G, 210D, 210B, and 210A may be referred to as ancestors ofprocess 210H. - In some embodiments, the
process tree 220 may be viewed as a graph that provides information about the executions of theprocesses 210 within it and the connections between theprocesses 210. In some cases, analysis for malware may utilize theprocess tree 220 to detect potential fingerprints of harmful activity. Such analysis may examine atarget process 210, itsparent process 210, and itsgrandparent process 210. As an example utilizingFIG. 2B , analysis in a Windows operating system environment may detect theprocess 210H (seeFIGS. 2A and 2B ) that was executed having a command line of regsvr32.exe. Based on this, theparent process 210G of thetarget process 210H may be analyzed, which may be, for example, powershell.exe. Theparent process 210D of powershell.exe may be analyzed, which may be, for example, wscript.exe, which may be determined to be malware that was executed by an infected application. By examining theprocess tree 220 of the a givensuspect target process 210H,relevant process 210D (in this case, wscript.exe) may be detected which may be determined to be a relevant cause of the eventual execution of thetarget process 210H. - In some embodiments, similar types of malware may have
similar process trees 220, even if the contents of theprocesses 210 change. For example, similar types of malware may be spawned in a similar series of operations. Embodiments of the present disclosure may collect information about theprocesses 210 and the structure of theprocess tree 220. This information may be analyzed to determine an operational state of theprocess tree 220 and, in some embodiments, determine whether a particular configuration ofprocesses 210 of aprocess tree 220 may be associated with malware. - Referring back to
FIG. 2A , to analyze theprocess tree 220,process metadata 250 may be collected for each of theprocesses 210. In some embodiments, theprocess metadata 250 may be collected by themonitoring engine 215. For example,first process metadata 250A may be collected for thefirst process 210A,second process metadata 250B may be collected for thesecond process 210B,third process metadata 250C may be collected for thethird process 210C,fourth process metadata 250D may be collected for thefourth process 210D,fifth process metadata 250E may be collected for thefifth process 210E,sixth process metadata 250F may be collected for thesixth process 210F,seventh process metadata 250G may be collected for theseventh process 210G, andeighth process metadata 250H may be collected for theeighth process 210H. - In some embodiments, the
process metadata 250 may be collected (e.g., by the monitoring engine 215) when the associatedprocess 210 is spawned. In some embodiments, theprocess metadata 250 may be collected at particular operation points during the execution of theprocess 210. For example,process metadata 250 may be collected when theprocess 210 is first spawned. The process metadata 250 may be collected again and/or updated when other operations are performed by theprocess 210, such as a filesystem access, a screen capture, spawning another process, accessing a network, or other operation. Foroperating systems 115 that support fork and exec system calls,process metadata 250 may be collected when theprocess 210 is first forked and again when theprocess 210 performs an exec system call. - The process metadata 250 may include information related to the associated
process 210. Examples of information of theprocess metadata 250 includes, but is not limited to an operating system process identifier (ID) of theprocess 210, a unique generated process ID (UPID) of the process 210 (UPIDs are described, for example, in U.S. patent application Ser. No. 18/081,144, filed on Dec. 14, 2022, and U.S. patent application Ser. No. 18/081,149, filed on Dec. 14, 2022), an operating system process ID of the parent of theprocess 210, a UPID of the parent of theprocess 210, an operating system process ID of the grandparent of theprocess 210, a UPID of the grandparent of theprocess 210, a filename (e.g., a full path) of an image (e.g., an executable process image stored in storage device 126) of theprocess 210, the command line used to create theprocess 210, the image filename of theparent process 210, the command line of theparent process 210, the image filename of thegrandparent process 210, the command line of thegrandparent process 210, an identification of the action that caused the generation of the process metadata 250 (e.g., the identification of the operation, such as an exec, a screenshot, access of thestorage device 126, etc.), the name of the action that caused the generation of theprocess metadata 250, a uniform resource locator (URL) associated with theprocess 210, and the like. The provided list includes examples ofprocess metadata 250, but is not intended to limit the embodiments of the present disclosure. Additional elements ofprocess metadata 250 associated with aparticular process 210 are contemplated. - Referring back to
FIG. 1 , themonitoring engine 215 may transmit theprocess metadata 250 to thedetection computing device 110. For example, themonitoring engine 215 may transmit theprocess metadata 250 over thenetwork 102 connecting theclient computing device 120 to thedetection computing device 110. Thedetection computing device 110 may utilize theprocess metadata 250 to analyze theprocess trees 220 of theclient computing device 120. - For example, a
malware identification engine 255 of thedetection computing device 110 may analyze theprocess metadata 250 to determine aclassification 254 and/or anexplanation 252 associated with theprocess metadata 250. In some embodiments, theclassification 254 may be a determination as to whether malware is present on theclient computing device 120, where the malware is associated with theprocess metadata 250. - In some embodiments, the
explanation 252 may be an identification of which process 210 orprocesses 210 of a givenprocess tree 220 are associated with malware. In some embodiments, theexplanation 252 may include an identification of which process 210 of theprocess tree 220 is most relevant in the determination of malware. For example, theexplanation 252 may identify that aparticular process 210 is performing harmful activities on theclient computing device 120, but that another ancestor process 210 (e.g., agrandparent process 210 or great-grandparent process 210) is also infected and is likely the root cause of the malware. In some embodiments, theexplanation 252 may be generated as natural language text suitable for display to a user, administrator, and/or analyst. -
FIG. 3 is a flow diagram of amethod 300 of generating amalware classification 254 and/ormalware explanation 252, according to some embodiments of the present disclosure. A description of elements ofFIG. 3 that have been previously described will be omitted for brevity.Method 300 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a central processing unit (CPU), a system-on-chip (SoC), etc.), software (e.g., instructions running/executing on a processing device), firmware (e.g., microcode), or a combination thereof. In some embodiments, themethod 300 may be performed by a computing device (e.g., detection computing device 110). - With reference to
FIG. 3 ,method 300 illustrates example functions used by various embodiments. Although specific function blocks (“blocks”) are disclosed inmethod 300, such blocks are examples. That is, embodiments are well suited to performing various other blocks or variations of the blocks recited inmethod 300. It is appreciated that the blocks inmethod 300 may be performed in an order different than presented, and that not all of the blocks inmethod 300 may be performed. - Referring simultaneously to
FIG. 3 andFIGS. 1, 2A, and 2B as well, themethod 300 begins atblock 310, in whichprocess metadata 250 is received fromprocesses 210 of aprocess tree 220. As described herein, theprocess metadata 250 may be collected (e.g., from a client computing device 120) by amonitoring engine 215 and transmitted to thedetection computing device 110 over thenetwork 102. In some embodiments, thedetection computing device 110 may store theprocess metadata 250 in a process metadata store 285 (e.g., in storage device 126). - At
block 320, one ormore process metadata 250 may be utilized to generate one ormore process embeddings 262 from theprocess metadata 250. In some embodiments, the one ormore process embeddings 262 may be generated by an embedding engine 260 (seeFIG. 1 ) of thedetection computing device 110. - The embedding
engine 260 may be utilized generate embeddings for data, words, sentences, or documents. Embedding may refer to the process of taking a data element, such as a text-string and/or other data, and producing a vector of numbers for it. In other words, the original data element is “embedded” into the new multi-dimensional (embedding) space. The generated vectors (also referred to herein as embeddings) are not random/arbitrary. Instead, when entities are embedded, the points associated with the embeddings represented in the multi-dimensional space are close if the entities are similar and/or related. - Referring to
FIG. 1 , the embeddingengine 260 may utilize an embeddingmodel 265. In some embodiments, the embeddingmodel 265 may be, or include a large language model (LLM), though the embodiments of the present disclosure are not limited to such a configuration. In some embodiments, the embeddingmodel 265 may be a transformer-based neural network that is capable of managing tabular data. The embeddingmodel 265 may be trained on vast amounts of text data using unsupervised learning techniques. During the training process, the embeddingmodel 265 may learn, for example, to predict the next word in a sentence based on the context provided by the preceding words. Similarly, the embeddingmodel 265 may learn during training of relationships between data in datasets, such as tabular data. This process enables the embeddingmodel 265 to develop a rich understanding of the relationships between words and the contextual nuances of language and associated data. - In some embodiments, text utilized to train the embedding
model 265 may include text available online, such as text on web pages, postings, and the like, but the embodiments of the present disclosure are not limited to such a configuration. In some embodiments, the embeddingmodel 265 may be trained on process-specific contents, such as those included inoperating systems 115. The embeddingmodel 265 may maintain its training state, for example, instorage 126, which can be utilized when the embeddingmodel 265 is operated. -
FIG. 4 is schematic block diagram illustrating an operation of the embeddingengine 260 to generate the process embedding 262, in accordance with some embodiments of the present disclosure. A description of elements ofFIG. 4 that have been previously described will be omitted for brevity. - The embedding
engine 260 may be configured to take as input theprocess metadata 250 and generate a corresponding process embedding 262. For example, the embeddingengine 260 may process afirst process metadata 250A and generate (e.g., utilizing the embedding model 265) a first process embedding 262A. In a similar manner, second, third, up toNth process metadata engine 260 to generate second, third, up toNth process embeddings process metadata 250 and process embeddings 262 illustrated inFIG. 4 is merely an example, and is not intended to limit the embodiments of the present disclosure. - The generated
process embeddings 262 for a given input ofprocess metadata 250 are numerical representations (e.g., vectors) that encode semantic and syntactic properties of the language represented by the input. The process embeddings 262 may be high-dimensional vectors, where the dimensions capture different aspects of the language and/or data of theprocess metadata 250. The process embeddings 262 produced by the embeddingengine 260 may have several desirable properties. First, the process embeddings 262 may capture semantic similarity, meaning that similar words or phrases are represented by vectors that are close to each other in the embedding space. For example, the embeddings of “dog” and “cat” would be closer together than the embeddings of “dog” and “car.” This property allows for tasks like word similarity measurement or finding related words based on the vectors of theprocess embeddings 262. - Second, the process embeddings 262 may capture contextual information. Since the embedding
model 265 is trained on vast amounts of data and/or text, it may programmatically learn to understand the meaning of data and/or words based on their surrounding context. This enables the process embeddings 262 to reflect the meaning of data within theprocess metadata 250. Furthermore, the embeddingengine 260 may generateprocess embeddings 262 by aggregating the embeddings of individual portions of theprocess metadata 250. This allows for understanding the overall meaning and semantic compositionality of longer metadata portions. - The embedding
engine 260 of thedetection computing device 110 may generate a process embedding 262 for each of theprocess metadata 250 and store the results in thestorage device 126. As will be described further herein, the process embeddings 262 may be utilized to classify potential malware executing on theclient computing device 120. - Referring back to
FIG. 3 , the operations of themethod 300 may continue withblock 330, in which a process tree embedding 264 is generated based on theprocess embeddings 262. To generate the process tree embedding 264, the embeddingengine 260 may aggregate and/or combine the process embeddings 262 associated with a givenprocess tree 220. -
FIG. 5 is schematic block diagram illustrating an operation of generating a process tree embedding 264, in accordance with some embodiments of the present disclosure. A description of elements ofFIG. 5 that have been previously described will be omitted for brevity. - Referring to
FIG. 5 , to generate the process tree embedding 264 for aparticular process tree 220, the process embeddings 262 that are associated with theprocesses 210 of theprocess tree 220 may be aggregated and/or combined. For example, inFIG. 5 , an example is illustrated in which four process embeddings 262 (262A, 262B, 262C, 262D) are aggregated for aparticular process tree 220. The process embeddings 262A, 262B, 262C, 262D may be associated with theprocesses 210 of the process tree 220 (e.g., as executed on theoperating system 115 of theclient computing device 120, as illustrated inFIGS. 1, 2A, and 2B ). - The process embeddings 262A, 262B, 262C, 262D may be processed by an
aggregation operation 510. In some embodiments, theaggregation operation 510 may perform neighborhood aggregation on the various vectors of the process embeddings 262 associated with theprocess tree 220. Theaggregation operation 510 may be performed in a variety of different ways. For example, in some embodiments, the process tree embedding 264 may be constructed by performing an averaging of the coordinates of the vectors of the process embeddings 262 (e.g., theprocess embeddings process tree 220. This is merely an example, and other forms of theaggregation operation 510 may be utilized without deviating from the embodiments of the present disclosure. For example, in some embodiments, the coordinates of the vectors of the process embeddings 262 may be summed to generate the process tree embedding 264. In some embodiments, the process tree embedding 264 may be formed by taking the maximum value for each coordinate of the respective vectors of theprocess embeddings 262. Other forms of combining the process embeddings 262 to generate the process tree embedding 264 are contemplated. For example, in some embodiments, the process embeddings 262 may be combined using a machine learning model, such as a neural network. For example, the machine learning model may be trained based on known combinations ofprocesses 210 and/orprocess trees 220 to aggregate the various coordinates of the vectors comprising the process embeddings 262 to generate the process tree embedding 264. - Referring back to
FIG. 3 andFIG. 1 , the operations of themethod 300 may continue withoperation 340, in which aclassification 254 and/orexplanation 252 for theprocess tree 220 are generated based on the process tree embedding 264. In some embodiments, theclassification 254 and/or theexplanation 252 may be generated by amalware identification engine 255 of thedetection computing device 110. -
FIG. 6 is schematic block diagram illustrating an operation of generating amalware classification 254 and/ormalware explanation 252, in accordance with some embodiments of the present disclosure. A description of elements ofFIG. 6 that have been previously described will be omitted for brevity. - The
classification 254 may provide a determination as to whether theprocess tree 220 is associated with malware, or other negative operating environment. Referring to bothFIG. 6 andFIG. 1 , to generate theclassification 254, the process tree embedding 264 may be processed by themalware identification engine 255. In some embodiments, themalware identification engine 255 may process the process tree embedding 264 utilizing anidentification model 275. In some embodiments, theidentification model 275 may be a neural network model based on machine learning. - In some embodiments, the
malware identification engine 255 may also generate anexplanation 252. Theexplanation 252 may be an identification of which processes 210 of theprocess tree 220 are relevant in the determination of themalware classification 254. For example, themalware identification engine 255 may determine that a process tree embedding 264 is associated with malware. Thus, theclassification 254 may identify that aparticular process tree 220 is associated with malware, and theexplanation 252 may identify why theprocess tree 220 is associated with malware, and the relevant portions of the process tree 220 (e.g., which processes 210) that contributed to theclassification 254. In some embodiments, theexplanation 252 may also be generated based on theidentification model 275. - Referring to
FIG. 6 , themalware identification engine 255 may take the process tree embedding 264 as input. Themalware identification engine 255 may analyze the process tree embedding 264 utilizing theidentification model 275. Theidentification model 275 may be a machine learning model trained on a plurality ofprocess tree embeddings 264. Each of theprocess tree embeddings 264 may have a knownclassification 254 and a knownexplanation 252. By processing the plurality of process tree embeddings 264 (including their knownclassifications 254 and known explanation 252) through machine learning, relationships may be established between different aspects of theprocess tree embeddings 264 and theclassifications 254 and/orexplanations 252 to generate theidentification model 275. -
FIG. 7A is a block diagram illustrating an example training system for performing a machine learning operation based on process tree embeddings 264 fromprocess metadata 250, according to some embodiments of the present disclosure. A description of elements ofFIG. 7A that have been previously described will be omitted for brevity. - Referring to
FIGS. 1 and 7A , asystem 700A for performing a machine learning operation may include learningoperations 730 which perform a feedback controlled comparison between atraining dataset 720 and atesting dataset 725 based on theprocess tree embeddings 264. In some embodiments, thesystem 700A may be implemented by aclassification training engine 270 of thedetection computing device 110, as illustrated inFIG. 1 . However, the embodiments of the present disclosure are not limited to such a configuration. In some embodiments, anidentification model 275 may be pre-trained and provided to thedetection computing device 110. - In some embodiments, the
process tree embeddings 264, generated from theprocess metadata 250 as described herein, may be combined withtraining classification value 754 and/ortraining explanation value 752 to generate process tree-specific input data 707. More specifically, the process tree embeddings 264 from aparticular process tree 220 may be combined withtraining classification value 754 and/ortraining explanation value 752 for thesame process tree 220. Thetraining classification value 754 for theprocess tree 220 may identify whether theprocess tree 220 contains or is associated with malware, and thetraining explanation value 752 for theprocess tree 220 may identify the underlying basis for the malware classification and which of theprocesses 210 of theprocess tree 220 are most relevant to the malware classification. In some embodiments, as part of training theidentification model 275,particular process trees 220 with known classifications (e.g., it is known whether theprocess tree 220 contains or is associated with malware) may be collected andprocess tree embeddings 264 may be formed fromprocess metadata 250 associated withprocesses 210 of theprocess tree 220 with known malware classifications and/or known explanations for the malware. The known classification and the known explanation of a givenprocess tree 220 may be used as thetraining classification value 754 and/or thetraining explanation value 752, and combined with the process tree embedding 264 to form the process tree-specific input data 707 for thatprocess tree 220. - In some embodiments,
process metadata 250 may be collected from aprocess 210 that is part of aprocess tree 220 that is known to contain or be associated with malware. Thus, atraining classification value 754 of the known-bad process tree 220 may be generated indicating that theprocess tree 220 is associated with malware and atraining explanation value 752 of the known-bad process tree 220 may be generated identifying which portion of theprocess tree 220 is contributing to thetraining classification value 754. A set ofprocess tree embeddings 264 may be generated from the process metadata 250 (as described herein with respect toFIGS. 4 and 5 ). The set ofprocess tree embeddings 264 may be combined with the training classification value 754 (e.g., malware) and/or thetraining explanation value 752 to generate the process tree-specific input data 707 for thatprocess tree 220. - Similarly,
process metadata 250 may be collected fromprocesses 210 of aprocess tree 220 that is known to be free of malware. Thus, atraining classification value 754 and/or atraining explanation value 752 of the known-good process tree 220 may be generated indicating that theprocess tree 220 is free of malware. A process tree embedding 264 may be generated from theprocesses 210 of theprocess tree 220 as described herein. The process tree embedding 264 may be combined with a training classification value 754 (e.g., malware-free) and/or atraining explanation value 752 to generate the process tree-specific input data 707 for thatprocess tree 220. - In this way, process tree-
specific input data 707 may be generated for eachprocess tree 220 used for training theidentification model 275. The process tree-specific input data 707 may be separated into two groups: atraining dataset 720 and atesting dataset 725. Each group of thetraining dataset 720 and thetesting dataset 725 may include process tree-specific input data 707 (e.g.,process tree embeddings 264 and their associatedtraining classification value 754 and/or associated training explanation value 752) for a plurality ofprocess trees 220. -
Learning operation 730 may be performed on thetraining dataset 720. The learningoperations 730 may examine the process tree embeddings 264 to establish a relationship between the elements of the process tree embeddings 264 that accurately predict the classification value 754 (e.g., malware or not malware) and/or theexplanation value 752 for a givenprocess tree 220. The learningoperations 730 may generate aML training model 765 that represents the determined relationship. TheML training model 765 may take a process tree embedding 264 as input, and output a classification value (e.g., malware or non-malware) and/or the explanation value for theprocess tree 220 associated with the process tree embedding 264. The learningoperations 730 may attempt to adjustparameters 735 of theML training model 765 to generate a best-fit algorithm that describes a relationship between the process tree embedding 264 and thetraining classification value 754 and/or thetraining explanation value 752 for all of theprocess trees 220 of thetraining dataset 720. A set ofparameters 735 may be selected based on thetraining dataset 720 and preliminarily established as theML training model 765. - The results of the learning
operations 730 may be provided to anevaluation operation 740. Theevaluation operation 740 may utilize theML training model 765 generated by the learning operations 730 (based on the training dataset 720) to see if theML training model 765 correctly predicts thetraining classification value 754 and/or thetraining explanation value 752 for the process tree embeddings 264 of thetesting dataset 725. If theML training model 765 accurately predicts the training classification values 754 and/or the training explanation values 752 of thetesting dataset 725, it may be promoted to theidentification model 275. If theML training model 765 does not accurately predict the training classification values 754 and/or the training explanation values 752 of thetesting dataset 725,feedback 712 may be provided to thelearning operations 730, and the learningoperations 730 may be repeated, with additional adjustment of theparameters 735. This process of learningoperations 730 andevaluation operation 740 may be repeated until anacceptable identification model 275 is generated that is capable of accurately predicting the training classification values 754, the training explanation values 752, and/or combinations of the training classification values 754 and the training explanation values 752. - Once the
identification model 275 is generated, it may be used to predictclassifications 254 and/orexplanations 252 for productionprocess tree embeddings 264. For example, for a givenprocess 210,process metadata 250 may be generated, as described herein. A process tree embedding 264 may be generated in a manner similar to that discussed herein (e.g., with respect toFIGS. 4 and 5 ). For example, a process embedding 262 may be generated for eachprocess 210 of aprocess tree 220, and the process tree embedding 264 may be generated from theprocess embeddings 262. - As illustrated in
FIG. 7A , the process tree embedding 264 may be provided to theidentification model 275. The operations of theidentification model 275 may generate the classification 254 (e.g., whether or not theprocess tree 220 associated with the production process tree embedding 264 contains and/or is associated with malware) and/or the explanation 252 (e.g., which processes 210 of theprocess tree 220 are relevant to the classification of the process tree 220). -
FIG. 7B is a block diagram of asystem 700B incorporating aneural network model 790 for generating aclassification 254 and/or anexplanation 252 of a process tree embedding 264 based onprocess metadata 250, according to some embodiments of the present disclosure. In some embodiments, thesystem 700B may be implemented by theclassification training engine 270 of thedetection computing device 110, as illustrated inFIG. 1 . However, the embodiments of the present disclosure are not limited to such a configuration. In some embodiments, theneural network model 790 may be pre-trained and provided as theidentification model 275 to thedetection computing device 110. - Referring to
FIG. 7B , theneural network model 790 includes an input layer having a plurality of input nodes I1 to IN, a sequence of neural network layers (layers 1 to Z are illustrated inFIG. 7B ) each including a plurality (e.g., 1 to X inFIG. 7B ) of weight nodes, and an output layer including at least one output node. In the particular non-limiting example ofFIG. 7B , the input layer includes input nodes I1 to IN (where N is any plural integer). A first one of the sequence of neural network layers includes weight nodes N1L1 (where “1L1” refers to a first weight node on layer one) to NXL1 (where X is any plural integer). A last one (“Z”) of the sequence of neural network layers includes weight nodes N1LZ (where Z is any plural integer) to NYLZ (where Y is any plural integer). The output layer includes a plurality of output nodes O1 to OM (where M is any plural integer). - The
neural network model 790 can be operated to process elements of the process tree embedding 264 through different inputs (e.g., input nodes I1 to IN) to generate one or more outputs (e.g., output nodes O1 to OM). The elements of the process tree embedding 264 that can be simultaneously processed through different input nodes I1 to IN may include, for example, statistical values (e.g., minimum, maximum, average, and/or standard deviation) of axes of an embedding space based the process tree embedding 264. Theclassification 254 and/or theexplanation 252 that can be output (e.g., through output nodes O1 to OM) may include an indication of whether theprocess tree 220 associated with the process tree embedding 264 is associated with malware and/or theprocesses 210 of theprocess tree 220 that may contribute to that classification. - During operation and/or training of the
neural network model 790, the various weights of the neural network layers may be adjusted based on a comparison of predictedprocess classification 254 and/or predictedexplanation 252 to data of an actual classification and/or explanation (such astraining classification value 754 and/or training explanation value 752). The comparison may be performed, for example, through the use of a loss function. The loss function may provide a mechanism to calculate how poorly the training model is performing by comparing what the model is predicting with the actual value it is supposed to output. The interconnected structure between the input nodes, the weight nodes of the neural network layers, and the output nodes may cause a given element of the process tree embedding 264 to influence theclassification prediction 254 and/or theexplanation prediction 252 generated for all of the other elements of the process tree embedding 264 that are simultaneously processed. Theclassification prediction 254 and/or theexplanation prediction 252 generated by theneural network model 790 may thereby identify a comparative prioritization of which of the elements of the process tree embedding 264 provide a higher/lower impact on theclassification 254 as to whether the associatedprocess tree 220 is, or is not, associated with malware and/or theexplanation prediction 252 as to which processes 210 of theprocess tree 220 contribute to thatclassification 254. - The
neural network model 790 ofFIG. 7B is an example that has been provided for ease of illustration and explanation of one embodiment. Other embodiments may include any non-zero number of input layers having any non-zero number of input nodes, any non-zero number of neural network layers having a plural number of weight nodes, and any non-zero number of output layers having any non-zero number of output nodes. The number of input nodes can be selected based on the number of input values that are to be simultaneously processed, and the number of output nodes can be similarly selected based on the number of output characteristics that are to be simultaneously generated therefrom. - Referring back to
FIG. 3 , atblock 350, once theclassification 254 and/orexplanation 252 have been generated, an identification of malware may be made based on theclassification 254 and/orexplanation 252. For example, if theclassification 254 indicates that theprocess tree 220 is associated with malware, theexplanation 252 may be utilized to identify aprocess 210 of theprocess tree 220 that is relevant to the classification of malware. The identification of theprocess 210 relevant to the classification of malware for theprocess tree 220 may allow for remediation actions to be taken with respect to therelevant process 210. - The
method 300 illustrates that process embeddings 262 may be generated fromprocess metadata 250, and theprocess tree embeddings 264 may be generated from theprocess embeddings 262. However, the embodiments of the present disclosure are not limited to such a configuration. -
FIG. 8 is a flow diagram of anothermethod 800 of generating amalware classification 254 and/ormalware explanation 252, according to some embodiments of the present disclosure. A description of elements ofFIG. 8 that have been previously described will be omitted for brevity.Method 800 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a central processing unit (CPU), a system-on-chip (SoC), etc.), software (e.g., instructions running/executing on a processing device), firmware (e.g., microcode), or a combination thereof. In some embodiments, themethod 800 may be performed by a computing device (e.g., detection computing device 110). - With reference to
FIG. 8 ,method 800 illustrates example functions used by various embodiments. Although specific function blocks (“blocks”) are disclosed inmethod 800, such blocks are examples. That is, embodiments are well suited to performing various other blocks or variations of the blocks recited inmethod 800. It is appreciated that the blocks inmethod 800 may be performed in an order different than presented, and that not all of the blocks inmethod 800 may be performed. - Referring to
FIG. 8 , as well as the prior figures, themethod 800 begins atblock 810, in whichprocess metadata 250 is received fromprocesses 210 of aprocess tree 220. The receipt of theprocess metadata 250 may be similar to theblock 310 described herein with respect toFIG. 3 and, as such, a duplicate description thereof will be omitted. - As
operation 820, a process tree embedding 264 is generated based on theprocess metadata 250. To generate the process tree embedding 264, the embeddingengine 260 may take as input theprocess metadata 250.Operation 820 may generate the process tree embedding 264 directly from theprocess metadata 250, rather than generating the intermediate process embeddings 262, as described herein with respect toFIG. 3 . -
FIG. 9 is schematic block diagram illustrating an operation of generating a process tree embedding 264 fromprocess metadata 250, in accordance with some embodiments of the present disclosure. A description of elements ofFIG. 9 that have been previously described will be omitted for brevity. - Referring to
FIG. 9 , to generate the process tree embedding 264 for aparticular process tree 220, the process embeddings 262 that are associated with theprocesses 210 of theprocess tree 220 may be provided to the embeddingengine 260. For example, inFIG. 5 , an example is illustrated in which four process metadata 250 (250A, 250B, 250C, 250D) are collected for aparticular process tree 220. The process metadata 250A, 250B, 250C, 250D may be processed by the embeddingengine 260 to generate the process tree embedding 264. As previously described, the generated process tree embedding 264 for a given set ofprocess metadata 250 is a numerical representation that encodes semantic and syntactic properties of the language represented by the input (e.g., the process metadata 250). - In the
method 800, the process tree embedding 264 may be directly generated from theprocess metadata 250. In some embodiments, theprocess metadata 250 may be concatenated or otherwise combined when provided to the embeddingengine 260. In some embodiments, additionalprocess tree metadata 950 may be provided to the embeddingengine 260 along with theprocess metadata 250. Theprocess tree metadata 950 may include information related to theprocess tree 220, such as the number ofprocesses 210, details on the execution history of theprocess tree 220, tags associated with theprocess tree 220, statistics associated with theprocess tree 220, and the like. In some embodiments, theprocess tree metadata 950 may be omitted. - Referring back to
FIG. 8 andFIG. 1 , the operations of themethod 800 may continue with operation 839, in which aclassification 254 and/orexplanation 252 for theprocess tree 220 are generated based on the process tree embedding 264. In some embodiments, theclassification 254 and/or theexplanation 252 may be generated by amalware identification engine 255 of thedetection computing device 110. The generation of theclassification 254 and/orexplanation 252 for theprocess tree 220 may be similar to theblock 340 described herein with respect toFIG. 3 and, as such, a duplicate description thereof will be omitted. - The
method 800 ofFIG. 8 may allow for the generation of the process tree embedding 264 directly from theprocess metadata 250, as compared to themethod 300 ofFIG. 3 . As a result, theclassification 254 and/orexplanation 252 may be generated more quickly and with the use of fewer resources. In addition, by providing theprocess metadata 250 for all of theprocesses 210 at once, it may be possible for the embeddingengine 260 to extract additional context from theprocess tree 220, which may lead to a moreaccurate classification 254 and/orexplanation 252. - Referring back to
FIG. 1 , once generated theclassification 254 and/or theexplanation 252 may be utilized to detect and/or remediate malware. For example, in some embodiments, responsive to theclassification 254 and/or theexplanation 252 indicating thatprocess tree 220 is associated with malware, steps may be taken to isolate theprocesses 210 of theprocess tree 220. In some embodiments, for example, the classification 254 (and/or explanation 252) may be transmitted to theclient computing device 120. Theclient computing device 120 may take remediation to address any malware that may be indicated by theexplanation 252. For example, if theclassification 254 indicates that one ormore processes 210 of aprocess tree 220 executing on theclient computing device 120 are associated with malware, the one ormore processes 210 may be terminated and/or quarantined. For example, in response to determining that aprocess 210 is, or is associated with, malware, the operating system 115 (and/or the monitoring engine 215) may terminate and/or unload the process 210 (or all of theprocesses 210 of the process tree 220), deny the program executable associated with theprocess 210 permission to execute frommemory 124, and/or deny theprocess 210 access to resources of theclient computing device 120. In some embodiments, responsive to theclassification 254 and/orexplanation 252 indicating malware, an alert may be transmitted from the detection computing device 110 (e.g., to theclient computing device 120 and/or another administrative computing device). - The
system 100 provides an improvement in the technology associated with computer security. For example, thesystem 100 provides an improved malware detection platform that is able to holistically analyze aprocess tree 220 to determine if theprocess tree 220 is associated with malware, as well as determining which processes 210 of theprocess tree 220 are contributing to this determination. Thesystem 100 is a technological improvement over some techniques for malware detection in that it does not exclusively utilize static image comparisons, which may be quickly varied by malware developers. Instead, embodiments according to the present disclosure may identify malware based on characteristics of portions of the running processes associated with the malware, and may be able to bypass obfuscation techniques that might otherwise make the malware detection difficult. - In
FIG. 1 , the embeddingmodel 265 and theidentification model 275 are illustrated as separate models. However, the embodiments of the present disclosure are not limited to such a configuration. In some embodiments, the embeddingmodel 265 and theidentification model 275 may be different portions of a single machine learning-based model. -
FIG. 10 is a flow diagram of amethod 1000 of operating a malware detection system, in accordance with some embodiments of the present disclosure.Method 1000 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a central processing unit (CPU), a system-on-chip (SoC), etc.), software (e.g., instructions running/executing on a processing device), firmware (e.g., microcode), or a combination thereof. In some embodiments, themethod 1000 may be performed by a computing device (e.g., detection computing device 110). - With reference to
FIG. 10 ,method 1000 illustrates example functions used by various embodiments. Although specific function blocks (“blocks”) are disclosed inmethod 1000, such blocks are examples. That is, embodiments are well suited to performing various other blocks or variations of the blocks recited inmethod 1000. It is appreciated that the blocks inmethod 1000 may be performed in an order different than presented, and that not all of the blocks inmethod 1000 may be performed. - Referring simultaneously to the prior figures as well, the
method 1000 begins atblock 1010, which includes generating a process tree embedding corresponding to a process tree, the process tree comprising a plurality of processes. In some embodiments, the process tree embedding and process tree may be similar to the process tree embedding 264 and/or theprocess tree 220 described herein with respect toFIGS. 1 to 9 . In some embodiments, the plurality of processes may be similar to theprocesses 210 described herein with respect toFIGS. 1 to 9 . In some embodiments, generating the process tree embedding corresponding to the process tree includes generating a process embedding corresponding to a first process of the process tree, and generating the process tree embedding based on the process embedding. In some embodiments, the process embedding may be similar to the process embedding 262 described herein with respect toFIGS. 1 to 9 . - In some embodiments, generating the process embedding comprises submitting metadata associated with the first process to an LLM. In some embodiments, the metadata may be similar to the
process metadata 250 described herein with respect toFIGS. 1 to 9 . In some embodiments, the metadata includes at least one of an operating system process identifier (ID) of the first process, a unique generated process ID (UPID) of first process, an operating system process ID of an ancestor in the process tree of first process, a UPID of the ancestor in the process tree of the first process, a filename of an executable image of the first process, a command line used to create the first process, a filename of an executable image of the ancestor in the process tree of the first process, a command line of the ancestor in the process tree of the first process, or an identification of an action that caused a generation of the metadata for the first process. - In some embodiments, generating the process tree embedding corresponding to the process tree includes generating process embeddings for each of the plurality of processes of the process tree, and aggregating the process embeddings to generate the process tree embedding.
- In some embodiments, generating the process tree embedding corresponding to the process tree includes generating the process tree embedding based on respective metadata of each of the plurality of processes of the process tree.
- At
block 1020, operations of themethod 1000 may include processing the process tree embedding with a machine learning model to generate an identification of malware associated with the process tree. In some embodiments, the machine learning model may be similar to theidentification model 275 described herein with respect toFIGS. 1 to 9 . In some embodiments, the identification may be similar to and/or based on theclassification 254 and/or theexplanation 252 described herein with respect toFIGS. 1 to 9 . - In some embodiments, processing the process tree embedding with the machine learning model to generate the identification of malware associated with the process tree may include processing the process tree embedding with the machine learning model to generate a classification of the process tree as being associated with malware; and, responsive to the classification indicating that the process tree is associated with malware, generating the identification of a first process of the plurality of processes that is relevant to the classification of the process tree as being associated with malware, generating, by a processing device, an identification of a first process of the plurality of processes that is relevant to the classification of the process tree as being associated with malware. In some embodiments, the classification may be similar to the
classification 254 described herein with respect toFIGS. 1 to 9 . In some embodiments, the identification of the first process may be based on theexplanation 252 described herein with respect toFIGS. 1 to 9 . - In some embodiments, the
method 1000 may further include, responsive to the identification of malware associated with the process tree, initiating remediation on one or more of the plurality of processes of the process tree. -
FIG. 11 is a component diagram of an example of adevice architecture 1100 for malware detection, in accordance with embodiments of the disclosure. Thedevice architecture 1100 includescomputing device 110 havingprocessing device 122 andmemory 124, as described herein with respect toFIGS. 1 to 10 . - A process tree embedding 1164 may be generated that to a
process tree 220. Theprocess tree 220 may include a plurality ofprocesses 210. In some embodiments, the process tree embedding 1164 and process tree may be similar to the process tree embedding 264 and/or theprocess tree 220 described herein with respect toFIGS. 1 to 10 . In some embodiments, the plurality ofprocesses 210 may be similar to theprocesses 210 described herein with respect toFIGS. 1 to 10 . - The process tree embedding 1164 may be processed with a machine learning (ML)
model 1175 to generate anidentification 1152 of theprocess tree 220 as being associated with malware. In some embodiments, theML model 1175 may be similar to theidentification model 275 described herein with respect toFIGS. 1 to 10 . In some embodiments, theidentification 1152 may be based on theclassification 254 and/orexplanation 252 described herein with respect toFIGS. 1 to 10 . - The
device architecture 1100 ofFIG. 11 provides an improved capability for malware detection. Thedevice architecture 1100 allows for analysis and/or detection of malware based on process metadata and process tree comparison. By detecting a similarity of a particular process tree to other process trees that are associated with malware, a potentially problematic process may be detected without having to wait for damaging operations to occur. Thedevice architecture 1100 may be able to detect malware based on the behavior and/or configuration of the processes and process tree, rather than relying on a scan or comparison of static executable images. Moreover, as described herein, thedevice architecture 1100 may be configured to provide a natural language explanation for a relevant process of the process tree that is responsible for the determination that the process tree is associated with malware. Thus, even if a particular process is executing the malware instructions, thedevice architecture 1100 may be able to identify other processes that are the root cause of the malware. Thedevice architecture 1100 provides a technological improvement to the operation of typical computing devices in that it is able to identify malware more efficiently and does not require constant updates to maintain malware signatures, saving on processing resources and downtime. -
FIG. 12 is a block diagram of anexample computing device 1200 that may perform one or more of the operations described herein, in accordance with some embodiments of the disclosure.Computing device 1200 may be connected to other computing devices in a LAN, an intranet, an extranet, and/or the Internet. The computing device may operate in the capacity of a server machine in client-server network environment or in the capacity of a client in a peer-to-peer network environment. The computing device may be provided by a personal computer (PC), a set-top box (STB), a server, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single computing device is illustrated, the term “computing device” shall also be taken to include any collection of computing devices that individually or jointly execute a set (or multiple sets) of instructions to perform the methods discussed herein. - The
example computing device 1200 may include a processing device (e.g., a general purpose processor, a PLD, etc.) 1202, a main memory 1204 (e.g., synchronous dynamic random access memory (DRAM), read-only memory (ROM)), a static memory 1206 (e.g., flash memory) and adata storage device 1218, which may communicate with each other via a bus 1230. -
Processing device 1202 may be provided by one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. In an illustrative example,processing device 1202 may include a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets.Processing device 1202 may also include one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. Theprocessing device 1202 may execute the operations described herein, in accordance with one or more aspects of the present disclosure, for performing the operations and steps discussed herein. -
Computing device 1200 may further include anetwork interface device 1208 which may communicate with anetwork 1220. Thecomputing device 1200 also may include a video display unit 1210 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device 1212 (e.g., a keyboard), a cursor control device 1214 (e.g., a mouse) and an acoustic signal generation device 1216 (e.g., a speaker). In one embodiment,video display unit 1210,alphanumeric input device 1212, andcursor control device 1214 may be combined into a single component or device (e.g., an LCD touch screen). -
Data storage device 1218 may include a computer-readable storage medium 1228 on which may be stored one or more sets ofinstructions 1225 that may include instructions for an embeddingengine 260 and/or amalware identification engine 255 for carrying out the operations described herein, in accordance with one or more aspects of the present disclosure.Instructions 1225 may also reside, completely or at least partially, withinmain memory 1204 and/or withinprocessing device 1202 during execution thereof bycomputing device 1200,main memory 1204 andprocessing device 1202 also constituting computer-readable media. Theinstructions 1225 may further be transmitted or received over anetwork 1220 vianetwork interface device 1208. - While computer-
readable storage medium 1228 is shown in an illustrative example to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database and/or associated caches and servers) that store the one or more sets of instructions. The term “computer-readable storage medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform the methods described herein. The term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media and magnetic media. - Unless specifically stated otherwise, terms such as “generating,” “processing,” “aggregating,” “submitting,” “initiating,” or the like, refer to actions and processes performed or implemented by computing devices that manipulates and transforms data represented as physical (electronic) quantities within the computing device's registers and memories into other data similarly represented as physical quantities within the computing device memories or registers or other such information storage, transmission or display devices. Also, the terms “first,” “second,” “third,” “fourth,” etc., as used herein are meant as labels to distinguish among different elements and may not necessarily have an ordinal meaning according to their numerical designation.
- Examples described herein also relate to an apparatus for performing the operations described herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computing device selectively programmed by a computer program stored in the computing device. Such a computer program may be stored in a computer-readable non-transitory storage medium.
- The methods and illustrative examples described herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used in accordance with the teachings described herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as set forth in the description above.
- The above description is intended to be illustrative, and not restrictive. Although the present disclosure has been described with references to specific illustrative examples, it will be recognized that the present disclosure is not limited to the examples described. The scope of the disclosure should be determined with reference to the following claims, along with the full scope of equivalents to which the claims are entitled.
- As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising”, “includes”, and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Therefore, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
- It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
- Although the method operations were described in a specific order, it should be understood that other operations may be performed in between described operations, described operations may be adjusted so that they occur at slightly different times or the described operations may be distributed in a system which allows the occurrence of the processing operations at various intervals associated with the processing.
- Various units, circuits, or other components may be described or claimed as “configured to” or “configurable to” perform a task or tasks. In such contexts, the phrase “configured to” or “configurable to” is used to connote structure by indicating that the units/circuits/components include structure (e.g., circuitry) that performs the task or tasks during operation. As such, the unit/circuit/component can be said to be configured to perform the task, or configurable to perform the task, even when the specified unit/circuit/component is not currently operational (e.g., is not on). The units/circuits/components used with the “configured to” or “configurable to” language include hardware—for example, circuits, memory storing program instructions executable to implement the operation, etc. Reciting that a unit/circuit/component is “configured to” perform one or more tasks, or is “configurable to” perform one or more tasks, is expressly intended not to invoke 35 U.S.C. § 112 (f) for that unit/circuit/component. Additionally, “configured to” or “configurable to” can include generic structure (e.g., generic circuitry) that is manipulated by software and/or firmware (e.g., an FPGA or a general-purpose processor executing software) to operate in manner that is capable of performing the task(s) at issue. “Configured to” may also include adapting a manufacturing process (e.g., a semiconductor fabrication facility) to fabricate devices (e.g., integrated circuits) that are adapted to implement or perform one or more tasks. “Configurable to” is expressly intended not to apply to blank media, an unprogrammed processor or unprogrammed generic computer, or an unprogrammed programmable logic device, programmable gate array, or other unprogrammed device, unless accompanied by programmed media that confers the ability to the unprogrammed device to be configured to perform the disclosed function(s).
- The foregoing description, for the purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the embodiments and its practical applications, to thereby enable others skilled in the art to best utilize the embodiments and various modifications as may be suited to the particular use contemplated. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.
Claims (20)
1. A method of detecting malware, the method comprising:
generating a process tree embedding corresponding to a process tree, the process tree comprising a plurality of processes; and
processing, by a processing device, the process tree embedding with a machine learning model to generate an identification of malware associated with the process tree.
2. The method of claim 1 , wherein processing the process tree embedding with the machine learning model to generate the identification of malware associated with the process tree comprises:
processing the process tree embedding with the machine learning model to generate a classification of the process tree as being associated with malware; and
responsive to the classification indicating that the process tree is associated with malware, generating, by the processing device, the identification of a first process of the plurality of processes that is relevant to the classification of the process tree as being associated with malware.
3. The method of claim 1 , wherein generating the process tree embedding corresponding to the process tree comprises:
generating a process embedding corresponding to a first process of the process tree; and
generating the process tree embedding based on the process embedding.
4. The method of claim 3 , wherein generating the process embedding comprises submitting metadata associated with the first process to a large language model (LLM).
5. The method of claim 4 , wherein the metadata comprises at least one of an operating system process identifier (ID) of the first process, a unique generated process ID (UPID) of first process, an operating system process ID of an ancestor in the process tree of first process, a UPID of the ancestor in the process tree of the first process, a filename of an executable image of the first process, a command line used to create the first process, a filename of an executable image of the ancestor in the process tree of the first process, a command line of the ancestor in the process tree of the first process, or an identification of an action that caused a generation of the metadata for the first process.
6. The method of claim 1 , wherein generating the process tree embedding corresponding to the process tree comprises:
generating process embeddings for each of the plurality of processes of the process tree; and
aggregating the process embeddings to generate the process tree embedding.
7. The method of claim 1 , wherein generating the process tree embedding corresponding to the process tree comprises:
generating the process tree embedding based on respective metadata of each of the plurality of processes of the process tree.
8. A system comprising:
a memory; and
a processing device, operatively coupled to the memory, to:
generate a process tree embedding corresponding to a process tree, the process tree comprising a plurality of processes; and
process the process tree embedding with a machine learning model to generate an identification of malware associated with the process tree.
9. The system of claim 8 , wherein, to generate the process tree embedding corresponding to the process tree, the processing device is to:
generate a process embedding corresponding to a first process of the process tree; and
generate the process tree embedding based on the process embedding.
10. The system of claim 9 , wherein, to generate the process embedding, the processing device is to submit metadata associated with the first process to a large language model (LLM).
11. The system of claim 10 , wherein the metadata comprises at least one of an operating system process identifier (ID) of the first process, a unique generated process ID (UPID) of first process, an operating system process ID of an ancestor in the process tree of first process, a UPID of the ancestor in the process tree of the first process, a filename of an executable image of the first process, a command line used to create the first process, a filename of an executable image of the ancestor in the process tree of the first process, a command line of the ancestor in the process tree of the first process, or an identification of an action that caused a generation of the metadata for the first process.
12. The system of claim 8 , wherein, to generate the process tree embedding corresponding to the process tree, the processing device is to:
generate process embeddings for each of the plurality of processes of the process tree; and
aggregate the process embeddings to generate the process tree embedding.
13. The system of claim 8 , wherein, to generate the process tree embedding corresponding to the process tree, the processing device is to:
generate the process tree embedding based on respective metadata of each of the plurality of processes of the process tree.
14. The system of claim 8 , wherein the processing device is further to:
responsive to the identification of malware associated with the process tree, initiate remediation on one or more of the plurality of processes of the process tree.
15. A non-transitory computer-readable storage medium including instructions that, when executed by a processing device, cause the processing device to:
generate a process tree embedding corresponding to a process tree, the process tree comprising a plurality of processes; and
process, by the processing device, the process tree embedding with a machine learning model to generate an identification of malware associated with the process tree.
16. The non-transitory computer-readable storage medium of claim 15 , wherein, to generate the process tree embedding corresponding to the process tree, the processing device is to:
generate a process embedding corresponding to a first process of the process tree; and
generate the process tree embedding based on the process embedding.
17. The non-transitory computer-readable storage medium of claim 16 , wherein, to generate the process embedding, the processing device is to submit metadata associated with the first process to a large language model (LLM).
18. The non-transitory computer-readable storage medium of claim 15 , wherein, to process the process tree embedding with the machine learning model to generate the identification of malware associated with the process tree, the processing device is to:
process the process tree embedding with the machine learning model to generate a classification of the process tree as being associated with malware; and
responsive to the classification indicating that the process tree is associated with malware, generate the identification of a first process of the plurality of processes that is relevant to the classification of the process tree as being associated with malware.
19. The non-transitory computer-readable storage medium of claim 15 , wherein, to generate the process tree embedding corresponding to the process tree, the processing device is to:
generate process embeddings for each of the plurality of processes of the process tree; and
aggregate the process embeddings to generate the process tree embedding.
20. The non-transitory computer-readable storage medium of claim 15 , wherein, to generate the process tree embedding corresponding to the process tree, the processing device is to:
generate the process tree embedding based on respective metadata of each of the plurality of processes of the process tree.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/216,833 US20250005154A1 (en) | 2023-06-30 | 2023-06-30 | Techniques for utilizing embeddings to monitor process trees |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/216,833 US20250005154A1 (en) | 2023-06-30 | 2023-06-30 | Techniques for utilizing embeddings to monitor process trees |
Publications (1)
Publication Number | Publication Date |
---|---|
US20250005154A1 true US20250005154A1 (en) | 2025-01-02 |
Family
ID=94126116
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/216,833 Pending US20250005154A1 (en) | 2023-06-30 | 2023-06-30 | Techniques for utilizing embeddings to monitor process trees |
Country Status (1)
Country | Link |
---|---|
US (1) | US20250005154A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191239A (en) * | 2019-12-30 | 2020-05-22 | 北京邮电大学 | Process detection method and system for application program |
US20200167464A1 (en) * | 2018-11-28 | 2020-05-28 | International Business Machines Corporation | Detecting malicious activity on a computer system |
US20210406368A1 (en) * | 2020-06-30 | 2021-12-30 | Microsoft Technology Licensing, Llc | Deep learning-based analysis of signals for threat detection |
US20240330446A1 (en) * | 2023-03-27 | 2024-10-03 | Microsoft Technology Licensing, Llc | Finding semantically related security information |
-
2023
- 2023-06-30 US US18/216,833 patent/US20250005154A1/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200167464A1 (en) * | 2018-11-28 | 2020-05-28 | International Business Machines Corporation | Detecting malicious activity on a computer system |
CN111191239A (en) * | 2019-12-30 | 2020-05-22 | 北京邮电大学 | Process detection method and system for application program |
US20210406368A1 (en) * | 2020-06-30 | 2021-12-30 | Microsoft Technology Licensing, Llc | Deep learning-based analysis of signals for threat detection |
US20240330446A1 (en) * | 2023-03-27 | 2024-10-03 | Microsoft Technology Licensing, Llc | Finding semantically related security information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230252136A1 (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information | |
KR102790640B1 (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information | |
US20230252144A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20230254340A1 (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information | |
US20240054215A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20230048076A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
KR102411383B1 (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information | |
US20250028818A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20240054210A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US12282554B2 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20230306113A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20250005154A1 (en) | Techniques for utilizing embeddings to monitor process trees | |
KR102447278B1 (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information | |
US20240214396A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US12174958B2 (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information | |
US20240348639A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20250028827A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20250028826A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20250030704A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20250028823A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20240214406A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
EP4407495A1 (en) | Machine learning-based malware detection for code reflection | |
US20240211595A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
US20250028825A1 (en) | Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program | |
KR102396238B1 (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: CROWDSTRIKE, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SAVA, VASILE-DANIEL;SUMEDREA, PAUL;POPA, CRISTIAN VIOREL;REEL/FRAME:064128/0066 Effective date: 20230628 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |