US20170075744A1 - Identifying root causes of failures in a deployed distributed application using historical fine grained machine state data - Google Patents
Identifying root causes of failures in a deployed distributed application using historical fine grained machine state data Download PDFInfo
- Publication number
- US20170075744A1 US20170075744A1 US14/852,006 US201514852006A US2017075744A1 US 20170075744 A1 US20170075744 A1 US 20170075744A1 US 201514852006 A US201514852006 A US 201514852006A US 2017075744 A1 US2017075744 A1 US 2017075744A1
- Authority
- US
- United States
- Prior art keywords
- machine state
- computer readable
- state data
- readable program
- program code
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1629—Error detection by comparing the output of redundant processing systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2268—Logging of test results
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/065—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/32—Monitoring with visual or acoustical indication of the functioning of the machine
- G06F11/321—Display for diagnostics, e.g. diagnostic result display, self-test user interface
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/064—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving time analysis
Definitions
- Downtime does not only affect revenue generation lost, in fact the true cost of downtime can be much higher.
- the true cost can include, for example, lost or dissatisfied customers, damage to a company's reputation, lost employee productivity, and even devaluation of the business (e.g., falling stock prices).
- a large number of non-malicious failures occur during routine maintenance (e.g., uninterruptable power supply (UPS) replacement, failure of a machine hard disk, adding of new machines or deprecating old machines from the cluster).
- routine maintenance e.g., uninterruptable power supply (UPS) replacement, failure of a machine hard disk, adding of new machines or deprecating old machines from the cluster.
- UPS uninterruptable power supply
- one aspect of the invention provides a method of identifying root causes of system failures in a distributed system said method comprising: utilizing at least one processor to execute computer code that performs the steps of: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on the historical machine state data, a healthy map model; detecting at least one failed machine state in the distributed system; comparing the failed machine state against the healthy map model; identifying, based on the comparison, at least one root cause of the failed machine state; and displaying, on a display device, a ranked list comprising the at least one root cause.
- Another aspect of the of the invention provides an apparatus for identifying root causes of system failures in a distributed system apparatus comprising: at least one processor; and a computer readable storage medium having computer readable program code embodied therewith and executable by the at least one processor, the computer readable program code comprising: computer readable program code that records, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; computer readable program code that creates, based on the historical machine state data, a healthy map model; computer readable program code that detects at least one failed machine state in the distributed system; computer readable program code that compares the failed machine state against the healthy map model; computer readable program code that identifies, based on the comparison, at least one root cause of the failed machine state; and computer readable program code that displays, on a display device, a ranked list comprising the at least one root cause.
- An additional aspect of the invention provides a computer program product for identifying root causes of system failures in a distributed system, said computer program product comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code that records, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; computer readable program code that creates, based on the historical machine state data, a healthy map model; computer readable program code that detects at least one failed machine state in the distributed system; computer readable program code that compares the failed machine state against the healthy map model; computer readable program code that identifies, based on the comparison, at least one root cause of the failed machine state; and computer readable program code that displays, on a display device, a ranked list comprising the at least one root cause.
- a further aspect of the invention provides a method comprising: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on the historical machine state data, a healthy map model; detecting at least one failed machine state in the distributed system; determining a failure time, wherein the failure time is associated with the at least one machine state failure; determining a healthy time, wherein the healthy time is associated with a healthy state of the machine state and its dependencies prior to the failure time; generating at least one seed-anomaly score, using an inference algorithm, for machine states between the healthy time and the failure time; and modifying the at least one seed-anomaly score, based on an iterative graph convergence algorithm; wherein the ranked list is based on the modified at least one seed-anomaly score.
- FIG. 1 illustrates an example embodiment of identifying a root cause of a failure in a distributed system.
- FIG. 2 illustrates another example embodiment of identifying a root cause of a failure in a distributed system.
- FIG. 3 illustrates an example of key value based machine state data an embodiment may collect.
- FIG. 4 schematically illustrates an example property graph of a networked distributed system/application.
- FIG. 5 illustrates the lifespan based profile component of the healthy model, at a per entity level.
- FIG. 6 illustrates the value histogram based profile component of the healthy model, at a per entity level.
- FIG. 7 illustrates a determination of a problematic time interval.
- FIG. 8 illustrates a categorization table of entities within a problematic time interval.
- FIG. 9 illustrates an example scoring algorithm for a seed-anomaly score.
- FIG. 10A illustrates an example of seed score strengthening.
- FIG. 10B illustrates an example of seed score weakening.
- FIG. 11 illustrates an example graphical user interface for a root cause search application.
- FIG. 12 illustrates an example embodiment of identifying root causes of failures in a deployed distributed application using historical fine grained machine state data.
- FIG. 13 illustrates an example computer system.
- Problem diagnosis speed matters; thus there is a need for techniques that allow for enhanced automation via fine grained root cause identification.
- Problem diagnosis of distributed systems is challenging for various reasons, for example, the increasing number of participating distributed components within a typical application, or the large variety of potential causes for failed applications. Further examples include but are not limited to: application and deployment misconfigurations; application code related errors or performance bugs; a change to dependent packages/shared libraries; issues with hosted infrastructure (e.g., shared resource contention); and the like.
- fine grained machine state data offer immense potential to help fully identify the root cause or pinpoint the root cause at a fine grained level
- building a solution that allows for automation of the process creates a technical challenge in that operating on fine grained machine state data is many orders of magnitude more challenging than what is available in current solutions (e.g., those that analyze metric or log data). This is because the number of fine grained machine state entities is so much higher than the number of collected metrics and log files currently analyzed.
- the technical problem is not only of tackling the scale and volume of fine grained machine entities, but also devising new techniques that can operate on fine grained machine entities.
- the techniques used to analyze metric data e.g., tracking the average of numeric metric and reporting alerts based on significant deviations from average
- the techniques used for analyzing log data e.g., searching for loglines that report errors or warning messages
- an embodiment allows for root cause identification to be automated. This is enabled through periodically collecting very fine grained machine state data of various types (e.g., processes, connections, configuration settings, packages, application metrics, attributes of shared infrastructure (e.g., disk, central processing unit (CPU), memory, etc.)). This machine state data is then used to discover application invariants on the machine state (e.g., a list of typical processes the current application starts, a list of typical remote connection ports, a list of typical shared libraries accessed, a list of configuration files read, etc.).
- An invariant is a condition that can be relied upon to be true during execution of a program, or during some portion of it. It is a logical assertion that is held to always be true during a certain phase of execution.
- correlations are generated across anomalies (i.e., deviation of a faulty state from the invariants) of various types of machine state data related to running applications. An embodiment may then take the discovered correlations and identify possible root causes of a fault.
- fine grain machine state data (e.g., processes, configuration files, installed packages, metrics, infrastructure shared resource utilization, etc.) is periodically collected and analyzed from physical machines and/or virtual machines (VMs) on which the distributed application is deployed.
- VMs virtual machines
- This periodic collection directly from running applications when an entire system is healthy allows an embodiment to construct a healthy reference model.
- This healthy reference model captures application invariants over a variety of machine states.
- an embodiment compares the recent machine state data collected from the failed application against the application's healthy reference model (i.e., typical invariants). Based on the difference between the healthy reference model and the faulty state, an embodiment may identify potential root causes for the observed fault.
- the embodiment then utilizes a root cause inference algorithm that is able to pinpoint the root cause and/or return a ranked list of potential root causes with the most likely or relevant cause being given high rank.
- the inference algorithm calculates the divergence score of the entity's state at the time of fault as compared to the healthy state model of that entity.
- An even further embodiment may include a training phase, which is executed during the operation of a healthy system. Initially, an embodiment may periodically collect fine grained machine data. Once the data collected, an embodiment may represent the collected data in a temporally evolving property graph model. Based on this temporally evolving property graph, a healthy profile is built on a per entity basis from the time series of evolution of the entity's state.
- An embodiment may then include a root cause diagnosis phase, which is executed when a predetermined trigger occurs (e.g., getting a service level agreement (SLA) violation, receiving a ticket, etc.).
- the root cause diagnosis phase may first determine the problematic time interval [t_good, t_bad] using the metrics and dependency edges.
- a graph-diff (difference determining graph) is then created based on the g(t_bad) and g(t_good) to identify the set of potentially anomalous entities. Once the graph is created, an embodiment will assign a seed-anomaly score to each potentially anomalous entity based on their divergence from the healthy state model.
- An embodiment may then use dependency edges to strengthen or weaken the seed-anomaly scores, using the various methods described herein, to create a ranked list of root causes. Once the ranked list exists, it may be displayed in a faceted manner with additional navigation options.
- an embodiment may then, based upon the property graph representation, construct a healthy state reference model.
- An embodiment calculates the divergence (i.e., difference) between a detected fault state and the known healthy state model.
- an embodiment utilizes a root cause inference algorithm to exploit specific property graph based modeling techniques as used herein.
- the embodiment maps or converts multi-silo machine state data into a key value based property graph, wherein different machine state features are nodes/vertices of the graph.
- the terms node and vertex are used interchangeably throughout this disclosure when in reference to graph generation.
- each node additionally has a property or attribute list in the form of key value pairs.
- an embodiment may collect machine state data.
- the machine state data may be collected through a variety of methods.
- an embodiment may utilize a crawler which systematically browses (i.e., crawls) through the integrated system and records information at various states and times.
- the crawler then indexes content within the system as it crawls.
- the crawler has the ability to recognize certain characteristics of the system (e.g., particular dependencies, communication between entities, certain code operators, etc.).
- the data may be collected via manual entry of a user (e.g., a user may enter specific components and their exiting dependences).
- An embodiment may also collect machine state data from multiple silos of data, for example, metrics, configuration files, files, processes, packages, connections, development operations, tickets submitted indicating potential changes or updates, known events detected, logs, administrative operations, etc.
- a further example embodiment is shown in FIG. 2 at 210 .
- An embodiment may, during the fine grained machine state data collection, draw from multiple sources. For example, and as shown in 210 , work ticket requests, developer operations, standard metrics, infrastructure events, data logs, configuration files, administration operations, and the like can be used for data collection.
- the fine grain data may be collected from various levels within the distributed system (e.g., application, middleware, infrastructure, etc.).
- Application metrics may be for example infrastructure incidents, infrastructure availability, infrastructure utilization and performance, application issues, application availability, application utilization and performance, operations, application support, etc.
- Collecting data related to the configuration files may help detect changes to the configuration files themselves (e.g., change of remote port, thread pool size, etc.).
- Collecting data related to the processes may help detect processes that have crashed or even new resource heavy process that may have been created.
- Collecting data related to the packages may help detect changes to typical opened files (e.g., a change to a new version of a shared library due to package upgrades).
- Collecting data related to the connections may help detect missing network connections to remote topology nodes of the same application.
- Collecting data related to the development operations may be used to detect recent changes to code deployment data.
- a “namespace” is determined, wherein a namespace may be a tag that is associated by the crawler to represent the source of the data.
- namespaces may be the machine name, or a representation of the machine (e.g., ⁇ machine-name, application-component>), or even simply the application and component name if only one such component exists in the application cluster.
- an embodiment may assign a “featuretype” at 330 to uniquely represent each fine grained machine entity collected on a particular namespace.
- the ⁇ namespace:featuretype>tuple may be used to uniquely identify the different fine grained machine entities collected by the system.
- an embodiment may ensure that the featuretype is assigned in a way, such that a featuretype of the same machine entity across a historic timeline is time invariant (e.g., using a name instead of a process ID for processes).
- an embodiment may record the “crawltime.”
- the crawltime is the recorded time during which the data was collected. This allows an embodiment to associate the recorded data state with a particular reference time. This is important because, as discussed further herein, the machine state data is continuously collected and stored in a historical repository. The repository is then queried when a failure is detected and a time is identified for which the machine state was healthy (e.g., t_good).
- An embodiment may also record the featuretype as shown at 330 which allows an embodiment to differentiate between the types of entitles states being stored (e.g., operating system (OS), disk, configuration file, package, connection, process, etc.).
- OS operating system
- configuration file e.g., configuration file, package, connection, process, etc.
- the data is collected via any method (e.g., those disclosed herein) it is stored on a storage device (e.g., that shown at 34 ′) at 120 .
- the storage system also houses the historical data states.
- the historical data states are associated with a known time period during which the state was taken (i.e., a snapshot) as discussed herein.
- the newly acquired machine state data is then added to the existing historical state data at 130 . This is because the machine state data is routinely collected at given intervals, and the historical machine state data is updated based on the continual collecting.
- a system could use a crawler, like that discussed herein, to constantly crawl through the network system and record data states.
- an embodiment may only take snapshots once daily, weekly, or monthly depending on the demands of the infrastructure.
- An embodiment creates a time evolving property graph representation.
- An embodiment may, as shown in FIG. 4 , create a property graph, which links different machine features that have specific dependencies or causal relationships between each other as edges.
- an edge i.e., the lines between the nodes
- N 1 depends on N 2 .
- N 2 could causally affect N 1 .
- Additional examples may be a particular process (i.e., a graph node) having an edge to a configuration file (i.e., another graph node) that it reads such as at 420 .
- vertex modeling (V) equals a set of nodes/vertices. Each vertex has a vertex key/id and a set of properties associated with it.
- An embodiment may convert the collected machine state data into the vertex properties by: (1) flattening hierarchical key space into unique vertex key/id.
- prefix keys by virtual machine name e.g., prefix keys by virtual machine name
- using intelligent design to ensure the keys are time invariant e.g., using a name instead of a process ID for processes
- attributes of the features e.g., the JavaScript Object Notation (JSON) fields
- JSON JavaScript Object Notation
- An embodiment may create the vertex type annotations based on featuretype.
- the different classes of vertex annotations e.g., process, configuration, metric, topology, etc. may be leveraged to trigger a class/type specific modeling and root cause inference logic.
- the edge modeling (E) comprises a set of edges representing dependencies between vertex keys (e.g., a process reading a configuration file, a file belonging to a particular package, a process opening a connection to a remote port, etc.).
- the dependency relationships may be of the varying forms. For example, a metric entity may depend on a process entity which is being produced (e.g., 430 ). Additionally, a process entity may depend on the configuration files from which it reads (e.g., 440 ). A further example may be a process entity depending on a shared library package entity (e.g., 450 ).
- a process may depend on a virtual machine (VM) or machine disk entity on which it runs (e.g., 460 ), other examples could include CPU entities and memory entities related to the VM.
- An additional example may be one process entity depending on another remote process entity that it interacts with (i.e., inferring relationships from connections to the different entities) (e.g., 470 ).
- An additional example embodiment of a time evolving machine property graph comprising a web of fine grained machine entities interconnected with dependency edges is shown in FIG. 2 at 220 .
- an embodiment may construct a healthy state reference model by aggregating multiple property graphs related to different time snapshots of the application when healthy. These snapshots may be annotated and aggregated, with specific techniques based on the node vertex modeling in combination with the edge modeling.
- One example embodiment such as that at 230 , may create a healthy state model at the node level by identifying what is considered “normal” on particular nodes, and what could be expected based on a particular node (e.g., is the node always present, intermittent, etc.). Additionally, node attributes may be factors (e.g., identifying typical ranges for attribute values and determining how much fluctuation is normal).
- the vertex is the union of nodes/vertices across different datacenter states at various points in time, wherein each vertex may have various characteristics.
- One such characteristic is “occurrence-probability,” which in addition to the vertex ‘key/id’ described herein, has an additional “vertexkey_occurence_probability” characteristic created which measures how often a vertex was part of the state snapshot.
- Another characteristic is “lifespan profile,” which relates to the expected duration (and deviation) of an entity remaining in the same state before it is updated.
- a characteristic labeled “attribute value histogram,” which is discussed further herein, may be used to enable an embodiment to maintain typical ranges of attribute values (for each attribute) to allow the embodiment to learn what amount of fluctuation is normal.
- An embodiment may also use edge level analysis in building the healthy model. For example, identifying which set of edges occur as invariants, which set of edges are intermittent, which set of edges are rare, etc. Similar to the node analysis, attribute level analysis may be used, for example, to determine what typical edge attributes values are normal, or what normal attribute behavior can be expected. Also as above, an embodiment may aggregate the historical information per edge entity into aggregated attributes, for example occurrence probability, lifespan distribution, and attribute value histogram.
- an embodiment may create a healthy state model at a per entity level, which allows for tracking the entity via its historic lifespan profile.
- the time series or lifecycle at 510 of an entity may be represented as a series of its constituent versions.
- the version of an entity is updated, either when any of the attributes change ( 540 ), or it transitions from exiting to non-existing ( 520 ), or non-existing to existing ( 530 ).
- the lifespan of each version can be computed as the time interval [v.b, v.d] wherein “v.b” is the version's birth time and “v.d” is the version's death time.
- the lifespan profile of the entity can be computed to track what the average lifespan is, and also the standard deviation therein.
- an embodiment may, compute an average and standard deviation of the contiguous existence (i.e., discounting changes to its attribute values) of an entity during a predetermined time 540 .
- an embodiment would calculate the contiguous existence durations as: [(v1.d ⁇ v0.b), (v2.d ⁇ v2.b), (v3.d ⁇ v3.b)].
- a further embodiment may additionally or alternatively calculate the average and standard deviation of non-existing (or disappearance) durations as: [(v2.b ⁇ v1.d), (v3.b ⁇ v2.d)].
- an embodiment may, based on the healthy model track the historic profile at a per-entity level for each of its attributes.
- an embodiment may have an entity “E” at 610 .
- entities (E) may have multiple versions (e.g., E.v1, E.v2, etc.).
- the entity may also be associated a list of attributes (e.g., A1, A2, A3, A4, etc.) at 620 .
- the attributes correspond to specific factors related to the entity (e.g., attributes for the process entity ( 340 ), such as: cmd, threads, pid, openfiles, user, etc. as shown in FIG. 3 ).
- the attributes have the ability to change over time, and thus an embodiment monitors them with regular periodicity as discussed herein.
- a value histogram is calculated for the occurrence probability of each of the attributes at 630 - 660 .
- an embodiment may determine if, or which, attributes' value fluctuates. For example, the attribute A2 ( 640 ) does not fluctuate at all, whereas attribute A4 ( 660 ) fluctuates reasonably.
- An embodiment may also capture, for a fluctuating attribute, a typical dominating value of the attribute (e.g., A1 observed to have the value ‘11’ with 85% probability).
- the value histogram allows an embodiment to determine if the fluctuations are benign or abnormal by observing typical faults.
- this continuous machine state capturing, recording, and model creation will continue if no fault is detected at 140 .
- an embodiment may compare the failed machine state against the created healthy map model at 150 . This comparison allows an embodiment to determine a difference between the aggregated and annotated property graphs related to the healthy reference model and the property graph captured at the faulty machine state time.
- a specific instance e.g., a failure of an application violating an SLA, a user raises a ticket, etc. triggers a flag observed at a time which an embodiment records as “t_bad.”
- the embodiment will then need to identify a time in the past (e.g., t_good) such that potential root causes of the problem are related to the changes in the time-interval [t_good, t_bad].
- t_good a time in the past
- an embodiment needs to minimize the interval [t_good, t_bad] as much as possible, while also ensuring the search is broad enough to capture all potential root causes for the failure of a target application.
- One embodiment may receive dependency graph data which is mined from the fine grained machine state data discussed herein. Additionally or alternatively, an embodiment may also receive metric data for all metrics collected from a datacenter (e.g., metrics across all applications, middleware, infrastructure, etc.). Additionally, an embodiment may have an alert threshold, which may be utilized to determine when/if a metric is in a good or bad state. An embodiment may also be able to use algorithms to infer an alert threshold. For example, an embodiment may use a change point detection algorithm, which utilizes the historical value of a metric to detect sudden change points reflecting transitioning to bad state.
- an embodiment may detect the set of metrics that are attached to any machine entities that are “reachable” from the machine entity (e.g., process) associated with a target application that was observed to fail or misbehave. For example, an embodiment may start at t_bad, and proceed backward in time until all “reachable metrics” (e.g., dependencies) are determined to be in a good state, thus determining a t_good.
- machine entity e.g., process
- an embodiment determines the problematic time interval [t_good, t_bad] using the time services of divergence scores and dependency edges. For example, an embodiment first acquires a set of machine entities that are reachable via dependency edges from the observed faulty entity. Then, based the time series of seed-anomaly scores for each of the reachable entities, an embodiment must determine the nearest time (moving backward in time) from t_bad when all reachable entities have low divergence or anomaly scores. This point in time is then labeled t_good.
- FIG. 7 illustrates a dependency graph.
- the edge i.e., the lines between the nodes
- E 1 depends on E 2 .
- E 2 could causally affect E 1 .
- a failure is triggered respective to E 0 at 730 , and thus a time t_bad is identified.
- an embodiment will trace back all entities that can casually affect E 0 .
- E 0 at 740 is dependent on E 2 at 750 , which is further dependent on E 5 at 760 .
- E 2 was never in a failed state.
- t_good would be determined based on a previously functional state of E 0 (due to E 2 being fully functional at all times).
- E 2 at 750 depends on E 5 at 760
- E 0 depends indirectly on E 5 .
- a fault at E 5 does not directly lead to an observable change in the core functioning of E 2 as shown at 771 in the collected operational data related to entity E 2 .
- the fault did affect a small subset of operations for which E 1 is transitively dependent on E 5 via E 2 .
- the periodic history of E 5 is investigated until a time is identified that E 5 was in a functional state at 780 . This time is thus labeled as t_good, because all causally dependent entities on which E 0 is reliant are determined to be in good operational state at that point.
- a large fraction of entities at 840 remained constant in value throughout the time interval and were essentially pruned out via the graph “diff” operation as being unlikely to be the root cause, as they remained consistent.
- an embodiment is more able to identify potential root causes because the number of changes in a problematic time window is much smaller than the total number of entities in an entire data center.
- An embodiment may then assign the remaining non-pruned graph nodes and edges weights based on a predetermined algorithm. For example, an embodiment may assign an initial seed-anomaly score to these non-pruned entities using the per-entity healthy state model is shown in FIG. 9 .
- the non-pruned entities typically fall into one of three category types. The first being “Newly_Observed” at 910 .
- an example algorithm for a newly observed entity determines if the entity was observed at any point in the past (e.g., outside of the [t_bad, t_good] window). If an embodiment determines that the entity was indeed present at a previous time, the previously calculated value histogram score, discussed herein, is used.
- an entity may fall into the “Disappeared_Now” category at 920 .
- an example algorithm for a disappeared now entity compares the entity with a historic existence profile and determines a set of variables (e.g., [disappearance_duration], [avg_hist_disappearance_duration], [stddev_hist_disappearance_duration]. etc.), and then calculate the score based on the following equation: ((d_observed-d_historic_avg)/d_historic_stddev).
- an entity may fall into the “Changed_In_Value” category at 930 .
- an example algorithm for a changed in value entity compares the historic value histogram on a per-attribute basis against the entity. An embodiment may then calculate a divergence score of an attribute, which is inversely proportional to the healthy state occurrence probability determined herein. The entity would then be assigned the maximum divergence score of any associated attribute.
- an embodiment may strengthen or weaken the seed-anomaly scores based on dependency edges. As each entity is strengthened or weakened, the root causes begin to sort by score, thus creating a ranked list of root causes.
- FIG. 10A an embodiment illustrates a Storage Area Network (SAN) disk entity.
- SAN Storage Area Network
- a seed-anomaly score may become weaker with each cycle.
- an application process e.g., P 0
- PO may be in remote communication with P 1 and P 2 , wherein one of P 1 and P 2 is the root cause.
- P 1 and P 2 are analyzed, and it is determined that P 3 at 1010 and P 4 and 1020 , both of which P 1 depends on are in a healthy state (e.g., operating correctly).
- P 5 at 1030 which P 2 depends on is misbehaving.
- P 1 gets a weakened score, because P 3 and P 4 are in proper condition, and thus less likely to be identified as the root cause of the failure.
- An iterative graph convergence algorithm may then be run that propagates the seed-anomaly scores or updated scores from the previous iteration along the dependency edges. Based on this algorithm, nodes having the highest weight after multiple iteration rounds are likely to be identified as root cause candidates at 160 .
- the root cause candidates are typically associated with an entity within the networked distributed system/application (e.g., integrated network system, operating system, application, virtual machine, hardware component, etc.)
- a further embodiment utilizes an iterative graph algorithm (similar to a web page ranking algorithm) that converges the final weights of the graph nodes, thus indicating the probability of a particular feature being the root cause of the identified problem. Thereby, creating a cause probability for each root cause candidate.
- a further embodiment of root cause identification is shown at 240 .
- GUI graphical user interface
- FIG. 11 An example GUI is shown in FIG. 11 .
- an embodiment may display the ranked results and allow for faceted graph navigation. For example, an embodiment may allow a user to search at 1110 using a particular term like root or mysql. Additionally, an embodiment may allow a user to select a specific time or time interval range for the search function at 1120 .
- an embodiment may allow a user may narrow the search based on various facets, for example, the featuretypes at 1130 . Additionally or alternately, the user may further narrow the search based on facets of the namespace at 1140 . The featuretypes and namespace variables are further discussed herein with respect to FIG. 3 .
- the search results are displayed at 1150 .
- An embodiment may include within the search results a summary of the root cause entity (e.g., namespace, crawltime, featuretype, etc.)
- FIG. 12 identifies the technical improvement to the existing method of identifying a root cause as done by IT administrators currently.
- metric typically, there are three types of data: metric, log, and fine grained machine state data get examined.
- the typical drill down approach is to use metric data to detect service level agreement (SLA) violations on and thereby identify the faulty application component.
- SLA service level agreement
- the log data parsed to identify errors or warnings messages thus identifying faulty application components.
- the granularity at which metric and log data can pinpoint the cause of the fault fails short of what is needed in the field.
- computing node 10 ′ is only one example of a suitable computing node and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention described herein. Regardless, computing node 10 ′ is capable of being implemented and/or performing any of the functionality set forth hereinabove. In accordance with embodiments of the invention, computing node 10 ′ may be part of a cloud network or could be part of another type of distributed or other network (e.g., it could represent an enterprise server), or could represent a stand alone node.
- computing node 10 ′ there is a computer system/server 12 ′, which is operational with numerous other general purpose or special purpose computing system environments or configurations.
- Examples of well known computing systems, environments, and/or configurations that may be suitable for use with computer system/server 12 ′ include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand held or laptop devices, multiprocessor systems, microprocessor based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems or devices, and the like.
- Computer system/server 12 ′ may be described in the general context of computer system executable instructions, such as program modules, being executed by a computer system.
- program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types.
- Computer system/server 12 ′ may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network.
- program modules may be located in both local and remote computer system storage media including memory storage devices.
- computer system/server 12 ′ in computing node 10 ′ is shown in the form of a general purpose computing device.
- the components of computer system/server 12 ′ may include, but are not limited to, at least one processor or processing unit 16 ′, a system memory 28 ′, and a bus 18 ′ that couples various system components including system memory 28 ′ to processor 16 ′.
- Bus 18 ′ represents at least one of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
- such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnects (PCI) bus.
- ISA Industry Standard Architecture
- MCA Micro Channel Architecture
- EISA Enhanced ISA
- VESA Video Electronics Standards Association
- PCI Peripheral Component Interconnects
- Computer system/server 12 ′ typically includes a variety of computer system readable media. Such media may be any available media that are accessible by computer system/server 12 ′, and include both volatile and non-volatile media, removable and non-removable media.
- System memory 28 ′ can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 ′ and/or cache memory 32 ′.
- Computer system/server 12 ′ may further include other removable/non-removable, volatile/non-volatile computer system storage media.
- storage system 34 ′ can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a “hard drive”).
- a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media
- each can be connected to bus 18 ′ by at least one data media interface.
- memory 28 ′ may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
- Program/utility 40 ′ having a set (at least one) of program modules 42 ′, may be stored in memory 28 ′ (by way of example, and not limitation), as well as an operating system, at least one application program, other program modules, and program data. Each of the operating systems, at least one application program, other program modules, and program data or some combination thereof, may include an implementation of a networking environment.
- Program modules 42 ′ generally carry out the functions and/or methodologies of embodiments of the invention as described herein.
- Computer system/server 12 ′ may also communicate with at least one external device 14 ′ such as a keyboard, a pointing device, a display 24 ′, etc.; at least one device that enables a user to interact with computer system/server 12 ′; and/or any devices (e.g., network card, modem, etc.) that enable computer system/server 12 ′ to communicate with at least one other computing device. Such communication can occur via I/O interfaces 22 ′. Still yet, computer system/server 12 ′ can communicate with at least one network such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter 20 ′.
- LAN local area network
- WAN wide area network
- public network e.g., the Internet
- network adapter 20 ′ communicates with the other components of computer system/server 12 ′ via bus 18 ′.
- bus 18 ′ It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer system/server 12 ′. Examples include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
- the present invention may be a system, a method, and/or a computer program product.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
- Most enterprises resort to hosting their applications on a co-located or cloud datacenter. Typically, these applications are complex distributed applications that in addition to comprising multiple components (e.g., modules or micro-services) may require complex interactions between the different components. Furthermore, these applications may rely on specific infrastructure and middleware components provided by the cloud provider itself. It is vital to business operations that these cloud hosted distributed applications are constantly available, because the cost of downtime can be significant. It is not hyperbole to state that a single hour of downtime can cost a business retailer tens of thousands of dollars.
- Downtime does not only affect revenue generation lost, in fact the true cost of downtime can be much higher. The true cost can include, for example, lost or dissatisfied customers, damage to a company's reputation, lost employee productivity, and even devaluation of the business (e.g., falling stock prices). A large number of non-malicious failures occur during routine maintenance (e.g., uninterruptable power supply (UPS) replacement, failure of a machine hard disk, adding of new machines or deprecating old machines from the cluster).
- In summary, one aspect of the invention provides a method of identifying root causes of system failures in a distributed system said method comprising: utilizing at least one processor to execute computer code that performs the steps of: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on the historical machine state data, a healthy map model; detecting at least one failed machine state in the distributed system; comparing the failed machine state against the healthy map model; identifying, based on the comparison, at least one root cause of the failed machine state; and displaying, on a display device, a ranked list comprising the at least one root cause.
- Another aspect of the of the invention provides an apparatus for identifying root causes of system failures in a distributed system apparatus comprising: at least one processor; and a computer readable storage medium having computer readable program code embodied therewith and executable by the at least one processor, the computer readable program code comprising: computer readable program code that records, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; computer readable program code that creates, based on the historical machine state data, a healthy map model; computer readable program code that detects at least one failed machine state in the distributed system; computer readable program code that compares the failed machine state against the healthy map model; computer readable program code that identifies, based on the comparison, at least one root cause of the failed machine state; and computer readable program code that displays, on a display device, a ranked list comprising the at least one root cause.
- An additional aspect of the invention provides a computer program product for identifying root causes of system failures in a distributed system, said computer program product comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code that records, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; computer readable program code that creates, based on the historical machine state data, a healthy map model; computer readable program code that detects at least one failed machine state in the distributed system; computer readable program code that compares the failed machine state against the healthy map model; computer readable program code that identifies, based on the comparison, at least one root cause of the failed machine state; and computer readable program code that displays, on a display device, a ranked list comprising the at least one root cause.
- A further aspect of the invention provides a method comprising: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on the historical machine state data, a healthy map model; detecting at least one failed machine state in the distributed system; determining a failure time, wherein the failure time is associated with the at least one machine state failure; determining a healthy time, wherein the healthy time is associated with a healthy state of the machine state and its dependencies prior to the failure time; generating at least one seed-anomaly score, using an inference algorithm, for machine states between the healthy time and the failure time; and modifying the at least one seed-anomaly score, based on an iterative graph convergence algorithm; wherein the ranked list is based on the modified at least one seed-anomaly score.
- For a better understanding of exemplary embodiments of the invention, together with other and further features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying drawings, and the scope of the claimed embodiments of the invention will be pointed out in the appended claims.
-
FIG. 1 illustrates an example embodiment of identifying a root cause of a failure in a distributed system. -
FIG. 2 illustrates another example embodiment of identifying a root cause of a failure in a distributed system. -
FIG. 3 illustrates an example of key value based machine state data an embodiment may collect. -
FIG. 4 schematically illustrates an example property graph of a networked distributed system/application. -
FIG. 5 illustrates the lifespan based profile component of the healthy model, at a per entity level. -
FIG. 6 illustrates the value histogram based profile component of the healthy model, at a per entity level. -
FIG. 7 illustrates a determination of a problematic time interval. -
FIG. 8 illustrates a categorization table of entities within a problematic time interval. -
FIG. 9 illustrates an example scoring algorithm for a seed-anomaly score. -
FIG. 10A illustrates an example of seed score strengthening. -
FIG. 10B illustrates an example of seed score weakening. -
FIG. 11 illustrates an example graphical user interface for a root cause search application. -
FIG. 12 illustrates an example embodiment of identifying root causes of failures in a deployed distributed application using historical fine grained machine state data. -
FIG. 13 illustrates an example computer system. - It will be readily understood that the components of the embodiments of the invention, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations in addition to the described exemplary embodiments. Thus, the following more detailed description of the embodiments of the invention, as represented in the figures, is not intended to limit the scope of the embodiments of the invention, as claimed, but is merely representative of exemplary embodiments of the invention.
- Reference throughout this specification to “one embodiment” or “an embodiment” (or the like) means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, appearances of the phrases “in one embodiment” or “in an embodiment” or the like in various places throughout this specification are not necessarily all referring to the same embodiment.
- Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in at least one embodiment. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the invention. One skilled in the relevant art may well recognize, however, that embodiments of the invention can be practiced without at least one of the specific details thereof, or can be practiced with other methods, components, materials, et cetera. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
- In large networked distributed system (e.g., cloud-hosted distributed application running in a shared datacenter) it is typical for resources to be shared and utilized by a large number of varying application systems. Because so many of the resources or entities within a networked distributed application system are dependent on each other, the failure of a single entity can cause a cascading failure throughout the system. Thus, when a single system or multiple systems fail it can be difficult to determine what particular resources are at fault. Further, since the distributed application has so many dependent resources, diagnosing which one is the root cause of the problem can be very challenging. Therefore it is vital that these systems be monitored and maintained to ensure that when a fault occurs, the root cause of the fault can be determined quickly so as to ensure the highest possible uptime.
- However, due to the scale and complexity of current networked distributed systems, one of the major problems faced by system administrators is the diagnosis and identification of the root cause of a failure of a distributed application (e.g., a deployed cloud application) within the interconnected network. It can be particularly difficult, when a fault is observed in a distributed application that is running (e.g., a currently active application).
- In order to assist the Information Technology (IT) administrators (subject matter experts) in the root cause analysis process, tools have been developed to reduce the amount of manual effort spent on the identification process. However, conventional tools simply analyze metric and log data, and thus are unable to pinpoint the precise root cause of the problem. This is due to the inherent nature of metric and log data itself.
- Thus, due to the short comings of current solutions, IT administrators are required to remotely log into particular machines manually and run tests on the faulty components while analyzing a large volume of fine grained machine state entities (e.g., processes, configuration files, packages, connections, mounted disk partitions, file system metadata, etc.) that may be related to the observed issues (e.g., SLA violations) in the metrics and/or the messages in the error log.
- In problem diagnosis, speed matters; thus there is a need for techniques that allow for enhanced automation via fine grained root cause identification. Problem diagnosis of distributed systems is challenging for various reasons, for example, the increasing number of participating distributed components within a typical application, or the large variety of potential causes for failed applications. Further examples include but are not limited to: application and deployment misconfigurations; application code related errors or performance bugs; a change to dependent packages/shared libraries; issues with hosted infrastructure (e.g., shared resource contention); and the like.
- As of today, there exists no tool to automatically analyze fine grained machine state data and identify the root cause, resulting in IT administrators spending an inordinately large amount of time manually analyzing the huge volume of fine grained machine state entities that might be related to the fault.
- As a result of this approach, current solutions for root cause diagnosis still require large amounts of time and energy from a subject matter expert to fully (i.e., pinpoint at a fine grained level) identify the actual root cause. This required manual inspection of potential causes by a subject matter expert is time and cost intensive. Thus, a solution is needed that can automate the identification process and do so at a granular level to specifically identify the root cause.
- However, although fine grained machine state data offer immense potential to help fully identify the root cause or pinpoint the root cause at a fine grained level, building a solution that allows for automation of the process creates a technical challenge in that operating on fine grained machine state data is many orders of magnitude more challenging than what is available in current solutions (e.g., those that analyze metric or log data). This is because the number of fine grained machine state entities is so much higher than the number of collected metrics and log files currently analyzed.
- Furthermore, the technical problem is not only of tackling the scale and volume of fine grained machine entities, but also devising new techniques that can operate on fine grained machine entities. For instance, the techniques used to analyze metric data (e.g., tracking the average of numeric metric and reporting alerts based on significant deviations from average) does not apply to machine entities. Similarly, the techniques used for analyzing log data (e.g., searching for loglines that report errors or warning messages) fail to account for machine entities.
- Therefore, an embodiment, allows for root cause identification to be automated. This is enabled through periodically collecting very fine grained machine state data of various types (e.g., processes, connections, configuration settings, packages, application metrics, attributes of shared infrastructure (e.g., disk, central processing unit (CPU), memory, etc.)). This machine state data is then used to discover application invariants on the machine state (e.g., a list of typical processes the current application starts, a list of typical remote connection ports, a list of typical shared libraries accessed, a list of configuration files read, etc.). An invariant is a condition that can be relied upon to be true during execution of a program, or during some portion of it. It is a logical assertion that is held to always be true during a certain phase of execution. Then, based on the collected information, correlations are generated across anomalies (i.e., deviation of a faulty state from the invariants) of various types of machine state data related to running applications. An embodiment may then take the discovered correlations and identify possible root causes of a fault.
- In another embodiment, fine grain machine state data (e.g., processes, configuration files, installed packages, metrics, infrastructure shared resource utilization, etc.) is periodically collected and analyzed from physical machines and/or virtual machines (VMs) on which the distributed application is deployed. This periodic collection directly from running applications when an entire system is healthy allows an embodiment to construct a healthy reference model. This healthy reference model captures application invariants over a variety of machine states. When an application fault is observed, an embodiment compares the recent machine state data collected from the failed application against the application's healthy reference model (i.e., typical invariants). Based on the difference between the healthy reference model and the faulty state, an embodiment may identify potential root causes for the observed fault. The embodiment then utilizes a root cause inference algorithm that is able to pinpoint the root cause and/or return a ranked list of potential root causes with the most likely or relevant cause being given high rank. The inference algorithm calculates the divergence score of the entity's state at the time of fault as compared to the healthy state model of that entity.
- An even further embodiment may include a training phase, which is executed during the operation of a healthy system. Initially, an embodiment may periodically collect fine grained machine data. Once the data collected, an embodiment may represent the collected data in a temporally evolving property graph model. Based on this temporally evolving property graph, a healthy profile is built on a per entity basis from the time series of evolution of the entity's state.
- An embodiment may then include a root cause diagnosis phase, which is executed when a predetermined trigger occurs (e.g., getting a service level agreement (SLA) violation, receiving a ticket, etc.). The root cause diagnosis phase may first determine the problematic time interval [t_good, t_bad] using the metrics and dependency edges. A graph-diff (difference determining graph) is then created based on the g(t_bad) and g(t_good) to identify the set of potentially anomalous entities. Once the graph is created, an embodiment will assign a seed-anomaly score to each potentially anomalous entity based on their divergence from the healthy state model. An embodiment may then use dependency edges to strengthen or weaken the seed-anomaly scores, using the various methods described herein, to create a ranked list of root causes. Once the ranked list exists, it may be displayed in a faceted manner with additional navigation options.
- The description now turns to the figures. The illustrated embodiments of the invention will be best understood by reference to the figures. The following description is intended only by way of example and simply illustrates certain selected exemplary embodiments of the invention as claimed herein.
- Specific reference will now be made here below to the figures. It should be appreciated that the processes, arrangements and products broadly illustrated therein can be carried out on, or in accordance with, essentially any suitable computer system or set of computer systems, which may, by way of an illustrative and non-restrictive example, include a system or server such as that indicated at 12′ in
FIG. 13 . In accordance with an exemplary embodiment, most if not all of the process steps, components and outputs discussed with respect toFIG. 1 can be performed or utilized by way of a processing unit or units and system memory such as those indicated, respectively, at 16′ and 28′ inFIG. 13 , whether on a server computer, a client computer, a node computer in a distributed network, or any combination thereof. - Broadly contemplated herein, in accordance with at least one embodiment of the invention are methods and arrangements which involve collecting fine grain machine state data and converting that data into a property graph model. An embodiment may then, based upon the property graph representation, construct a healthy state reference model. An embodiment then calculates the divergence (i.e., difference) between a detected fault state and the known healthy state model. Additionally, an embodiment utilizes a root cause inference algorithm to exploit specific property graph based modeling techniques as used herein. The embodiment then maps or converts multi-silo machine state data into a key value based property graph, wherein different machine state features are nodes/vertices of the graph. The terms node and vertex are used interchangeably throughout this disclosure when in reference to graph generation. Furthermore, each node additionally has a property or attribute list in the form of key value pairs.
- Referring now to
FIG. 1 which schematically illustrates a system architecture, in accordance with at least one embodiment of the invention. At 110, an embodiment may collect machine state data. The machine state data may be collected through a variety of methods. For example, an embodiment may utilize a crawler which systematically browses (i.e., crawls) through the integrated system and records information at various states and times. The crawler then indexes content within the system as it crawls. In addition, the crawler has the ability to recognize certain characteristics of the system (e.g., particular dependencies, communication between entities, certain code operators, etc.). - Additionally or alternatively, the data may be collected via manual entry of a user (e.g., a user may enter specific components and their exiting dependences). An embodiment may also collect machine state data from multiple silos of data, for example, metrics, configuration files, files, processes, packages, connections, development operations, tickets submitted indicating potential changes or updates, known events detected, logs, administrative operations, etc.
- A further example embodiment is shown in
FIG. 2 at 210. An embodiment may, during the fine grained machine state data collection, draw from multiple sources. For example, and as shown in 210, work ticket requests, developer operations, standard metrics, infrastructure events, data logs, configuration files, administration operations, and the like can be used for data collection. The fine grain data may be collected from various levels within the distributed system (e.g., application, middleware, infrastructure, etc.). - Application metrics may be for example infrastructure incidents, infrastructure availability, infrastructure utilization and performance, application issues, application availability, application utilization and performance, operations, application support, etc. Collecting data related to the configuration files may help detect changes to the configuration files themselves (e.g., change of remote port, thread pool size, etc.). Collecting data related to the processes may help detect processes that have crashed or even new resource heavy process that may have been created. Collecting data related to the packages may help detect changes to typical opened files (e.g., a change to a new version of a shared library due to package upgrades). Collecting data related to the connections may help detect missing network connections to remote topology nodes of the same application. Collecting data related to the development operations may be used to detect recent changes to code deployment data.
- Referring briefly to
FIG. 3 , additional examples and explanations regarding machine state data that is collected is shown. For example, at 310 a “namespace” is determined, wherein a namespace may be a tag that is associated by the crawler to represent the source of the data. For example, namespaces may be the machine name, or a representation of the machine (e.g., <machine-name, application-component>), or even simply the application and component name if only one such component exists in the application cluster. - Further, as discussed herein, an embodiment may assign a “featuretype” at 330 to uniquely represent each fine grained machine entity collected on a particular namespace. Thus, the <namespace:featuretype>tuple may be used to uniquely identify the different fine grained machine entities collected by the system. Moreover, an embodiment may ensure that the featuretype is assigned in a way, such that a featuretype of the same machine entity across a historic timeline is time invariant (e.g., using a name instead of a process ID for processes).
- Additionally shown in
FIG. 3 , an embodiment may record the “crawltime.” The crawltime is the recorded time during which the data was collected. This allows an embodiment to associate the recorded data state with a particular reference time. This is important because, as discussed further herein, the machine state data is continuously collected and stored in a historical repository. The repository is then queried when a failure is detected and a time is identified for which the machine state was healthy (e.g., t_good). An embodiment may also record the featuretype as shown at 330 which allows an embodiment to differentiate between the types of entitles states being stored (e.g., operating system (OS), disk, configuration file, package, connection, process, etc.). - Referring back to
FIG. 1 , once the data is collected via any method (e.g., those disclosed herein) it is stored on a storage device (e.g., that shown at 34′) at 120. The storage system also houses the historical data states. The historical data states are associated with a known time period during which the state was taken (i.e., a snapshot) as discussed herein. The newly acquired machine state data is then added to the existing historical state data at 130. This is because the machine state data is routinely collected at given intervals, and the historical machine state data is updated based on the continual collecting. For example, a system could use a crawler, like that discussed herein, to constantly crawl through the network system and record data states. Alternatively, an embodiment may only take snapshots once daily, weekly, or monthly depending on the demands of the infrastructure. - Based on the acquired historical data, an embodiment creates a time evolving property graph representation. Brief reference will now be made to
FIG. 4 . An embodiment may, as shown inFIG. 4 , create a property graph, which links different machine features that have specific dependencies or causal relationships between each other as edges. For example, an edge (i.e., the lines between the nodes) between feature node N1 and N2 at 410 implies that N1 depends on N2. In other words N2 could causally affect N1. Additional examples may be a particular process (i.e., a graph node) having an edge to a configuration file (i.e., another graph node) that it reads such as at 420. - In order to map the collected data snapshot at time (t) into a property graph (e.g., G=(V, E)) an embodiment utilizes vertex modeling and edge modeling. In vertex modeling (V) equals a set of nodes/vertices. Each vertex has a vertex key/id and a set of properties associated with it. An embodiment may convert the collected machine state data into the vertex properties by: (1) flattening hierarchical key space into unique vertex key/id. (e.g., prefix keys by virtual machine name); (2) using intelligent design to ensure the keys are time invariant (e.g., using a name instead of a process ID for processes); and (3) causing the attributes of the features (e.g., the JavaScript Object Notation (JSON) fields) to become properties of the vertex (i.e., list of key/value pairs). An embodiment may create the vertex type annotations based on featuretype. The different classes of vertex annotations (e.g., process, configuration, metric, topology, etc.) may be leveraged to trigger a class/type specific modeling and root cause inference logic.
- In an embodiment, the edge modeling (E) comprises a set of edges representing dependencies between vertex keys (e.g., a process reading a configuration file, a file belonging to a particular package, a process opening a connection to a remote port, etc.). The dependency relationships may be of the varying forms. For example, a metric entity may depend on a process entity which is being produced (e.g., 430). Additionally, a process entity may depend on the configuration files from which it reads (e.g., 440). A further example may be a process entity depending on a shared library package entity (e.g., 450). In an even further example, a process may depend on a virtual machine (VM) or machine disk entity on which it runs (e.g., 460), other examples could include CPU entities and memory entities related to the VM. An additional example may be one process entity depending on another remote process entity that it interacts with (i.e., inferring relationships from connections to the different entities) (e.g., 470). An additional example embodiment of a time evolving machine property graph comprising a web of fine grained machine entities interconnected with dependency edges is shown in
FIG. 2 at 220. - After mapping the collected machine state data into the property graph representation, an embodiment may construct a healthy state reference model by aggregating multiple property graphs related to different time snapshots of the application when healthy. These snapshots may be annotated and aggregated, with specific techniques based on the node vertex modeling in combination with the edge modeling. One example embodiment, such as that at 230, may create a healthy state model at the node level by identifying what is considered “normal” on particular nodes, and what could be expected based on a particular node (e.g., is the node always present, intermittent, etc.). Additionally, node attributes may be factors (e.g., identifying typical ranges for attribute values and determining how much fluctuation is normal).
- Moreover, the vertex is the union of nodes/vertices across different datacenter states at various points in time, wherein each vertex may have various characteristics. One such characteristic is “occurrence-probability,” which in addition to the vertex ‘key/id’ described herein, has an additional “vertexkey_occurence_probability” characteristic created which measures how often a vertex was part of the state snapshot. Another characteristic is “lifespan profile,” which relates to the expected duration (and deviation) of an entity remaining in the same state before it is updated. Additionally, a characteristic labeled “attribute value histogram,” which is discussed further herein, may be used to enable an embodiment to maintain typical ranges of attribute values (for each attribute) to allow the embodiment to learn what amount of fluctuation is normal.
- An embodiment may also use edge level analysis in building the healthy model. For example, identifying which set of edges occur as invariants, which set of edges are intermittent, which set of edges are rare, etc. Similar to the node analysis, attribute level analysis may be used, for example, to determine what typical edge attributes values are normal, or what normal attribute behavior can be expected. Also as above, an embodiment may aggregate the historical information per edge entity into aggregated attributes, for example occurrence probability, lifespan distribution, and attribute value histogram.
- Referring now to
FIG. 5 , an embodiment may create a healthy state model at a per entity level, which allows for tracking the entity via its historic lifespan profile. The time series or lifecycle at 510 of an entity may be represented as a series of its constituent versions. The version of an entity is updated, either when any of the attributes change (540), or it transitions from exiting to non-existing (520), or non-existing to existing (530). Based on the time series of its successive versions, the lifespan of each version can be computed as the time interval [v.b, v.d] wherein “v.b” is the version's birth time and “v.d” is the version's death time. Based the lifespan of each version, the lifespan profile of the entity can be computed to track what the average lifespan is, and also the standard deviation therein. - Further, an embodiment, based on
FIG. 5 , may, compute an average and standard deviation of the contiguous existence (i.e., discounting changes to its attribute values) of an entity during apredetermined time 540. For example, based onFIG. 5 , an embodiment would calculate the contiguous existence durations as: [(v1.d−v0.b), (v2.d−v2.b), (v3.d−v3.b)]. A further embodiment may additionally or alternatively calculate the average and standard deviation of non-existing (or disappearance) durations as: [(v2.b−v1.d), (v3.b−v2.d)]. By capturing the lifespan, existence, and non-existence profiles of entities in this manner, an embodiment is able to better summarize and determine whether an entity is always existing, transient in nature, or rarely existing. - Referring now to
FIG. 6 , another embodiment may, based on the healthy model track the historic profile at a per-entity level for each of its attributes. By way of example, an embodiment may have an entity “E” at 610. As discussed herein, entities (E) may have multiple versions (e.g., E.v1, E.v2, etc.). The entity may also be associated a list of attributes (e.g., A1, A2, A3, A4, etc.) at 620. The attributes correspond to specific factors related to the entity (e.g., attributes for the process entity (340), such as: cmd, threads, pid, openfiles, user, etc. as shown inFIG. 3 ). - The attributes have the ability to change over time, and thus an embodiment monitors them with regular periodicity as discussed herein. Based on the periodically captured information, a value histogram is calculated for the occurrence probability of each of the attributes at 630-660. Using this histogram, an embodiment may determine if, or which, attributes' value fluctuates. For example, the attribute A2 (640) does not fluctuate at all, whereas attribute A4 (660) fluctuates reasonably. An embodiment may also capture, for a fluctuating attribute, a typical dominating value of the attribute (e.g., A1 observed to have the value ‘11’ with 85% probability). Thus, the value histogram allows an embodiment to determine if the fluctuations are benign or abnormal by observing typical faults.
- Referring back to
FIG. 1 , this continuous machine state capturing, recording, and model creation will continue if no fault is detected at 140. However, when a fault is detected or observed at 140, an embodiment may compare the failed machine state against the created healthy map model at 150. This comparison allows an embodiment to determine a difference between the aggregated and annotated property graphs related to the healthy reference model and the property graph captured at the faulty machine state time. - In an embodiment, a specific instance (e.g., a failure of an application violating an SLA, a user raises a ticket, etc.) triggers a flag observed at a time which an embodiment records as “t_bad.” The embodiment will then need to identify a time in the past (e.g., t_good) such that potential root causes of the problem are related to the changes in the time-interval [t_good, t_bad]. In order to accurately identify possible root causes, amongst all possible time intervals, an embodiment needs to minimize the interval [t_good, t_bad] as much as possible, while also ensuring the search is broad enough to capture all potential root causes for the failure of a target application.
- One embodiment may receive dependency graph data which is mined from the fine grained machine state data discussed herein. Additionally or alternatively, an embodiment may also receive metric data for all metrics collected from a datacenter (e.g., metrics across all applications, middleware, infrastructure, etc.). Additionally, an embodiment may have an alert threshold, which may be utilized to determine when/if a metric is in a good or bad state. An embodiment may also be able to use algorithms to infer an alert threshold. For example, an embodiment may use a change point detection algorithm, which utilizes the historical value of a metric to detect sudden change points reflecting transitioning to bad state.
- Once the above data is received, an embodiment may detect the set of metrics that are attached to any machine entities that are “reachable” from the machine entity (e.g., process) associated with a target application that was observed to fail or misbehave. For example, an embodiment may start at t_bad, and proceed backward in time until all “reachable metrics” (e.g., dependencies) are determined to be in a good state, thus determining a t_good.
- Referring to
FIG. 7 , an embodiment determines the problematic time interval [t_good, t_bad] using the time services of divergence scores and dependency edges. For example, an embodiment first acquires a set of machine entities that are reachable via dependency edges from the observed faulty entity. Then, based the time series of seed-anomaly scores for each of the reachable entities, an embodiment must determine the nearest time (moving backward in time) from t_bad when all reachable entities have low divergence or anomaly scores. This point in time is then labeled t_good. - By way of specific example,
FIG. 7 illustrates a dependency graph. The edge (i.e., the lines between the nodes) between feature node E1 and E2 at 710 implies that E1 depends on E2. In other words E2 could causally affect E1. In an embodiment, a failure is triggered respective to E0 at 730, and thus a time t_bad is identified. Then, based on the dependency graph, an embodiment will trace back all entities that can casually affect E0. By way of example, E0 at 740 is dependent on E2 at 750, which is further dependent on E5 at 760. As shown in the time dependent graph at 770, E2 was never in a failed state. Thus, if E0 was only dependent on E2, t_good would be determined based on a previously functional state of E0 (due to E2 being fully functional at all times). However, because E2 at 750 depends on E5 at 760, E0 depends indirectly on E5. It may also be possible that that a fault at E5 does not directly lead to an observable change in the core functioning of E2 as shown at 771 in the collected operational data related to entity E2. However, the fault did affect a small subset of operations for which E1 is transitively dependent on E5 via E2. Thus the periodic history of E5 is investigated until a time is identified that E5 was in a functional state at 780. This time is thus labeled as t_good, because all causally dependent entities on which E0 is reliant are determined to be in good operational state at that point. - Referring now to
FIG. 8 , once an embodiment has computed a [t_good, t_bad] problematic time-interval, it may take a “diff” of the property graph between t_good and t_bad to generate at least three classes of entities that could be potential root causes: (1) new entities that were observed in the problematic time-interval but did not exist at t_good (shown in 810), (2) entities that were present at t_good but disappeared/deleted in this problematic time-interval (shown in 820), and (3) entities that were observed before but changed in the value of their attributes (shown in 830). In a further embodiment, a large fraction of entities at 840 remained constant in value throughout the time interval and were essentially pruned out via the graph “diff” operation as being unlikely to be the root cause, as they remained consistent. Thus, an embodiment is more able to identify potential root causes because the number of changes in a problematic time window is much smaller than the total number of entities in an entire data center. - An embodiment may then assign the remaining non-pruned graph nodes and edges weights based on a predetermined algorithm. For example, an embodiment may assign an initial seed-anomaly score to these non-pruned entities using the per-entity healthy state model is shown in
FIG. 9 . The non-pruned entities typically fall into one of three category types. The first being “Newly_Observed” at 910. As shown inFIG. 9 , an example algorithm for a newly observed entity determines if the entity was observed at any point in the past (e.g., outside of the [t_bad, t_good] window). If an embodiment determines that the entity was indeed present at a previous time, the previously calculated value histogram score, discussed herein, is used. - Alternatively, an entity may fall into the “Disappeared_Now” category at 920. Again, as shown in
FIG. 9 , an example algorithm for a disappeared now entity compares the entity with a historic existence profile and determines a set of variables (e.g., [disappearance_duration], [avg_hist_disappearance_duration], [stddev_hist_disappearance_duration]. etc.), and then calculate the score based on the following equation: ((d_observed-d_historic_avg)/d_historic_stddev). - Finally, an entity may fall into the “Changed_In_Value” category at 930. Once again, as shown in
FIG. 9 , an example algorithm for a changed in value entity compares the historic value histogram on a per-attribute basis against the entity. An embodiment may then calculate a divergence score of an attribute, which is inversely proportional to the healthy state occurrence probability determined herein. The entity would then be assigned the maximum divergence score of any associated attribute. Referring toFIGS. 10A and 10B , once the scores are assigned to each entity, an embodiment may strengthen or weaken the seed-anomaly scores based on dependency edges. As each entity is strengthened or weakened, the root causes begin to sort by score, thus creating a ranked list of root causes. Due to the fact that the scores strengthening and weakening depend on dependencies, an entity that is heavily dependent upon a large number of other entities will receive a high score. For example, inFIG. 10A , an embodiment illustrates a Storage Area Network (SAN) disk entity. Thus, because a SAN disk affects a great many other entitles (e.g., all disk I/O heavy process entities that access or read from it) it has its seed-anomaly score strengthened. - Additionally or alternatively, a seed-anomaly score may become weaker with each cycle. By way of example, and referring now to
FIG. 10B , an embodiment may weaken the seed score if an application process (e.g., P0) fails or behaves improperly. PO may be in remote communication with P1 and P2, wherein one of P1 and P2 is the root cause. Thus, the dependencies of P1 and P2 are analyzed, and it is determined that P3 at 1010 and P4 and 1020, both of which P1 depends on are in a healthy state (e.g., operating correctly). However, when analyzing P2, it is determined that P5 at 1030 which P2 depends on is misbehaving. In the foregoing embodiment, P1 gets a weakened score, because P3 and P4 are in proper condition, and thus less likely to be identified as the root cause of the failure. - An iterative graph convergence algorithm may then be run that propagates the seed-anomaly scores or updated scores from the previous iteration along the dependency edges. Based on this algorithm, nodes having the highest weight after multiple iteration rounds are likely to be identified as root cause candidates at 160. The root cause candidates are typically associated with an entity within the networked distributed system/application (e.g., integrated network system, operating system, application, virtual machine, hardware component, etc.) A further embodiment utilizes an iterative graph algorithm (similar to a web page ranking algorithm) that converges the final weights of the graph nodes, thus indicating the probability of a particular feature being the root cause of the identified problem. Thereby, creating a cause probability for each root cause candidate. A further embodiment of root cause identification is shown at 240.
- Once the probability score is generated for each root cause candidate, they are ranked and displayed to the user at 170 via a graphical user interface (GUI). An example GUI is shown in
FIG. 11 . Referring now toFIG. 11 , an embodiment may display the ranked results and allow for faceted graph navigation. For example, an embodiment may allow a user to search at 1110 using a particular term like root or mysql. Additionally, an embodiment may allow a user to select a specific time or time interval range for the search function at 1120. - Prior to or directly after a search has been carried out, an embodiment may allow a user may narrow the search based on various facets, for example, the featuretypes at 1130. Additionally or alternately, the user may further narrow the search based on facets of the namespace at 1140. The featuretypes and namespace variables are further discussed herein with respect to
FIG. 3 . Once all search requirements are entered at 1110, and all desired refinements are selected at 1120-1140, the search results are displayed at 1150. An embodiment may include within the search results a summary of the root cause entity (e.g., namespace, crawltime, featuretype, etc.) - Thus as described herein,
FIG. 12 identifies the technical improvement to the existing method of identifying a root cause as done by IT administrators currently. For example, typically, there are three types of data: metric, log, and fine grained machine state data get examined. The typical drill down approach is to use metric data to detect service level agreement (SLA) violations on and thereby identify the faulty application component. Then utilizing the information attained from the metric data, the log data parsed to identify errors or warnings messages thus identifying faulty application components. However, the granularity at which metric and log data can pinpoint the cause of the fault fails short of what is needed in the field. Consequently, the clues offered by metric and log data then used by the IT admin to manually analyze the fine grained machine state entities (processes, configuration files, packages, connections, mounted disk partitions, file system metadata, etc.) that might be related to the observed errors and warning messages. Thus an embodiment, presents a technical advantage over the aforementioned process by automatically analyzing fine grained machine state data and reporting detected anomalies. - Referring now to
FIG. 13 , a schematic of an example of a computing node is shown.Computing node 10′ is only one example of a suitable computing node and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention described herein. Regardless, computingnode 10′ is capable of being implemented and/or performing any of the functionality set forth hereinabove. In accordance with embodiments of the invention, computingnode 10′ may be part of a cloud network or could be part of another type of distributed or other network (e.g., it could represent an enterprise server), or could represent a stand alone node. - In
computing node 10′ there is a computer system/server 12′, which is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with computer system/server 12′ include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand held or laptop devices, multiprocessor systems, microprocessor based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems or devices, and the like. - Computer system/server 12′ may be described in the general context of computer system executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. Computer system/server 12′ may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
- As shown in
FIG. 12 , computer system/server 12′ incomputing node 10′ is shown in the form of a general purpose computing device. The components of computer system/server 12′ may include, but are not limited to, at least one processor orprocessing unit 16′, asystem memory 28′, and a bus 18′ that couples various system components includingsystem memory 28′ toprocessor 16′. Bus 18′ represents at least one of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnects (PCI) bus. - Computer system/server 12′ typically includes a variety of computer system readable media. Such media may be any available media that are accessible by computer system/server 12′, and include both volatile and non-volatile media, removable and non-removable media.
-
System memory 28′ can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30′ and/orcache memory 32′. Computer system/server 12′ may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only,storage system 34′ can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a “hard drive”). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus 18′ by at least one data media interface. As will be further depicted and described below,memory 28′ may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention. - Program/
utility 40′, having a set (at least one) of program modules 42′, may be stored inmemory 28′ (by way of example, and not limitation), as well as an operating system, at least one application program, other program modules, and program data. Each of the operating systems, at least one application program, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules 42′ generally carry out the functions and/or methodologies of embodiments of the invention as described herein. - Computer system/server 12′ may also communicate with at least one
external device 14′ such as a keyboard, a pointing device, adisplay 24′, etc.; at least one device that enables a user to interact with computer system/server 12′; and/or any devices (e.g., network card, modem, etc.) that enable computer system/server 12′ to communicate with at least one other computing device. Such communication can occur via I/O interfaces 22′. Still yet, computer system/server 12′ can communicate with at least one network such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) vianetwork adapter 20′. As depicted,network adapter 20′ communicates with the other components of computer system/server 12′ via bus 18′. It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer system/server 12′. Examples include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc. - This disclosure has been presented for purposes of illustration and description but is not intended to be exhaustive or limiting. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to explain principles and practical application, and to enable others of ordinary skill in the art to understand the disclosure.
- Although illustrative embodiments of the invention have been described herein with reference to the accompanying drawings, it is to be understood that the embodiments of the invention are not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope or spirit of the disclosure.
- The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
- The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions. These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/852,006 US9772898B2 (en) | 2015-09-11 | 2015-09-11 | Identifying root causes of failures in a deployed distributed application using historical fine grained machine state data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/852,006 US9772898B2 (en) | 2015-09-11 | 2015-09-11 | Identifying root causes of failures in a deployed distributed application using historical fine grained machine state data |
Publications (2)
Publication Number | Publication Date |
---|---|
US20170075744A1 true US20170075744A1 (en) | 2017-03-16 |
US9772898B2 US9772898B2 (en) | 2017-09-26 |
Family
ID=58236853
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/852,006 Active 2035-12-04 US9772898B2 (en) | 2015-09-11 | 2015-09-11 | Identifying root causes of failures in a deployed distributed application using historical fine grained machine state data |
Country Status (1)
Country | Link |
---|---|
US (1) | US9772898B2 (en) |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170068581A1 (en) * | 2015-09-04 | 2017-03-09 | International Business Machines Corporation | System and method for relationship based root cause recommendation |
US20170269566A1 (en) * | 2016-03-17 | 2017-09-21 | Fanuc Corporation | Operation management method for machine tool |
US20180351838A1 (en) * | 2017-06-02 | 2018-12-06 | Vmware, Inc. | Methods and systems that diagnose and manage undesirable operational states of computing facilities |
US10255128B2 (en) * | 2016-08-17 | 2019-04-09 | Red Hat, Inc. | Root cause candidate determination in multiple process systems |
US20190132191A1 (en) * | 2017-10-17 | 2019-05-02 | Loom Systems LTD. | Machine-learning and deep-learning techniques for predictive ticketing in information technology systems |
CN110399260A (en) * | 2018-04-24 | 2019-11-01 | Emc知识产权控股有限公司 | System and method for predictively servicing and supporting solutions |
US20200034222A1 (en) * | 2018-07-29 | 2020-01-30 | Hewlett Packard Enterprise Development Lp | Determination of cause of error state of elements |
US10585723B2 (en) * | 2018-04-14 | 2020-03-10 | Microsoft Technology Licensing, Llc | Quantification of compute performance across multiple independently executed microservices with a state machine supported workflow graph |
US10600002B2 (en) * | 2016-08-04 | 2020-03-24 | Loom Systems LTD. | Machine learning techniques for providing enriched root causes based on machine-generated data |
US20200117531A1 (en) * | 2018-10-10 | 2020-04-16 | Microsoft Technology Licensing, Llc | Error source module identification and remedial action |
US10769009B2 (en) | 2018-03-21 | 2020-09-08 | International Business Machines Corporation | Root cause analysis for correlated development and operations data |
US10789119B2 (en) | 2016-08-04 | 2020-09-29 | Servicenow, Inc. | Determining root-cause of failures based on machine-generated textual data |
US10909018B2 (en) | 2015-09-04 | 2021-02-02 | International Business Machines Corporation | System and method for end-to-end application root cause recommendation |
US10938623B2 (en) * | 2018-10-23 | 2021-03-02 | Hewlett Packard Enterprise Development Lp | Computing element failure identification mechanism |
US10963634B2 (en) | 2016-08-04 | 2021-03-30 | Servicenow, Inc. | Cross-platform classification of machine-generated textual data |
US11093360B2 (en) | 2019-07-24 | 2021-08-17 | International Business Machines Corporation | Multivariate anomaly detection and identification |
US20220012602A1 (en) * | 2020-07-07 | 2022-01-13 | Intuit Inc. | Inference-based incident detection and reporting |
CN114041154A (en) * | 2019-06-28 | 2022-02-11 | 皇家飞利浦有限公司 | Maintaining history visualization tools to facilitate resolving intermittent problems |
US11265204B1 (en) | 2020-08-04 | 2022-03-01 | Juniper Networks, Inc. | Using a programmable resource dependency mathematical model to perform root cause analysis |
US11269711B2 (en) | 2020-07-14 | 2022-03-08 | Juniper Networks, Inc. | Failure impact analysis of network events |
US20220103417A1 (en) * | 2020-09-25 | 2022-03-31 | Juniper Networks, Inc. | Hypothesis driven diagnosis of network systems |
US20220182279A1 (en) * | 2020-04-10 | 2022-06-09 | Servicenow, Inc. | Context-Aware Automated Root Cause Analysis in Managed Networks |
US20220201024A1 (en) * | 2020-12-23 | 2022-06-23 | Varmour Networks, Inc. | Modeling Topic-Based Message-Oriented Middleware within a Security System |
US11416325B2 (en) | 2012-03-13 | 2022-08-16 | Servicenow, Inc. | Machine-learning and deep-learning techniques for predictive ticketing in information technology systems |
US20220318082A1 (en) * | 2021-04-01 | 2022-10-06 | Bmc Software, Inc. | Root cause identification and event classification in system monitoring |
US11531908B2 (en) * | 2019-03-12 | 2022-12-20 | Ebay Inc. | Enhancement of machine learning-based anomaly detection using knowledge graphs |
EP4137946A1 (en) * | 2021-08-19 | 2023-02-22 | Bull SAS | Method for representation of a computer system distributed by graph embedding |
US20230059857A1 (en) * | 2021-08-05 | 2023-02-23 | International Business Machines Corporation | Repairing of machine learning pipelines |
US20230105304A1 (en) * | 2021-10-01 | 2023-04-06 | Healtech Software India Pvt. Ltd. | Proactive avoidance of performance issues in computing environments |
US11711374B2 (en) | 2019-05-31 | 2023-07-25 | Varmour Networks, Inc. | Systems and methods for understanding identity and organizational access to applications within an enterprise environment |
WO2023195014A1 (en) * | 2022-04-05 | 2023-10-12 | Krayden Amir | System and method for hybrid observability of highly distributed systems |
WO2023224764A1 (en) * | 2022-05-20 | 2023-11-23 | Nec Laboratories America, Inc. | Multi-modality root cause localization for cloud computing systems |
US11863580B2 (en) | 2019-05-31 | 2024-01-02 | Varmour Networks, Inc. | Modeling application dependencies to identify operational risk |
US11876817B2 (en) | 2020-12-23 | 2024-01-16 | Varmour Networks, Inc. | Modeling queue-based message-oriented middleware relationships in a security system |
CN118131738A (en) * | 2024-05-08 | 2024-06-04 | 珠海市众知科技有限公司 | Controller processing optimization method, system and medium based on test data analysis |
US12105581B2 (en) * | 2020-10-07 | 2024-10-01 | Mitsubishi Electric Corporation | Failure symptom detection system, failure symptom detection method, and recording medium |
CN119477649A (en) * | 2025-01-08 | 2025-02-18 | 山东省邮电工程有限公司 | Cable laying management method and system for communication construction |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10257053B2 (en) | 2016-06-28 | 2019-04-09 | International Business Machines Corporation | Analyzing contention data and following resource blockers to find root causes of computer problems |
CN108899904B (en) * | 2018-08-30 | 2021-04-30 | 山东大学 | Method and system for rapidly searching cascading failures of alternating current and direct current large power grid |
RU2739866C2 (en) * | 2018-12-28 | 2020-12-29 | Акционерное общество "Лаборатория Касперского" | Method for detecting compatible means for systems with anomalies |
US11042320B2 (en) * | 2019-02-18 | 2021-06-22 | International Business Machines Corporation | Problem diagnosis in complex SAN environments |
US11126454B2 (en) | 2019-07-22 | 2021-09-21 | Red Hat, Inc. | Enforcing retention policies with respect to virtual machine snapshots |
US11349703B2 (en) * | 2020-07-24 | 2022-05-31 | Hewlett Packard Enterprise Development Lp | Method and system for root cause analysis of network issues |
US11947439B2 (en) | 2020-11-30 | 2024-04-02 | International Business Machines Corporation | Learning from distributed traces for anomaly detection and root cause analysis |
US11513930B2 (en) | 2020-12-03 | 2022-11-29 | International Business Machines Corporation | Log-based status modeling and problem diagnosis for distributed applications |
US11995562B2 (en) | 2020-12-03 | 2024-05-28 | International Business Machines Corporation | Integrating documentation knowledge with log mining for system diagnosis |
US11474892B2 (en) | 2020-12-03 | 2022-10-18 | International Business Machines Corporation | Graph-based log sequence anomaly detection and problem diagnosis |
US11599404B2 (en) | 2020-12-03 | 2023-03-07 | International Business Machines Corporation | Correlation-based multi-source problem diagnosis |
US11243835B1 (en) * | 2020-12-03 | 2022-02-08 | International Business Machines Corporation | Message-based problem diagnosis and root cause analysis |
US11403326B2 (en) | 2020-12-03 | 2022-08-02 | International Business Machines Corporation | Message-based event grouping for a computing operation |
US11797538B2 (en) | 2020-12-03 | 2023-10-24 | International Business Machines Corporation | Message correlation extraction for mainframe operation |
US12277049B2 (en) | 2022-03-21 | 2025-04-15 | International Business Machines Corporation | Fault localization in a distributed computing system |
US12035142B2 (en) | 2022-03-21 | 2024-07-09 | Bank Of America Corporation | Systems and methods for dynamic communication channel switching for secure message propagation |
US12242332B2 (en) * | 2022-10-10 | 2025-03-04 | Oracle International Corporation | Identifying root cause anomalies in time series |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6651183B1 (en) * | 1999-10-28 | 2003-11-18 | International Business Machines Corporation | Technique for referencing failure information representative of multiple related failures in a distributed computing environment |
US20040049565A1 (en) * | 2002-09-11 | 2004-03-11 | International Business Machines Corporation | Methods and apparatus for root cause identification and problem determination in distributed systems |
US20070016831A1 (en) * | 2005-07-12 | 2007-01-18 | Gehman Byron C | Identification of root cause for a transaction response time problem in a distributed environment |
US20090150131A1 (en) * | 2007-12-05 | 2009-06-11 | Honeywell International, Inc. | Methods and systems for performing diagnostics regarding underlying root causes in turbine engines |
US20110154367A1 (en) * | 2009-12-18 | 2011-06-23 | Bernd Gutjahr | Domain event correlation |
US8001527B1 (en) * | 2004-12-21 | 2011-08-16 | Zenprise, Inc. | Automated root cause analysis of problems associated with software application deployments |
US20110231704A1 (en) * | 2010-03-19 | 2011-09-22 | Zihui Ge | Methods, apparatus and articles of manufacture to perform root cause analysis for network events |
US8064438B1 (en) * | 2004-11-22 | 2011-11-22 | At&T Intellectual Property Ii, L.P. | Method and apparatus for determining the configuration of voice over internet protocol equipment in remote locations |
US20120158925A1 (en) * | 2010-12-17 | 2012-06-21 | Microsoft Corporation | Monitoring a model-based distributed application |
US20140068326A1 (en) * | 2012-09-06 | 2014-03-06 | Triumfant, Inc. | Systems and Methods for Automated Memory and Thread Execution Anomaly Detection in a Computer Network |
US20140136896A1 (en) * | 2012-11-14 | 2014-05-15 | International Business Machines Corporation | Diagnosing distributed applications using application logs and request processing paths |
US20140189086A1 (en) * | 2013-01-03 | 2014-07-03 | Microsoft Corporation | Comparing node states to detect anomalies |
US20140281739A1 (en) * | 2013-03-14 | 2014-09-18 | Netflix, Inc. | Critical systems inspector |
US9104572B1 (en) * | 2013-02-11 | 2015-08-11 | Amazon Technologies, Inc. | Automated root cause analysis |
US9122602B1 (en) * | 2011-08-31 | 2015-09-01 | Amazon Technologies, Inc. | Root cause detection service |
US20160124823A1 (en) * | 2014-10-31 | 2016-05-05 | International Business Machines Corporation | Log analytics for problem diagnosis |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080140817A1 (en) | 2006-12-06 | 2008-06-12 | Agarwal Manoj K | System and method for performance problem localization |
US8015139B2 (en) | 2007-03-06 | 2011-09-06 | Microsoft Corporation | Inferring candidates that are potentially responsible for user-perceptible network problems |
US7877642B2 (en) | 2008-10-22 | 2011-01-25 | International Business Machines Corporation | Automatic software fault diagnosis by exploiting application signatures |
US8495429B2 (en) | 2010-05-25 | 2013-07-23 | Microsoft Corporation | Log message anomaly detection |
US8972783B2 (en) | 2011-06-28 | 2015-03-03 | International Business Machines Corporation | Systems and methods for fast detection and diagnosis of system outages |
US8751867B2 (en) | 2011-10-12 | 2014-06-10 | Vmware, Inc. | Method and apparatus for root cause and critical pattern prediction using virtual directed graphs |
US8862727B2 (en) | 2012-05-14 | 2014-10-14 | International Business Machines Corporation | Problem determination and diagnosis in shared dynamic clouds |
-
2015
- 2015-09-11 US US14/852,006 patent/US9772898B2/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6651183B1 (en) * | 1999-10-28 | 2003-11-18 | International Business Machines Corporation | Technique for referencing failure information representative of multiple related failures in a distributed computing environment |
US20040049565A1 (en) * | 2002-09-11 | 2004-03-11 | International Business Machines Corporation | Methods and apparatus for root cause identification and problem determination in distributed systems |
US8064438B1 (en) * | 2004-11-22 | 2011-11-22 | At&T Intellectual Property Ii, L.P. | Method and apparatus for determining the configuration of voice over internet protocol equipment in remote locations |
US8001527B1 (en) * | 2004-12-21 | 2011-08-16 | Zenprise, Inc. | Automated root cause analysis of problems associated with software application deployments |
US20070016831A1 (en) * | 2005-07-12 | 2007-01-18 | Gehman Byron C | Identification of root cause for a transaction response time problem in a distributed environment |
US20090150131A1 (en) * | 2007-12-05 | 2009-06-11 | Honeywell International, Inc. | Methods and systems for performing diagnostics regarding underlying root causes in turbine engines |
US20110154367A1 (en) * | 2009-12-18 | 2011-06-23 | Bernd Gutjahr | Domain event correlation |
US20110231704A1 (en) * | 2010-03-19 | 2011-09-22 | Zihui Ge | Methods, apparatus and articles of manufacture to perform root cause analysis for network events |
US20120158925A1 (en) * | 2010-12-17 | 2012-06-21 | Microsoft Corporation | Monitoring a model-based distributed application |
US9122602B1 (en) * | 2011-08-31 | 2015-09-01 | Amazon Technologies, Inc. | Root cause detection service |
US20140068326A1 (en) * | 2012-09-06 | 2014-03-06 | Triumfant, Inc. | Systems and Methods for Automated Memory and Thread Execution Anomaly Detection in a Computer Network |
US20140136896A1 (en) * | 2012-11-14 | 2014-05-15 | International Business Machines Corporation | Diagnosing distributed applications using application logs and request processing paths |
US20140189086A1 (en) * | 2013-01-03 | 2014-07-03 | Microsoft Corporation | Comparing node states to detect anomalies |
US9104572B1 (en) * | 2013-02-11 | 2015-08-11 | Amazon Technologies, Inc. | Automated root cause analysis |
US20140281739A1 (en) * | 2013-03-14 | 2014-09-18 | Netflix, Inc. | Critical systems inspector |
US20160124823A1 (en) * | 2014-10-31 | 2016-05-05 | International Business Machines Corporation | Log analytics for problem diagnosis |
Cited By (57)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11416325B2 (en) | 2012-03-13 | 2022-08-16 | Servicenow, Inc. | Machine-learning and deep-learning techniques for predictive ticketing in information technology systems |
US20170068581A1 (en) * | 2015-09-04 | 2017-03-09 | International Business Machines Corporation | System and method for relationship based root cause recommendation |
US10318366B2 (en) * | 2015-09-04 | 2019-06-11 | International Business Machines Corporation | System and method for relationship based root cause recommendation |
US10909018B2 (en) | 2015-09-04 | 2021-02-02 | International Business Machines Corporation | System and method for end-to-end application root cause recommendation |
US20170269566A1 (en) * | 2016-03-17 | 2017-09-21 | Fanuc Corporation | Operation management method for machine tool |
US10600002B2 (en) * | 2016-08-04 | 2020-03-24 | Loom Systems LTD. | Machine learning techniques for providing enriched root causes based on machine-generated data |
US11675647B2 (en) | 2016-08-04 | 2023-06-13 | Servicenow, Inc. | Determining root-cause of failures based on machine-generated textual data |
US10963634B2 (en) | 2016-08-04 | 2021-03-30 | Servicenow, Inc. | Cross-platform classification of machine-generated textual data |
US10789119B2 (en) | 2016-08-04 | 2020-09-29 | Servicenow, Inc. | Determining root-cause of failures based on machine-generated textual data |
US10255128B2 (en) * | 2016-08-17 | 2019-04-09 | Red Hat, Inc. | Root cause candidate determination in multiple process systems |
US11178037B2 (en) * | 2017-06-02 | 2021-11-16 | Vmware, Inc. | Methods and systems that diagnose and manage undesirable operational states of computing facilities |
US20180351838A1 (en) * | 2017-06-02 | 2018-12-06 | Vmware, Inc. | Methods and systems that diagnose and manage undesirable operational states of computing facilities |
US10454801B2 (en) * | 2017-06-02 | 2019-10-22 | Vmware, Inc. | Methods and systems that diagnose and manage undesirable operational states of computing facilities |
US10740692B2 (en) * | 2017-10-17 | 2020-08-11 | Servicenow, Inc. | Machine-learning and deep-learning techniques for predictive ticketing in information technology systems |
US20190132191A1 (en) * | 2017-10-17 | 2019-05-02 | Loom Systems LTD. | Machine-learning and deep-learning techniques for predictive ticketing in information technology systems |
US10769009B2 (en) | 2018-03-21 | 2020-09-08 | International Business Machines Corporation | Root cause analysis for correlated development and operations data |
US11055153B2 (en) * | 2018-04-14 | 2021-07-06 | Microsoft Technology Licensing, Llc | Quantification of compute performance across multiple independently executed microservices with a state machine supported workflow graph |
US10585723B2 (en) * | 2018-04-14 | 2020-03-10 | Microsoft Technology Licensing, Llc | Quantification of compute performance across multiple independently executed microservices with a state machine supported workflow graph |
CN110399260A (en) * | 2018-04-24 | 2019-11-01 | Emc知识产权控股有限公司 | System and method for predictively servicing and supporting solutions |
US20200034222A1 (en) * | 2018-07-29 | 2020-01-30 | Hewlett Packard Enterprise Development Lp | Determination of cause of error state of elements |
US10831587B2 (en) * | 2018-07-29 | 2020-11-10 | Hewlett Packard Enterprise Development Lp | Determination of cause of error state of elements in a computing environment based on an element's number of impacted elements and the number in an error state |
US20200117531A1 (en) * | 2018-10-10 | 2020-04-16 | Microsoft Technology Licensing, Llc | Error source module identification and remedial action |
US10938623B2 (en) * | 2018-10-23 | 2021-03-02 | Hewlett Packard Enterprise Development Lp | Computing element failure identification mechanism |
US11531908B2 (en) * | 2019-03-12 | 2022-12-20 | Ebay Inc. | Enhancement of machine learning-based anomaly detection using knowledge graphs |
US12131265B2 (en) * | 2019-03-12 | 2024-10-29 | Ebay Inc. | Enhancement of machine learning-based anomaly detection using knowledge graphs |
US20230048212A1 (en) * | 2019-03-12 | 2023-02-16 | Ebay Inc. | Enhancement Of Machine Learning-Based Anomaly Detection Using Knowledge Graphs |
US11863580B2 (en) | 2019-05-31 | 2024-01-02 | Varmour Networks, Inc. | Modeling application dependencies to identify operational risk |
US11711374B2 (en) | 2019-05-31 | 2023-07-25 | Varmour Networks, Inc. | Systems and methods for understanding identity and organizational access to applications within an enterprise environment |
CN114041154A (en) * | 2019-06-28 | 2022-02-11 | 皇家飞利浦有限公司 | Maintaining history visualization tools to facilitate resolving intermittent problems |
US20220309474A1 (en) * | 2019-06-28 | 2022-09-29 | Koninklijke Philips N.V. | Maintenance history visualizer to facilitate solving intermittent problems |
US11093360B2 (en) | 2019-07-24 | 2021-08-17 | International Business Machines Corporation | Multivariate anomaly detection and identification |
US20220182279A1 (en) * | 2020-04-10 | 2022-06-09 | Servicenow, Inc. | Context-Aware Automated Root Cause Analysis in Managed Networks |
US12273230B2 (en) * | 2020-04-10 | 2025-04-08 | Servicenow, Inc. | Context-aware automated root cause analysis in managed networks |
US20220012602A1 (en) * | 2020-07-07 | 2022-01-13 | Intuit Inc. | Inference-based incident detection and reporting |
US11651254B2 (en) * | 2020-07-07 | 2023-05-16 | Intuit Inc. | Inference-based incident detection and reporting |
US11809266B2 (en) | 2020-07-14 | 2023-11-07 | Juniper Networks, Inc. | Failure impact analysis of network events |
US11269711B2 (en) | 2020-07-14 | 2022-03-08 | Juniper Networks, Inc. | Failure impact analysis of network events |
US11265204B1 (en) | 2020-08-04 | 2022-03-01 | Juniper Networks, Inc. | Using a programmable resource dependency mathematical model to perform root cause analysis |
US11888679B2 (en) * | 2020-09-25 | 2024-01-30 | Juniper Networks, Inc. | Hypothesis driven diagnosis of network systems |
US20220103417A1 (en) * | 2020-09-25 | 2022-03-31 | Juniper Networks, Inc. | Hypothesis driven diagnosis of network systems |
US12105581B2 (en) * | 2020-10-07 | 2024-10-01 | Mitsubishi Electric Corporation | Failure symptom detection system, failure symptom detection method, and recording medium |
US11876817B2 (en) | 2020-12-23 | 2024-01-16 | Varmour Networks, Inc. | Modeling queue-based message-oriented middleware relationships in a security system |
US11818152B2 (en) * | 2020-12-23 | 2023-11-14 | Varmour Networks, Inc. | Modeling topic-based message-oriented middleware within a security system |
US20220201024A1 (en) * | 2020-12-23 | 2022-06-23 | Varmour Networks, Inc. | Modeling Topic-Based Message-Oriented Middleware within a Security System |
US20220318082A1 (en) * | 2021-04-01 | 2022-10-06 | Bmc Software, Inc. | Root cause identification and event classification in system monitoring |
US11640329B2 (en) * | 2021-04-01 | 2023-05-02 | Bmc Software, Inc. | Using an event graph schema for root cause identification and event classification in system monitoring |
US12130699B2 (en) | 2021-04-01 | 2024-10-29 | Bmc Software, Inc. | Using an event graph schema for root cause identification and event classification in system monitoring |
US11868166B2 (en) * | 2021-08-05 | 2024-01-09 | International Business Machines Corporation | Repairing machine learning pipelines |
US20230059857A1 (en) * | 2021-08-05 | 2023-02-23 | International Business Machines Corporation | Repairing of machine learning pipelines |
US11907159B2 (en) | 2021-08-19 | 2024-02-20 | Bull Sas | Method for representing a distributed computing system by graph embedding |
EP4137946A1 (en) * | 2021-08-19 | 2023-02-22 | Bull SAS | Method for representation of a computer system distributed by graph embedding |
US20230105304A1 (en) * | 2021-10-01 | 2023-04-06 | Healtech Software India Pvt. Ltd. | Proactive avoidance of performance issues in computing environments |
US12130720B2 (en) * | 2021-10-01 | 2024-10-29 | Healtech Software India Pvt. Ltd. | Proactive avoidance of performance issues in computing environments using a probabilistic model and causal graphs |
WO2023195014A1 (en) * | 2022-04-05 | 2023-10-12 | Krayden Amir | System and method for hybrid observability of highly distributed systems |
WO2023224764A1 (en) * | 2022-05-20 | 2023-11-23 | Nec Laboratories America, Inc. | Multi-modality root cause localization for cloud computing systems |
CN118131738A (en) * | 2024-05-08 | 2024-06-04 | 珠海市众知科技有限公司 | Controller processing optimization method, system and medium based on test data analysis |
CN119477649A (en) * | 2025-01-08 | 2025-02-18 | 山东省邮电工程有限公司 | Cable laying management method and system for communication construction |
Also Published As
Publication number | Publication date |
---|---|
US9772898B2 (en) | 2017-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9772898B2 (en) | Identifying root causes of failures in a deployed distributed application using historical fine grained machine state data | |
Notaro et al. | A survey of aiops methods for failure management | |
US10962968B2 (en) | Predicting failures in electrical submersible pumps using pattern recognition | |
US10585774B2 (en) | Detection of misbehaving components for large scale distributed systems | |
Ibidunmoye et al. | Performance anomaly detection and bottleneck identification | |
US20190228296A1 (en) | Significant events identifier for outlier root cause investigation | |
US8850263B1 (en) | Streaming and sampling in real-time log analysis | |
US9424157B2 (en) | Early detection of failing computers | |
US9471462B2 (en) | Proactive risk analysis and governance of upgrade process | |
US8949676B2 (en) | Real-time event storm detection in a cloud environment | |
AU2017274576B2 (en) | Classification of log data | |
US11900248B2 (en) | Correlating data center resources in a multi-tenant execution environment using machine learning techniques | |
US10996861B2 (en) | Method, device and computer product for predicting disk failure | |
Zeng et al. | Traceark: Towards actionable performance anomaly alerting for online service systems | |
US11675647B2 (en) | Determining root-cause of failures based on machine-generated textual data | |
Ali et al. | [Retracted] Classification and Prediction of Software Incidents Using Machine Learning Techniques | |
Alharthi et al. | Sentiment analysis based error detection for large-scale systems | |
Remil et al. | Aiops solutions for incident management: Technical guidelines and a comprehensive literature review | |
US20170031803A1 (en) | Automatic knowledge base generation for root cause in application performance management | |
US10372585B2 (en) | Incident tracker | |
Meng et al. | Driftinsight: detecting anomalous behaviors in large-scale cloud platform | |
US9952773B2 (en) | Determining a cause for low disk space with respect to a logical disk | |
US11138512B2 (en) | Management of building energy systems through quantification of reliability | |
US20230179501A1 (en) | Health index of a service | |
Remil | A data mining perspective on explainable AIOps with applications to software maintenance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DESHPANDE, PRASAD MANIKARAO;NANDI, ANIMESH;SUBRAMANIAN, SURIYA;REEL/FRAME:036546/0116 Effective date: 20150910 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |