US20200106828A1 - Parallel Computation Network Device - Google Patents
Parallel Computation Network Device Download PDFInfo
- Publication number
- US20200106828A1 US20200106828A1 US16/357,356 US201916357356A US2020106828A1 US 20200106828 A1 US20200106828 A1 US 20200106828A1 US 201916357356 A US201916357356 A US 201916357356A US 2020106828 A1 US2020106828 A1 US 2020106828A1
- Authority
- US
- United States
- Prior art keywords
- data
- central
- hdms
- nodes
- hdm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000004220 aggregation Methods 0.000 claims abstract description 107
- 230000002776 aggregation Effects 0.000 claims abstract description 106
- 238000011946 reduction process Methods 0.000 claims abstract description 65
- 230000009467 reduction Effects 0.000 claims abstract description 42
- 239000003607 modifier Substances 0.000 claims abstract description 7
- 239000000543 intermediate Substances 0.000 claims description 74
- 238000000034 method Methods 0.000 claims description 40
- 238000004891 communication Methods 0.000 description 12
- 238000012545 processing Methods 0.000 description 10
- 238000013461 design Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 6
- 239000000243 solution Substances 0.000 description 6
- 238000012790 confirmation Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004888 barrier function Effects 0.000 description 3
- 230000000903 blocking effect Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/22—Parsing or analysis of headers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5072—Grid computing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/30—Peripheral units, e.g. input or output ports
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/12—Protocol engines
Definitions
- the present invention relates to parallel computation, and in particular, but not exclusively, to parallel computation in a network device.
- a task in parallel computation may require performing a reduction operation on a stream of data which is distributed over several nodes in a network.
- An example of a reduction operation may be a floating point ADD operation.
- the result of the operation may be published to one or more requesting processors.
- Another example of a reduction operation may include computing a variance of a large amount of data.
- the data may be distributed over N sub-processes, over N respective nodes, where each sub-process holds a subset of the data.
- Each sub-process calculates the sum (referred to as a first order sum below) of its data subset and calls a sum reduction operation.
- the sum of each element to the power of two (e.g., squared) (referred to as a second order sum below) may be computed.
- the resulting first and second order sums are distributed to the N sub-processes.
- Each of the N sub-processes then computes the variance based on the first and second order sums of all the data subsets.
- Computing the variance by each of the N sub-processes may be useful, for example, when an application on one of the N nodes searches for some sort of estimator.
- the average and variance of an error may be computed, and according to the results, each of the N sub-processes selects a new estimator. Since the code in all the sub-processes is the same, the new estimator will be the same as well.
- Modern computing and storage infrastructure use distributed systems to increase scalability and performance.
- Common uses for such distributed systems include: datacenter applications, distributed storage systems, and HPC clusters running parallel applications.
- HPC and datacenter applications use different methods to implement distributed systems, both perform parallel computation on a large number of networked compute nodes with aggregation of partial results or from the nodes into a global result.
- partition phase tasks and data sets are partitioned across compute nodes that process data locally potentially taking advantage of locality of data to generate partial results.
- the partition phase is followed by the aggregation phase where the partial results are collected and aggregated to obtain a final result.
- the data aggregation phase in many cases creates a bottleneck on the network due to many-to-one or many-to-few types of traffic, i.e., many nodes communicating with one node or a few nodes or controllers.
- network time can account for more than 30% of transaction execution time. In some cases, network time accounts for more than 70% of the execution time.
- Collective communication is a term used to describe communication patterns in which all members of a group of communication end-points participate.
- MPI Message Passing interface
- the communication end-points are MPI processes and the groups associated with the collective operation are described by the local and remote groups associated with the MPI communicator.
- MPI and SHMEM OpenSHMEM
- the MPI standard defines blocking and non-blocking forms of barrier synchronization, broadcast, gather, scatter, gather-to-all, all-to-all gather/scatter, reduction, reduce-scatter, and scan.
- a single operation type, such as gather may have several different variants, such as scatter and scatterv, which differ in such things as the relative amount of data each end-point receives or the MPI data-type associated with data of each MPI rank, i.e., the sequential number of the processes within a job or group.
- the OpenSHMEM specification (available on the Internet from the OpenSHMEM website) contains a communications library that uses one-sided communication and utilizes a partitioned global address space.
- the library includes such operations as blocking barrier synchronization, broadcast, collect, and reduction forms of collective operations.
- Previous attempts to mitigate the traffic bottleneck include installing faster networks and implementing congestion control mechanisms.
- Other optimizations have focused on changes at the nodes or endpoints, e.g., HCA enhancements and host-based software changes. While these schemes enable more efficient and faster execution, they do not reduce the amount of data transferred and thus are limited.
- the '613 publication describes an efficient hardware implementation integrated into logic circuits of network switches, thus providing high performance and efficiency.
- the protocol advantageously employs reliable transport such as RoCE and lnfiniBand transport (or any other transport assuring reliable transmission of packets) to support aggregation.
- reliable transport such as RoCE and lnfiniBand transport (or any other transport assuring reliable transmission of packets)
- the implementation of the aggregation protocol is network topology-agnostic, and produces repeatable results for non-commutative operations, e.g., floating point ADD operations, regardless of the request order of arrival. Aggregation result delivery is efficient and reliable, and group creation is supported.
- the '613 publication describes modifications in switch hardware and software.
- the protocol can be efficiently realized by incorporating an aggregation unit and floating point ALU into a network switch ASIC.
- the changes improve the performance of selected collective operations by processing the data as it traverses the network, eliminating the need to send data multiple times between end-points. This decreases the amount of data traversing the network as aggregation nodes are reached.
- collective communication algorithms are implemented in the network, thereby freeing up CPU resources for computation, rather than using them to process communication.
- the modified switches of the '613 publication support performance-critical barrier and collective operations involving reduction of data sets, for example reduction in the number of columns of a table.
- the modifications in the switches enable the development of collective protocols for frequently-used types of collective operations, while avoiding a large increase in switch hardware resources, e.g., die size.
- the reduction operations are reproducible, and support all but the product reduction operation applied to vectors, and also support data types commonly used by MPI and OpenSHMEM applications. Multiple applications sharing common network resources are supported, optionally employing caching mechanisms of management objects.
- hardware multicast may distribute the results, with a reliability protocol to handle dropped multicast packets.
- the network portion of a reduction operation can be completed in less than three microseconds.
- the '613 publication describes a mechanism, referred to as the “Scalable Hierarchical Aggregation Protocol” (SHArP), that performs aggregation in a data network efficiently. This mechanism reduces bandwidth consumption and reduces latency.
- SQL Scalable Hierarchical Aggregation Protocol
- a network device including a plurality of ports configured to serve as ingress ports for receiving data packets from a network and as egress ports for forwarding at least some of the data packets, streaming aggregation circuitry connected to the ingress ports, and configured to analyze the received data packets to identify ones of the data packets having payloads targeted for a data reduction process, parse at least some of the identified data packets into payload data and headers, and inject the parsed payload data into the data reduction process, data reduction circuitry connected to the ingress ports and configured to perform the data reduction process on the parsed payload data, and including a multiplicity of hardware data modifiers (HMDs) including a central HDM and non-central HDMs, the HDMs being connected and arranged to reduce the parsed payload data in stages with a stage of the data reduction process in the network device being performed by the central HDM which is configured to receive data from at least two of the non-central HDMs and to output resultant
- HMDs hardware data modifiers
- At least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level, each one of the non-central HDMs in the leaf level is configured to receive data from at least two of the ingress ports and to output data to one of the non-central HDMs in the at least one intermediate level, and the central HDM is disposed in the root level, and is configured to receive data from the at least two non-central HDMs in the at least one intermediate level.
- the HDMs are arranged in a daisy-chain configuration including at least two end nodes converging via intermediates nodes on a central node, wherein each one of the at least two end nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and to output data to one of the non-central HDMs disposed in one of the intermediate nodes, each one of the intermediate nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and one of the non-central HDMs, and the central node includes the central HDM which is configured to receive data from the at least two non-central HDMs of the intermediate nodes.
- the device includes a central block including the central HDM, an application layer controller configured to control at least part of an aggregation protocol among network nodes in the network, select the at least one network node to which to forward the resultant reduced data output by the central HDM, and manage operations performed by the HDMs, the transport layer controller which is also configured to perform requester handling of the data packets including the resultant reduced data, and a requester handling database to store data used in the requester handling.
- the device includes two respective input channels into the central block to the central HDM from two respective ones of the non-central HDMs, the non-central HDMs being disposed externally to the central block.
- the streaming aggregation circuitry is configured to perform responder handling, so as to split the responder handling and the requester handling of data packets of targeted for the data reduction process of the aggregation protocol between the streaming aggregation circuitry and the central block, respectively, each of the ingress ports further including a responder handling database to store data used in the responder handling.
- At least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level, each one of the non-central HDMs in the leaf level is configured to receive data from at least two of the ingress ports and to output data to one of the non-central HDMs in the at least one intermediate level, and the central HDM is disposed in the root level, and is configured to receive data from the at least two non-central HDMs in the at least one intermediate level.
- the HDMs are arranged in a daisy-chain configuration including at least two end nodes converging via intermediates nodes on a central node, wherein each one of the at least two end nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and to output data to one of the non-central HDMs disposed in one of the intermediate nodes, each one of the intermediate nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and one of the non-central HDMs, and the central node includes the central HDM which is configured to receive data from the at least two non-central HDMs of the intermediate nodes.
- the streaming aggregation circuitry includes a plurality of respective streaming aggregation circuitry units connected to, and serving respective ones of the ingress ports.
- the data reduction process is part of an aggregation protocol.
- a data reduction method including receiving data packets in a network device from a network, forwarding at least some of the data packets, analyzing the received data packets to identify ones of the data packets having payloads targeted for a data reduction, parsing at least some of the identified data packets into payload data and headers, injecting the parsed payload data into the data reduction process, performing the data reduction process on the parsed payload data using data reduction circuitry including a multiplicity of hardware data modifiers (HDMs) including a central HDM and non-central HDMs, the HDMs being connected and arranged to reduce the parsed payload data in stages with a stage of the data reduction process in the network device being performed by the central HDM outputting resultant reduced data, and managing forwarding of the resultant reduced data in data packets to at least one network node.
- HDMs hardware data modifiers
- At least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level, the method further including receiving data from at least two ingress ports and outputting data to one of the non-central HDMs in the at least one intermediate level by one of the non-central HDMs in the leaf level, and receiving data from at least two non-central HDMs in the at least one intermediate level by the central HDM which is disposed in the root level.
- the HDMs are arranged in a daisy-chain configuration including at least two end nodes converging via intermediates nodes on a central node, the method further including receiving data from at least one ingress port and outputting data to one of the non-central HDMs disposed in one of the intermediate nodes by one of non-central HDMs disposed in one of the at least two end nodes, receiving data from at least one ingress port and from one of the non-central HDMs by one of the non-central HDMs disposed in one of the intermediate nodes, and receiving data from at least two of the non-central HDMs of the intermediate nodes by the central HDM disposed in the central node.
- the method includes controlling at least part of an aggregation protocol among network nodes in the network, selecting the at least one network node to which to forward the resultant reduced data output by the central HDM, managing operations performed by the HDMs, performing requester handling of the data packets including the resultant reduced data, and storing data used in the requester handling.
- the method includes performing responder handling by streaming aggregation circuitry, so as to split the responder handling and the requester handling of data packets targeted for the data reduction process of the aggregation protocol between the streaming aggregation circuitry and a central block including the central HDM, respectively, and storing data used in the responder handling.
- At least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level, the method further including receiving data from at least two ingress ports and outputting data to one of the non-central HDMs in the at least one intermediate level by one of the non-central HDMs in the leaf level, and receiving data from at least two non-central HDMs in the at least one intermediate level by the central HDM which is disposed in the root level.
- the HDMs are arranged in a daisy-chain configuration including at least two end nodes converging via intermediates nodes on a central node, the method further including receiving data from at least one ingress port and outputting data to one of the non-central HDMs disposed in one of the intermediate nodes by one of non-central HDMs disposed in one of the at least two end nodes, receiving data from at least one ingress port and from one of the non-central HDMs by one of the non-central HDMs disposed in one of the intermediate nodes, and receiving data from at least two of the non-central HDMs of the intermediate nodes by the central HDM disposed in the central node.
- the data reduction process is part of an aggregation protocol.
- FIGS. 1 and 2 are block diagram views of network devices implementing data reduction
- FIG. 3 is a block diagram view of a network device implementing a data reduction process constructed and operative in accordance with an embodiment of the present invention
- FIG. 4 is a flowchart including exemplary steps in a method of operation of the network device of FIG. 3 ;
- FIG. 5 is a block diagram view of a network device implementing a data reduction process constructed and operative in accordance with an alternative embodiment of the present invention.
- a task in parallel computation may require performing a reduction operation on a stream of data which is distributed over several nodes in a network.
- a data reduction operation may be performed on a sub-set of the data and a result of the data reduction operation may then be distributed to one or more network nodes.
- the distribution of the data and the results may be managed in accordance with an aggregation protocol, for example, but not limited to, the “Scalable Hierarchical Aggregation Protocol” (SHArP), described in the abovementioned '613 publication.
- Scalable Hierarchical Aggregation Protocol Scalable Hierarchical Aggregation Protocol
- the data reduction operation may be performed by any suitable processing device, for example, but not limited to, a network device such a switch or a router, which in addition to performing packet forwarding also performs data reduction.
- the network device may receive data packets from various respective network nodes over respective ingress ports. The received data packets may be analyzed to determine whether the packets should be forwarded to other network nodes, or whether the packets should be forwarded to a data reduction process within the receiving network device.
- One problem to be addressed when performing data reduction in a network device is to perform the data reduction efficiently while maintaining throughput. Maintaining a high data throughput is particularly challenging when data is being received from multiple network nodes.
- the network device may include a central block to which all the packets destined for the data reduction process are sent.
- the central block may include a single arithmetic logic unit (ALU) to perform the data reduction process, an application layer controller to manage the aggregation protocol, and a transport layer controller to manage receiving data packets of the aggregation protocol and forwarding the resultant reduced data output of the data reduction process to one or more network nodes and to manage responder and requester handling.
- ALU arithmetic logic unit
- Another solution is to provide high speed links between each ingress port and the central block or multiple high speed links from forwarding circuitry to the central block with the central block including multiple ALUs arranged in a hierarchical structure so that in a first level of the hierarchical structure, data from any two ingress ports is reduced by one ALU, and in a second level the data output of the ALUs in the first level is reduced by the ALUs in the second level and so on until a central ALU receives data input from two ALUs to yield a final reduced data output. Therefore, the overall processing over the various ALUs takes place at a speed which is intended not to be limited other than by the maximum interconnect speed between the ALUs; this maximum speed is also termed herein “wire speed” (WS).
- WS wireless speed
- this solution provides sufficient throughput due to the different levels of ALUs, this solution requires high-speed connections running over the chip between each ingress port (or the forwarding circuitry) and the central block.
- This solution is generally not scalable because when the number of ports is increased, the high-speed connections running over the chip to the central block and the additional ALUs needed in the central block are increased, leading to a more complicated and expensive chip design and manufacture.
- the network device includes multiple non-central ALUs located outside of the central block on the same chip as the central block with generally two high speed connections from the non-central ALUs outside of the central block leading into a central ALU located in the central block.
- This design reduces the number of high-speed connections entering the central block from the device ports to two. Additionally, locating the non-central ALUs outside the central block generally leads to an overall shorter length of high-speed connections on the chip. The closer the non-central ALUs are to the ports generally leads to a shorter overall length of high-speed connections.
- the responder handling and requester handling of data packets associated with the data reduction process of the aggregation protocol are split between the streaming aggregation circuitry units associated with ingress ports (described in more detail below) and the transport later controller of the central block, respectively.
- Each of the ingress ports is associated with a responder handling database to store data associated with the responder handling, while the central block includes a requester handling database to store data associated with the requester handling.
- the payload data of packets targeted for the data reduction process can be injection into the data reduction process without having to send all the packet data (e.g., including headers and other responder handling data) to the central block thereby reducing congestion in the central block.
- Each of the ingress ports may be associated with a streaming aggregation circuitry unit which speculatively analyzes received data packets based on header data (described in more detail below) to identify the data packets having payloads targeted for the data reduction process as part of an aggregation protocol.
- the streaming aggregation circuitry unit after receiving confirmation of the speculative analysis from the forwarding circuitry, parses the identified data packets into payload data and headers and injects the parsed payload data into the data reduction process.
- the ALUs are arranged in a hierarchical configuration including a root level, a leaf level and one or more intermediate levels.
- Each non-central ALU in the leaf level may receive data from at least two of the ingress ports and output data to one of the non-central ALUs in an intermediate level adjacent to the leaf level.
- the central ALU is disposed in the root level, and receives data from two (or more) non-central ALUs in the intermediate level below the root level.
- the ALUs are arranged in a daisy-chain configuration comprising two (or more) end nodes converging via intermediates nodes on a central node.
- Each of the end nodes includes one of the non-central ALUs and may receive data from at least one ingress port and output data to a non-central ALU (i.e., the next ALU in the chain) disposed in one of the intermediate nodes.
- Each intermediate node includes a non-central ALU which may receive data from at least one ingress port and one of the non-central ALUs (i.e., the previous ALU in the chain).
- the central node includes the central ALU which receives data from two (or more) non-central ALUs of the intermediate nodes.
- FIG. 1 is a block diagram view of a network device 100 implementing data reduction.
- the network device 100 includes a plurality of ports 102 , forwarding circuitry 104 , and a central block 106 including an application layer controller 108 , a transport layer controller 110 , an ALU 112 , and a database 114 .
- the ports 102 are configured as ingress and egress ports for receiving and sending data packets 116 , respectively.
- the received packets 116 are processed by the forwarding circuitry 104 and are either: forwarded to the central block 106 , via a single high-speed channel 118 , for injecting by the application layer controller 108 into a data reduction process which is performed by the ALU 112 ; or forwarded to another network node via one of the egress ports.
- the ALU 112 processes data received by various ports 102 in a serial fashion.
- the resultant reduced data output by the ALU 112 is packetized by the transport layer controller 110 which manages forwarding the packetized data to at least one network node according to the dictates of an aggregation protocol.
- the transport layer controller 110 also manages requester and responder handling of the packets associated with the aggregation protocol. It should be noted that the transport layer controller 110 may also manage responder and requester handling for packets received by the network device 100 from a parent node in the aggregation protocol for forwarding to one or more other network nodes in the aggregation protocol.
- the application layer controller 108 manages at least part of the aggregation protocol (based on at least data included in the packet headers of the received packets for the data reduction process) including selecting which network node(s) (not shown) the packetized data should be sent to.
- the application layer controller 108 typically based on data included in the packet headers of the received packets for the data reduction process, instructs the ALU 112 how to reduce the data (e.g., which mathematical operation(s) to perform on the data).
- the database 114 stores data used by the application layer controller 108 and the transport layer controller 110 including data used in requester and responder handling.
- Data reduction processing may be performed by any suitable hardware data modifier and embodiments of the present invention are not limited to using an ALU for performing data reduction.
- the hardware data modifiers may perform any suitable data modification, including concatenation, by way of example only.
- the data reduction process in the network device 100 does not generally provide a throughput according to “wire speed” (WS) unless data is being received via one of the ports 102 .
- WS wireless speed
- FIG. 2 is a block diagram view of a network device 200 implementing data reduction.
- the network device 200 includes a plurality of ports 202 , forwarding circuitry 204 , and a central block 206 including an application layer controller 208 , a transport layer controller 210 , a plurality of ALUs 212 , and a database 214 .
- the network device 200 also includes a streaming aggregation circuitry unit 220 associated with each of the ports 202 .
- the network device 200 of FIG. 2 illustrates how the central block 106 of the network device 100 could be amended in order to achieve WS processing throughput in the data reduction process.
- the design of the network device 200 has some drawbacks discussed below in more detail.
- Data packets 216 received by each port 202 are speculatively analyzed by the associated streaming aggregation circuitry unit 220 to determine if packets 216 should be injected in the data reduction process (via high-speed connections 218 ) or forwarded to another network node via internal connections 222 and the forwarding circuitry 204 and one of the egress ports.
- the speculative analysis is confirmed by the forwarding circuitry 204 prior to the streaming aggregation circuitry unit 220 parsing the packets 216 and injecting the parsed payload data into the data reduction process. This process is described in more detail with reference to the embodiment of FIGS. 3 and 4 .
- Each streaming aggregation circuitry unit 220 is connected via a high-speed connection 218 to one of the ALUs 212 in the central block 206 .
- the functionality of the streaming aggregation circuitry unit 220 is combined with forwarding circuitry 204 and multiple high-speed links may be provided between the forwarding circuitry 204 and the central block 206 .
- the ALUs 212 are arranged in a hierarchical configuration in the central block 206 so that each ALU 212 - 1 in a layer closest to the streaming aggregation circuitry units 220 receives the data packets 216 from two of the streaming aggregation circuitry units 220 .
- the ALUs 212 - 1 reduce data included in the packet payloads.
- Each ALU 212 - 2 in a layer adjacent to the ALUs 212 - 1 receives the reduced data and at least part of the packets 216 (e.g., the packet headers).
- the ALUs 212 - 2 further reduce the received reduced data yielding further reduced data.
- the ALU 212 - 3 located in a root layer of the hierarchical configuration receives the further reduced data from the ALUs 212 - 2 , and still further reduces the received data yielding a resultant reduced data which is packetized by the transport layer controller 210 for forwarding to one or more network nodes determined by the application layer controller 208 .
- the network device 200 therefore provides a data reduction process at WS at the cost of the high-speed connection 218 extending from each of the streaming aggregation circuitry units 220 to the central block 206 .
- the network device 200 is generally not scalable in terms of chip design and manufacturer, among other drawbacks, due to the high-speed connection 218 extending from the streaming aggregation circuitry units 220 to the central block 206 and the placement of many ALUs 212 in the central block 206 .
- FIG. 3 is a block diagram view of a network device 300 implementing a data reduction process constructed and operative in accordance with an embodiment of the present invention.
- the network device 300 includes a plurality of ports 302 (configurable as ingress and egress ports), forwarding circuitry 304 , data reduction circuitry 312 , a central block 306 including an application layer controller 308 , a transport layer controller 310 , a central ALU 312 - 3 (which is part of the data reduction circuitry 312 ), a requester handling database 314 , streaming aggregation circuitry including a plurality of respective streaming aggregation circuitry units 320 connected to, and serving respective ones of the ingress ports.
- the network device 300 also includes a responder handling database 324 comprised in each port 302 . For the sake of simplicity not all of the responder handling databases 324 have been labelled in FIG. 3 .
- the ports 302 are configured to serve as ingress ports for receiving data packets 316 from a network and as egress ports for forwarding at least some of the data packets 316 .
- Each streaming aggregation circuitry unit 320 is configured to speculatively analyze the received data packets 316 (received on the port 302 associated with that streaming aggregation circuitry unit 320 ) to identify the data packets 316 having payloads targeted for the data reduction process as part of an aggregation protocol, for example, but not limited to, the “Scalable Hierarchical Aggregation Protocol” (SHArP), described in the abovementioned '613 publication.
- Scalable Hierarchical Aggregation Protocol for example, but not limited to, the “Scalable Hierarchical Aggregation Protocol” (SHArP), described in the abovementioned '613 publication.
- the streaming aggregation circuitry unit 320 forwards packet headers to the forwarding circuitry 304 as part of a confirmation process whereby the forwarding circuitry 304 confirms that the packets analyzed as being targeted for the data reduction process may be injected into the data reduction process. This is described in more detail with reference to FIG. 4 .
- each streaming aggregation circuitry unit 320 is configured to parse the identified data packets into payload data and headers and to inject the parsed payload data into the data reduction process.
- the streaming aggregation circuitry units 320 are configured to perform responder handling (of the packets 316 associated with the aggregation protocol and packets 316 not associated with the aggregation protocol) whereas requester handling of data packets associated with the aggregation protocol is performed by the transport layer controller 310 described in more detail below.
- the transport layer controller 310 in the central block 306 may still manage responder and requester handing for packets received by the network device 300 from a parent node in the aggregation protocol for forwarding to one or more other nodes in the aggregation protocol.
- the packets received from the patent node are forwarded from the receiving port 302 via the forwarding circuitry 304 to the central block 306 where the transport layer controller 310 manages the responder handling of the received packet.
- the transport layer controller 310 then manages forwarding and requester handling of the received packet to one or more network nodes according to the aggregation protocol.
- Packet headers of the packets 316 targeted for the data reduction process may also include data needed by the application layer controller 308 for managing the aggregation protocol, such as which operation(s) (e.g., mathematical operation(s)) the ALUs 312 should perform and where resultant reduced data should be sent. Therefore, at least one packet (e.g., a first packet in a message) of the packets 316 targeted for the data reduction process is forwarded to the central block 306 via the forwarding circuitry 304 for receipt by the application layer controller 308 .
- operation(s) e.g., mathematical operation(s)
- the responder handling and the requester handling of data packets targeted for the data reduction process of the aggregation protocol are split between the streaming aggregation circuitry units 320 and the central block 306 (in which the transport layer controller 310 is disposed), respectively.
- the split also manifests itself with respect to data storage with the responder handling database 324 of each port 302 storing data used in the responder handling, and the requester handling database 314 disposed in the central block 306 storing data used in the requester handling as well as other data used by the application layer controller 308 and the transport layer controller 310 .
- Splitting the responder and requester handling between the streaming aggregation circuitry unit 320 and the central block 306 also leads to less data being injected to the data reduction circuitry 312 as only data needed for data reduction needs to be injected into the data reduction circuitry.
- the data reduction circuitry 312 is connected to the ingress ports via the streaming aggregation circuitry units 320 and is configured to perform the data reduction process on the parsed payload data.
- the data reduction circuitry 312 includes a multiplicity of ALUs including the central ALU 312 - 3 and non-central ALUs 312 - 1 , 312 - 2 .
- the ALUs are connected and arranged to reduce the parsed payload data in stages with a last stage of the data reduction process in the network device 300 being performed by the central ALU 312 - 3 .
- the central ALU 312 - 3 is configured to receive data from the non-central ALUs 312 - 2 and to output resultant reduced data. In the example, of FIG. 3 the central ALU 312 - 3 receives input from two non-central ALUs 312 - 2 . In some embodiments, the central ALU 312 - 3 may receive input from more than two non-central ALUs 312 - 2 .
- the non-central ALUs 312 - 1 , 312 - 2 are disposed externally to the central block 306 so that there are only two respective high-speed input channels 318 into the central block 306 to the central ALU 312 - 3 from two respective ones of the non-central ALUs 312 - 2 . Disposing the non-central ALUs externally to the central block 306 leads to shorter overall high-speed channels 318 even though generally the network device 300 has the same amount of ALUs as the network device 200 of FIG. 2 .
- the design of network device 300 not only shortens the overall length of the high-speed channels but reduces the number of interfaces to central block 306 and is generally more scalable.
- the central block 306 may include more than one ALU with more than two input channels 318 entering the central block 306 .
- FIG. 3 shows the ALUs arranged in a hierarchical configuration including a root level, a leaf level and an intermediate level.
- Each non-central ALU 312 - 1 in the leaf level is configured to receive data from at least two of the ingress ports and to output data to one of the non-central ALUs 312 - 2 in the intermediate level.
- the central ALU 312 - 3 is disposed in the root level, and is configured to receive data from the non-central ALUs 312 - 2 in the intermediate level.
- the number of ingress ports connected to, and providing data to, one ALU 312 - 1 may depend on the processing capabilities and bandwidth properties of the ALU 312 - 1 .
- the number of levels in the hierarchical configuration depends on the number of ports 302 included in the network device 300 so that in some implementations the network device 300 may include multiple intermediate levels, in which an ALU in an intermediate level closer to the leaf level may output data to an adjacent intermediate level closer to the root level, and so on, until all the intermediate levels are traversed by the data that is input into the data reduction process.
- the application layer controller 308 is configured to: control at least part of the aggregation protocol among network nodes in the network; manage operation(s) (e.g., mathematical operation(s)) performed by the ALUs; and select at least one network node (e.g., according to the aggregation protocol) to which to forward the resultant reduced data output by the central ALU 312 - 3 .
- manage operation(s) e.g., mathematical operation(s)
- select at least one network node e.g., according to the aggregation protocol
- the transport layer controller 310 is configured to: manage forwarding of the resultant reduced data in data packets to the network node(s) (selected by the application layer controller 308 ) via at least one of the egress ports; and perform requester handling of the data packets that include the resultant reduced data.
- Any suitable method may be used to distribute the data packets that include the resultant reduced data to the selected network node(s), for example, but not limited to, a distribution method described in US Patent Publication 2018/0287928 of Levi, et al., which is herein incorporated by reference.
- transport layer controller 310 and the application layer controller 308 may be combined in a single physical component or, alternatively, implemented using multiple physical components. These physical components may comprise hard-wired or programmable devices, or a combination of the two. In some embodiments, at least some of the functions may be carried out by a programmable processor under the control of suitable software. This software may be downloaded to a device in electronic form, over a network, for example. Alternatively, or additionally, the software may be stored in tangible, non-transitory computer-readable storage media, such as optical, magnetic, or electronic memory.
- the network device 300 may include other standard components included in a network device but have been not been included in the present disclosure for the sake of simplicity.
- FIG. 4 is a flowchart 400 including exemplary steps in a method of operation of the network device 300 of FIG. 3 . Reference is also made to FIG. 3 .
- the streaming aggregation circuitry comprising the streaming aggregation circuitry units 320 is configured to inspect (block 402 ) the received data packets 316 to speculatively analyze the data packets to identify data packets having payloads targeted for the data reduction process as part of the aggregation protocol.
- the speculative analysis may include inspecting packet header information including source and destination information as well as other header information such as layer 4 data.
- the streaming aggregation circuitry determines whether ones of the packets 316 are potentially for the data reduction process (branch 410 ) or not for the data reduction process (branch 406 ).
- the packets 316 which are not targeted for the data reduction process are forwarded (block 408 ) according to a forwarding mechanism to one or more network nodes (which may include forwarding to the central block 306 ) according to the destination addresses of the packets 316 .
- the packets 316 may be forwarded by the streaming aggregation circuitry units 320 to the forwarding circuitry 304 for forwarding to one or more network nodes via one or more of the egress ports.
- the streaming aggregation circuitry is configured to send (block 412 ) packet headers of data packets having payloads potentially targeted for the data reduction process as part of the aggregation protocol to the forwarding circuitry 304 .
- the forwarding circuitry 304 is configured to analyze the packet headers (using layer 2 and 3 information such as source and destination information) in order to determine whether the speculative analysis of the streaming aggregation circuitry was correct.
- the forwarding circuitry 304 is configured to send a confirmation approving the speculative analysis to the streaming aggregation circuitry which receives (block 414 ) the decision.
- the streaming aggregation circuitry unit does not forward payload data of the given packet(s) to the data reduction path but instead forwards the given packet(s) via the standard forwarding mechanism of the network device 300 .
- the streaming aggregation circuitry is configured to parse (block 416 ) the data packets 316 confirmed as being targeted for the data reduction process into payload data and headers.
- the streaming aggregation circuitry is configured to inject (block 418 ) the parsed payload data into the data reduction process executed by the ALUs 312 , which perform (block 420 ) the data reduction process yielding the resultant reduced data.
- the transport layer controller 310 is configured to manage (block 422 ) forwarding the resultant reduced data in data packets.
- FIG. 5 is a block diagram view of a network device 500 implementing a data reduction process constructed and operative in accordance with an alternative embodiment of the present invention.
- the network device 500 is substantially the same as the network device 300 of FIG. 3 except that the ALUs of data reduction circuitry 512 of the network device 500 is arranged in a daisy-chain configuration described in more detail below.
- the network device 500 includes a plurality of ports 502 , forwarding circuitry 504 , a central block 506 including an application layer controller 508 , a transport layer controller 510 , a central ALU 512 - 3 , and a requester handling database 514 .
- the network device 500 also includes high speed connections 518 between the ports 502 and the data reduction circuitry 512 all the way to the central ALU 512 - 3 .
- the network device 500 also includes streaming aggregation circuitry comprising a plurality of respective streaming aggregation circuitry units 520 associated with respective ones of the ports 502 , and a responder handling database 524 included in each port 502 .
- the daisy-chain configuration includes at least two end nodes converging via intermediates nodes on a central node, as will now be described in more detail.
- the ALUs 512 are connected in a chain formation extending from two sides to the central ALU 512 - 3 , namely, from ALUs 512 - 1 via ALUs 512 - 2 to the central ALU 512 - 3 .
- Each end node includes one of the non-central ALUs 512 - 1 configured to receive data from at least one ingress port 502 - 1 ( FIG. 5 shows the ALUs 512 - 1 each receiving input from two ingress ports 502 - 1 ) and to output data to one of the non-central ALUs 512 - 2 (next in the chain) disposed in one of the intermediate nodes.
- Each intermediate node includes one of the non-central ALUs 512 - 2 configured to receive data from at least one of the ingress ports 502 - 2 and one of the non-central ALUs 512 - 1 or one of the ALUs 512 - 2 .
- the central node includes the central ALU 512 - 3 which is configured to receive data from at least two non-central ALUs (two ALUs 512 - 2 in FIG. 5 ) of the intermediate nodes.
- An ALU 512 in the daisy-chain configuration may receive input from one, two, or more ports 502 depending on the processing speed of the ALU 512 and the design requirements.
- the daisy-chain configuration generally does not require additional hierarchical layers of ALUs when new ports are added to the network device design as compare to the hierarchical configuration.
- each additional port generally adds a new streaming aggregation circuitry unit 520 , a responder handling database 524 and an ALU 512 - 1 (which may be shared with other ports).
- the daisy chain configuration generally results in shorter high-speed connections 518 across the chip implementing the network device 500 and is generally more scalable than the design used in the network device 300 .
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
In one embodiment, a network device includes ports to serve as ingress ports and as egress ports, streaming aggregation circuitry to analyze received data packets to identify the data packets having payloads targeted for a data reduction process as part of an aggregation protocol, parse at least some of the identified data packets into payload data and headers, and inject the parsed payload data into the data reduction process, data reduction circuitry to perform the data reduction process, and including hardware data modifiers (HDMs), the HDMs being connected and arranged to reduce the parsed payload data in stages with a stage of the data reduction process being performed by a central HDM to receive data from at least two non-central HDMs and to output resultant reduced data, and a transport layer controller to manage forwarding of the resultant reduced data to at least one network node.
Description
- The present application claims priority from U.S. Provisional Patent Application Ser. No. 62/739,879 of Levi, et al., filed Oct. 2, 2018, the disclosure of which is hereby incorporated herein by reference.
- The present invention relates to parallel computation, and in particular, but not exclusively, to parallel computation in a network device.
- In general, a task in parallel computation may require performing a reduction operation on a stream of data which is distributed over several nodes in a network. An example of a reduction operation may be a floating point ADD operation. The result of the operation may be published to one or more requesting processors. Another example of a reduction operation may include computing a variance of a large amount of data. The data may be distributed over N sub-processes, over N respective nodes, where each sub-process holds a subset of the data. Each sub-process calculates the sum (referred to as a first order sum below) of its data subset and calls a sum reduction operation. In a similar manner, the sum of each element to the power of two (e.g., squared) (referred to as a second order sum below) may be computed. The resulting first and second order sums are distributed to the N sub-processes. Each of the N sub-processes then computes the variance based on the first and second order sums of all the data subsets. Computing the variance by each of the N sub-processes may be useful, for example, when an application on one of the N nodes searches for some sort of estimator. In some cases, for each given estimator, the average and variance of an error may be computed, and according to the results, each of the N sub-processes selects a new estimator. Since the code in all the sub-processes is the same, the new estimator will be the same as well.
- Modern computing and storage infrastructure use distributed systems to increase scalability and performance. Common uses for such distributed systems include: datacenter applications, distributed storage systems, and HPC clusters running parallel applications. While HPC and datacenter applications use different methods to implement distributed systems, both perform parallel computation on a large number of networked compute nodes with aggregation of partial results or from the nodes into a global result.
- Many datacenter applications such as search and query processing, deep learning, graph and stream processing typically follow a partition-aggregation pattern. An example is the well-known MapReduce programming model for processing problems in parallel across huge datasets using a large number of computers arranged in a grid or cluster. In the partition phase, tasks and data sets are partitioned across compute nodes that process data locally potentially taking advantage of locality of data to generate partial results. The partition phase is followed by the aggregation phase where the partial results are collected and aggregated to obtain a final result. The data aggregation phase in many cases creates a bottleneck on the network due to many-to-one or many-to-few types of traffic, i.e., many nodes communicating with one node or a few nodes or controllers.
- For example, in large public datacenters analysis traces show that up to 46% of the datacenter traffic is generated during the aggregation phase, and network time can account for more than 30% of transaction execution time. In some cases, network time accounts for more than 70% of the execution time.
- Collective communication is a term used to describe communication patterns in which all members of a group of communication end-points participate. For example, in case of Message Passing interface (MPI) the communication end-points are MPI processes and the groups associated with the collective operation are described by the local and remote groups associated with the MPI communicator.
- Many types of collective operations occur in HPC communication protocols, and more specifically in MPI and SHMEM (OpenSHMEM). The MPI standard defines blocking and non-blocking forms of barrier synchronization, broadcast, gather, scatter, gather-to-all, all-to-all gather/scatter, reduction, reduce-scatter, and scan. A single operation type, such as gather, may have several different variants, such as scatter and scatterv, which differ in such things as the relative amount of data each end-point receives or the MPI data-type associated with data of each MPI rank, i.e., the sequential number of the processes within a job or group.
- The OpenSHMEM specification (available on the Internet from the OpenSHMEM website) contains a communications library that uses one-sided communication and utilizes a partitioned global address space. The library includes such operations as blocking barrier synchronization, broadcast, collect, and reduction forms of collective operations.
- The performance of collective operations for applications that use such functions is often critical to the overall performance of these applications, as they limit performance and scalability. This comes about because all communication end-points implicitly interact with each other with serialized data exchange taking place between end-points. The specific communication and computation details of such operations depend on the type of collective operation, as does the scaling of these algorithms. Additionally, the explicit coupling between communication end-points tends to magnify the effects of system noise on the parallel applications using these, by delaying one or more data exchanges, resulting in further challenges to application scalability.
- Previous attempts to mitigate the traffic bottleneck include installing faster networks and implementing congestion control mechanisms. Other optimizations have focused on changes at the nodes or endpoints, e.g., HCA enhancements and host-based software changes. While these schemes enable more efficient and faster execution, they do not reduce the amount of data transferred and thus are limited.
- US Patent Publication 2017/0063613 of Bloch, et al. (hereinafter the '613 publication), which is hereby incorporated herein by reference, describes a scalable hierarchical aggregation protocol that implements in-network hierarchical aggregation, in which aggregation nodes (switches and routers) residing in the network fabric perform hierarchical aggregation to efficiently aggregate data from a large number of servers, without traversing the network multiple times. The protocol avoids congestion caused by incast, when many nodes send data to a single node.
- The '613 publication describes an efficient hardware implementation integrated into logic circuits of network switches, thus providing high performance and efficiency. The protocol advantageously employs reliable transport such as RoCE and lnfiniBand transport (or any other transport assuring reliable transmission of packets) to support aggregation. The implementation of the aggregation protocol is network topology-agnostic, and produces repeatable results for non-commutative operations, e.g., floating point ADD operations, regardless of the request order of arrival. Aggregation result delivery is efficient and reliable, and group creation is supported.
- The '613 publication describes modifications in switch hardware and software. The protocol can be efficiently realized by incorporating an aggregation unit and floating point ALU into a network switch ASIC. The changes improve the performance of selected collective operations by processing the data as it traverses the network, eliminating the need to send data multiple times between end-points. This decreases the amount of data traversing the network as aggregation nodes are reached. In one aspect of the '613 publication collective communication algorithms are implemented in the network, thereby freeing up CPU resources for computation, rather than using them to process communication.
- The modified switches of the '613 publication support performance-critical barrier and collective operations involving reduction of data sets, for example reduction in the number of columns of a table. The modifications in the switches enable the development of collective protocols for frequently-used types of collective operations, while avoiding a large increase in switch hardware resources, e.g., die size. For a given application-to-system mapping, the reduction operations are reproducible, and support all but the product reduction operation applied to vectors, and also support data types commonly used by MPI and OpenSHMEM applications. Multiple applications sharing common network resources are supported, optionally employing caching mechanisms of management objects. As a further optimization, hardware multicast may distribute the results, with a reliability protocol to handle dropped multicast packets. In a practical system, based on Mellanox Switch-1B2 lnfiniBand switches connecting 10,000 end nodes in a three-level fat-tree topology, the network portion of a reduction operation can be completed in less than three microseconds.
- The '613 publication describes a mechanism, referred to as the “Scalable Hierarchical Aggregation Protocol” (SHArP), that performs aggregation in a data network efficiently. This mechanism reduces bandwidth consumption and reduces latency.
- There is provided in accordance with an embodiment of the present disclosure, a network device, including a plurality of ports configured to serve as ingress ports for receiving data packets from a network and as egress ports for forwarding at least some of the data packets, streaming aggregation circuitry connected to the ingress ports, and configured to analyze the received data packets to identify ones of the data packets having payloads targeted for a data reduction process, parse at least some of the identified data packets into payload data and headers, and inject the parsed payload data into the data reduction process, data reduction circuitry connected to the ingress ports and configured to perform the data reduction process on the parsed payload data, and including a multiplicity of hardware data modifiers (HMDs) including a central HDM and non-central HDMs, the HDMs being connected and arranged to reduce the parsed payload data in stages with a stage of the data reduction process in the network device being performed by the central HDM which is configured to receive data from at least two of the non-central HDMs and to output resultant reduced data, and a transport layer controller configured to manage forwarding of the resultant reduced data in data packets to at least one network node via at least one of the egress ports.
- Further in accordance with an embodiment of the present disclosure at least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level, each one of the non-central HDMs in the leaf level is configured to receive data from at least two of the ingress ports and to output data to one of the non-central HDMs in the at least one intermediate level, and the central HDM is disposed in the root level, and is configured to receive data from the at least two non-central HDMs in the at least one intermediate level.
- Still further in accordance with an embodiment of the present disclosure the HDMs are arranged in a daisy-chain configuration including at least two end nodes converging via intermediates nodes on a central node, wherein each one of the at least two end nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and to output data to one of the non-central HDMs disposed in one of the intermediate nodes, each one of the intermediate nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and one of the non-central HDMs, and the central node includes the central HDM which is configured to receive data from the at least two non-central HDMs of the intermediate nodes.
- Additionally, in accordance with an embodiment of the present disclosure, the device includes a central block including the central HDM, an application layer controller configured to control at least part of an aggregation protocol among network nodes in the network, select the at least one network node to which to forward the resultant reduced data output by the central HDM, and manage operations performed by the HDMs, the transport layer controller which is also configured to perform requester handling of the data packets including the resultant reduced data, and a requester handling database to store data used in the requester handling.
- Moreover, in accordance with an embodiment of the present disclosure, the device includes two respective input channels into the central block to the central HDM from two respective ones of the non-central HDMs, the non-central HDMs being disposed externally to the central block.
- Further in accordance with an embodiment of the present disclosure the streaming aggregation circuitry is configured to perform responder handling, so as to split the responder handling and the requester handling of data packets of targeted for the data reduction process of the aggregation protocol between the streaming aggregation circuitry and the central block, respectively, each of the ingress ports further including a responder handling database to store data used in the responder handling.
- Still further in accordance with an embodiment of the present disclosure at least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level, each one of the non-central HDMs in the leaf level is configured to receive data from at least two of the ingress ports and to output data to one of the non-central HDMs in the at least one intermediate level, and the central HDM is disposed in the root level, and is configured to receive data from the at least two non-central HDMs in the at least one intermediate level.
- Additionally, in accordance with an embodiment of the present disclosure the HDMs are arranged in a daisy-chain configuration including at least two end nodes converging via intermediates nodes on a central node, wherein each one of the at least two end nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and to output data to one of the non-central HDMs disposed in one of the intermediate nodes, each one of the intermediate nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and one of the non-central HDMs, and the central node includes the central HDM which is configured to receive data from the at least two non-central HDMs of the intermediate nodes.
- Moreover, in accordance with an embodiment of the present disclosure the streaming aggregation circuitry includes a plurality of respective streaming aggregation circuitry units connected to, and serving respective ones of the ingress ports.
- Further in accordance with an embodiment of the present disclosure the data reduction process is part of an aggregation protocol.
- There is also provided in accordance with another embodiment of the present disclosure, a data reduction method, including receiving data packets in a network device from a network, forwarding at least some of the data packets, analyzing the received data packets to identify ones of the data packets having payloads targeted for a data reduction, parsing at least some of the identified data packets into payload data and headers, injecting the parsed payload data into the data reduction process, performing the data reduction process on the parsed payload data using data reduction circuitry including a multiplicity of hardware data modifiers (HDMs) including a central HDM and non-central HDMs, the HDMs being connected and arranged to reduce the parsed payload data in stages with a stage of the data reduction process in the network device being performed by the central HDM outputting resultant reduced data, and managing forwarding of the resultant reduced data in data packets to at least one network node.
- Still further in accordance with an embodiment of the present disclosure at least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level, the method further including receiving data from at least two ingress ports and outputting data to one of the non-central HDMs in the at least one intermediate level by one of the non-central HDMs in the leaf level, and receiving data from at least two non-central HDMs in the at least one intermediate level by the central HDM which is disposed in the root level.
- Additionally, in accordance with an embodiment of the present disclosure the HDMs are arranged in a daisy-chain configuration including at least two end nodes converging via intermediates nodes on a central node, the method further including receiving data from at least one ingress port and outputting data to one of the non-central HDMs disposed in one of the intermediate nodes by one of non-central HDMs disposed in one of the at least two end nodes, receiving data from at least one ingress port and from one of the non-central HDMs by one of the non-central HDMs disposed in one of the intermediate nodes, and receiving data from at least two of the non-central HDMs of the intermediate nodes by the central HDM disposed in the central node.
- Moreover, in accordance with an embodiment of the present disclosure, the method includes controlling at least part of an aggregation protocol among network nodes in the network, selecting the at least one network node to which to forward the resultant reduced data output by the central HDM, managing operations performed by the HDMs, performing requester handling of the data packets including the resultant reduced data, and storing data used in the requester handling.
- Further in accordance with an embodiment of the present disclosure, the method includes performing responder handling by streaming aggregation circuitry, so as to split the responder handling and the requester handling of data packets targeted for the data reduction process of the aggregation protocol between the streaming aggregation circuitry and a central block including the central HDM, respectively, and storing data used in the responder handling.
- Still further in accordance with an embodiment of the present disclosure at least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level, the method further including receiving data from at least two ingress ports and outputting data to one of the non-central HDMs in the at least one intermediate level by one of the non-central HDMs in the leaf level, and receiving data from at least two non-central HDMs in the at least one intermediate level by the central HDM which is disposed in the root level.
- Additionally, in accordance with an embodiment of the present disclosure the HDMs are arranged in a daisy-chain configuration including at least two end nodes converging via intermediates nodes on a central node, the method further including receiving data from at least one ingress port and outputting data to one of the non-central HDMs disposed in one of the intermediate nodes by one of non-central HDMs disposed in one of the at least two end nodes, receiving data from at least one ingress port and from one of the non-central HDMs by one of the non-central HDMs disposed in one of the intermediate nodes, and receiving data from at least two of the non-central HDMs of the intermediate nodes by the central HDM disposed in the central node.
- Moreover, in accordance with an embodiment of the present disclosure the data reduction process is part of an aggregation protocol.
- The present invention will be understood from the following detailed description, taken in conjunction with the drawings in which:
-
FIGS. 1 and 2 are block diagram views of network devices implementing data reduction; -
FIG. 3 is a block diagram view of a network device implementing a data reduction process constructed and operative in accordance with an embodiment of the present invention; -
FIG. 4 is a flowchart including exemplary steps in a method of operation of the network device ofFIG. 3 ; and -
FIG. 5 is a block diagram view of a network device implementing a data reduction process constructed and operative in accordance with an alternative embodiment of the present invention. - As previously mentioned, a task in parallel computation may require performing a reduction operation on a stream of data which is distributed over several nodes in a network. In any node, a data reduction operation may be performed on a sub-set of the data and a result of the data reduction operation may then be distributed to one or more network nodes. The distribution of the data and the results may be managed in accordance with an aggregation protocol, for example, but not limited to, the “Scalable Hierarchical Aggregation Protocol” (SHArP), described in the abovementioned '613 publication.
- The data reduction operation may be performed by any suitable processing device, for example, but not limited to, a network device such a switch or a router, which in addition to performing packet forwarding also performs data reduction. The network device may receive data packets from various respective network nodes over respective ingress ports. The received data packets may be analyzed to determine whether the packets should be forwarded to other network nodes, or whether the packets should be forwarded to a data reduction process within the receiving network device.
- One problem to be addressed when performing data reduction in a network device is to perform the data reduction efficiently while maintaining throughput. Maintaining a high data throughput is particularly challenging when data is being received from multiple network nodes.
- One solution is for the network device to include a central block to which all the packets destined for the data reduction process are sent. The central block may include a single arithmetic logic unit (ALU) to perform the data reduction process, an application layer controller to manage the aggregation protocol, and a transport layer controller to manage receiving data packets of the aggregation protocol and forwarding the resultant reduced data output of the data reduction process to one or more network nodes and to manage responder and requester handling. With this solution, the single ALU generally becomes quickly overloaded and cannot reduce the data arriving at more than two ingress ports without compromising throughput. Even if more ALUs are added to the central block the ingress and egress interface of the central block will become congested.
- Another solution is to provide high speed links between each ingress port and the central block or multiple high speed links from forwarding circuitry to the central block with the central block including multiple ALUs arranged in a hierarchical structure so that in a first level of the hierarchical structure, data from any two ingress ports is reduced by one ALU, and in a second level the data output of the ALUs in the first level is reduced by the ALUs in the second level and so on until a central ALU receives data input from two ALUs to yield a final reduced data output. Therefore, the overall processing over the various ALUs takes place at a speed which is intended not to be limited other than by the maximum interconnect speed between the ALUs; this maximum speed is also termed herein “wire speed” (WS). Although, this solution provides sufficient throughput due to the different levels of ALUs, this solution requires high-speed connections running over the chip between each ingress port (or the forwarding circuitry) and the central block. This solution is generally not scalable because when the number of ports is increased, the high-speed connections running over the chip to the central block and the additional ALUs needed in the central block are increased, leading to a more complicated and expensive chip design and manufacture.
- Therefore, in some embodiments of the present invention, the network device includes multiple non-central ALUs located outside of the central block on the same chip as the central block with generally two high speed connections from the non-central ALUs outside of the central block leading into a central ALU located in the central block. This design reduces the number of high-speed connections entering the central block from the device ports to two. Additionally, locating the non-central ALUs outside the central block generally leads to an overall shorter length of high-speed connections on the chip. The closer the non-central ALUs are to the ports generally leads to a shorter overall length of high-speed connections.
- In some embodiments, the responder handling and requester handling of data packets associated with the data reduction process of the aggregation protocol are split between the streaming aggregation circuitry units associated with ingress ports (described in more detail below) and the transport later controller of the central block, respectively. Each of the ingress ports is associated with a responder handling database to store data associated with the responder handling, while the central block includes a requester handling database to store data associated with the requester handling.
- As a result of splitting the responder handling and requester handling between the streaming aggregation circuitry units and the transport layer controller of the central block, the payload data of packets targeted for the data reduction process can be injection into the data reduction process without having to send all the packet data (e.g., including headers and other responder handling data) to the central block thereby reducing congestion in the central block.
- Each of the ingress ports may be associated with a streaming aggregation circuitry unit which speculatively analyzes received data packets based on header data (described in more detail below) to identify the data packets having payloads targeted for the data reduction process as part of an aggregation protocol. The streaming aggregation circuitry unit, after receiving confirmation of the speculative analysis from the forwarding circuitry, parses the identified data packets into payload data and headers and injects the parsed payload data into the data reduction process.
- In some embodiments the ALUs are arranged in a hierarchical configuration including a root level, a leaf level and one or more intermediate levels. Each non-central ALU in the leaf level may receive data from at least two of the ingress ports and output data to one of the non-central ALUs in an intermediate level adjacent to the leaf level. The central ALU is disposed in the root level, and receives data from two (or more) non-central ALUs in the intermediate level below the root level.
- In other embodiments, the ALUs are arranged in a daisy-chain configuration comprising two (or more) end nodes converging via intermediates nodes on a central node. Each of the end nodes includes one of the non-central ALUs and may receive data from at least one ingress port and output data to a non-central ALU (i.e., the next ALU in the chain) disposed in one of the intermediate nodes. Each intermediate node includes a non-central ALU which may receive data from at least one ingress port and one of the non-central ALUs (i.e., the previous ALU in the chain). The central node includes the central ALU which receives data from two (or more) non-central ALUs of the intermediate nodes.
- Documents incorporated by reference herein are to be considered an integral part of the application except that, to the extent that any terms are defined in these incorporated documents in a manner that conflicts with definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.
- Reference is now made to
FIG. 1 , which is a block diagram view of anetwork device 100 implementing data reduction. Thenetwork device 100 includes a plurality ofports 102, forwardingcircuitry 104, and acentral block 106 including anapplication layer controller 108, atransport layer controller 110, anALU 112, and adatabase 114. - The
ports 102 are configured as ingress and egress ports for receiving and sendingdata packets 116, respectively. The receivedpackets 116 are processed by the forwardingcircuitry 104 and are either: forwarded to thecentral block 106, via a single high-speed channel 118, for injecting by theapplication layer controller 108 into a data reduction process which is performed by theALU 112; or forwarded to another network node via one of the egress ports. TheALU 112 processes data received byvarious ports 102 in a serial fashion. The resultant reduced data output by theALU 112 is packetized by thetransport layer controller 110 which manages forwarding the packetized data to at least one network node according to the dictates of an aggregation protocol. Thetransport layer controller 110 also manages requester and responder handling of the packets associated with the aggregation protocol. It should be noted that thetransport layer controller 110 may also manage responder and requester handling for packets received by thenetwork device 100 from a parent node in the aggregation protocol for forwarding to one or more other network nodes in the aggregation protocol. Theapplication layer controller 108 manages at least part of the aggregation protocol (based on at least data included in the packet headers of the received packets for the data reduction process) including selecting which network node(s) (not shown) the packetized data should be sent to. Theapplication layer controller 108, typically based on data included in the packet headers of the received packets for the data reduction process, instructs theALU 112 how to reduce the data (e.g., which mathematical operation(s) to perform on the data). Thedatabase 114 stores data used by theapplication layer controller 108 and thetransport layer controller 110 including data used in requester and responder handling. - Data reduction processing may be performed by any suitable hardware data modifier and embodiments of the present invention are not limited to using an ALU for performing data reduction. In some embodiments, the hardware data modifiers may perform any suitable data modification, including concatenation, by way of example only.
- As mentioned above the data reduction process in the
network device 100 does not generally provide a throughput according to “wire speed” (WS) unless data is being received via one of theports 102. - Reference is now made to
FIG. 2 , which is a block diagram view of anetwork device 200 implementing data reduction. In this figure and in the figures referenced below similar reference numerals have been used to reference similar elements for the sake of consistency. Thenetwork device 200 includes a plurality ofports 202, forwardingcircuitry 204, and acentral block 206 including anapplication layer controller 208, atransport layer controller 210, a plurality of ALUs 212, and adatabase 214. Thenetwork device 200 also includes a streamingaggregation circuitry unit 220 associated with each of theports 202. - The
network device 200 ofFIG. 2 illustrates how thecentral block 106 of thenetwork device 100 could be amended in order to achieve WS processing throughput in the data reduction process. However, the design of thenetwork device 200 has some drawbacks discussed below in more detail. -
Data packets 216 received by eachport 202 are speculatively analyzed by the associated streamingaggregation circuitry unit 220 to determine ifpackets 216 should be injected in the data reduction process (via high-speed connections 218) or forwarded to another network node viainternal connections 222 and the forwardingcircuitry 204 and one of the egress ports. The speculative analysis is confirmed by the forwardingcircuitry 204 prior to the streamingaggregation circuitry unit 220 parsing thepackets 216 and injecting the parsed payload data into the data reduction process. This process is described in more detail with reference to the embodiment ofFIGS. 3 and 4 . - Each streaming
aggregation circuitry unit 220 is connected via a high-speed connection 218 to one of the ALUs 212 in thecentral block 206. In some embodiments, the functionality of the streamingaggregation circuitry unit 220 is combined with forwardingcircuitry 204 and multiple high-speed links may be provided between the forwardingcircuitry 204 and thecentral block 206. - The ALUs 212 are arranged in a hierarchical configuration in the
central block 206 so that each ALU 212-1 in a layer closest to the streamingaggregation circuitry units 220 receives thedata packets 216 from two of the streamingaggregation circuitry units 220. The ALUs 212-1 reduce data included in the packet payloads. Each ALU 212-2 in a layer adjacent to the ALUs 212-1 receives the reduced data and at least part of the packets 216 (e.g., the packet headers). The ALUs 212-2 further reduce the received reduced data yielding further reduced data. The ALU 212-3 located in a root layer of the hierarchical configuration receives the further reduced data from the ALUs 212-2, and still further reduces the received data yielding a resultant reduced data which is packetized by thetransport layer controller 210 for forwarding to one or more network nodes determined by theapplication layer controller 208. - The
network device 200 therefore provides a data reduction process at WS at the cost of the high-speed connection 218 extending from each of the streamingaggregation circuitry units 220 to thecentral block 206. As discussed above thenetwork device 200 is generally not scalable in terms of chip design and manufacturer, among other drawbacks, due to the high-speed connection 218 extending from the streamingaggregation circuitry units 220 to thecentral block 206 and the placement of many ALUs 212 in thecentral block 206. - Reference is now made to
FIG. 3 , which is a block diagram view of anetwork device 300 implementing a data reduction process constructed and operative in accordance with an embodiment of the present invention. - The
network device 300 includes a plurality of ports 302 (configurable as ingress and egress ports), forwardingcircuitry 304,data reduction circuitry 312, acentral block 306 including anapplication layer controller 308, atransport layer controller 310, a central ALU 312-3 (which is part of the data reduction circuitry 312), arequester handling database 314, streaming aggregation circuitry including a plurality of respective streamingaggregation circuitry units 320 connected to, and serving respective ones of the ingress ports. Thenetwork device 300 also includes aresponder handling database 324 comprised in eachport 302. For the sake of simplicity not all of theresponder handling databases 324 have been labelled inFIG. 3 . - The
ports 302 are configured to serve as ingress ports for receivingdata packets 316 from a network and as egress ports for forwarding at least some of thedata packets 316. Each streamingaggregation circuitry unit 320 is configured to speculatively analyze the received data packets 316 (received on theport 302 associated with that streaming aggregation circuitry unit 320) to identify thedata packets 316 having payloads targeted for the data reduction process as part of an aggregation protocol, for example, but not limited to, the “Scalable Hierarchical Aggregation Protocol” (SHArP), described in the abovementioned '613 publication. The streamingaggregation circuitry unit 320 forwards packet headers to the forwardingcircuitry 304 as part of a confirmation process whereby the forwardingcircuitry 304 confirms that the packets analyzed as being targeted for the data reduction process may be injected into the data reduction process. This is described in more detail with reference toFIG. 4 . - Subject to receiving confirmation from the forwarding
circuitry 304, each streamingaggregation circuitry unit 320 is configured to parse the identified data packets into payload data and headers and to inject the parsed payload data into the data reduction process. - The streaming
aggregation circuitry units 320 are configured to perform responder handling (of thepackets 316 associated with the aggregation protocol andpackets 316 not associated with the aggregation protocol) whereas requester handling of data packets associated with the aggregation protocol is performed by thetransport layer controller 310 described in more detail below. - It should be noted that the
transport layer controller 310 in thecentral block 306 may still manage responder and requester handing for packets received by thenetwork device 300 from a parent node in the aggregation protocol for forwarding to one or more other nodes in the aggregation protocol. The packets received from the patent node are forwarded from the receivingport 302 via the forwardingcircuitry 304 to thecentral block 306 where thetransport layer controller 310 manages the responder handling of the received packet. Thetransport layer controller 310 then manages forwarding and requester handling of the received packet to one or more network nodes according to the aggregation protocol. - Packet headers of the
packets 316 targeted for the data reduction process may also include data needed by theapplication layer controller 308 for managing the aggregation protocol, such as which operation(s) (e.g., mathematical operation(s)) theALUs 312 should perform and where resultant reduced data should be sent. Therefore, at least one packet (e.g., a first packet in a message) of thepackets 316 targeted for the data reduction process is forwarded to thecentral block 306 via the forwardingcircuitry 304 for receipt by theapplication layer controller 308. - Therefore, the responder handling and the requester handling of data packets targeted for the data reduction process of the aggregation protocol are split between the streaming
aggregation circuitry units 320 and the central block 306 (in which thetransport layer controller 310 is disposed), respectively. The split also manifests itself with respect to data storage with theresponder handling database 324 of eachport 302 storing data used in the responder handling, and therequester handling database 314 disposed in thecentral block 306 storing data used in the requester handling as well as other data used by theapplication layer controller 308 and thetransport layer controller 310. Splitting the responder and requester handling between the streamingaggregation circuitry unit 320 and thecentral block 306 also leads to less data being injected to thedata reduction circuitry 312 as only data needed for data reduction needs to be injected into the data reduction circuitry. - The
data reduction circuitry 312 is connected to the ingress ports via the streamingaggregation circuitry units 320 and is configured to perform the data reduction process on the parsed payload data. Thedata reduction circuitry 312 includes a multiplicity of ALUs including the central ALU 312-3 and non-central ALUs 312-1, 312-2. The ALUs are connected and arranged to reduce the parsed payload data in stages with a last stage of the data reduction process in thenetwork device 300 being performed by the central ALU 312-3. The central ALU 312-3 is configured to receive data from the non-central ALUs 312-2 and to output resultant reduced data. In the example, ofFIG. 3 the central ALU 312-3 receives input from two non-central ALUs 312-2. In some embodiments, the central ALU 312-3 may receive input from more than two non-central ALUs 312-2. - In the embodiment of
FIG. 3 , the non-central ALUs 312-1, 312-2 are disposed externally to thecentral block 306 so that there are only two respective high-speed input channels 318 into thecentral block 306 to the central ALU 312-3 from two respective ones of the non-central ALUs 312-2. Disposing the non-central ALUs externally to thecentral block 306 leads to shorter overall high-speed channels 318 even though generally thenetwork device 300 has the same amount of ALUs as thenetwork device 200 ofFIG. 2 . The design ofnetwork device 300 not only shortens the overall length of the high-speed channels but reduces the number of interfaces tocentral block 306 and is generally more scalable. - In some embodiments, the
central block 306 may include more than one ALU with more than twoinput channels 318 entering thecentral block 306. - The example of
FIG. 3 , shows the ALUs arranged in a hierarchical configuration including a root level, a leaf level and an intermediate level. Each non-central ALU 312-1 in the leaf level is configured to receive data from at least two of the ingress ports and to output data to one of the non-central ALUs 312-2 in the intermediate level. The central ALU 312-3 is disposed in the root level, and is configured to receive data from the non-central ALUs 312-2 in the intermediate level. The number of ingress ports connected to, and providing data to, one ALU 312-1 may depend on the processing capabilities and bandwidth properties of the ALU 312-1. The number of levels in the hierarchical configuration depends on the number ofports 302 included in thenetwork device 300 so that in some implementations thenetwork device 300 may include multiple intermediate levels, in which an ALU in an intermediate level closer to the leaf level may output data to an adjacent intermediate level closer to the root level, and so on, until all the intermediate levels are traversed by the data that is input into the data reduction process. - The
application layer controller 308 is configured to: control at least part of the aggregation protocol among network nodes in the network; manage operation(s) (e.g., mathematical operation(s)) performed by the ALUs; and select at least one network node (e.g., according to the aggregation protocol) to which to forward the resultant reduced data output by the central ALU 312-3. - The
transport layer controller 310 is configured to: manage forwarding of the resultant reduced data in data packets to the network node(s) (selected by the application layer controller 308) via at least one of the egress ports; and perform requester handling of the data packets that include the resultant reduced data. - Any suitable method may be used to distribute the data packets that include the resultant reduced data to the selected network node(s), for example, but not limited to, a distribution method described in US Patent Publication 2018/0287928 of Levi, et al., which is herein incorporated by reference.
- In practice, some or all of the functions of the
transport layer controller 310 and theapplication layer controller 308 may be combined in a single physical component or, alternatively, implemented using multiple physical components. These physical components may comprise hard-wired or programmable devices, or a combination of the two. In some embodiments, at least some of the functions may be carried out by a programmable processor under the control of suitable software. This software may be downloaded to a device in electronic form, over a network, for example. Alternatively, or additionally, the software may be stored in tangible, non-transitory computer-readable storage media, such as optical, magnetic, or electronic memory. - The
network device 300 may include other standard components included in a network device but have been not been included in the present disclosure for the sake of simplicity. - Reference is now made to
FIG. 4 , which is aflowchart 400 including exemplary steps in a method of operation of thenetwork device 300 ofFIG. 3 . Reference is also made toFIG. 3 . - The streaming aggregation circuitry comprising the streaming
aggregation circuitry units 320 is configured to inspect (block 402) the receiveddata packets 316 to speculatively analyze the data packets to identify data packets having payloads targeted for the data reduction process as part of the aggregation protocol. The speculative analysis may include inspecting packet header information including source and destination information as well as other header information such as layer 4 data. At adecision block 404, the streaming aggregation circuitry determines whether ones of thepackets 316 are potentially for the data reduction process (branch 410) or not for the data reduction process (branch 406). Thepackets 316 which are not targeted for the data reduction process (including packets from a parent node of the aggregation process and destined for the central block 306) are forwarded (block 408) according to a forwarding mechanism to one or more network nodes (which may include forwarding to the central block 306) according to the destination addresses of thepackets 316. For example, thepackets 316 may be forwarded by the streamingaggregation circuitry units 320 to the forwardingcircuitry 304 for forwarding to one or more network nodes via one or more of the egress ports. - The streaming aggregation circuitry is configured to send (block 412) packet headers of data packets having payloads potentially targeted for the data reduction process as part of the aggregation protocol to the forwarding
circuitry 304. The forwardingcircuitry 304 is configured to analyze the packet headers (usinglayer 2 and 3 information such as source and destination information) in order to determine whether the speculative analysis of the streaming aggregation circuitry was correct. The forwardingcircuitry 304 is configured to send a confirmation approving the speculative analysis to the streaming aggregation circuitry which receives (block 414) the decision. If the forwardingcircuitry 304 determines that the speculative analysis of the streaming aggregation circuitry of a given packet(s) is incorrect, the streaming aggregation circuitry unit does not forward payload data of the given packet(s) to the data reduction path but instead forwards the given packet(s) via the standard forwarding mechanism of thenetwork device 300. - The streaming aggregation circuitry is configured to parse (block 416) the
data packets 316 confirmed as being targeted for the data reduction process into payload data and headers. The streaming aggregation circuitry is configured to inject (block 418) the parsed payload data into the data reduction process executed by theALUs 312, which perform (block 420) the data reduction process yielding the resultant reduced data. Thetransport layer controller 310 is configured to manage (block 422) forwarding the resultant reduced data in data packets. - Reference is now made to
FIG. 5 , which is a block diagram view of anetwork device 500 implementing a data reduction process constructed and operative in accordance with an alternative embodiment of the present invention. - The
network device 500 is substantially the same as thenetwork device 300 ofFIG. 3 except that the ALUs ofdata reduction circuitry 512 of thenetwork device 500 is arranged in a daisy-chain configuration described in more detail below. Thenetwork device 500 includes a plurality of ports 502, forwardingcircuitry 504, acentral block 506 including anapplication layer controller 508, atransport layer controller 510, a central ALU 512-3, and arequester handling database 514. Thenetwork device 500 also includeshigh speed connections 518 between the ports 502 and thedata reduction circuitry 512 all the way to the central ALU 512-3. Thenetwork device 500 also includes streaming aggregation circuitry comprising a plurality of respective streamingaggregation circuitry units 520 associated with respective ones of the ports 502, and aresponder handling database 524 included in each port 502. - The daisy-chain configuration includes at least two end nodes converging via intermediates nodes on a central node, as will now be described in more detail. The
ALUs 512 are connected in a chain formation extending from two sides to the central ALU 512-3, namely, from ALUs 512-1 via ALUs 512-2 to the central ALU 512-3. - Each end node includes one of the non-central ALUs 512-1 configured to receive data from at least one ingress port 502-1 (
FIG. 5 shows the ALUs 512-1 each receiving input from two ingress ports 502-1) and to output data to one of the non-central ALUs 512-2 (next in the chain) disposed in one of the intermediate nodes. - Each intermediate node includes one of the non-central ALUs 512-2 configured to receive data from at least one of the ingress ports 502-2 and one of the non-central ALUs 512-1 or one of the ALUs 512-2.
- The central node includes the central ALU 512-3 which is configured to receive data from at least two non-central ALUs (two ALUs 512-2 in
FIG. 5 ) of the intermediate nodes. - An
ALU 512 in the daisy-chain configuration may receive input from one, two, or more ports 502 depending on the processing speed of theALU 512 and the design requirements. - The daisy-chain configuration generally does not require additional hierarchical layers of ALUs when new ports are added to the network device design as compare to the hierarchical configuration. In the daisy-chain configuration each additional port generally adds a new streaming
aggregation circuitry unit 520, aresponder handling database 524 and an ALU 512-1 (which may be shared with other ports). The daisy chain configuration generally results in shorter high-speed connections 518 across the chip implementing thenetwork device 500 and is generally more scalable than the design used in thenetwork device 300. - Various features of the invention which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable sub-combination.
- The embodiments described above are cited by way of example, and the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.
Claims (18)
1. A network device, comprising:
a plurality of ports configured to serve as ingress ports for receiving data packets from a network and as egress ports for forwarding at least some of the data packets;
streaming aggregation circuitry connected to the ingress ports, and configured to:
analyze the received data packets to identify ones of the data packets having payloads targeted for a data reduction process;
parse at least some of the identified data packets into payload data and headers; and
inject the parsed payload data into the data reduction process;
data reduction circuitry connected to the ingress ports and configured to perform the data reduction process on the parsed payload data, and comprising a multiplicity of hardware data modifiers (HMDs) including a central HDM and non-central HDMs, the HDMs being connected and arranged to reduce the parsed payload data in stages with a stage of the data reduction process in the network device being performed by the central HDM which is configured to receive data from at least two of the non-central HDMs and to output resultant reduced data; and
a transport layer controller configured to manage forwarding of the resultant reduced data in data packets to at least one network node via at least one of the egress ports.
2. The device according to claim 1 , wherein:
at least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level;
each one of the non-central HDMs in the leaf level is configured to receive data from at least two of the ingress ports and to output data to one of the non-central HDMs in the at least one intermediate level; and
the central HDM is disposed in the root level, and is configured to receive data from the at least two non-central HDMs in the at least one intermediate level.
3. The device according to claim 1 , wherein the HDMs are arranged in a daisy-chain configuration comprising at least two end nodes converging via intermediates nodes on a central node, wherein:
each one of the at least two end nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and to output data to one of the non-central HDMs disposed in one of the intermediate nodes;
each one of the intermediate nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and one of the non-central HDMs; and
the central node includes the central HDM which is configured to receive data from the at least two non-central HDMs of the intermediate nodes.
4. The device according to claim 1 , further comprising a central block including:
the central HDM;
an application layer controller configured to: control at least part of an aggregation protocol among network nodes in the network; select the at least one network node to which to forward the resultant reduced data output by the central HDM; and manage operations performed by the HDMs;
the transport layer controller which is also configured to perform requester handling of the data packets including the resultant reduced data; and
a requester handling database to store data used in the requester handling.
5. The device according to claim 4 , further comprising two respective input channels into the central block to the central HDM from two respective ones of the non-central HDMs, the non-central HDMs being disposed externally to the central block.
6. The device according to claim 4 , wherein the streaming aggregation circuitry is configured to perform responder handling, so as to split the responder handling and the requester handling of data packets of targeted for the data reduction process of the aggregation protocol between the streaming aggregation circuitry and the central block, respectively, each of the ingress ports further comprising a responder handling database to store data used in the responder handling.
7. The device according to claim 6 , wherein:
at least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level;
each one of the non-central HDMs in the leaf level is configured to receive data from at least two of the ingress ports and to output data to one of the non-central HDMs in the at least one intermediate level; and
the central HDM is disposed in the root level, and is configured to receive data from the at least two non-central HDMs in the at least one intermediate level.
8. The device according to claim 6 , wherein the HDMs are arranged in a daisy-chain configuration comprising at least two end nodes converging via intermediates nodes on a central node, wherein:
each one of the at least two end nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and to output data to one of the non-central HDMs disposed in one of the intermediate nodes;
each one of the intermediate nodes includes one of the non-central HDMs configured to receive data from at least one of the ingress ports and one of the non-central HDMs; and
the central node includes the central HDM which is configured to receive data from the at least two non-central HDMs of the intermediate nodes.
9. The device according to claim 1 , wherein the streaming aggregation circuitry includes a plurality of respective streaming aggregation circuitry units connected to, and serving respective ones of the ingress ports.
10. The device according to claim 1 , wherein the data reduction process is part of an aggregation protocol.
11. A data reduction method, comprising:
receiving data packets in a network device from a network;
forwarding at least some of the data packets;
analyzing the received data packets to identify ones of the data packets having payloads targeted for a data reduction;
parsing at least some of the identified data packets into payload data and headers;
injecting the parsed payload data into the data reduction process;
performing the data reduction process on the parsed payload data using data reduction circuitry comprising a multiplicity of hardware data modifiers (HDMs) including a central HDM and non-central HDMs, the HDMs being connected and arranged to reduce the parsed payload data in stages with a stage of the data reduction process in the network device being performed by the central HDM outputting resultant reduced data; and
managing forwarding of the resultant reduced data in data packets to at least one network node.
12. The method according to claim 11 , wherein at least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level; the method further comprising:
receiving data from at least two ingress ports and outputting data to one of the non-central HDMs in the at least one intermediate level by one of the non-central HDMs in the leaf level; and
receiving data from at least two non-central HDMs in the at least one intermediate level by the central HDM which is disposed in the root level.
13. The method according to claim 11 , wherein the HDMs are arranged in a daisy-chain configuration comprising at least two end nodes converging via intermediates nodes on a central node, the method further comprising:
receiving data from at least one ingress port and outputting data to one of the non-central HDMs disposed in one of the intermediate nodes by one of non-central HDMs disposed in one of the at least two end nodes;
receiving data from at least one ingress port and from one of the non-central HDMs by one of the non-central HDMs disposed in one of the intermediate nodes; and
receiving data from at least two of the non-central HDMs of the intermediate nodes by the central HDM disposed in the central node.
14. The method according to claim 11 , further comprising:
controlling at least part of an aggregation protocol among network nodes in the network;
selecting the at least one network node to which to forward the resultant reduced data output by the central HDM;
managing operations performed by the HDMs;
performing requester handling of the data packets including the resultant reduced data; and
storing data used in the requester handling.
15. The method according to claim 14 , further comprising:
performing responder handling by streaming aggregation circuitry, so as to split the responder handling and the requester handling of data packets targeted for the data reduction process of the aggregation protocol between the streaming aggregation circuitry and a central block comprising the central HDM, respectively; and
storing data used in the responder handling.
16. The method according to claim 15 , wherein at least some of the HDMs are arranged in a hierarchical configuration including a root level, a leaf level and at one intermediate level; the method further comprising:
receiving data from at least two ingress ports and outputting data to one of the non-central HDMs in the at least one intermediate level by one of the non-central HDMs in the leaf level; and
receiving data from at least two non-central HDMs in the at least one intermediate level by the central HDM which is disposed in the root level.
17. The method according to claim 15 , wherein the HDMs are arranged in a daisy-chain configuration comprising at least two end nodes converging via intermediates nodes on a central node, the method further comprising:
receiving data from at least one ingress port and outputting data to one of the non-central HDMs disposed in one of the intermediate nodes by one of non-central HDMs disposed in one of the at least two end nodes;
receiving data from at least one ingress port and from one of the non-central HDMs by one of the non-central HDMs disposed in one of the intermediate nodes; and
receiving data from at least two of the non-central HDMs of the intermediate nodes by the central HDM disposed in the central node.
18. The method according to claim 11 , wherein the data reduction process is part of an aggregation protocol.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/357,356 US20200106828A1 (en) | 2018-10-02 | 2019-03-19 | Parallel Computation Network Device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862739879P | 2018-10-02 | 2018-10-02 | |
US16/357,356 US20200106828A1 (en) | 2018-10-02 | 2019-03-19 | Parallel Computation Network Device |
Publications (1)
Publication Number | Publication Date |
---|---|
US20200106828A1 true US20200106828A1 (en) | 2020-04-02 |
Family
ID=69946749
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/357,356 Abandoned US20200106828A1 (en) | 2018-10-02 | 2019-03-19 | Parallel Computation Network Device |
Country Status (1)
Country | Link |
---|---|
US (1) | US20200106828A1 (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11196586B2 (en) | 2019-02-25 | 2021-12-07 | Mellanox Technologies Tlv Ltd. | Collective communication system and methods |
US11252027B2 (en) | 2020-01-23 | 2022-02-15 | Mellanox Technologies, Ltd. | Network element supporting flexible data reduction operations |
US11277455B2 (en) | 2018-06-07 | 2022-03-15 | Mellanox Technologies, Ltd. | Streaming system |
US11303582B1 (en) * | 2019-06-28 | 2022-04-12 | Amazon Technologies, Inc. | Multi-layer network for metric aggregation |
CN114363248A (en) * | 2020-09-29 | 2022-04-15 | 华为技术有限公司 | Computing system, accelerator, switching plane and aggregation communication method |
US11556378B2 (en) | 2020-12-14 | 2023-01-17 | Mellanox Technologies, Ltd. | Offloading execution of a multi-task parameter-dependent operation to a network device |
US11625393B2 (en) | 2019-02-19 | 2023-04-11 | Mellanox Technologies, Ltd. | High performance computing system |
US11750699B2 (en) | 2020-01-15 | 2023-09-05 | Mellanox Technologies, Ltd. | Small message aggregation |
US11765237B1 (en) | 2022-04-20 | 2023-09-19 | Mellanox Technologies, Ltd. | Session-based remote direct memory access |
US11789792B1 (en) | 2022-05-26 | 2023-10-17 | Mellanox Technologies, Ltd. | Message passing interface (MPI) collectives using multi-allgather |
US11876885B2 (en) | 2020-07-02 | 2024-01-16 | Mellanox Technologies, Ltd. | Clock queue with arming and/or self-arming features |
US11922237B1 (en) | 2022-09-12 | 2024-03-05 | Mellanox Technologies, Ltd. | Single-step collective operations |
US11973694B1 (en) | 2023-03-30 | 2024-04-30 | Mellanox Technologies, Ltd. | Ad-hoc allocation of in-network compute-resources |
US12223051B2 (en) | 2021-02-01 | 2025-02-11 | Mellanox Technologies, Ltd. | Secure in-service firmware update |
DE102024127471A1 (en) | 2023-09-26 | 2025-03-27 | Mellanox Technologies, Ltd. | IN-SERVICE SOFTWARE UPDATE, MANAGED BY NETWORK CONTROLLER |
Citations (90)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6483804B1 (en) * | 1999-03-01 | 2002-11-19 | Sun Microsystems, Inc. | Method and apparatus for dynamic packet batching with a high performance network interface |
US6728862B1 (en) * | 2000-05-22 | 2004-04-27 | Gazelle Technology Corporation | Processor array and parallel data processing methods |
US20050281287A1 (en) * | 2004-06-21 | 2005-12-22 | Fujitsu Limited | Control method of communication system, communication controller and program |
US20080244220A1 (en) * | 2006-10-13 | 2008-10-02 | Guo Hui Lin | Filter and Method For Filtering |
US20080263329A1 (en) * | 2007-04-19 | 2008-10-23 | Archer Charles J | Parallel-Prefix Broadcast for a Parallel-Prefix Operation on a Parallel Computer |
US20090063817A1 (en) * | 2007-08-27 | 2009-03-05 | Arimilli Lakshminarayana B | System and Method for Packet Coalescing in Virtual Channels of a Data Processing System in a Multi-Tiered Full-Graph Interconnect Architecture |
US20090063891A1 (en) * | 2007-08-27 | 2009-03-05 | Arimilli Lakshminarayana B | System and Method for Providing Reliability of Communication Between Supernodes of a Multi-Tiered Full-Graph Interconnect Architecture |
US20100095086A1 (en) * | 2008-10-14 | 2010-04-15 | International Business Machines Corporation | Dynamically Aligning Enhanced Precision Vectors Based on Addresses Corresponding to Reduced Precision Vectors |
US7738443B2 (en) * | 2007-06-26 | 2010-06-15 | International Business Machines Corporation | Asynchronous broadcast for ordered delivery between compute nodes in a parallel computing system where packet header space is limited |
US20100185719A1 (en) * | 2000-06-26 | 2010-07-22 | Howard Kevin D | Apparatus For Enhancing Performance Of A Parallel Processing Environment, And Associated Methods |
US20100241828A1 (en) * | 2009-03-18 | 2010-09-23 | Microsoft Corporation | General Distributed Reduction For Data Parallel Computing |
US20110060891A1 (en) * | 2009-09-04 | 2011-03-10 | International Business Machines Corporation | Parallel pipelined vector reduction in a data processing system |
US20110066649A1 (en) * | 2009-09-14 | 2011-03-17 | Myspace, Inc. | Double map reduce distributed computing framework |
US20110173413A1 (en) * | 2010-01-08 | 2011-07-14 | International Business Machines Corporation | Embedding global barrier and collective in a torus network |
US20110219208A1 (en) * | 2010-01-08 | 2011-09-08 | International Business Machines Corporation | Multi-petascale highly efficient parallel supercomputer |
US20110238956A1 (en) * | 2010-03-29 | 2011-09-29 | International Business Machines Corporation | Collective Acceleration Unit Tree Structure |
US20110276789A1 (en) * | 2010-05-04 | 2011-11-10 | Google Inc. | Parallel processing of data |
US20120131309A1 (en) * | 2010-11-18 | 2012-05-24 | Texas Instruments Incorporated | High-performance, scalable mutlicore hardware and software system |
US8380880B2 (en) * | 2007-02-02 | 2013-02-19 | The Mathworks, Inc. | Scalable architecture |
US20130117548A1 (en) * | 2011-11-07 | 2013-05-09 | Nvidia Corporation | Algorithm for vectorization and memory coalescing during compiling |
US8510366B1 (en) * | 2006-03-22 | 2013-08-13 | The Mathworks, Inc. | Dynamic distribution for distributed arrays and related rules |
US20130318525A1 (en) * | 2012-05-25 | 2013-11-28 | International Business Machines Corporation | Locality-aware resource allocation for cloud computing |
US20130336292A1 (en) * | 2012-06-19 | 2013-12-19 | Vinayak Sadashiv Kore | Wireless fire system based on open standard wireless protocols |
US20140047341A1 (en) * | 2012-08-07 | 2014-02-13 | Advanced Micro Devices, Inc. | System and method for configuring cloud computing systems |
US20140122831A1 (en) * | 2012-10-30 | 2014-05-01 | Tal Uliel | Instruction and logic to provide vector compress and rotate functionality |
US20140189308A1 (en) * | 2012-12-29 | 2014-07-03 | Christopher J. Hughes | Methods, apparatus, instructions, and logic to provide vector address conflict detection functionality |
US20140280420A1 (en) * | 2013-03-13 | 2014-09-18 | Qualcomm Incorporated | Vector processing engines having programmable data path configurations for providing multi-mode radix-2x butterfly vector processing circuits, and related vector processors, systems, and methods |
US20140281370A1 (en) * | 2013-03-13 | 2014-09-18 | Qualcomm Incorporated | Vector processing engines having programmable data path configurations for providing multi-mode vector processing, and related vector processors, systems, and methods |
US20140365548A1 (en) * | 2013-06-11 | 2014-12-11 | Analog Devices Technology | Vector matrix product accelerator for microprocessor integration |
US20140362692A1 (en) * | 2012-02-24 | 2014-12-11 | Huawei Technologies Co., Ltd. | Data transmission method, access point, relay node, and data node for packet aggregation |
US20150106578A1 (en) * | 2013-10-15 | 2015-04-16 | Coho Data Inc. | Systems, methods and devices for implementing data management in a distributed data storage system |
US20150143085A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING REORDERING CIRCUITRY IN DATA FLOW PATHS BETWEEN EXECUTION UNITS AND VECTOR DATA MEMORY TO PROVIDE IN-FLIGHT REORDERING OF OUTPUT VECTOR DATA STORED TO VECTOR DATA MEMORY, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS |
US20150143078A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING A TAPPED-DELAY LINE(S) FOR PROVIDING PRECISION FILTER VECTOR PROCESSING OPERATIONS WITH REDUCED SAMPLE RE-FETCHING AND POWER CONSUMPTION, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS |
US20150143086A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING FORMAT CONVERSION CIRCUITRY IN DATA FLOW PATHS BETWEEN VECTOR DATA MEMORY AND EXECUTION UNITS TO PROVIDE IN-FLIGHT FORMAT-CONVERTING OF INPUT VECTOR DATA TO EXECUTION UNITS FOR VECTOR PROCESSING OPERATIONS, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS |
US20150143079A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING TAPPED-DELAY LINE(S) FOR PROVIDING PRECISION CORRELATION / COVARIANCE VECTOR PROCESSING OPERATIONS WITH REDUCED SAMPLE RE-FETCHING AND POWER CONSUMPTION, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS |
US20150143077A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING MERGING CIRCUITRY IN DATA FLOW PATHS BETWEEN EXECUTION UNITS AND VECTOR DATA MEMORY TO PROVIDE IN-FLIGHT MERGING OF OUTPUT VECTOR DATA STORED TO VECTOR DATA MEMORY, AND RELATED VECTOR PROCESSING INSTRUCTIONS, SYSTEMS, AND METHODS |
US20150143076A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING DESPREADING CIRCUITRY IN DATA FLOW PATHS BETWEEN EXECUTION UNITS AND VECTOR DATA MEMORY TO PROVIDE IN-FLIGHT DESPREADING OF SPREAD-SPECTRUM SEQUENCES, AND RELATED VECTOR PROCESSING INSTRUCTIONS, SYSTEMS, AND METHODS |
US20150154058A1 (en) * | 2013-11-29 | 2015-06-04 | Fujitsu Limited | Communication control device, information processing apparatus, parallel computer system, and control method for parallel computer system |
US20150188987A1 (en) * | 2010-11-15 | 2015-07-02 | Coke S. Reed | Parallel information system utilizing flow control and virtual channels |
US20150212972A1 (en) * | 2014-01-28 | 2015-07-30 | Arm Limited | Data processing apparatus and method for performing scan operations |
US20150379022A1 (en) * | 2014-06-27 | 2015-12-31 | General Electric Company | Integrating Execution of Computing Analytics within a Mapreduce Processing Environment |
US20160055225A1 (en) * | 2012-05-15 | 2016-02-25 | Splunk Inc. | Replication of summary data in a clustered computing environment |
US20160112531A1 (en) * | 2014-10-20 | 2016-04-21 | PlaceIQ. Inc. | Scripting distributed, parallel programs |
US20160179537A1 (en) * | 2014-12-23 | 2016-06-23 | Intel Corporation | Method and apparatus for performing reduction operations on a set of vector elements |
US20160342568A1 (en) * | 2015-05-21 | 2016-11-24 | Goldman, Sachs & Co. | General-purpose parallel computing architecture |
US20170063613A1 (en) * | 2015-08-31 | 2017-03-02 | Mellanox Technologies Ltd. | Aggregation protocol |
US20170093715A1 (en) * | 2015-09-29 | 2017-03-30 | Ixia | Parallel Match Processing Of Network Packets To Identify Packet Data For Masking Or Other Actions |
US20170116154A1 (en) * | 2015-10-23 | 2017-04-27 | The Intellisis Corporation | Register communication in a network-on-a-chip architecture |
US20170187629A1 (en) * | 2015-12-28 | 2017-06-29 | Amazon Technologies, Inc. | Multi-path transport design |
US20170187846A1 (en) * | 2015-12-29 | 2017-06-29 | Amazon Technologies, Inc. | Reliable, out-of-order receipt of packets |
US20170187496A1 (en) * | 2015-12-29 | 2017-06-29 | Amazon Technologies, Inc. | Reliable, out-of-order transmission of packets |
US20170199844A1 (en) * | 2015-05-21 | 2017-07-13 | Goldman, Sachs & Co. | General-purpose parallel computing architecture |
US9756154B1 (en) * | 2014-10-13 | 2017-09-05 | Xilinx, Inc. | High throughput packet state processing |
US20180004530A1 (en) * | 2014-12-15 | 2018-01-04 | Hyperion Core, Inc. | Advanced processor architecture |
US20180046901A1 (en) * | 2016-08-12 | 2018-02-15 | Beijing Deephi Intelligence Technology Co., Ltd. | Hardware accelerator for compressed gru on fpga |
US20180089278A1 (en) * | 2016-09-26 | 2018-03-29 | Splunk Inc. | Data conditioning for dataset destination |
US20180091442A1 (en) * | 2016-09-29 | 2018-03-29 | International Business Machines Corporation | Network switch architecture supporting multiple simultaneous collective operations |
US20180173673A1 (en) * | 2016-12-15 | 2018-06-21 | Ecole Polytechnique Federale De Lausanne (Epfl) | Atomic Object Reads for In-Memory Rack-Scale Computing |
US10015106B1 (en) * | 2015-04-06 | 2018-07-03 | EMC IP Holding Company LLC | Multi-cluster distributed data processing platform |
US20180285316A1 (en) * | 2017-04-03 | 2018-10-04 | Google Llc | Vector reduction processor |
US20180321938A1 (en) * | 2017-05-08 | 2018-11-08 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
US20180367465A1 (en) * | 2017-06-15 | 2018-12-20 | Mellanox Technologies, Ltd. | Transmission and Reception of Raw Video Using Scalable Frame Rate |
US20180375781A1 (en) * | 2016-03-11 | 2018-12-27 | Huawei Technologies Co.,Ltd. | Coflow identification method and system, and server using method |
US20190026250A1 (en) * | 2017-07-24 | 2019-01-24 | Tesla, Inc. | Vector computational unit |
US20190065208A1 (en) * | 2017-08-31 | 2019-02-28 | Cambricon Technologies Corporation Limited | Processing device and related products |
US20190068501A1 (en) * | 2017-08-25 | 2019-02-28 | Intel Corporation | Throttling for bandwidth imbalanced data transfers |
US20190102640A1 (en) * | 2017-09-29 | 2019-04-04 | Infineon Technologies Ag | Accelerating convolutional neural network computation throughput |
US20190102179A1 (en) * | 2017-09-30 | 2019-04-04 | Intel Corporation | Processors and methods for privileged configuration in a spatial array |
US20190102338A1 (en) * | 2017-09-30 | 2019-04-04 | Intel Corporation | Processors, methods, and systems with a configurable spatial accelerator having a sequencer dataflow operator |
US20190114533A1 (en) * | 2017-10-17 | 2019-04-18 | Xilinx, Inc. | Machine learning runtime library for neural network acceleration |
US20190138638A1 (en) * | 2016-09-26 | 2019-05-09 | Splunk Inc. | Task distribution in an execution node of a distributed execution environment |
US20190147092A1 (en) * | 2016-09-26 | 2019-05-16 | Splunk Inc. | Distributing partial results to worker nodes from an external data system |
US10296351B1 (en) * | 2017-03-15 | 2019-05-21 | Ambarella, Inc. | Computer vision processing in hardware data paths |
US10305980B1 (en) * | 2013-11-27 | 2019-05-28 | Intellectual Property Systems, LLC | Arrangements for communicating data in a computing system using multiple processors |
US10318306B1 (en) * | 2017-05-03 | 2019-06-11 | Ambarella, Inc. | Multidimensional vectors in a coprocessor |
US20190235866A1 (en) * | 2018-02-01 | 2019-08-01 | Tesla, Inc. | Instruction set architecture for a vector computational unit |
US10425350B1 (en) * | 2015-04-06 | 2019-09-24 | EMC IP Holding Company LLC | Distributed catalog service for data processing platform |
US20190303263A1 (en) * | 2018-03-30 | 2019-10-03 | Kermin E. Fleming, JR. | Apparatus, methods, and systems for integrated performance monitoring in a configurable spatial accelerator |
US20190303168A1 (en) * | 2018-04-03 | 2019-10-03 | Intel Corporation | Apparatus, methods, and systems for conditional queues in a configurable spatial accelerator |
US20190324431A1 (en) * | 2017-08-02 | 2019-10-24 | Strong Force Iot Portfolio 2016, Llc | Data collection systems and methods with alternate routing of input channels |
US20190339688A1 (en) * | 2016-05-09 | 2019-11-07 | Strong Force Iot Portfolio 2016, Llc | Methods and systems for data collection, learning, and streaming of machine signals for analytics and maintenance using the industrial internet of things |
US20190347099A1 (en) * | 2018-05-08 | 2019-11-14 | Arm Limited | Arithmetic operation with shift |
US20190369994A1 (en) * | 2018-06-05 | 2019-12-05 | Qualcomm Incorporated | Providing multi-element multi-vector (memv) register file access in vector-processor-based devices |
US20190377580A1 (en) * | 2008-10-15 | 2019-12-12 | Hyperion Core Inc. | Execution of instructions based on processor and data availability |
US20200005859A1 (en) * | 2018-06-29 | 2020-01-02 | Taiwan Semiconductor Manufacturing Company, Ltd. | Memory computation circuit and method |
US10541938B1 (en) * | 2015-04-06 | 2020-01-21 | EMC IP Holding Company LLC | Integration of distributed data processing platform with one or more distinct supporting platforms |
US20200034145A1 (en) * | 2018-07-24 | 2020-01-30 | Apple Inc. | Computation Engine that Operates in Matrix and Vector Modes |
US20200057748A1 (en) * | 2018-08-16 | 2020-02-20 | Tachyum Ltd. | Arithmetic logic unit layout for a processor |
US20200103894A1 (en) * | 2018-05-07 | 2020-04-02 | Strong Force Iot Portfolio 2016, Llc | Methods and systems for data collection, learning, and streaming of machine signals for computerized maintenance management system using the industrial internet of things |
US10621489B2 (en) * | 2018-03-30 | 2020-04-14 | International Business Machines Corporation | Massively parallel neural inference computing elements |
-
2019
- 2019-03-19 US US16/357,356 patent/US20200106828A1/en not_active Abandoned
Patent Citations (90)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6483804B1 (en) * | 1999-03-01 | 2002-11-19 | Sun Microsystems, Inc. | Method and apparatus for dynamic packet batching with a high performance network interface |
US6728862B1 (en) * | 2000-05-22 | 2004-04-27 | Gazelle Technology Corporation | Processor array and parallel data processing methods |
US20100185719A1 (en) * | 2000-06-26 | 2010-07-22 | Howard Kevin D | Apparatus For Enhancing Performance Of A Parallel Processing Environment, And Associated Methods |
US20050281287A1 (en) * | 2004-06-21 | 2005-12-22 | Fujitsu Limited | Control method of communication system, communication controller and program |
US8510366B1 (en) * | 2006-03-22 | 2013-08-13 | The Mathworks, Inc. | Dynamic distribution for distributed arrays and related rules |
US20080244220A1 (en) * | 2006-10-13 | 2008-10-02 | Guo Hui Lin | Filter and Method For Filtering |
US8380880B2 (en) * | 2007-02-02 | 2013-02-19 | The Mathworks, Inc. | Scalable architecture |
US20080263329A1 (en) * | 2007-04-19 | 2008-10-23 | Archer Charles J | Parallel-Prefix Broadcast for a Parallel-Prefix Operation on a Parallel Computer |
US7738443B2 (en) * | 2007-06-26 | 2010-06-15 | International Business Machines Corporation | Asynchronous broadcast for ordered delivery between compute nodes in a parallel computing system where packet header space is limited |
US20090063817A1 (en) * | 2007-08-27 | 2009-03-05 | Arimilli Lakshminarayana B | System and Method for Packet Coalescing in Virtual Channels of a Data Processing System in a Multi-Tiered Full-Graph Interconnect Architecture |
US20090063891A1 (en) * | 2007-08-27 | 2009-03-05 | Arimilli Lakshminarayana B | System and Method for Providing Reliability of Communication Between Supernodes of a Multi-Tiered Full-Graph Interconnect Architecture |
US20100095086A1 (en) * | 2008-10-14 | 2010-04-15 | International Business Machines Corporation | Dynamically Aligning Enhanced Precision Vectors Based on Addresses Corresponding to Reduced Precision Vectors |
US20190377580A1 (en) * | 2008-10-15 | 2019-12-12 | Hyperion Core Inc. | Execution of instructions based on processor and data availability |
US20100241828A1 (en) * | 2009-03-18 | 2010-09-23 | Microsoft Corporation | General Distributed Reduction For Data Parallel Computing |
US20110060891A1 (en) * | 2009-09-04 | 2011-03-10 | International Business Machines Corporation | Parallel pipelined vector reduction in a data processing system |
US20110066649A1 (en) * | 2009-09-14 | 2011-03-17 | Myspace, Inc. | Double map reduce distributed computing framework |
US20110173413A1 (en) * | 2010-01-08 | 2011-07-14 | International Business Machines Corporation | Embedding global barrier and collective in a torus network |
US20110219208A1 (en) * | 2010-01-08 | 2011-09-08 | International Business Machines Corporation | Multi-petascale highly efficient parallel supercomputer |
US20110238956A1 (en) * | 2010-03-29 | 2011-09-29 | International Business Machines Corporation | Collective Acceleration Unit Tree Structure |
US20110276789A1 (en) * | 2010-05-04 | 2011-11-10 | Google Inc. | Parallel processing of data |
US20150188987A1 (en) * | 2010-11-15 | 2015-07-02 | Coke S. Reed | Parallel information system utilizing flow control and virtual channels |
US20120131309A1 (en) * | 2010-11-18 | 2012-05-24 | Texas Instruments Incorporated | High-performance, scalable mutlicore hardware and software system |
US20130117548A1 (en) * | 2011-11-07 | 2013-05-09 | Nvidia Corporation | Algorithm for vectorization and memory coalescing during compiling |
US20140362692A1 (en) * | 2012-02-24 | 2014-12-11 | Huawei Technologies Co., Ltd. | Data transmission method, access point, relay node, and data node for packet aggregation |
US20160055225A1 (en) * | 2012-05-15 | 2016-02-25 | Splunk Inc. | Replication of summary data in a clustered computing environment |
US20130318525A1 (en) * | 2012-05-25 | 2013-11-28 | International Business Machines Corporation | Locality-aware resource allocation for cloud computing |
US20130336292A1 (en) * | 2012-06-19 | 2013-12-19 | Vinayak Sadashiv Kore | Wireless fire system based on open standard wireless protocols |
US20140047341A1 (en) * | 2012-08-07 | 2014-02-13 | Advanced Micro Devices, Inc. | System and method for configuring cloud computing systems |
US20140122831A1 (en) * | 2012-10-30 | 2014-05-01 | Tal Uliel | Instruction and logic to provide vector compress and rotate functionality |
US20140189308A1 (en) * | 2012-12-29 | 2014-07-03 | Christopher J. Hughes | Methods, apparatus, instructions, and logic to provide vector address conflict detection functionality |
US20140280420A1 (en) * | 2013-03-13 | 2014-09-18 | Qualcomm Incorporated | Vector processing engines having programmable data path configurations for providing multi-mode radix-2x butterfly vector processing circuits, and related vector processors, systems, and methods |
US20140281370A1 (en) * | 2013-03-13 | 2014-09-18 | Qualcomm Incorporated | Vector processing engines having programmable data path configurations for providing multi-mode vector processing, and related vector processors, systems, and methods |
US20140365548A1 (en) * | 2013-06-11 | 2014-12-11 | Analog Devices Technology | Vector matrix product accelerator for microprocessor integration |
US20150106578A1 (en) * | 2013-10-15 | 2015-04-16 | Coho Data Inc. | Systems, methods and devices for implementing data management in a distributed data storage system |
US20150143086A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING FORMAT CONVERSION CIRCUITRY IN DATA FLOW PATHS BETWEEN VECTOR DATA MEMORY AND EXECUTION UNITS TO PROVIDE IN-FLIGHT FORMAT-CONVERTING OF INPUT VECTOR DATA TO EXECUTION UNITS FOR VECTOR PROCESSING OPERATIONS, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS |
US20150143076A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING DESPREADING CIRCUITRY IN DATA FLOW PATHS BETWEEN EXECUTION UNITS AND VECTOR DATA MEMORY TO PROVIDE IN-FLIGHT DESPREADING OF SPREAD-SPECTRUM SEQUENCES, AND RELATED VECTOR PROCESSING INSTRUCTIONS, SYSTEMS, AND METHODS |
US20150143079A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING TAPPED-DELAY LINE(S) FOR PROVIDING PRECISION CORRELATION / COVARIANCE VECTOR PROCESSING OPERATIONS WITH REDUCED SAMPLE RE-FETCHING AND POWER CONSUMPTION, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS |
US20150143078A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING A TAPPED-DELAY LINE(S) FOR PROVIDING PRECISION FILTER VECTOR PROCESSING OPERATIONS WITH REDUCED SAMPLE RE-FETCHING AND POWER CONSUMPTION, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS |
US20150143077A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING MERGING CIRCUITRY IN DATA FLOW PATHS BETWEEN EXECUTION UNITS AND VECTOR DATA MEMORY TO PROVIDE IN-FLIGHT MERGING OF OUTPUT VECTOR DATA STORED TO VECTOR DATA MEMORY, AND RELATED VECTOR PROCESSING INSTRUCTIONS, SYSTEMS, AND METHODS |
US20150143085A1 (en) * | 2013-11-15 | 2015-05-21 | Qualcomm Incorporated | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING REORDERING CIRCUITRY IN DATA FLOW PATHS BETWEEN EXECUTION UNITS AND VECTOR DATA MEMORY TO PROVIDE IN-FLIGHT REORDERING OF OUTPUT VECTOR DATA STORED TO VECTOR DATA MEMORY, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS |
US10305980B1 (en) * | 2013-11-27 | 2019-05-28 | Intellectual Property Systems, LLC | Arrangements for communicating data in a computing system using multiple processors |
US20150154058A1 (en) * | 2013-11-29 | 2015-06-04 | Fujitsu Limited | Communication control device, information processing apparatus, parallel computer system, and control method for parallel computer system |
US20150212972A1 (en) * | 2014-01-28 | 2015-07-30 | Arm Limited | Data processing apparatus and method for performing scan operations |
US20150379022A1 (en) * | 2014-06-27 | 2015-12-31 | General Electric Company | Integrating Execution of Computing Analytics within a Mapreduce Processing Environment |
US9756154B1 (en) * | 2014-10-13 | 2017-09-05 | Xilinx, Inc. | High throughput packet state processing |
US20160112531A1 (en) * | 2014-10-20 | 2016-04-21 | PlaceIQ. Inc. | Scripting distributed, parallel programs |
US20180004530A1 (en) * | 2014-12-15 | 2018-01-04 | Hyperion Core, Inc. | Advanced processor architecture |
US20160179537A1 (en) * | 2014-12-23 | 2016-06-23 | Intel Corporation | Method and apparatus for performing reduction operations on a set of vector elements |
US10425350B1 (en) * | 2015-04-06 | 2019-09-24 | EMC IP Holding Company LLC | Distributed catalog service for data processing platform |
US10015106B1 (en) * | 2015-04-06 | 2018-07-03 | EMC IP Holding Company LLC | Multi-cluster distributed data processing platform |
US10541938B1 (en) * | 2015-04-06 | 2020-01-21 | EMC IP Holding Company LLC | Integration of distributed data processing platform with one or more distinct supporting platforms |
US20160342568A1 (en) * | 2015-05-21 | 2016-11-24 | Goldman, Sachs & Co. | General-purpose parallel computing architecture |
US20170199844A1 (en) * | 2015-05-21 | 2017-07-13 | Goldman, Sachs & Co. | General-purpose parallel computing architecture |
US20170063613A1 (en) * | 2015-08-31 | 2017-03-02 | Mellanox Technologies Ltd. | Aggregation protocol |
US20170093715A1 (en) * | 2015-09-29 | 2017-03-30 | Ixia | Parallel Match Processing Of Network Packets To Identify Packet Data For Masking Or Other Actions |
US20170116154A1 (en) * | 2015-10-23 | 2017-04-27 | The Intellisis Corporation | Register communication in a network-on-a-chip architecture |
US20170187629A1 (en) * | 2015-12-28 | 2017-06-29 | Amazon Technologies, Inc. | Multi-path transport design |
US20170187846A1 (en) * | 2015-12-29 | 2017-06-29 | Amazon Technologies, Inc. | Reliable, out-of-order receipt of packets |
US20170187496A1 (en) * | 2015-12-29 | 2017-06-29 | Amazon Technologies, Inc. | Reliable, out-of-order transmission of packets |
US20180375781A1 (en) * | 2016-03-11 | 2018-12-27 | Huawei Technologies Co.,Ltd. | Coflow identification method and system, and server using method |
US20190339688A1 (en) * | 2016-05-09 | 2019-11-07 | Strong Force Iot Portfolio 2016, Llc | Methods and systems for data collection, learning, and streaming of machine signals for analytics and maintenance using the industrial internet of things |
US20180046901A1 (en) * | 2016-08-12 | 2018-02-15 | Beijing Deephi Intelligence Technology Co., Ltd. | Hardware accelerator for compressed gru on fpga |
US20190138638A1 (en) * | 2016-09-26 | 2019-05-09 | Splunk Inc. | Task distribution in an execution node of a distributed execution environment |
US20190147092A1 (en) * | 2016-09-26 | 2019-05-16 | Splunk Inc. | Distributing partial results to worker nodes from an external data system |
US20180089278A1 (en) * | 2016-09-26 | 2018-03-29 | Splunk Inc. | Data conditioning for dataset destination |
US20180091442A1 (en) * | 2016-09-29 | 2018-03-29 | International Business Machines Corporation | Network switch architecture supporting multiple simultaneous collective operations |
US20180173673A1 (en) * | 2016-12-15 | 2018-06-21 | Ecole Polytechnique Federale De Lausanne (Epfl) | Atomic Object Reads for In-Memory Rack-Scale Computing |
US10296351B1 (en) * | 2017-03-15 | 2019-05-21 | Ambarella, Inc. | Computer vision processing in hardware data paths |
US20180285316A1 (en) * | 2017-04-03 | 2018-10-04 | Google Llc | Vector reduction processor |
US10318306B1 (en) * | 2017-05-03 | 2019-06-11 | Ambarella, Inc. | Multidimensional vectors in a coprocessor |
US20180321938A1 (en) * | 2017-05-08 | 2018-11-08 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
US20180367465A1 (en) * | 2017-06-15 | 2018-12-20 | Mellanox Technologies, Ltd. | Transmission and Reception of Raw Video Using Scalable Frame Rate |
US20190026250A1 (en) * | 2017-07-24 | 2019-01-24 | Tesla, Inc. | Vector computational unit |
US20190324431A1 (en) * | 2017-08-02 | 2019-10-24 | Strong Force Iot Portfolio 2016, Llc | Data collection systems and methods with alternate routing of input channels |
US20190068501A1 (en) * | 2017-08-25 | 2019-02-28 | Intel Corporation | Throttling for bandwidth imbalanced data transfers |
US20190065208A1 (en) * | 2017-08-31 | 2019-02-28 | Cambricon Technologies Corporation Limited | Processing device and related products |
US20190102640A1 (en) * | 2017-09-29 | 2019-04-04 | Infineon Technologies Ag | Accelerating convolutional neural network computation throughput |
US20190102179A1 (en) * | 2017-09-30 | 2019-04-04 | Intel Corporation | Processors and methods for privileged configuration in a spatial array |
US20190102338A1 (en) * | 2017-09-30 | 2019-04-04 | Intel Corporation | Processors, methods, and systems with a configurable spatial accelerator having a sequencer dataflow operator |
US20190114533A1 (en) * | 2017-10-17 | 2019-04-18 | Xilinx, Inc. | Machine learning runtime library for neural network acceleration |
US20190235866A1 (en) * | 2018-02-01 | 2019-08-01 | Tesla, Inc. | Instruction set architecture for a vector computational unit |
US20190303263A1 (en) * | 2018-03-30 | 2019-10-03 | Kermin E. Fleming, JR. | Apparatus, methods, and systems for integrated performance monitoring in a configurable spatial accelerator |
US10621489B2 (en) * | 2018-03-30 | 2020-04-14 | International Business Machines Corporation | Massively parallel neural inference computing elements |
US20190303168A1 (en) * | 2018-04-03 | 2019-10-03 | Intel Corporation | Apparatus, methods, and systems for conditional queues in a configurable spatial accelerator |
US20200103894A1 (en) * | 2018-05-07 | 2020-04-02 | Strong Force Iot Portfolio 2016, Llc | Methods and systems for data collection, learning, and streaming of machine signals for computerized maintenance management system using the industrial internet of things |
US20190347099A1 (en) * | 2018-05-08 | 2019-11-14 | Arm Limited | Arithmetic operation with shift |
US20190369994A1 (en) * | 2018-06-05 | 2019-12-05 | Qualcomm Incorporated | Providing multi-element multi-vector (memv) register file access in vector-processor-based devices |
US20200005859A1 (en) * | 2018-06-29 | 2020-01-02 | Taiwan Semiconductor Manufacturing Company, Ltd. | Memory computation circuit and method |
US20200034145A1 (en) * | 2018-07-24 | 2020-01-30 | Apple Inc. | Computation Engine that Operates in Matrix and Vector Modes |
US20200057748A1 (en) * | 2018-08-16 | 2020-02-20 | Tachyum Ltd. | Arithmetic logic unit layout for a processor |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11277455B2 (en) | 2018-06-07 | 2022-03-15 | Mellanox Technologies, Ltd. | Streaming system |
US11625393B2 (en) | 2019-02-19 | 2023-04-11 | Mellanox Technologies, Ltd. | High performance computing system |
US11196586B2 (en) | 2019-02-25 | 2021-12-07 | Mellanox Technologies Tlv Ltd. | Collective communication system and methods |
US12177039B2 (en) | 2019-02-25 | 2024-12-24 | Mellanox Technologies, Ltd. | Collective communication system and methods |
US11876642B2 (en) | 2019-02-25 | 2024-01-16 | Mellanox Technologies, Ltd. | Collective communication system and methods |
US11303582B1 (en) * | 2019-06-28 | 2022-04-12 | Amazon Technologies, Inc. | Multi-layer network for metric aggregation |
US11750699B2 (en) | 2020-01-15 | 2023-09-05 | Mellanox Technologies, Ltd. | Small message aggregation |
US11252027B2 (en) | 2020-01-23 | 2022-02-15 | Mellanox Technologies, Ltd. | Network element supporting flexible data reduction operations |
US11876885B2 (en) | 2020-07-02 | 2024-01-16 | Mellanox Technologies, Ltd. | Clock queue with arming and/or self-arming features |
US12177325B2 (en) | 2020-07-02 | 2024-12-24 | Mellanox Technologies, Ltd. | Clock queue with arming and/or self-arming features |
CN114363248A (en) * | 2020-09-29 | 2022-04-15 | 华为技术有限公司 | Computing system, accelerator, switching plane and aggregation communication method |
US11556378B2 (en) | 2020-12-14 | 2023-01-17 | Mellanox Technologies, Ltd. | Offloading execution of a multi-task parameter-dependent operation to a network device |
US11880711B2 (en) | 2020-12-14 | 2024-01-23 | Mellanox Technologies, Ltd. | Offloading execution of a multi-task parameter-dependent operation to a network device |
US12223051B2 (en) | 2021-02-01 | 2025-02-11 | Mellanox Technologies, Ltd. | Secure in-service firmware update |
US11765237B1 (en) | 2022-04-20 | 2023-09-19 | Mellanox Technologies, Ltd. | Session-based remote direct memory access |
US12244670B2 (en) | 2022-04-20 | 2025-03-04 | Mellanox Technologies, Ltd | Session-based remote direct memory access |
US11789792B1 (en) | 2022-05-26 | 2023-10-17 | Mellanox Technologies, Ltd. | Message passing interface (MPI) collectives using multi-allgather |
US11922237B1 (en) | 2022-09-12 | 2024-03-05 | Mellanox Technologies, Ltd. | Single-step collective operations |
US11973694B1 (en) | 2023-03-30 | 2024-04-30 | Mellanox Technologies, Ltd. | Ad-hoc allocation of in-network compute-resources |
DE102024127471A1 (en) | 2023-09-26 | 2025-03-27 | Mellanox Technologies, Ltd. | IN-SERVICE SOFTWARE UPDATE, MANAGED BY NETWORK CONTROLLER |
US12289311B2 (en) | 2023-09-26 | 2025-04-29 | Mellanox Technologies, Ltd | In-service software update managed by network controller |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200106828A1 (en) | Parallel Computation Network Device | |
Abts et al. | High performance datacenter networks: Architectures, algorithms, and opportunities | |
US10284383B2 (en) | Aggregation protocol | |
US10965586B2 (en) | Resilient network communication using selective multipath packet flow spraying | |
US9325622B2 (en) | Autonomic traffic load balancing in link aggregation groups | |
US8417778B2 (en) | Collective acceleration unit tree flow control and retransmit | |
US9276846B2 (en) | Packet extraction optimization in a network processor | |
US8335238B2 (en) | Reassembling streaming data across multiple packetized communication channels | |
EP3625939A1 (en) | Access node for data centers | |
Bhowmik et al. | High performance publish/subscribe middleware in software-defined networks | |
US10129181B2 (en) | Controlling the reactive caching of wildcard rules for packet processing, such as flow processing in software-defined networks | |
US20050097300A1 (en) | Processing system and method including a dedicated collective offload engine providing collective processing in a distributed computing environment | |
JP2007234014A (en) | Event multicast platform for scalable content base | |
Teixeira et al. | Packetscope: Monitoring the packet lifecycle inside a switch | |
US8266504B2 (en) | Dynamic monitoring of ability to reassemble streaming data across multiple channels based on history | |
Paul et al. | MG-Join: A scalable join for massively parallel multi-GPU architectures | |
CN104917680B (en) | For executing the computer system of the parallel hash of stream of packets | |
US20110238956A1 (en) | Collective Acceleration Unit Tree Structure | |
Bhowmik et al. | Distributed control plane for software-defined networks: A case study using event-based middleware | |
CN102780616B (en) | Network equipment and method and device for message processing based on multi-core processor | |
US10635774B2 (en) | Integrated circuit design | |
Guo et al. | An efficient parallelized L7-filter design for multicore servers | |
EP3012742B1 (en) | Data distribution system, data communication device and program for data distribution | |
Ruia et al. | Flowcache: A cache-based approach for improving SDN scalability | |
US10791058B2 (en) | Hierarchical enforcement of service flow quotas |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MELLANOX TECHNOLOGIES, LTD., ISRAEL Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ELIAS, GEORGE;LEVI, LION;ROMLET, EVYATAR;AND OTHERS;SIGNING DATES FROM 20190312 TO 20190318;REEL/FRAME:048630/0198 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |