US20240411579A1 - Metadata partitioning across virtual processors - Google Patents
Metadata partitioning across virtual processors Download PDFInfo
- Publication number
- US20240411579A1 US20240411579A1 US18/332,828 US202318332828A US2024411579A1 US 20240411579 A1 US20240411579 A1 US 20240411579A1 US 202318332828 A US202318332828 A US 202318332828A US 2024411579 A1 US2024411579 A1 US 2024411579A1
- Authority
- US
- United States
- Prior art keywords
- data
- partition
- virtual processor
- data object
- metadata
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000638 solvent extraction Methods 0.000 title 1
- 238000005192 partition Methods 0.000 claims abstract description 256
- 230000004044 response Effects 0.000 claims abstract description 30
- 230000005012 migration Effects 0.000 claims abstract description 28
- 238000013508 migration Methods 0.000 claims abstract description 28
- 238000003860 storage Methods 0.000 claims description 69
- 238000012545 processing Methods 0.000 claims description 21
- 238000000034 method Methods 0.000 claims description 13
- 230000015654 memory Effects 0.000 description 19
- 230000006870 function Effects 0.000 description 16
- 238000010586 diagram Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 7
- 230000003068 static effect Effects 0.000 description 6
- 238000013507 mapping Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000007792 addition Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 239000002131 composite material Substances 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5077—Logical partitioning of resources; Management or configuration of virtualized resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/544—Buffers; Shared memory; Pipes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/4557—Distribution of virtual machine instances; Migration and load balancing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45583—Memory management, e.g. access or allocation
Definitions
- a storage arrangement can include a cluster of computer nodes that manage access of data in a shared storage system that is shared by the cluster of computer nodes.
- Each computer node of the cluster of computer nodes can execute one or more virtual processors, with each virtual processor managing access to a respective data portion in the shared storage system.
- FIG. 1 is a block diagram of an arrangement that includes a cluster of computer nodes, a shared storage system, and requester devices, according to some examples.
- FIGS. 2 A and 2 B are block diagrams of partitions in respective virtual processors, according to some examples.
- FIG. 3 is a flow diagram of a write operation according to some examples.
- FIG. 4 is a block diagram of configuration information specifying prefix lengths for respective data buckets, according to some examples.
- FIG. 5 is a block diagram of a storage medium storing machine-readable instructions according to some examples.
- FIG. 6 is a block diagram of a system according to some examples.
- FIG. 7 is a flow diagram of a process according to some examples.
- a “virtual processor” can refer to a computing entity implemented with machine-readable instructions that are executable on a computer node. By managing access of different data portions using respective different virtual processors executed in a cluster of computer nodes, data throughput can be improved when the virtual processors access data in parallel from a shared storage system.
- a “shared” storage system is a storage system that is shared (i.e., accessible) by any computer node of the cluster of computer nodes.
- a data access request (write request or read request) can be received at a given computer node of the cluster of computer nodes.
- a virtual processor in the given computer node can be assigned to handle the data access request.
- the virtual processor assigned to handle the data access request may be referred to as a “source virtual processor” with respect to the data access request.
- the source virtual processor may not be the virtual processor that “owns” (i.e., manages access and/or updates to) metadata for the data that is the subject of the data access request.
- the virtual processor that owns metadata for data that is the subject of a given data access request may be referred to as a “metadata virtual processor” with respect to the given data access request.
- a virtual processor may also “own” a data object; such a virtual processor is responsible for managing the access and/or updates of the data object.
- the source virtual processor determines which virtual processor is the metadata virtual processor, and obtains the metadata from the metadata virtual processor.
- An example of the metadata may include a list of chunk identifiers of chunks that make up the data object. The list of chunk identifiers can be used by the source virtual processor to retrieve the chunks of the data object.
- Another example of the metadata can include a version of the data object.
- Metadata portion For load balancing and improved throughput, metadata for respective data objects can be partitioned into multiple partitions that are spread across multiple virtual processors executing in a cluster of computer nodes.
- Each computer node can execute one or more virtual processors, and each virtual processor may include one or more partitions.
- a virtual processor “including” a partition can refer to the virtual processor owning a portion of the metadata (referred to as “metadata portion”) in the partition.
- a virtual processor may be migrated from a source computer node to a target computer node.
- virtual processor migration can lead to processing overhead related to maintaining associations of metadata portions with respective virtual processors.
- An association between a metadata portion and a given virtual processor can be represented by mapping information. If the associations between metadata portions and virtual processors are not properly maintained in response to migrations of virtual processors, then source virtual processors may have difficulty finding metadata virtual processors when handling incoming data access requests.
- a further challenge relates to how metadata for the data buckets are sharded across the cluster of computer nodes as the cluster changes over time (such as due to additions of computer nodes to the cluster).
- a mapping scheme uses a partition map and a virtual processor-computer node (VP-CN) map.
- the partition map associates (maps) partitions (that include metadata portions) to respective virtual processors.
- the VP-CN map associates (maps) virtual processors to respective computer nodes of the cluster of computer nodes.
- the VP-CN node map is updated, but the partition map does not change.
- requests for data objects that have keys that map to a given virtual processor would continue to map to the given virtual processor after the migration of the given virtual processor between different computer nodes.
- the static nature of the partition map in the context of virtual processor migrations allows for a system including the cluster of computer nodes to deterministically map metadata of data objects to virtual processors.
- a first data bucket may be generated or received when the cluster of computer nodes has a first quantity of computer nodes. Later, one or more computer nodes may be added to the cluster. Even though the cluster of computer nodes has been expanded, the partition map and the VP-CN node map for the first data bucket are not changed, which avoids the burden associated with having to update mappings as the cluster of computer nodes expands.
- a second data bucket is generated or received after the expansion of the cluster of computer nodes, a partition map may be created for the second data bucket that make use of the increased quantity of computer nodes. Note that there may be one partition map per data bucket (so that there are multiple partition maps for respective data buckets). However, there is one VP-CN node map that is common for multiple data buckets.
- FIG. 1 is a block diagram of an example arrangement that includes a cluster 100 of computer nodes CN 1 , CN 2 , and CN 3 . Although three computer nodes are shown in FIG. 1 , in other examples, the cluster 100 can include less than three or more than three computer nodes.
- a “computer node” can refer to a physical node that includes a processing resource as well as other resources (e.g., communication resources and memory resources) that can perform various tasks.
- a “processing resource” can include one or more hardware processing circuits which can include any or some combination of a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, or another hardware processing circuit.
- a “communication resource” can include a communication interface (including communication hardware such as a transceiver and any related machine-readable instructions).
- a “memory resource” can include a memory implemented with a collection of one or more memory devices.
- the cluster 100 of computer nodes is able to manage the access of data stored in a shared storage system 102 in response to data access requests received over a network 112 from requester devices 104 .
- a “requester device” can refer to any electronic device that is able to send a request to access data (read data or write data). Examples of electronic devices include any or some combination of the following: desktop computers, notebook computers, tablet computers, server computers, game appliances, Internet-of-Things (IoT) devices, vehicles, household appliances, and so forth.
- Examples of the network 112 can include any or some combination of the following: a storage area network (SAN), a local area network (LAN), a wide area network (WAN), and so forth.
- the shared storage system 102 is accessible by any of the computer nodes CN 1 to CN 3 over a communication link 106 between the cluster 100 of computer nodes and the shared storage system 102 .
- the shared storage system 102 is implemented using a collection of storage devices 108 .
- a “collection” of items can refer to a single item or to multiple items.
- the collection of storage devices 108 can include a single storage device or multiple storage devices. Examples of storage devices can include any or some combination of the following: disk-based storage devices, solid state drives, and so forth.
- the requester devices 104 can send data access requests to any of the computer nodes CN 1 to CN 3 over the network 112 .
- Each computer node executes a collection of virtual processors (a single virtual processor or multiple virtual processors).
- a virtual processor is executed by a processing resource of a computer node.
- the computer node CN 1 includes virtual processors VP 1 and VP 2
- the computer node CN 2 includes virtual processors VP 3 and VP 4
- the computer node CN 3 includes virtual processors VP 5 and VP 6 .
- FIG. 1 shows two virtual processors in each computer node, in other examples, a computer node may include a different number (e.g., 1 or more than 2) of virtual processors. Communications among the virtual processors VP 1 , VP 2 , VP 3 , VP 4 , VP 5 , and VP 6 can occur over a communication link between the virtual processors, such as an inter-process link.
- Virtual processors can also be migrated between computer nodes. For example, to achieve load balancing or for fault tolerance or recovery, a first virtual processor may be migrated from a current (source) computer node to a destination computer node. Once migrated, the first virtual processor executes at the destination computer node.
- data to be stored in the shared storage system 102 by virtual processors can be part of data buckets.
- a data bucket can refer to any type of container that includes a collection of data objects (a single data object or multiple data objects).
- a specific example of a data bucket is an S 3 bucket in an Amazon cloud storage. In other examples, other types of data buckets can be employed.
- a data object can be divided into a collection of data chunks (a single data chunk or multiple data chunks). Each data chunk (or more simply “chunk”) has a specified size (a static size or a size that can dynamically change).
- the storage locations of the chunks are storage locations in the shared storage system 102 .
- Each of the virtual processors VP 1 to VP 6 may maintain respective metadata associated with data objects stored or to be stored in the shared storage system 102 .
- the metadata can include data object metadata such as a list of chunk identifiers (IDs) that identify chunks of a data object, and a virtual processor ID that identifies a virtual processor.
- the list of chunk IDs can include a single chunk ID or multiple chunk IDs.
- the data object metadata can further include a version ID that represents a version of a data object. As a data object is modified by write request(s), corresponding different versions of the data object are created (and identified by respective version IDs).
- an “ID” can refer to any information (e.g., a name, a string, etc.) that can be used to distinguish one item from another item (e.g., distinguish between chunks, or distinguish between data objects, or distinguish between versions of a data object, or distinguish between virtual processors, and so forth).
- the metadata may further include storage location metadata representing storage locations of chunks of data objects in the shared storage system 102 .
- the storage location metadata can include any or some combination of the following: an offset, a storage address, a block number, and so forth.
- the metadata may also include commit metadata maintained by a metadata virtual processor during write operations and read operations.
- the commit metadata indicates whether a write of a subject data object is in progress (e.g., at a virtual processor that is assigned to handle the write) or a write of the subject data object is no longer in progress (i.e., a write of the subject data object is complete). Note that a write of the subject data object is complete if the subject data object has been written to either a write buffer (not shown) or the shared storage system 102 .
- a write buffer can be part of an NV memory (e.g., any of 110 - 1 to 110 - 3 ) and is associated with a respective virtual processor for caching write data.
- Metadata can be partitioned into multiple partitions that are spread across multiple virtual processors executing in the cluster 100 of computer nodes.
- Each virtual processor may include one or more partitions.
- a virtual processor “including” a partition can refer to the virtual processor owning a metadata portion in the partition.
- the virtual processor VP 1 includes partitions P 11 , P 12 , P 13 for data bucket B 1 , and includes partitions P 1 A, P 1 B for data bucket B 2 .
- the partitions for the data bucket B 1 are shaded, while the partitions for the data bucket B 2 are not shaded in FIG. 1 .
- the virtual processor VP 2 includes partitions P 21 , P 22 , P 23 for data bucket B 1 , and includes partitions P 2 A, P 2 B for data bucket B 2
- the virtual processor VP 3 includes partitions P 31 , P 32 , P 33 for data bucket B 1 , and includes partitions P 3 A, P 3 B for data bucket B 2
- the virtual processor VP 4 includes partitions P 41 , P 42 , P 43 for data bucket B 1 , and includes partitions P 4 A, P 4 B for data bucket B 2
- the virtual processor VP 5 includes partitions P 51 , P 52 , P 53 for data bucket B 1 , and includes partitions P 5 A, P 5 B for data bucket B 2
- the virtual processor VP 6 includes partitions P 61 , P 62 , P 63 for data bucket B 1 , and includes partitions P 6 A, P 6 B for data bucket B 2 .
- NV memory is a memory that can persistently store data, such that data stored in the NV memory is not lost when power is removed from the NV memory or from a computer node in which the NV memory is located.
- An NV memory can include a collection of NV memory devices (a single NV memory device or multiple NV memory devices), such as flash memory devices and/or other types of NV memory devices.
- each virtual processor is depicted as including a specific quantity of partitions, in other examples, a virtual processor can include a different quantity of partitions for any given data bucket.
- Each virtual processor is responsible for managing respective metadata of one or more partitions.
- each virtual processor of the P virtual processors is responsible for managing metadata of M/P partitions.
- a partition is identified by a Partition ID.
- a Partition ID is based on applying a hash function (e.g., a cryptographic hash function or another type of hash function) on information (key) associated with a data object.
- the key associated with the data object on which the hash function is applied includes a Bucket ID identifying the data bucket of which the data object is part, and an Object ID that identifies the data object.
- the hash function applied on the key associated with the data object produces a hash value that is the Partition ID or from which the Partition ID is derived.
- a hash value (Hash-Value) is computed as follows:
- Hash - Value Hash ( F ( Request . key ) ) , ( Eq . 1 )
- F(x) is a transformation applied on x (in this case a transformation applied on Request.key)
- Hash( ) is a hash function
- Request.key includes the Bucket ID and the Object ID of a data access request (a read request or write request).
- the transformation can be a concatenation operation of Bucket ID and Object ID, for example.
- the Partition ID is computed based on the Hash-Value as follows:
- Partition ⁇ ID H ⁇ ash - Value ⁇ % ⁇ #Partitions , ( Eq . 2 )
- % is a modulus operation
- #Partitions is a constant (which can be a configured value) that represents how many partitions are to be divided across a cluster of computer nodes.
- the modulus operation computes the remainder after dividing Hash-Value by #Partitions.
- the metadata for each data bucket is sharded across multiple partitions.
- a given partition defines a subset of the object key-space for a specific data bucket and maps to one virtual processor.
- the object key-space for a data bucket refers to the possible range of values associated with the key for data objects in the data bucket, such as by Hash-Value-Range above.
- the entire object key-space can be uniformly distributed among the partitions, which can lead to equal size partitions.
- a last partition may be larger or smaller if the object key-space is not exactly divisible by the number of partitions.
- every data object maps to one partition and all data objects in a partition belong to the same data bucket.
- all versions of a given data object map to the same partition; in other words, the hashing of keys of data objects to map to partitions is version agnostic.
- the quantity of partitions for a data bucket is a configurable constant (e.g., #Partitions in Eq. 2 above) across all buckets-in other words, each data bucket of multiple data buckets to be stored in the shared storage system 102 can have a static quantity of partitions.
- the configurable constant is 12 (in a different example, a different configurable constant of partitions can be used).
- the 12 partitions are divided across 4 virtual processors VP 1 to VP 4 such that each virtual processor (of VP 1 to VP 4 ) includes three partitions for bucket B 1 .
- the 12 partitions are divided across 6 virtual processors VP 1 to VP 6 such that each virtual processor (of VP 1 to VP 6 ) includes two partitions for bucket B 2 .
- the configurable constant may be a highly composite number with a relatively large quantity of divisors.
- the number 48 is divisible by the following divisors: 1, 2, 3, 4, 6, 8, 12, 16, 24, 48.
- FIG. 1 further depicts partition maps and VP-CN maps stored in the NV memories 110 - 1 to 110 - 3 , which are discussed below in conjunction with FIGS. 2 A and 2 B .
- FIG. 2 A shows an example arrangement of the cluster 100 of computer nodes at a first point in time (T 1 ).
- the cluster 100 of computer nodes includes two computer nodes CN 1 and CN 2 .
- the computer node CN 1 executes the virtual processors VP 1 , VP 2
- the computer node CN 2 executes the virtual processors VP 3 , VP 4 . Note that at time T 1 , the computer node CN 3 is not yet present in the cluster 100 .
- Data bucket B 1 is created when the cluster 100 is configured with the computer nodes CN 1 and CN 2 at time T 1 .
- the creation of data bucket B 1 is triggered by a user, such as at a requester device 104 .
- another entity a program or a machine
- Creation of a data bucket can be triggered in response to a request or any other type of input from an entity (user, program, or machine).
- the request to create data bucket B 1 can be received at any of the computer nodes of the cluster 100 .
- the computer node that received the request creates a partition map.
- Each computer node includes a map creation engine that creates the partition map.
- the computer nodes CN 1 and CN 2 include respective map creation engines 202 - 1 and 202 - 2 .
- an “engine” can refer to one or more hardware processing circuits, which can include any or some combination of a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, or another hardware processing circuit.
- an “engine” can refer to a combination of one or more hardware processing circuits and machine-readable instructions (software and/or firmware) executable on the one or more hardware processing circuits.
- the map creation engine 202 - 1 responds to the request to create data bucket B 1 by creating a B 1 partition map 204 that maps partitions containing metadata for data objects of data bucket B 1 to respective virtual processors.
- the B 1 partition map 204 includes a number of entries that is equal to #Partitions. If #Partitions is 12, then the B 1 partition map 204 includes 12 entries. Each entry of the 12 entries corresponds to a respective partition of the 12 partitions. Thus, the first entry of the B 1 partition map 204 corresponds to partition P 11 , the second entry of the B 1 partition map 204 corresponds to partition P 12 , the third entry of the B 1 partition map 204 corresponds to partition P 13 , the fourth entry of the B 1 partition map 204 corresponds to partition P 21 , and so forth. Each entry of the B 1 partition map 204 contains an identifier of a virtual processor that is associated with the corresponding partition. Thus, in the example of FIG.
- the 12 entries associate partitions to virtual processors as follows: the first entry maps P 11 to VP 1 , the second entry maps P 12 to VP 1 , the third entry maps P 13 to VP 1 , the fourth entry maps P 21 to VP 2 , the fifth entry maps P 22 to VP 2 , the sixth entry maps P 23 to VP 2 , the seventh entry maps P 31 to VP 3 , the eighth entry maps P 32 to VP 3 , the ninth entry maps P 33 to VP 3 , the tenth entry maps P 41 to VP 4 , the eleventh entry maps P 42 to VP 4 , and the twelfth entry maps P 43 to VP 4 .
- the B 1 partition map 204 can be stored (persisted) in the shared storage system 102 by the computer node CN 1 that processed the request to create data bucket B 1 .
- the other computer nodes such as CN 2 , can read the B 1 partition map 204 from the shared storage system 102 (such as while processing write and read requests) and cache a copy of the B 1 partition map 204 in the NV memory 110 - 2 .
- each computer node CN 1 or CN 2 also includes a VP-CN map 206 that is created by a map creation engine (same as or different from the map creation engine used to create a partition map).
- the VP-CN map 206 associates virtual processors to computer nodes that are present in the cluster 100 at time T 1 .
- the first entry of the VP-CN map 206 maps VP 1 to CN 1
- the second entry of the VP-CN map 206 maps VP 2 to CN 1
- the third entry of the VP-CN map 206 maps VP 3 to CN 2
- the fourth entry of the VP-CN map 206 maps VP 4 to CN 2 .
- the VP-CN map 206 can be modified to change the mapping of virtual processors and computer nodes.
- map creation engine(s) to create partition maps and/or a VP-CN maps can be in a computer system separate from the cluster 100 of the computer nodes.
- FIG. 2 B shows an arrangement of the cluster 100 of computer nodes at a later point in time (T 2 ).
- T 2 the third computer node CN 3 has been added to the cluster 100 of computer nodes.
- the third computer node CN 3 executes the virtual processors VP 5 and VP 6 , and includes a map creation engine 202 - 3 .
- data bucket B 2 is created.
- a map creation engine (any of 202 - 1 , 202 - 2 , and 202 - 3 in the computer node that received the request to create data bucket B 2 ) creates a B 2 partition map 208 .
- the 12 entries of the B 2 partition map 208 correspond to the 12 partitions P 1 A, P 1 B, P 2 A, P 2 B, P 3 A, P 3 B, P 4 A, P 4 B, P 5 A, P 5 B, P 6 A, and P 6 B, respectively.
- the 12 entries of the B 2 partition map 208 associate partitions to virtual processors as follows: the first entry maps P 1 A to VP 1 , the second entry maps P 1 B to VP 1 , the third entry maps P 2 A to VP 2 , the fourth entry maps P 2 B to VP 2 , the fifth entry maps P 3 A to VP 3 , the sixth entry maps P 3 B to VP 3 , the seventh entry maps P 4 A to VP 4 , the eighth entry maps P 4 B to VP 4 , the ninth entry maps P 5 A to VP 5 , the tenth entry maps P 5 B to VP 5 , the eleventh entry maps P 6 A to VP 6 , and the twelfth entry maps P 6 B to VP 6 .
- the B 2 partition map 208 can be written by the computer node at which the B 2 partition map 208 was created to the shared storage system 102 , and the other two computer nodes can read the B 2 partition map 208 from the shared storage system 102 to cache respective copies of the B 2 partition map 208 in the respective NV memories.
- each of the computer nodes CN 1 and CN 2 includes the following maps: the B 1 partition map 204 , the VP-CN map 206 , and the B 2 partition map 208 .
- the computer node CN 3 includes the VP-CN map 206 and the B 2 partition map 208 , but does not include the B 1 partition map 204 .
- the partition maps ( 204 and 208 ) once assigned can remain static as the cluster 100 of computer nodes is expanded by adding more computer nodes.
- the partition maps 204 and 208 and the VP-CN map 206 may be stored in a repository (e.g., in the shared storage system), with copies of the maps 204 , 206 , and 208 cached in the computer nodes CN 1 , CN 2 , and CN 3 , such as in the respective NV memories 110 - 1 , 110 - 2 , and 110 - 3 .
- FIG. 3 is a flow diagram of an example a write operation performed in response to an incoming write request received by a computer node from a requester device 104 .
- the flow depicted in FIG. 3 omits some tasks that may be performed as part of the write operation.
- the computer node that received the incoming write request is referred to as the “receiving computer node.”
- the receiving computer node selects one of the multiple virtual processors as a source virtual processor 302 to handle the incoming write request.
- the selection can be a random selection process in which the source virtual processor 302 is randomly selected from the multiple virtual processors in the receiving computer node.
- the selection process can be based on determining relative loads of the virtual processors in the receiving computer node, and selecting a virtual processor with the least load to be the source virtual processor 302 .
- the receiving computer node determines which of the virtual processors VP 1 to VP 6 in the cluster 100 of computer nodes is a metadata virtual processor 304 for the subject data object of the incoming write request 120 .
- the receiving computer node can apply a hash function or another type of function on a key associated with the subject data object, to produce a Partition ID that identifies the partition that the metadata for the subject data object is part of.
- the key includes a Bucket ID identifying the data bucket that the subject data object is part of, and an Object ID that identifies the subject data object.
- the source virtual processor 302 accesses (at 306 ) a partition map for the data bucket identified by the Bucket ID to determine which virtual processor (the metadata virtual processor 304 ) is mapped to the partition identified by the Partition ID.
- This virtual processor is the metadata virtual processor 304 .
- the source virtual processor 302 also accesses (at 308 ) the VP-CN map (e.g., 206 ) to identify which computer node the metadata virtual processor 304 executes in.
- the source virtual processor 302 sends (at 310 ) a control message to the metadata virtual processor 306 in the identified computer node.
- the control message contains the Bucket ID, Partition ID, Object ID, and Version ID of the subject data object.
- the control message is to indicate to the metadata virtual processor 306 that the source virtual processor 302 is ready to start the write operation for the incoming write request.
- the metadata virtual processor 306 In response to the control message, the metadata virtual processor 306 generates (at 312 ) a list of chunk IDs that identifies one or more data chunks for the subject data object. The chunk ID(s) generated depend(s) on the size of the subject data object—the size of the subject data object determines how many data chunks are to be divided from the subject data object. The metadata virtual processor 306 sends (at 314 ) the list of chunk IDs to the source virtual processor 302 .
- the source virtual processor 302 In response to receiving the list of chunk IDs from the metadata virtual processor 306 , the source virtual processor 302 writes (at 316 ) the chunk(s) of the subject data object using the chunk ID(s) included in the list of chunk IDs.
- the source virtual processor 302 can write the data chunk(s) to a write buffer (not shown) associated with the source virtual processor 302 and/or to the shared storage system 102 .
- rebalancing of partitions across computer nodes may occur if one or more computer nodes are removed from the cluster 100 .
- the partitions of the removed one or more computer nodes can be added to virtual processors of the remaining computer nodes of the cluster 100 .
- Rebalancing of partitions across computer nodes of the cluster 100 may also occur in response to another condition, such as overloading or a fault of a virtual processor or a computer node. For example, if a given virtual processor is overloaded, then a partition can be moved from the given virtual processor to another virtual processor. Rebalancing one or more partitions from heavily loaded virtual processors to more lightly loaded virtual processors can reduce metadata hot-spots. A metadata hot-spot can occur at a given virtual processor if there is a large number of requests for metadata associated with subject data objects of write requests from other virtual processors to the given virtual processor.
- a virtual processor or a computer node is heavily loaded (due to performing a large quantity of operations as compared to virtual processors on other computer nodes), then one or more partitions of the heavily loaded virtual processor or computer node can be moved to one or more other virtual processors (which can be on the same or a different computer node). Moving a partition from a first virtual processor to a second virtual processor results in the first virtual processor no longer owning the metadata of the partition, and the second virtual processor owning the metadata of the partition.
- the partition P 11 initially included in the virtual processor VP 1 on the computer node CN 1 may be moved to the virtual processor VP 3 on the computer node CN 2 .
- the partition map 204 is modified to map P 11 to VP 2 .
- the movement of a partition between different computer nodes can be accomplished with reduced overhead since data in data objects does not have to be moved between computer nodes as a result of the movement of the partition. For example, when the partition P 11 is moved from the virtual processor VP 1 to the virtual processor VP 3 , the data objects associated with the metadata in the moved partition do not have to be moved between the computer nodes CN 1 and CN 2 . If the data objects associated with the metadata in the moved partition P 11 are in the write buffer of the computer node CN 1 , the data objects can be flushed to the shared storage system 102 (but do not have to be copied to the computer node CN 2 ). If the data objects associated with the metadata in the moved partition P 11 are already in the shared storage system 102 , then no movement of the data objects occur in response to movement of the partition P 11 .
- a root or other higher-level element may be moved from the computer node CN 1 to the computer node CN 2 .
- An example of such a higher-level element is a superblock, which contains storage locations of the data objects.
- the amount of information in the superblock is much less than the data contained in the data objects referred to by the superblock, so the movement of the superblock between computer nodes has a relatively low overhead in terms of resources used.
- rebalancing can also be accomplished by moving a virtual processor (and its included partitions) between different computer nodes.
- the migration of a given virtual processor from a source computer node to a target computer node can similarly be accomplished without moving data of associated data objects between the source and target computer nodes, although superblocks or other higher-level elements may be moved.
- prefix-based hashing is applied on a prefix of a key of a data access request to produce a hash value from which a Partition ID is derived.
- Eq. 1 is modified to produce a hash value (Hash-Value) as follows:
- Hash - Value Hash ( Prefix ⁇ of ⁇ Request . key ) , ( Eq . 3 )
- Prefix of Request.key represents the prefix of the key.
- the prefix has a specified prefix length.
- different data buckets can be associated with respective prefix lengths, at least some of which may be different from one another.
- FIG. 4 shows an example in which a prefix length L 1 is specified for data bucket B 1 , a prefix length L 2 is specified for data bucket B 2 , a prefix length L 3 is specified for data bucket B 3 , and so forth.
- the prefix lengths L 1 , L 2 , L 3 , and so forth, can be stored as part of configuration information 400 that is accessible by virtual processors in the cluster 100 of computer nodes.
- the configuration information 400 may be stored in an NV memory of each computer node, for example.
- a prefix length for a data bucket can be expressed as a percentage of the full key length (the full length of the key including the Bucket ID and Object ID).
- L 1 may be set at 100%
- L 2 may be set at 90%
- L 3 may be set at 75%
- a prefix length can be expressed as a quantity of bits of the key on which the hash function is to be applied.
- each prefix length in the configuration information 400 is specified for a respective collection of data buckets (a single data bucket or multiple data buckets).
- the ability to specify different prefix lengths for different data buckets enhances flexibility in how metadata for the different data buckets are sharded across virtual processors in the cluster 100 of computer nodes.
- the different data buckets may be associated with different applications.
- a “range query” is a query for a collection of data objects with keys within a specified range.
- a given data bucket contains R (R ⁇ 2) data objects, where the keys of each of the data objects in the given data bucket start with “/myprefix/”, which is a combination of the Bucket ID and a portion (less than the entirety of the Object ID of a data object in the given data bucket.
- data object 1 has key “/myprefix/foo”
- data object 2 has key “/myprefix/bar, . . .
- data object N has key “/myprefix/foobar”.
- the prefix length for the given data bucket specified by the configuration information 400 is the length of the string “/myprefix/”.
- the hash function is applied on the same string “/myprefix/” of each key of the respective data object.
- the same Partition ID will be generated for each data object of the given data bucket, so that the metadata for all of the data objects in the given data bucket will end up in the same partition,
- the range query can be performed by just one virtual processor including the partition to which all of the data objects of given data bucket map.
- the range query can seek to retrieve or identify data objects of the given data bucket within a range of keys, such as between Key A and Key B.
- This range query can be performed at the virtual processor including the partition to which all of the data objects of given data bucket map, rather than being performed by multiple virtual processors across multiple computer nodes. As a result, the range query can be more efficiently performed, since data objects would not have to be exchanged between virtual processors to be aggregated before returning the result to the requester.
- FIG. 5 is a block diagram of a non-transitory machine-readable or computer-readable storage medium 500 storing machine-readable instructions that upon execution cause a processing system to perform various tasks.
- the processing system can include any one or more of the computer nodes shown in FIG. 1 , and/or any other computer system.
- the machine-readable instructions include partition map creation instructions 502 to create a partition map that maps partitions of a data bucket to respective virtual processors executed in a cluster of computer nodes that are coupled to a shared storage system to store data of the data bucket.
- the portions of metadata for the data bucket are divided across the partitions.
- the partition map creation instructions 502 can include instructions of any or some combination of the map creation engines 202 - 1 to 202 - 3 of FIGS. 2 A and 2 B .
- the machine-readable instructions include partition identification instructions 504 to, responsive to a request to access a data object in the data bucket, identify which partition of the partitions contains metadata for the data object based on a key associated with the data object.
- the identification of the partition is based on applying a function (e.g., a hash function such as in Eq. 1 or 3) on a key associated with the request to access the data object.
- the function can be applied on a first portion of the key associated with the data object to obtain a value, which is used to identify the partition that contains the metadata for the data object.
- the first portion of the key is a prefix of the key, where the prefix of the key is less than an entirety of the key.
- the machine-readable instructions further include virtual processor identification instructions 506 to identify, based on the identified partition and using the partition map, a virtual processor that has the metadata for the data object.
- the partition map associates the identified partition with the virtual processor.
- the machine-readable instructions further include VP-CN map update instructions 508 to, responsive to a migration of a first virtual processor from a first computer node to a second computer node of the cluster of computer nodes, update a VP-CN map that maps the respective virtual processors to corresponding computer nodes of the cluster of computer nodes.
- the VP-CN map is updated, the partition map remains unchanged in response to the migration of the first virtual processor from the first computer node to the second computer node.
- the migration of the first virtual processor from the first computer node to the second computer node causes migration of one or more portions of the metadata associated with the first virtual processor from the first computer node to the second computer node. In some examples, the migration of the first virtual processor from the first computer node to the second computer node is performed without performing movement of data owned by the first virtual processor between computer nodes of the cluster of computer nodes.
- a first request for a first data object associated with a first key maps, based on the partition map, to the first virtual processor on the first computer node prior to the migration of the first virtual processor, a first request for a first data object associated with a first key maps, based on the partition map, to the first virtual processor on the first computer node.
- a second request for the first data object associated with the first key maps, based on the partition map, to the first virtual processor on the second computer node prior to the migration of the first virtual processor, a first request for a first data object associated with a first key maps, based on the partition map, to the first virtual processor on the first computer node.
- the machine-readable instructions in response to the first request, obtain metadata for the first data object from the first virtual processor on the first computer node, and in response to the second request, the machine-readable instructions obtain the metadata for the first data object from the first virtual processor on the second computer node.
- different keys associated with respective data objects that share a same prefix map to a same partition
- the machine-readable instructions perform a range query for a range of keys at one or more virtual processors mapped to a partition associated with the range of keys.
- the machine-readable instructions migrate a first partition from the given virtual processor to another virtual processor, and update the partition map in response to the migration of the first partition.
- FIG. 6 is a block diagram of a system 600 that includes a cluster of computer nodes 602 to execute a plurality of virtual processors 604 .
- the cluster of computer nodes 602 stores a first partition map 606 that maps partitions of a first data bucket to a first subset of virtual processors, where portions of metadata for the first data bucket are divided across the partitions of the first data bucket.
- the cluster of computer nodes 602 stores a second partition map 608 that maps partitions of a second data bucket to a second subset of virtual processors, where portions of metadata for the second data bucket are divided across the partitions of the second data bucket.
- the system 600 includes a shared storage system 610 accessible by the cluster of computer nodes 602 to store data of the first and second data buckets.
- a first virtual processor VP 1 in the cluster of computer nodes 602 responds to a request 612 to access a first data object in the first data bucket by identifying a first given partition of the partitions of the first data bucket that contains metadata for the first data object based on a first portion of a first key 614 associated with the first data object.
- the first virtual processor VP 1 identifies, based on the first given partition and using the first partition map 606 , a virtual processor that has the metadata for the first data object.
- a second virtual processor VP 2 in the cluster of computer nodes 602 responds to a request 616 to access a second data object in the second data bucket by identifying a second given partition of the partitions of the second data bucket that contains metadata for the second data object based on a second portion of a second key 618 associated with the second data object.
- the second portion of the second key 618 is of a different length than the first portion of the first key 614 .
- the second virtual processor VP 2 identifies, based on the second given partition and using the second partition map 608 , a virtual processor that has the metadata for the second data object.
- the identifying of the first given partition of the partitions of the first data bucket that contains metadata for the first data object is based on applying a hash function on the first portion of the first key associated with the first data object
- the identifying of the second given partition of the partitions of the second data bucket that contains metadata for the second data object is based on applying the hash function on the second portion of the second key associated with the second data object.
- FIG. 7 is a block diagram of a process 700 according to some examples, which may be performed by a system including a hardware processor.
- the system can include one computer or more than one computer.
- the process 700 includes receiving (at 702 ) a request to create a data bucket.
- the process 700 includes creating (at 704 ), in response to the request to create the data bucket, a partition map that maps partitions of the data bucket to respective virtual processors executed in a cluster of computer nodes that are coupled to a shared storage system to store data of the data bucket, where portions of metadata for the data bucket are divided across the partitions.
- the process 700 includes receiving (at 706 ) a request to access a data object in the data bucket.
- the process 700 identifies (at 708 ) which partition of the partitions contains metadata for the data object based on applying a function on a key associated with the data object, and identifies (at 710 ), based on the identified partition and using the partition map, a virtual processor that has the metadata for the data object.
- the process 700 updates (at 712 ) a virtual processor-computer node map that maps the respective virtual processors to corresponding computer nodes of the cluster of computer nodes, where the partition map remains unchanged in response to the migration of the first virtual processor from the first computer node to the second computer node.
- a storage medium can include any or some combination of the following: a semiconductor memory device such as a dynamic or static random access memory (a DRAM or SRAM), an erasable and programmable read-only memory (EPROM), an electrically erasable and programmable read-only memory (EEPROM) and flash memory; a magnetic disk such as a fixed, floppy and removable disk; another magnetic medium including tape; an optical medium such as a compact disk (CD) or a digital video disk (DVD); or another type of storage device.
- a semiconductor memory device such as a dynamic or static random access memory (a DRAM or SRAM), an erasable and programmable read-only memory (EPROM), an electrically erasable and programmable read-only memory (EEPROM) and flash memory
- a magnetic disk such as a fixed, floppy and removable disk
- another magnetic medium including tape an optical medium such as a compact disk (CD) or a digital video disk (DVD); or another type of storage device.
- CD compact disk
- DVD
- Such computer-readable or machine-readable storage medium or media is (are) considered to be part of an article (or article of manufacture).
- An article or article of manufacture can refer to any manufactured single component or multiple components.
- the storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
In some examples, a system creates a partition map that maps partitions of a data bucket to respective virtual processors executed in a cluster of computer nodes. Responsive to a request to access a data object in the data bucket, the system identifies which partition contains metadata for the data object based on a key associated with the data object, and identifies, based on the identified partition and using the partition map, a virtual processor that has the metadata for the data object. Responsive to a migration of a first virtual processor from a first to a second computer node, the system updates a virtual processor-computer node map that maps the respective virtual processors to corresponding computer nodes of the cluster of computer nodes, where the partition map remains unchanged in response to the migration of the first virtual processor from the first computer node to the second computer node.
Description
- A storage arrangement can include a cluster of computer nodes that manage access of data in a shared storage system that is shared by the cluster of computer nodes. Each computer node of the cluster of computer nodes can execute one or more virtual processors, with each virtual processor managing access to a respective data portion in the shared storage system.
- Some implementations of the present disclosure are described with respect to the following figures.
-
FIG. 1 is a block diagram of an arrangement that includes a cluster of computer nodes, a shared storage system, and requester devices, according to some examples. -
FIGS. 2A and 2B are block diagrams of partitions in respective virtual processors, according to some examples. -
FIG. 3 is a flow diagram of a write operation according to some examples. -
FIG. 4 is a block diagram of configuration information specifying prefix lengths for respective data buckets, according to some examples. -
FIG. 5 is a block diagram of a storage medium storing machine-readable instructions according to some examples. -
FIG. 6 is a block diagram of a system according to some examples. -
FIG. 7 is a flow diagram of a process according to some examples. - Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements. The figures are not necessarily to scale, and the size of some parts may be exaggerated to more clearly illustrate the example shown. Moreover, the drawings provide examples and/or implementations consistent with the description; however, the description is not limited to the examples and/or implementations provided in the drawings.
- A “virtual processor” can refer to a computing entity implemented with machine-readable instructions that are executable on a computer node. By managing access of different data portions using respective different virtual processors executed in a cluster of computer nodes, data throughput can be improved when the virtual processors access data in parallel from a shared storage system. A “shared” storage system is a storage system that is shared (i.e., accessible) by any computer node of the cluster of computer nodes.
- A data access request (write request or read request) can be received at a given computer node of the cluster of computer nodes. A virtual processor in the given computer node can be assigned to handle the data access request. The virtual processor assigned to handle the data access request may be referred to as a “source virtual processor” with respect to the data access request. In some examples, the source virtual processor may not be the virtual processor that “owns” (i.e., manages access and/or updates to) metadata for the data that is the subject of the data access request. The virtual processor that owns metadata for data that is the subject of a given data access request may be referred to as a “metadata virtual processor” with respect to the given data access request. A virtual processor may also “own” a data object; such a virtual processor is responsible for managing the access and/or updates of the data object.
- To process a data access request (such as a request to access a data object), the source virtual processor determines which virtual processor is the metadata virtual processor, and obtains the metadata from the metadata virtual processor. An example of the metadata may include a list of chunk identifiers of chunks that make up the data object. The list of chunk identifiers can be used by the source virtual processor to retrieve the chunks of the data object. Another example of the metadata can include a version of the data object.
- For load balancing and improved throughput, metadata for respective data objects can be partitioned into multiple partitions that are spread across multiple virtual processors executing in a cluster of computer nodes. Each computer node can execute one or more virtual processors, and each virtual processor may include one or more partitions. A virtual processor “including” a partition can refer to the virtual processor owning a portion of the metadata (referred to as “metadata portion”) in the partition.
- For load balancing reasons and to address faults or failures in computer nodes of the cluster of computer nodes, a virtual processor may be migrated from a source computer node to a target computer node. However, virtual processor migration can lead to processing overhead related to maintaining associations of metadata portions with respective virtual processors. An association between a metadata portion and a given virtual processor can be represented by mapping information. If the associations between metadata portions and virtual processors are not properly maintained in response to migrations of virtual processors, then source virtual processors may have difficulty finding metadata virtual processors when handling incoming data access requests.
- Also, in examples where data is stored in the shared storage system in the form of data buckets, a further challenge relates to how metadata for the data buckets are sharded across the cluster of computer nodes as the cluster changes over time (such as due to additions of computer nodes to the cluster).
- In accordance with some implementations of the present disclosure, a mapping scheme is provided that uses a partition map and a virtual processor-computer node (VP-CN) map. The partition map associates (maps) partitions (that include metadata portions) to respective virtual processors. The VP-CN map associates (maps) virtual processors to respective computer nodes of the cluster of computer nodes. When a virtual processor is migrated between computer nodes, the VP-CN node map is updated, but the partition map does not change. As a result, requests for data objects that have keys that map to a given virtual processor would continue to map to the given virtual processor after the migration of the given virtual processor between different computer nodes. The static nature of the partition map in the context of virtual processor migrations allows for a system including the cluster of computer nodes to deterministically map metadata of data objects to virtual processors.
- In addition, data buckets can be received at different times. A first data bucket may be generated or received when the cluster of computer nodes has a first quantity of computer nodes. Later, one or more computer nodes may be added to the cluster. Even though the cluster of computer nodes has been expanded, the partition map and the VP-CN node map for the first data bucket are not changed, which avoids the burden associated with having to update mappings as the cluster of computer nodes expands. If a second data bucket is generated or received after the expansion of the cluster of computer nodes, a partition map may be created for the second data bucket that make use of the increased quantity of computer nodes. Note that there may be one partition map per data bucket (so that there are multiple partition maps for respective data buckets). However, there is one VP-CN node map that is common for multiple data buckets.
-
FIG. 1 is a block diagram of an example arrangement that includes acluster 100 of computer nodes CN1, CN2, and CN3. Although three computer nodes are shown inFIG. 1 , in other examples, thecluster 100 can include less than three or more than three computer nodes. A “computer node” can refer to a physical node that includes a processing resource as well as other resources (e.g., communication resources and memory resources) that can perform various tasks. A “processing resource” can include one or more hardware processing circuits which can include any or some combination of a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, or another hardware processing circuit. A “communication resource” can include a communication interface (including communication hardware such as a transceiver and any related machine-readable instructions). A “memory resource” can include a memory implemented with a collection of one or more memory devices. - The
cluster 100 of computer nodes is able to manage the access of data stored in a sharedstorage system 102 in response to data access requests received over anetwork 112 fromrequester devices 104. As used here, a “requester device” can refer to any electronic device that is able to send a request to access data (read data or write data). Examples of electronic devices include any or some combination of the following: desktop computers, notebook computers, tablet computers, server computers, game appliances, Internet-of-Things (IoT) devices, vehicles, household appliances, and so forth. Examples of thenetwork 112 can include any or some combination of the following: a storage area network (SAN), a local area network (LAN), a wide area network (WAN), and so forth. - The shared
storage system 102 is accessible by any of the computer nodes CN1 to CN3 over acommunication link 106 between thecluster 100 of computer nodes and the sharedstorage system 102. The sharedstorage system 102 is implemented using a collection ofstorage devices 108. As used here, a “collection” of items can refer to a single item or to multiple items. Thus, the collection ofstorage devices 108 can include a single storage device or multiple storage devices. Examples of storage devices can include any or some combination of the following: disk-based storage devices, solid state drives, and so forth. - The
requester devices 104 can send data access requests to any of the computer nodes CN1 to CN3 over thenetwork 112. Each computer node executes a collection of virtual processors (a single virtual processor or multiple virtual processors). A virtual processor is executed by a processing resource of a computer node. - In the example of
FIG. 1 , the computer node CN1 includes virtual processors VP1 and VP2, the computer node CN2 includes virtual processors VP3 and VP4, and the computer node CN3 includes virtual processors VP5 and VP6. AlthoughFIG. 1 shows two virtual processors in each computer node, in other examples, a computer node may include a different number (e.g., 1 or more than 2) of virtual processors. Communications among the virtual processors VP1, VP2, VP3, VP4, VP5, and VP6 can occur over a communication link between the virtual processors, such as an inter-process link. - Virtual processors can also be migrated between computer nodes. For example, to achieve load balancing or for fault tolerance or recovery, a first virtual processor may be migrated from a current (source) computer node to a destination computer node. Once migrated, the first virtual processor executes at the destination computer node.
- In some examples, data to be stored in the shared
storage system 102 by virtual processors can be part of data buckets. A data bucket can refer to any type of container that includes a collection of data objects (a single data object or multiple data objects). A specific example of a data bucket is an S3 bucket in an Amazon cloud storage. In other examples, other types of data buckets can be employed. - A data object can be divided into a collection of data chunks (a single data chunk or multiple data chunks). Each data chunk (or more simply “chunk”) has a specified size (a static size or a size that can dynamically change). The storage locations of the chunks are storage locations in the shared
storage system 102. - Each of the virtual processors VP1 to VP6 may maintain respective metadata associated with data objects stored or to be stored in the shared
storage system 102. In some examples, the metadata can include data object metadata such as a list of chunk identifiers (IDs) that identify chunks of a data object, and a virtual processor ID that identifies a virtual processor. The list of chunk IDs can include a single chunk ID or multiple chunk IDs. The data object metadata can further include a version ID that represents a version of a data object. As a data object is modified by write request(s), corresponding different versions of the data object are created (and identified by respective version IDs). As used here, an “ID” can refer to any information (e.g., a name, a string, etc.) that can be used to distinguish one item from another item (e.g., distinguish between chunks, or distinguish between data objects, or distinguish between versions of a data object, or distinguish between virtual processors, and so forth). - The metadata may further include storage location metadata representing storage locations of chunks of data objects in the shared
storage system 102. For example, the storage location metadata can include any or some combination of the following: an offset, a storage address, a block number, and so forth. - The metadata may also include commit metadata maintained by a metadata virtual processor during write operations and read operations. The commit metadata indicates whether a write of a subject data object is in progress (e.g., at a virtual processor that is assigned to handle the write) or a write of the subject data object is no longer in progress (i.e., a write of the subject data object is complete). Note that a write of the subject data object is complete if the subject data object has been written to either a write buffer (not shown) or the shared
storage system 102. A write buffer can be part of an NV memory (e.g., any of 110-1 to 110-3) and is associated with a respective virtual processor for caching write data. - In other examples, additional or alternative metadata may be employed.
- As noted above, metadata can be partitioned into multiple partitions that are spread across multiple virtual processors executing in the
cluster 100 of computer nodes. Each virtual processor may include one or more partitions. A virtual processor “including” a partition can refer to the virtual processor owning a metadata portion in the partition. - As shown in
FIG. 1 , the virtual processor VP1 includes partitions P11, P12, P13 for data bucket B1, and includes partitions P1A, P1B for data bucket B2. The partitions for the data bucket B1 are shaded, while the partitions for the data bucket B2 are not shaded inFIG. 1 . Similarly, the virtual processor VP2 includes partitions P21, P22, P23 for data bucket B1, and includes partitions P2A, P2B for data bucket B2, the virtual processor VP3 includes partitions P31, P32, P33 for data bucket B1, and includes partitions P3A, P3B for data bucket B2, the virtual processor VP4 includes partitions P41, P42, P43 for data bucket B1, and includes partitions P4A, P4B for data bucket B2, the virtual processor VP5 includes partitions P51, P52, P53 for data bucket B1, and includes partitions P5A, P5B for data bucket B2, and the virtual processor VP6 includes partitions P61, P62, P63 for data bucket B1, and includes partitions P6A, P6B for data bucket B2. - Even though the partitions are shown as being inside respective virtual processors in
FIG. 1 , note that the metadata portions of the partitions are stored in respective non-volatile (NV) memories 110-1, 110-2, and 110-3 of the corresponding computer nodes CN1, CN2, and CN3. An NV memory is a memory that can persistently store data, such that data stored in the NV memory is not lost when power is removed from the NV memory or from a computer node in which the NV memory is located. An NV memory can include a collection of NV memory devices (a single NV memory device or multiple NV memory devices), such as flash memory devices and/or other types of NV memory devices. - Also, although each virtual processor is depicted as including a specific quantity of partitions, in other examples, a virtual processor can include a different quantity of partitions for any given data bucket.
- Each virtual processor is responsible for managing respective metadata of one or more partitions. In an example, there are M (M≥2) partitions and P (P≥2) virtual processors. In such an example, each virtual processor of the P virtual processors is responsible for managing metadata of M/P partitions.
- A partition is identified by a Partition ID. In some examples, a Partition ID is based on applying a hash function (e.g., a cryptographic hash function or another type of hash function) on information (key) associated with a data object. In some examples, the key associated with the data object on which the hash function is applied includes a Bucket ID identifying the data bucket of which the data object is part, and an Object ID that identifies the data object. The hash function applied on the key associated with the data object produces a hash value that is the Partition ID or from which the Partition ID is derived.
- For example, a hash value (Hash-Value) is computed as follows:
-
- where F(x) is a transformation applied on x (in this case a transformation applied on Request.key), Hash( ) is a hash function, and Request.key includes the Bucket ID and the Object ID of a data access request (a read request or write request). The transformation can be a concatenation operation of Bucket ID and Object ID, for example.
- The Partition ID is computed based on the Hash-Value as follows:
-
- where % is a modulus operation, and #Partitions is a constant (which can be a configured value) that represents how many partitions are to be divided across a cluster of computer nodes. The modulus operation computes the remainder after dividing Hash-Value by #Partitions.
- As each data bucket is created, the metadata for each data bucket is sharded across multiple partitions. A given partition defines a subset of the object key-space for a specific data bucket and maps to one virtual processor. The object key-space for a data bucket refers to the possible range of values associated with the key for data objects in the data bucket, such as by Hash-Value-Range above. The entire object key-space can be uniformly distributed among the partitions, which can lead to equal size partitions. A last partition may be larger or smaller if the object key-space is not exactly divisible by the number of partitions.
- As shown in
FIG. 1 , multiple partitions of the same bucket can potentially be included in the same virtual processor. In some examples, every data object maps to one partition and all data objects in a partition belong to the same data bucket. Further, in some examples, all versions of a given data object map to the same partition; in other words, the hashing of keys of data objects to map to partitions is version agnostic. - In some examples, the quantity of partitions for a data bucket is a configurable constant (e.g., #Partitions in Eq. 2 above) across all buckets-in other words, each data bucket of multiple data buckets to be stored in the shared
storage system 102 can have a static quantity of partitions. In the example ofFIG. 1 , the configurable constant is 12 (in a different example, a different configurable constant of partitions can be used). For data bucket B1, the 12 partitions are divided across 4 virtual processors VP1 to VP4 such that each virtual processor (of VP1 to VP4) includes three partitions for bucket B1. For data bucket B2, the 12 partitions are divided across 6 virtual processors VP1 to VP6 such that each virtual processor (of VP1 to VP6) includes two partitions for bucket B2. - For an even distribution of partitions across virtual processors, the configurable constant (e.g., #Partitions in Eq. 2) may be a highly composite number with a relatively large quantity of divisors. For example, the number 48 is divisible by the following divisors: 1, 2, 3, 4, 6, 8, 12, 16, 24, 48. When a data bucket is created, the data bucket is associated with the static quantity of partitions, which are spread out as uniformly as possible across all available virtual processors in the
cluster 100 at the time the data bucket is created. -
FIG. 1 further depicts partition maps and VP-CN maps stored in the NV memories 110-1 to 110-3, which are discussed below in conjunction withFIGS. 2A and 2B . -
FIG. 2A shows an example arrangement of thecluster 100 of computer nodes at a first point in time (T1). At this first point in time, thecluster 100 of computer nodes includes two computer nodes CN1 and CN2. The computer node CN1 executes the virtual processors VP1, VP2, and the computer node CN2 executes the virtual processors VP3, VP4. Note that at time T1, the computer node CN3 is not yet present in thecluster 100. - Data bucket B1 is created when the
cluster 100 is configured with the computer nodes CN1 and CN2 at time T1. In some examples, the creation of data bucket B1 is triggered by a user, such as at arequester device 104. In other examples, another entity (a program or a machine) can trigger the creation of data bucket B1. Creation of a data bucket can be triggered in response to a request or any other type of input from an entity (user, program, or machine). - In some examples, the request to create data bucket B1 can be received at any of the computer nodes of the
cluster 100. The computer node that received the request creates a partition map. Each computer node includes a map creation engine that creates the partition map. In the example ofFIG. 2A , the computer nodes CN1 and CN2 include respective map creation engines 202-1 and 202-2. - As used here, an “engine” can refer to one or more hardware processing circuits, which can include any or some combination of a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, or another hardware processing circuit. Alternatively, an “engine” can refer to a combination of one or more hardware processing circuits and machine-readable instructions (software and/or firmware) executable on the one or more hardware processing circuits.
- It is assumed that the computer node CN1 received the request to create data bucket B1. In this example, the map creation engine 202-1 responds to the request to create data bucket B1 by creating a
B1 partition map 204 that maps partitions containing metadata for data objects of data bucket B1 to respective virtual processors. In the example ofFIG. 2A , it is assumed that metadata for a data bucket is sharded across 12 partitions (in other words, #Partitions=12). As a result, since there are four virtual processors VP1, VP2, VP3, and VP4, three partitions are mapped to each virtual processor. - The
B1 partition map 204 includes a number of entries that is equal to #Partitions. If #Partitions is 12, then theB1 partition map 204 includes 12 entries. Each entry of the 12 entries corresponds to a respective partition of the 12 partitions. Thus, the first entry of theB1 partition map 204 corresponds to partition P11, the second entry of theB1 partition map 204 corresponds to partition P12, the third entry of theB1 partition map 204 corresponds to partition P13, the fourth entry of theB1 partition map 204 corresponds to partition P21, and so forth. Each entry of theB1 partition map 204 contains an identifier of a virtual processor that is associated with the corresponding partition. Thus, in the example ofFIG. 2A , the 12 entries associate partitions to virtual processors as follows: the first entry maps P11 to VP1, the second entry maps P12 to VP1, the third entry maps P13 to VP1, the fourth entry maps P21 to VP2, the fifth entry maps P22 to VP2, the sixth entry maps P23 to VP2, the seventh entry maps P31 to VP3, the eighth entry maps P32 to VP3, the ninth entry maps P33 to VP3, the tenth entry maps P41 to VP4, the eleventh entry maps P42 to VP4, and the twelfth entry maps P43 to VP4. - Once created by the map creation engine 202-1, the
B1 partition map 204 can be stored (persisted) in the sharedstorage system 102 by the computer node CN1 that processed the request to create data bucket B1. The other computer nodes, such as CN2, can read theB1 partition map 204 from the shared storage system 102 (such as while processing write and read requests) and cache a copy of theB1 partition map 204 in the NV memory 110-2. - Note that each computer node CN1 or CN2 also includes a VP-
CN map 206 that is created by a map creation engine (same as or different from the map creation engine used to create a partition map). The VP-CN map 206 associates virtual processors to computer nodes that are present in thecluster 100 at time T1. The first entry of the VP-CN map 206 maps VP1 to CN1, the second entry of the VP-CN map 206 maps VP2 to CN1, the third entry of the VP-CN map 206 maps VP3 to CN2, and the fourth entry of the VP-CN map 206 maps VP4 to CN2. - If the configuration of the
cluster 100 is changed (by adding or removing computer nodes), then the VP-CN map 206 can be modified to change the mapping of virtual processors and computer nodes. - In further examples, the map creation engine(s) to create partition maps and/or a VP-CN maps can be in a computer system separate from the
cluster 100 of the computer nodes. -
FIG. 2B shows an arrangement of thecluster 100 of computer nodes at a later point in time (T2). At time T2, the third computer node CN3 has been added to thecluster 100 of computer nodes. The third computer node CN3 executes the virtual processors VP5 and VP6, and includes a map creation engine 202-3. - After the third computer node CN3 has been added, data bucket B2 is created. The metadata for the data objects of data bucket B2 is divided into 12 partitions (#Partitions=12) across the virtual processors VP1 to VP6 of the three computer nodes CN1, CN2, and CN3. As a result, two partitions are included in each virtual processor for data bucket B2.
- In response to the creation of data bucket B2, a map creation engine (any of 202-1, 202-2, and 202-3 in the computer node that received the request to create data bucket B2) creates a
B2 partition map 208. The 12 entries of theB2 partition map 208 correspond to the 12 partitions P1A, P1B, P2A, P2B, P3A, P3B, P4A, P4B, P5A, P5B, P6A, and P6B, respectively. The 12 entries of theB2 partition map 208 associate partitions to virtual processors as follows: the first entry maps P1A to VP1, the second entry maps P1B to VP1, the third entry maps P2A to VP2, the fourth entry maps P2B to VP2, the fifth entry maps P3A to VP3, the sixth entry maps P3B to VP3, the seventh entry maps P4A to VP4, the eighth entry maps P4B to VP4, the ninth entry maps P5A to VP5, the tenth entry maps P5B to VP5, the eleventh entry maps P6A to VP6, and the twelfth entry maps P6B to VP6. - Once created, the
B2 partition map 208 can be written by the computer node at which theB2 partition map 208 was created to the sharedstorage system 102, and the other two computer nodes can read theB2 partition map 208 from the sharedstorage system 102 to cache respective copies of theB2 partition map 208 in the respective NV memories. Note that each of the computer nodes CN1 and CN2 includes the following maps: theB1 partition map 204, the VP-CN map 206, and theB2 partition map 208. The computer node CN3 includes the VP-CN map 206 and theB2 partition map 208, but does not include theB1 partition map 204. - In some examples, the partition maps (204 and 208) once assigned can remain static as the
cluster 100 of computer nodes is expanded by adding more computer nodes. The partition maps 204 and 208 and the VP-CN map 206 may be stored in a repository (e.g., in the shared storage system), with copies of themaps -
FIG. 3 is a flow diagram of an example a write operation performed in response to an incoming write request received by a computer node from arequester device 104. For simplicity, the flow depicted inFIG. 3 omits some tasks that may be performed as part of the write operation. - The computer node that received the incoming write request is referred to as the “receiving computer node.” In examples where the receiving computer node executes multiple virtual processors, the receiving computer node selects one of the multiple virtual processors as a source
virtual processor 302 to handle the incoming write request. The selection can be a random selection process in which the sourcevirtual processor 302 is randomly selected from the multiple virtual processors in the receiving computer node. In another example, the selection process can be based on determining relative loads of the virtual processors in the receiving computer node, and selecting a virtual processor with the least load to be the sourcevirtual processor 302. - In response to the write request, the receiving computer node determines which of the virtual processors VP1 to VP6 in the
cluster 100 of computer nodes is a metadatavirtual processor 304 for the subject data object of the incoming write request 120. The receiving computer node can apply a hash function or another type of function on a key associated with the subject data object, to produce a Partition ID that identifies the partition that the metadata for the subject data object is part of. The key includes a Bucket ID identifying the data bucket that the subject data object is part of, and an Object ID that identifies the subject data object. Once the Partition ID is obtained, the sourcevirtual processor 302 accesses (at 306) a partition map for the data bucket identified by the Bucket ID to determine which virtual processor (the metadata virtual processor 304) is mapped to the partition identified by the Partition ID. This virtual processor is the metadatavirtual processor 304. - The source
virtual processor 302 also accesses (at 308) the VP-CN map (e.g., 206) to identify which computer node the metadatavirtual processor 304 executes in. The sourcevirtual processor 302 sends (at 310) a control message to the metadatavirtual processor 306 in the identified computer node. The control message contains the Bucket ID, Partition ID, Object ID, and Version ID of the subject data object. The control message is to indicate to the metadatavirtual processor 306 that the sourcevirtual processor 302 is ready to start the write operation for the incoming write request. - In response to the control message, the metadata
virtual processor 306 generates (at 312) a list of chunk IDs that identifies one or more data chunks for the subject data object. The chunk ID(s) generated depend(s) on the size of the subject data object—the size of the subject data object determines how many data chunks are to be divided from the subject data object. The metadatavirtual processor 306 sends (at 314) the list of chunk IDs to the sourcevirtual processor 302. - In response to receiving the list of chunk IDs from the metadata
virtual processor 306, the sourcevirtual processor 302 writes (at 316) the chunk(s) of the subject data object using the chunk ID(s) included in the list of chunk IDs. The sourcevirtual processor 302 can write the data chunk(s) to a write buffer (not shown) associated with the sourcevirtual processor 302 and/or to the sharedstorage system 102. - Note that rebalancing of partitions across computer nodes may occur if one or more computer nodes are removed from the
cluster 100. The partitions of the removed one or more computer nodes can be added to virtual processors of the remaining computer nodes of thecluster 100. - Rebalancing of partitions across computer nodes of the
cluster 100 may also occur in response to another condition, such as overloading or a fault of a virtual processor or a computer node. For example, if a given virtual processor is overloaded, then a partition can be moved from the given virtual processor to another virtual processor. Rebalancing one or more partitions from heavily loaded virtual processors to more lightly loaded virtual processors can reduce metadata hot-spots. A metadata hot-spot can occur at a given virtual processor if there is a large number of requests for metadata associated with subject data objects of write requests from other virtual processors to the given virtual processor. - If a virtual processor or a computer node is heavily loaded (due to performing a large quantity of operations as compared to virtual processors on other computer nodes), then one or more partitions of the heavily loaded virtual processor or computer node can be moved to one or more other virtual processors (which can be on the same or a different computer node). Moving a partition from a first virtual processor to a second virtual processor results in the first virtual processor no longer owning the metadata of the partition, and the second virtual processor owning the metadata of the partition.
- As an example, in
FIG. 1 , the partition P11 initially included in the virtual processor VP1 on the computer node CN1 may be moved to the virtual processor VP3 on the computer node CN2. As a result of the movement of the partition P11, thepartition map 204 is modified to map P11 to VP2. - The movement of a partition between different computer nodes can be accomplished with reduced overhead since data in data objects does not have to be moved between computer nodes as a result of the movement of the partition. For example, when the partition P11 is moved from the virtual processor VP1 to the virtual processor VP3, the data objects associated with the metadata in the moved partition do not have to be moved between the computer nodes CN1 and CN2. If the data objects associated with the metadata in the moved partition P11 are in the write buffer of the computer node CN1, the data objects can be flushed to the shared storage system 102 (but do not have to be copied to the computer node CN2). If the data objects associated with the metadata in the moved partition P11 are already in the shared
storage system 102, then no movement of the data objects occur in response to movement of the partition P11. - In some examples, if the data objects associated with the metadata in the moved partition P11 are arranged in a hierarchical structure, then a root or other higher-level element may be moved from the computer node CN1 to the computer node CN2. An example of such a higher-level element is a superblock, which contains storage locations of the data objects. However, the amount of information in the superblock is much less than the data contained in the data objects referred to by the superblock, so the movement of the superblock between computer nodes has a relatively low overhead in terms of resources used.
- Note that rebalancing can also be accomplished by moving a virtual processor (and its included partitions) between different computer nodes. The migration of a given virtual processor from a source computer node to a target computer node can similarly be accomplished without moving data of associated data objects between the source and target computer nodes, although superblocks or other higher-level elements may be moved.
- In some examples, prefix-based hashing is applied on a prefix of a key of a data access request to produce a hash value from which a Partition ID is derived. In such examples, Eq. 1 is modified to produce a hash value (Hash-Value) as follows:
-
- where Prefix of Request.key represents the prefix of the key.
- The prefix has a specified prefix length. In some examples, different data buckets can be associated with respective prefix lengths, at least some of which may be different from one another.
- For example,
FIG. 4 shows an example in which a prefix length L1 is specified for data bucket B1, a prefix length L2 is specified for data bucket B2, a prefix length L3 is specified for data bucket B3, and so forth. The prefix lengths L1, L2, L3, and so forth, can be stored as part ofconfiguration information 400 that is accessible by virtual processors in thecluster 100 of computer nodes. Theconfiguration information 400 may be stored in an NV memory of each computer node, for example. In some examples, a prefix length for a data bucket can be expressed as a percentage of the full key length (the full length of the key including the Bucket ID and Object ID). For example, L1 may be set at 100%, L2 may be set at 90%, L3 may be set at 75%, and so forth. In other examples, a prefix length can be expressed as a quantity of bits of the key on which the hash function is to be applied. - More generally, each prefix length in the
configuration information 400 is specified for a respective collection of data buckets (a single data bucket or multiple data buckets). - The ability to specify different prefix lengths for different data buckets (or more generally different collections of data buckets) enhances flexibility in how metadata for the different data buckets are sharded across virtual processors in the
cluster 100 of computer nodes. For example, the different data buckets (or collections of data buckets) may be associated with different applications. - The computation of hash values from which Partition IDs are determined based on prefixes of keys (rather than the entirety of the keys) can allow for greater efficiency when performing range queries with respect to data objects. A “range query” is a query for a collection of data objects with keys within a specified range.
- In an example, it is assumed that a given data bucket contains R (R≥2) data objects, where the keys of each of the data objects in the given data bucket start with “/myprefix/”, which is a combination of the Bucket ID and a portion (less than the entirety of the Object ID of a data object in the given data bucket. In this example, data object 1 has key “/myprefix/foo”, data object 2 has key “/myprefix/bar, . . . , and data object N has key “/myprefix/foobar”.
- Further, assume the prefix length for the given data bucket specified by the
configuration information 400 is the length of the string “/myprefix/”. As a result, when computing the hash value according to Eq. 3 based on the key of each respective data object in the given data bucket, the hash function is applied on the same string “/myprefix/” of each key of the respective data object. The same Partition ID will be generated for each data object of the given data bucket, so that the metadata for all of the data objects in the given data bucket will end up in the same partition, - Subsequently, if a user or another entity submits a range query (such as from a
requester device 104 ofFIG. 1 ), the range query can be performed by just one virtual processor including the partition to which all of the data objects of given data bucket map. As an example, the range query can seek to retrieve or identify data objects of the given data bucket within a range of keys, such as between Key A and Key B. This range query can be performed at the virtual processor including the partition to which all of the data objects of given data bucket map, rather than being performed by multiple virtual processors across multiple computer nodes. As a result, the range query can be more efficiently performed, since data objects would not have to be exchanged between virtual processors to be aggregated before returning the result to the requester. -
FIG. 5 is a block diagram of a non-transitory machine-readable or computer-readable storage medium 500 storing machine-readable instructions that upon execution cause a processing system to perform various tasks. The processing system can include any one or more of the computer nodes shown inFIG. 1 , and/or any other computer system. - The machine-readable instructions include partition
map creation instructions 502 to create a partition map that maps partitions of a data bucket to respective virtual processors executed in a cluster of computer nodes that are coupled to a shared storage system to store data of the data bucket. The portions of metadata for the data bucket are divided across the partitions. The partitionmap creation instructions 502 can include instructions of any or some combination of the map creation engines 202-1 to 202-3 ofFIGS. 2A and 2B . - The machine-readable instructions include
partition identification instructions 504 to, responsive to a request to access a data object in the data bucket, identify which partition of the partitions contains metadata for the data object based on a key associated with the data object. In some examples, the identification of the partition is based on applying a function (e.g., a hash function such as in Eq. 1 or 3) on a key associated with the request to access the data object. - The function can be applied on a first portion of the key associated with the data object to obtain a value, which is used to identify the partition that contains the metadata for the data object. In some examples, the first portion of the key is a prefix of the key, where the prefix of the key is less than an entirety of the key.
- The machine-readable instructions further include virtual
processor identification instructions 506 to identify, based on the identified partition and using the partition map, a virtual processor that has the metadata for the data object. The partition map associates the identified partition with the virtual processor. - The machine-readable instructions further include VP-CN
map update instructions 508 to, responsive to a migration of a first virtual processor from a first computer node to a second computer node of the cluster of computer nodes, update a VP-CN map that maps the respective virtual processors to corresponding computer nodes of the cluster of computer nodes. Although the VP-CN map is updated, the partition map remains unchanged in response to the migration of the first virtual processor from the first computer node to the second computer node. - In some examples, the migration of the first virtual processor from the first computer node to the second computer node causes migration of one or more portions of the metadata associated with the first virtual processor from the first computer node to the second computer node. In some examples, the migration of the first virtual processor from the first computer node to the second computer node is performed without performing movement of data owned by the first virtual processor between computer nodes of the cluster of computer nodes.
- In some examples, prior to the migration of the first virtual processor, a first request for a first data object associated with a first key maps, based on the partition map, to the first virtual processor on the first computer node. After the migration of the first virtual processor, a second request for the first data object associated with the first key maps, based on the partition map, to the first virtual processor on the second computer node.
- In some examples, in response to the first request, the machine-readable instructions obtain metadata for the first data object from the first virtual processor on the first computer node, and in response to the second request, the machine-readable instructions obtain the metadata for the first data object from the first virtual processor on the second computer node.
- In some examples, different keys associated with respective data objects that share a same prefix map to a same partition, and the machine-readable instructions perform a range query for a range of keys at one or more virtual processors mapped to a partition associated with the range of keys.
- In some examples, in response to detecting a metadata hotspot at a given virtual processor, the machine-readable instructions migrate a first partition from the given virtual processor to another virtual processor, and update the partition map in response to the migration of the first partition.
-
FIG. 6 is a block diagram of asystem 600 that includes a cluster ofcomputer nodes 602 to execute a plurality ofvirtual processors 604. The cluster ofcomputer nodes 602 stores afirst partition map 606 that maps partitions of a first data bucket to a first subset of virtual processors, where portions of metadata for the first data bucket are divided across the partitions of the first data bucket. The cluster ofcomputer nodes 602 stores asecond partition map 608 that maps partitions of a second data bucket to a second subset of virtual processors, where portions of metadata for the second data bucket are divided across the partitions of the second data bucket. - The
system 600 includes a shared storage system 610 accessible by the cluster ofcomputer nodes 602 to store data of the first and second data buckets. - A first virtual processor VP1 in the cluster of
computer nodes 602 responds to arequest 612 to access a first data object in the first data bucket by identifying a first given partition of the partitions of the first data bucket that contains metadata for the first data object based on a first portion of afirst key 614 associated with the first data object. The first virtual processor VP1 identifies, based on the first given partition and using thefirst partition map 606, a virtual processor that has the metadata for the first data object. - A second virtual processor VP2 in the cluster of
computer nodes 602 responds to arequest 616 to access a second data object in the second data bucket by identifying a second given partition of the partitions of the second data bucket that contains metadata for the second data object based on a second portion of asecond key 618 associated with the second data object. The second portion of thesecond key 618 is of a different length than the first portion of thefirst key 614. The second virtual processor VP2 identifies, based on the second given partition and using thesecond partition map 608, a virtual processor that has the metadata for the second data object. - In some examples, the identifying of the first given partition of the partitions of the first data bucket that contains metadata for the first data object is based on applying a hash function on the first portion of the first key associated with the first data object, and the identifying of the second given partition of the partitions of the second data bucket that contains metadata for the second data object is based on applying the hash function on the second portion of the second key associated with the second data object.
-
FIG. 7 is a block diagram of aprocess 700 according to some examples, which may be performed by a system including a hardware processor. The system can include one computer or more than one computer. - The
process 700 includes receiving (at 702) a request to create a data bucket. Theprocess 700 includes creating (at 704), in response to the request to create the data bucket, a partition map that maps partitions of the data bucket to respective virtual processors executed in a cluster of computer nodes that are coupled to a shared storage system to store data of the data bucket, where portions of metadata for the data bucket are divided across the partitions. - The
process 700 includes receiving (at 706) a request to access a data object in the data bucket. In response to the request to access the data object in the data bucket, theprocess 700 identifies (at 708) which partition of the partitions contains metadata for the data object based on applying a function on a key associated with the data object, and identifies (at 710), based on the identified partition and using the partition map, a virtual processor that has the metadata for the data object. - Responsive to a migration of a first virtual processor of the virtual processors from a first computer node to a second computer node of the cluster of computer nodes, the
process 700 updates (at 712) a virtual processor-computer node map that maps the respective virtual processors to corresponding computer nodes of the cluster of computer nodes, where the partition map remains unchanged in response to the migration of the first virtual processor from the first computer node to the second computer node. - A storage medium (e.g., 500 in
FIG. 5 ) can include any or some combination of the following: a semiconductor memory device such as a dynamic or static random access memory (a DRAM or SRAM), an erasable and programmable read-only memory (EPROM), an electrically erasable and programmable read-only memory (EEPROM) and flash memory; a magnetic disk such as a fixed, floppy and removable disk; another magnetic medium including tape; an optical medium such as a compact disk (CD) or a digital video disk (DVD); or another type of storage device. Note that the instructions discussed above can be provided on one computer-readable or machine-readable storage medium, or alternatively, can be provided on multiple computer-readable or machine-readable storage media distributed in a large system having possibly plural nodes. Such computer-readable or machine-readable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution. - In the present disclosure, use of the term “a,” “an,” or “the” is intended to include the plural forms as well, unless the context clearly indicates otherwise. Also, the term “includes,” “including,” “comprises,” “comprising,” “have,” or “having” when used in this disclosure specifies the presence of the stated elements, but do not preclude the presence or addition of other elements.
- In the foregoing description, numerous details are set forth to provide an understanding of the subject disclosed herein. However, implementations may be practiced without some of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.
Claims (20)
1. A non-transitory machine-readable storage medium comprising instructions that upon execution cause a processing system to:
create a partition map that maps partitions of a data bucket to respective virtual processors executed in a cluster of computer nodes that are coupled to a shared storage system to store data of the data bucket, wherein portions of metadata for the data bucket are divided across the partitions;
responsive to a request to access a data object in the data bucket, identify which partition of the partitions contains metadata for the data object based on a key associated with the data object, and identify, based on the identified partition and using the partition map, a virtual processor that has the metadata for the data object; and
responsive to a migration of a first virtual processor of the virtual processors from a first computer node to a second computer node of the cluster of computer nodes, update a virtual processor-computer node map that maps the respective virtual processors to corresponding computer nodes of the cluster of computer nodes,
wherein the partition map remains unchanged in response to the migration of the first virtual processor from the first computer node to the second computer node.
2. The non-transitory machine-readable storage medium of claim 1 , wherein the migration of the first virtual processor from the first computer node to the second computer node causes migration of one or more portions of the metadata associated with the first virtual processor from the first computer node to the second computer node.
3. The non-transitory machine-readable storage medium of claim 2 , wherein the migration of the first virtual processor from the first computer node to the second computer node is performed without performing movement of data owned by the first virtual processor between computer nodes of the cluster of computer nodes.
4. The non-transitory machine-readable storage medium of claim 1 , wherein:
prior to the migration of the first virtual processor, a first request for a first data object associated with a first key maps, based on the partition map, to the first virtual processor on the first computer node, and
after the migration of the first virtual processor, a second request for the first data object associated with the first key maps, based on the partition map, to the first virtual processor on the second computer node.
5. The non-transitory machine-readable storage medium of claim 4 , wherein the instructions upon execution cause the processing system to:
in response to the first request, obtain metadata for the first data object from the first virtual processor on the first computer node; and
in response to the second request, obtain the metadata for the first data object from the first virtual processor on the second computer node.
6. The non-transitory machine-readable storage medium of claim 1 , wherein the instructions upon execution cause the processing system to:
apply a function on a first portion of the key associated with the data object to obtain a value; and
use the value to identify the partition that contains the metadata for the data object.
7. The non-transitory machine-readable storage medium of claim 6 , wherein the first portion of the key is a prefix of the key, the prefix of the key being less than an entirety of the key.
8. The non-transitory machine-readable storage medium of claim 7 , wherein the instructions upon execution cause the processing system to:
store configuration information indicating a length of the prefix.
9. The non-transitory machine-readable storage medium of claim 8 , wherein the configuration information indicates different lengths of prefixes for different data buckets.
10. The non-transitory machine-readable storage medium of claim 7 , wherein different keys associated with respective data objects that share a same prefix map to a same partition of the partitions, and the instructions upon execution cause the processing system to:
perform a range query for a range of keys at one or more virtual processors mapped to a partition associated with the range of keys.
11. The non-transitory machine-readable storage medium of claim 6 , wherein the data object is a first data object, the data bucket is a first data bucket, and the instructions upon execution cause the processing system to:
apply a function on a first portion of a key associated with a second data object in a second data bucket to obtain a second value, wherein a length of the first portion of the key associated with the second data object in the second data bucket is different from a length of the first portion of the key associated with the first data object in the first data bucket; and
use the second value to identify another partition that contains metadata for the second data object.
12. The non-transitory machine-readable storage medium of claim 1 , wherein the instructions upon execution cause the processing system to:
in response to detecting a metadata hotspot at a given virtual processor, migrate a first partition of the partitions from the given virtual processor to another virtual processor; and
update the partition map in response to the migration of the first partition.
13. The non-transitory machine-readable storage medium of claim 1 , wherein the instructions upon execution cause the processing system to:
create the partition map in response to generation or receipt of the data bucket to be stored in the shared storage system.
14. A system comprising:
a cluster of computer nodes to execute a plurality of virtual processors, wherein the cluster of computer nodes is to store:
a first partition map that maps partitions of a first data bucket to a first subset of virtual processors, wherein portions of metadata for the first data bucket are divided across the partitions of the first data bucket, and
a second partition map that maps partitions of a second data bucket to a second subset of virtual processors, wherein portions of metadata for the second data bucket are divided across the partitions of the second data bucket; and
a shared storage system accessible by the cluster of computer nodes to store data of the first and second data buckets,
wherein a first virtual processor in the cluster of computer nodes is to:
responsive to a request to access a first data object in the first data bucket, identify a first given partition of the partitions of the first data bucket that contains metadata for the first data object based on a first portion of a first key associated with the first data object, and
identify, based on the first given partition and using the first partition map, a virtual processor that has the metadata for the first data object,
wherein a second virtual processor in the cluster of computer nodes is to:
responsive to a request to access a second data object in the second data bucket, identify a second given partition of the partitions of the second data bucket that contains metadata for the second data object based on a second portion of a second key associated with the second data object, wherein the second portion of the second key is of a different length than the first portion of the first key, and
identify, based on the second given partition and using the second partition map, a virtual processor that has the metadata for the second data object.
15. The system of claim 14 , wherein a first computer node of the cluster of computer nodes is to:
migrate the first virtual processor from the first computer node to a second computer node of the cluster of computer nodes, and
update a virtual processor-computer node map that maps the virtual processors of the first subset of virtual processors to corresponding computer nodes of the cluster of computer nodes,
wherein the first partition map remains unchanged in response to the migration of the first virtual processor from the first computer node to the second computer node.
16. The system of claim 14 , wherein a computer node of the cluster of computer nodes is to:
detect a metadata hot-spot at the first virtual processor, and
in response to the metadata hot-spot, move at least one partition from the first virtual processor to another virtual processor.
17. The system of claim 14 , wherein the identifying of the first given partition of the partitions of the first data bucket that contains metadata for the first data object is based on applying a hash function on the first portion of the first key associated with the first data object, and
wherein the identifying of the second given partition of the partitions of the second data bucket that contains metadata for the second data object is based on applying the hash function on the second portion of the second key associated with the second data object.
18. The system of claim 17 , wherein the first portion of the first key is a first prefix of the first key, and the second portion of the second key is a second prefix of the second key.
19. A method of a system comprising a hardware processor, comprising:
receiving a request to create a data bucket;
in response to the request to create the data bucket, creating a partition map that maps partitions of the data bucket to respective virtual processors executed in a cluster of computer nodes that are coupled to a shared storage system to store data of the data bucket, wherein portions of metadata for the data bucket are divided across the partitions;
receiving a request to access a data object in the data bucket;
in response to the request to access the data object in the data bucket, identifying which partition of the partitions contains metadata for the data object based on applying a function on a key associated with the data object, and identifying, based on the identified partition and using the partition map, a virtual processor that has the metadata for the data object; and
responsive to a migration of a first virtual processor of the virtual processors from a first computer node to a second computer node of the cluster of computer nodes, updating a virtual processor-computer node map that maps the respective virtual processors to corresponding computer nodes of the cluster of computer nodes, wherein the partition map remains unchanged in response to the migration of the first virtual processor from the first computer node to the second computer node.
20. The method of claim 19 , further comprising:
accessing configuration information in the cluster of computer nodes to determine a length of a portion of the key on which the function is to be applied.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/332,828 US20240411579A1 (en) | 2023-06-12 | 2023-06-12 | Metadata partitioning across virtual processors |
DE102024101412.1A DE102024101412A1 (en) | 2023-06-12 | 2024-01-18 | DISTRIBUTION OF METADATA AMONG VIRTUAL PROCESSORS |
CN202410327527.9A CN119127460A (en) | 2023-06-12 | 2024-03-21 | Metadata partitioning across virtual processors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/332,828 US20240411579A1 (en) | 2023-06-12 | 2023-06-12 | Metadata partitioning across virtual processors |
Publications (1)
Publication Number | Publication Date |
---|---|
US20240411579A1 true US20240411579A1 (en) | 2024-12-12 |
Family
ID=93567421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/332,828 Pending US20240411579A1 (en) | 2023-06-12 | 2023-06-12 | Metadata partitioning across virtual processors |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240411579A1 (en) |
CN (1) | CN119127460A (en) |
DE (1) | DE102024101412A1 (en) |
-
2023
- 2023-06-12 US US18/332,828 patent/US20240411579A1/en active Pending
-
2024
- 2024-01-18 DE DE102024101412.1A patent/DE102024101412A1/en active Pending
- 2024-03-21 CN CN202410327527.9A patent/CN119127460A/en active Pending
Also Published As
Publication number | Publication date |
---|---|
DE102024101412A1 (en) | 2024-12-12 |
CN119127460A (en) | 2024-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9575976B2 (en) | Methods and apparatuses to optimize updates in a file system based on birth time | |
US11562091B2 (en) | Low latency access to physical storage locations by implementing multiple levels of metadata | |
US20240004834A1 (en) | Directory structure for a distributed storage system | |
US8868926B2 (en) | Cryptographic hash database | |
US8751763B1 (en) | Low-overhead deduplication within a block-based data storage | |
US8489821B2 (en) | Managing concurrent accesses to a cache | |
US11630864B2 (en) | Vectorized queues for shortest-path graph searches | |
US8229968B2 (en) | Data caching for distributed execution computing | |
CN103581331B (en) | The online moving method of virtual machine and system | |
US11520759B2 (en) | Processing time series metrics data | |
US20120330907A1 (en) | Storage system for eliminating duplicated data | |
CN105027069A (en) | Deduplication of volume regions | |
US11093169B1 (en) | Lockless metadata binary tree access | |
US11210006B2 (en) | Distributed scalable storage | |
US11507553B2 (en) | Range lookup operations for Bε-trees using update messages | |
US11222070B2 (en) | Vectorized hash tables | |
US11221777B2 (en) | Storage system indexed using persistent metadata structures | |
EP4016312A1 (en) | Data operations using a cache table in a file system | |
CN114661232B (en) | Snapshot data reading method, device, system, equipment and storage medium | |
US20240411579A1 (en) | Metadata partitioning across virtual processors | |
JP6189266B2 (en) | Data processing apparatus, data processing method, and data processing program | |
US11874767B2 (en) | Memory partitions for processing entities | |
US12253974B2 (en) | Metadata processing method and apparatus, and a computer-readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DEVADAS, VINAY;VARADAN, SRIKANT;CORSI, CHRISTOPHER JOSEPH;AND OTHERS;SIGNING DATES FROM 20230602 TO 20230611;REEL/FRAME:063919/0967 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |