WO2015024406A1

WO2015024406A1 - Data file management method and device

Info

Publication number: WO2015024406A1
Application number: PCT/CN2014/079700
Authority: WO
Inventors: 罗成对; 张军
Original assignee: 华为技术有限公司
Priority date: 2013-08-23
Filing date: 2014-06-12
Publication date: 2015-02-26
Also published as: CN104424219B; CN104424219A

Abstract

Disclosed are a data file management method and device. The data file management method comprises: when an incremental data storage area reaches a first data file merging condition, merging record segments corresponding to each primary key in various data files in the incremental data storage area with found historical complete records corresponding to the primary keys respectively, so as to form a complete record of a merging time corresponding to each primary key; and writing the complete record of the merging time corresponding to each primary key into a newly-built data file of a complete data storage area, wherein the complete record of the merging time corresponding to each primary key is taken as an output result of accurately querying the primary keys in the complete data storage area. By means of the above-mentioned method, the present application enables the records of the primary keys to be centralized, thereby reducing the IO overheads for accurately querying the primary keys in the complete data storage area.

Description

Data file management method and device

TECHNICAL FIELD The present invention relates to a method and apparatus for managing data files. Background

The database is divided into a relational database and a non-relational database (Not Only SQL, NoSQL). NoSQL is a general term for all relational databases different from the traditional ones. NoSQL data storage does not require a fixed table mode, usually stored as a key-value pair. At present, most NoSQL data storages are based on Log-Stmctured Merge-Tree (LSM-tree), and a data structure and algorithm for delay updating and batch writing to hard disks are proposed. LSM-tree converts access to many small files into continuous high-volume transfers, making most access to the file system sequential, thereby increasing disk bandwidth utilization and minimizing system access performance overhead. , especially suitable for application environments that generate a lot of insert operations. Therefore, NoSQL based on LSM-tree is also called incremental database.

The LSM-tree consists of at least two parts. A component resident in memory, called CO tree (or CO), can be a data structure for any convenient key value search. Other components are resident in the hard disk, called C1...

The nodes that are frequently accessed in the CK tree (or C1 ... CK), C1 ... CK will also be cached in main memory. Incremental database uses incremental write mode, that is, the database adds records or updates records, first put into the memory data structure (such as the main memory data table, Memory Table, Memtable), that is, the CO tree, which reaches a certain size to form a Small data files (such as Sorted String Table, Sstable) are brushed into the hard disk data structure, that is, C1 ... CK tree, internal key (Rowkey) is arranged in order. Such a file will not be modifiable. When querying, you need to query the Rowkey records from these small data files to form a complete Rowkey record.

In the incremental write mode, a complete Rowkey record can be composed of Rowey record segments that are discrete in different data files. In this way, a Rowy exact query requires multiple memory input/output (10) consumption. Summary of the invention The technical problem to be solved by the present invention is to provide a data file management method and device, which can change the discrete state of the incremental storage area to the centralized state of the complete data storage area, and reduce the accurate query of the Rowkey in the complete data storage area. 10 overhead.

The first aspect of the present application provides a data file management method, including: when the incremental data storage area reaches a first data file merge condition, corresponding to each primary key in each data file in the incremental data storage area The recorded segments are respectively merged with the historical complete records corresponding to the found primary keys, forming a complete record of the merged moments corresponding to each of the primary keys; and writing the complete records of the merged moments corresponding to each of the primary keys to the complete record In a newly created data file of the data storage area, the complete record of the merge time corresponding to each of the primary keys is used as an output result of accurately querying the primary key in the complete data storage area.

In conjunction with the first aspect, in a first possible implementation manner of the first aspect, the method further includes: writing a complete record of the merge time corresponding to each of the primary keys to a main memory.

With reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the method further includes: reaching a second data file in the complete data storage area When the conditions are merged, each data file containing the complete record of each merge time saved in the complete data storage area is merged, and the redundant record of each of the primary keys of the complete data storage area is deleted.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the data file that is saved in the complete data storage area and includes a complete record of each merge time Performing a merge to delete a redundant record of each of the primary keys of the complete data storage area, specifically: performing a merge algorithm to perform each data file that is stored in the complete data storage area and includes a complete record of each merged time Merging, deleting redundant records of each of the primary keys of the complete data store.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the using a merge algorithm to complete a complete record that is saved in the complete data storage area and includes each merge time The data files are merged, and the step of deleting the redundant records of each of the primary keys of the complete data storage area comprises: from the data files containing the complete records of the combined time saved in the complete data storage area Finding the latest data file where each of the primary keys is located, the latest data file refers to the data file with the latest formation time; obtaining each of the primary keys from the latest data file where each primary key is located Corresponding complete record and write the complete data storage The merged data file of the storage area, deleting the data file of the completed data storage of the complete data storage area.

With reference to any one of the second to fourth possible implementation manners of the first aspect, in a fifth possible implementation manner of the first aspect, the data data file in the incremental data storage area Before the step of merging the record segments corresponding to the primary keys with the historical complete records corresponding to the found primary keys, forming a complete record of the merged moments corresponding to each of the primary keys, the method further includes: from the primary storage or the complete The data file of the data storage area is searched for the historical complete record corresponding to each of the primary keys.

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the searching for each of the data files from the main memory or the complete data storage area The step of recording the historical complete record corresponding to the primary key includes: searching, in the new and old manner, the data file in the main storage according to the formation time of the complete record corresponding to each of the primary keys, if the primary storage does not retrieve Then, the data is retrieved from the data file of the complete data storage area until the complete record corresponding to the primary key is retrieved, and the complete record of the retrieved primary key is a historical complete record corresponding to the primary key.

In conjunction with the fifth possible implementation of the first aspect, in a seventh possible implementation manner of the first aspect, the incremental data is stored when the historical full record corresponding to the primary key is not found The record segments corresponding to each primary key in each data file in the region are respectively merged with the historical complete records corresponding to the searched primary keys, and a complete record of the merge time corresponding to each primary key is formed, which is specifically as follows: The record segments corresponding to the primary keys in each data file in the volume data storage area are merged as a complete record of the merge time corresponding to the primary key.

In conjunction with the first aspect, in an eighth possible implementation of the first aspect, the method further includes: deleting the data file of the incremental data storage area.

A second aspect of the present application provides a storage device, where the storage device includes a first merge module and a write module, where: the first merge module is configured to reach a first data file merge condition in an incremental data storage area And combining the record segments corresponding to each primary key in each data file in the incremental data storage area with the historical complete records corresponding to the found primary keys, to form a complete merge time corresponding to each primary key. Recording and outputting to the write module; the write module is configured to write a complete record of the merge time corresponding to each of the primary keys into a newly created data file of the complete data storage area, where Complete record of the merged moment corresponding to each primary key As an output result of accurately querying the primary key in the complete data storage area.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the writing module is further configured to write a complete record of the merged moment corresponding to each of the primary keys into a main memory.

With reference to the second aspect or the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the device further includes a second merging module, where: the second merging module is used by When the second data file merge condition is reached in the complete data storage area, each data file containing the complete record of each merge time saved in the complete data storage area is merged, and each complete data storage area is deleted. A redundant record of the primary keys.

With reference to the second possible implementation of the second aspect, in a third possible implementation manner of the second aspect, the second combining module includes a searching unit and a writing unit, where: the searching unit is used to In each data file of the complete record containing the complete record of each merge time, the latest data file where each of the primary keys is located is found, and the latest data file refers to the data with the latest time. a file; the writing unit is configured to obtain a complete record corresponding to each of the primary keys from a latest data file in which each primary key is located, and write the merged data file of the complete data storage area, deleting the file The data file of the completed data pool of the complete data store.

With reference to any one of the first to third possible implementations of the second aspect, in a fourth possible implementation manner of the second aspect, the device further includes: a searching module, where: the searching module is used to Searching, in the data file of the main memory or the complete data storage area, a historical complete record corresponding to each of the primary keys, and outputting the historical complete record corresponding to each of the found primary keys to the first merge module .

With reference to the fourth possible implementation of the second aspect, in a fifth possible implementation manner of the second aspect, when the searching module does not find the historical complete record corresponding to the primary key, the first merge The module is configured to merge the record segments corresponding to the primary key in each data file in the incremental data storage area as a complete record of the merge time corresponding to the primary key.

The beneficial effects of the present invention are as follows: Different from the prior art, the present application combines the record segments corresponding to each Rowkey in the data file of the incremental data storage area with the historical complete records corresponding to the found Rowkey, forming each The complete record of the merge time corresponding to Rowkey is written into the complete data storage area. In this way, the data files of the incremental database are dynamically managed in the incremental data storage area and the complete data storage area, so that Rowkey is in the complete data. Storage area Stored in a centralized state, reducing overhead by accurately querying Rowkey in the full data store. DRAWINGS

1 is a schematic diagram of a hierarchical storage structure of the present application;

2 is a flow chart of an embodiment of a method for managing a data file of the present application;

3 is a flow chart showing a complete record of a merge time corresponding to each primary key in one embodiment of the data file management method of the present application;

4 is a flow chart of another embodiment of a method for managing a data file of the present application;

5 is a flowchart of a method for managing a data file of the present application. In one embodiment, a merge algorithm performs a process of merging data files including a complete record of each merge time stored in a complete data storage area;

6 is a schematic diagram of a storage structure of one embodiment of a data file management method of the present application; FIG. 7 is a schematic diagram of a storage structure of another embodiment of a data file management method of the present application; FIG. 8 is another management method of the data file of the present application. FIG. 9 is a schematic structural diagram of an embodiment of a storage device of the present application;

10 is a schematic structural diagram of another embodiment of a storage device of the present application;

11 is a schematic structural diagram of a second merge module in an embodiment of the storage device of the present application; and FIG. 12 is a schematic structural diagram of still another embodiment of the storage device of the present application. DETAILED DESCRIPTION Hard disk drives (HDDs) are widely used as storage media for storage systems, such as databases. Hard disk-based databases typically use a two-tier storage structure of Ma in Memory + HDD. The data record is first written to the main memory and then persisted to the hard disk under certain trigger conditions. However, for a long time, the development of the industry has been uneven, the performance of the main memory 10 has been greatly improved, and the performance of the hard disk 10 has been slow to grow. This has caused the hard disk-based database read and write performance to be severely limited by the hard disk 10. The advent of solid-state drives (Solid S ta te D isk, SSD) has brought considerable optimization space to the database. SSD has good read and write performance, is faster than HDD, and is usually introduced into the storage system as a limited capacity read/write cache, which constitutes a multi-layer storage structure of Ma in Memory+SSD + HDD. Hardware advantages, seeking a balance of performance, capacity, and price. Both SSD and HDD are non-volatile storage media.

In this application, a zero-level storage area, a primary storage area, and a secondary storage area are defined: a zero-level storage area refers to main storage; a primary storage area and a secondary storage area are two types of storage devices, wherein the primary storage area is relatively The read and write performance in the secondary storage area is outstanding, but the price is relatively expensive, such as main memory and SSD combination, SSD and HDD combination, HDD and tape combination. The primary storage area and the secondary storage area can be understood as a combination of SSD and HDD, but are not limited to this combination in the embodiment of the present application. In this application, a level one storage area is also referred to as an incremental data storage area, and a secondary storage area is referred to as a complete data storage area.

Referring to FIG. 1 , FIG. 1 is a schematic diagram of a hierarchical storage structure. A is shown in FIG. 2 as a schematic diagram of a two-layer storage structure, and B is a schematic diagram of a three-layer storage structure.

In a two-tier storage structure, the data flow direction is from a zero-level storage area to a primary storage area. The data store storage engine receives data writes (including inserts, updates, deletes) and the data is first written to the data set in the zero-level storage area. The storage engine monitors the data set. When a certain trigger condition is reached, such as the data set size exceeds a certain threshold, the data set that satisfies the condition is swamped (f lush ) to the persistent data file on the secondary storage area. When the storage engine receives the data query (se lec t ) request, the storage engine retrieves (re tri eve ) the data record fragment that meets the query condition from the data set in the zero-level storage area and the persistent data file on the secondary storage area, respectively. Then, the data record segments from the two storage areas are spliced to form a complete data record as a result of the query.

In the three-tier storage structure, the data flow direction is from a zero-level storage area to a primary storage area, and then from a primary storage area to a secondary storage area. The database storage engine receives requests for data writes (including inserts, updates, deletes), and the data is first written to the data set in the zero-level storage area. The storage engine monitors the data set. When a certain trigger condition is reached, for example, the data set size exceeds a certain threshold, the data set that satisfies the condition is swiped to the persistent data file on the primary storage area. When the persistent data file on the primary storage area satisfies the set trigger condition, the data is transferred in a certain form to the persistent data file on the secondary storage area. When the engine receives the data query (sel ec t ) request, the storage engine retrieves the data record segments that meet the query condition from the persistent data files on the data set, the primary storage area, and the secondary storage area in the zero-level storage area, respectively. Then splicing the data record segments from the three storage areas, Form a complete data record as a result of the query.

Existing delta databases typically use incremental write mode, resulting in a complete Rowkey record on the storage that can be discretely composed of Rowkey records of different data files. This results in a single Rowkey precise query for multiple memory 10 consumption.

Based on the prior art, a large number of data files are formed on the storage device, which causes the Rowkey to be discrete, which is not conducive to the technical problem of the query operation. The present application provides a data file management method and device, which can incrementally data the data files of the incremental database. The dynamic management of the storage area and the complete data storage area allows Rowkey to change from the discrete state of the initial incremental data storage area to the centralized state of the complete data storage area, reducing the overhead of Rowkey accurate query in the complete data storage area.

The method and device for managing the data file of the present application are described in detail below with reference to the specific embodiments, but are not intended to limit the scope of the application.

Referring to FIG. 2, FIG. 2 is a flowchart of an implementation manner of a data file management method according to the present application. The data file management method in this embodiment includes:

Step S101: When the incremental data storage area reaches the first data file merge condition, the record segments corresponding to each primary key in each data file in the incremental data storage area are respectively merged with the historical complete records corresponding to the found primary key, Forming a complete record of the merged moments corresponding to each primary key;

In the embodiment of the present application, the primary key (Rowkey) refers to the unique identifier of each sub-table mode of the nested structure supported by NoSQL. The following blog is an example to illustrate the nested type Schema and define the blog table ( Feed-Tab le ) Schema:

{ 〃 The following definition blog table ( feed_tab l e )

User i d //user id

Us er .name //username feed- i d 〃博文 id

Feed pos t t ime

Feed- content

{ Comment- id //comment id

Comment-posttime // comment time

Comment .content //Comment

Feed—Table's schema consists of a three-level child schema, which defines user information (user id, user_name), blog information (f eed_ id. feed-posttime. feed-content), and comment information (comment- id. Comment-posttime. comment-content ) , which have nested affiliation between them. User information, blog information, and comment information respectively have unique identifiers. In the Feed_Table, the user id, f eed_ id. comment _ id, where userid is called the primary key of the feed-table, that is, rowkey.

In the embodiment of the present application, the data file is divided into incremental data and complete data, corresponding to the storage area, and the incremental data is stored in the incremental data storage area. For a Rowkey, it is the incremental data of the Rowkey, and the complete data storage. In the full data store, for a Rowkey, it is the complete data of the Rowkey.

The user can preset the data merge condition of the incremental data storage area, that is, the first data file merge condition, for example, the preset predetermined time or the amount of data of the incremental data storage area reaches a predetermined threshold or as long as the incremental data storage area appears. The new incremental data is combined with the data files of the incremental data store. As long as the incremental data storage area reaches the first data file merge condition, the process of merging the data files of the incremental data storage area is performed.

When merging the data files of the incremental storage area, the history of the complete data storage area of the Rockey is participated in the merging process, and the complete record of the merging time corresponding to the Rowkey is obtained. The complete record of this merged moment can also be understood as the latest complete record, which is the complete record of the Rowey corresponding to this merge. That is, the Rowkey record is complete until the data file of the Rowkey record is merged in the next incremental data store. Each Rowkey record is formed with a new scalar (such as a timestamp). In the embodiment of the present application, the complete record of the historical complete record and the merge time is distinguished, and the historical complete record refers to the first record of the Rowkey found by the new to the old on the complete data storage area before the file merge starts. The first record records all records that the Rowkey had before the file merge. There is no historical full record for Rowkey that was first inserted into the full datastore. The complete record of the so-called merge time means that after the current file merge is completed, the Rowkey corresponds to all the records in the data file written to the complete data storage area (including the records of the previously merged and the merged Rowkey). The complete record of this merged moment has a certain timeliness, that is, it is valid only until the next record with the corresponding Rowkey is merged.

In the data file of the incremental data storage area, the data is arranged in order according to Rowkey. When merging, the records of each Rowkey in the data file are merged with the historical complete records of the query, and the corresponding correspondence of each Rowkey is obtained. A complete record of the time of the merger. Each Rowkey record in the data file here refers to all the record segments corresponding to Rowkey.

Step S1 02: Write the complete record of the merge time corresponding to each primary key into a newly created data file of the complete data storage area, wherein the complete record of the merge time corresponding to each primary key is used as an exact query in the complete data storage area. The output result of the primary key;

The complete records of the merge time corresponding to each Rowkey obtained after the merge are respectively written into the newly created data file of the complete data storage area, and the newly created data file is the target data generated in the complete data storage area after the merge. The file is used to store a complete record of the merge time corresponding to each Rowkey obtained by merging the data files of the incremental data storage area.

Since the exact query of the Rowkey in the complete data storage area is performed according to the time sequence of the file generation, after the merge is completed, if the Rowkey record is merged in the complete data storage area before the next time the Rowkey record is merged, then The complete record of the merge time corresponding to the Rowkey is the result of querying the Rowkey.

The above merge process can also be called a vertical merge process. It is a file merge method across storage areas. It can merge Rowkey record segments and make Rowkey aggregates. .

After the above merge process is completed, the data file of the incremental data storage area can be deleted to release the save data. Storage space.

Through the above description of the embodiments, it can be understood that the management method of the data file of the present application merges the record segments corresponding to each Rowkey in each data file of the incremental data storage area with the historical complete records corresponding to the found Rowkey. Form a complete record of the merge time corresponding to each Rowkey and write to the complete data storage area. In this way, the data files of the incremental database are dynamically managed in the incremental data storage area and the complete data storage area, thereby making Rowkey The full data store is stored in a centralized state, reducing overhead by accurately querying the Rowkey in the full data store.

In another embodiment of the management method of the data file of the present application, please refer to FIG. 3. FIG. 3 is a flowchart of forming a complete record of the merge time corresponding to each primary key, and the complete integration time of each primary key is formed in this embodiment. The record includes the following substeps:

Sub-step S201: sequentially inverting the data files of the incremental data storage area into the recording segments of each primary key in the order of the primary keys to obtain an incremental record of each primary key;

Incremental data storage n data files are iterated in order of Rowkey order, and all the records of each Rowkey iterated from the n data files are used as incremental records for each Rowkey.

Sub-step S202: searching for a historical complete record corresponding to each primary key from a data file of the main storage or the complete data storage area;

Find the historical complete record corresponding to each Rowkey from the data file of the main memory or the complete data storage area. When searching specifically, first search in the main data file, if not found in the data file of the complete data storage area. Find it. In the search, according to each primary key formation time is retrieved by the new j 1曰, straight j to find j Rowkey 々 has recorded, the j found Rowkey has recorded the latest timestamp, that is, the Rowkey history complete record. The above search process is performed for each Rowkey.

Sub-step S203: determining whether the historical complete record corresponding to the primary key is found;

After performing the above search process for each Rowkey, it is determined whether there is a historical complete record corresponding to the Rowkey, and for the Rowkey that does not find the historical complete record corresponding to the Rowkey, Execution sub-step S205, for the Rowkey that finds the historical complete record corresponding to the Rowkey, sub-step S204 is performed.

Sub-step S204: Combining the incremental record of each primary key with the historical complete record corresponding to the found primary key to form a complete record of the combined time corresponding to each primary key;

For the Rowkey that finds the historical complete record, the found historical record of the Rowkey is merged with the incremental record of the Rowkey to form a complete record of the merged moment corresponding to the Rowkey, that is, the latest complete record. This merge process is performed for each Rowkey that finds a historical full record, and a complete record of the merge time corresponding to each Rowkey is obtained.

Sub-step S205: The incremental record of the primary key is taken as a complete record of the merge time corresponding to the primary key;

For a Rowkey that does not find a historical full record, the Rowkey's incremental record is used as the complete record of the Rowkey's merged time, and is written to the target data file of the complete data storage area.

The following example specifies the vertical merge process. Please refer to the storage structure diagram shown in Figure 6, as shown in the figure:

The data file A and the data file B in the incremental data storage area include the user (User) 1, the user 1 and the user 3's feed increment data, that is, the data file A contains User1's f eed3, feed4, and User3's feed2 and feed3, data file B contains Userl's feed5 and User2's feedl. Here Userl, User2, User3 are the different Rowkeys mentioned above.

The data file 1 and the data file 2 in the complete data storage area are data files generated by the previous vertical or horizontal merge process, wherein the data file 1 is generated at the time point t1, which holds the complete record of the Userl and User 3 at the time t1. That is, Userl's feedl and User3's feedl are the result of a vertical file merge or a previous round of horizontal file merge. Data file 2 is generated at time point 12, which holds the complete record of Userl at time t2, that is, Userl's feedl and feed2, which are the result of the merge of the portrait files. Where t2 is later than tl. Data file 3 is a newly created data file for storing the output of the current vertical merge. The specific process of vertical consolidation is as follows:

(1) At the beginning of the vertical merge, the data file A and the data file B from the incremental data storage area are iterated in the order of Rowkey, and the Rowkey is iterated from the data file A and the data file B. The record fragment is used as the incremental record of the Rowkey, that is, Userl's feed3, feed4, and feed5 are used as Userl's incremental records, User2's feedl is used as User2's incremental record, and User3's feed2 and feed3 are used as User 3's incremental records;

(2) Find the historical complete record of each Rowkey from the data file of the main memory or the complete data storage area, specifically, first retrieve the main memory, and then go to the complete data storage area to find it. When searching, the primary key formation time is searched from new to old, until the Rowkey record is found. The found record is the latest timestamp, that is, the history complete record of Rowkey. This embodiment defaults to the case where the history of Rowkey is not found in the main memory. In the data file of the complete data storage area, first look up the historical complete record of Userl, find the feedl and feed2 of Userl in data file 2, which is the historical complete record of Userl, and then use the same method to find User2, but no corresponding correspondence is found. The complete history of the history, and then find the historical complete record of User3, that is, the feedl of User3 of data file 1; (3) merge the historical complete record of the found Rowkey with the incremental record of the Rowkey to obtain the latest update of Rowkey. A full record, a new data file written to the full data store. Userl's feedl-feed5 is written to data file 3, and User2, which has no historical full record, directly writes User2's incremental data feed1 to data file 3. User3's feedl and feed2 are written to the data file of the complete data storage area. Of course, the above writing process can also be written to the main memory at the same time;

(4) The vertical merge is completed, and the merged data file A and data file B of the incremental data storage area are deleted, and the process ends.

Referring to FIG. 4, FIG. 4 is a flowchart of another embodiment of a method for managing a data file according to the present application. The method for managing a data file in this embodiment includes the following steps:

Step S301: When the incremental data storage area reaches the first data file merge condition, the record segments corresponding to each primary key in each data file in the incremental data storage area are respectively merged with the historical complete records corresponding to the found primary key, Forming a complete record of the merged moments corresponding to each primary key;

Step S302: Write a complete record of the merge time corresponding to each primary key into a newly created data file of the complete data storage area, wherein the complete record of the merge time corresponding to each primary key is used as the exact primary key in the complete data storage area. Output result Step S303: deleting the data file of the incremental data storage area;

After completing the merge of each Rowkey record in the data file of the incremental data storage area and writing the complete record of the merge time of each Rowkey obtained by the merge into the complete data storage area, deleting the data file of the incremental data storage area, The delta datastore frees up space to write the next incremental data.

Step S304: When the complete data storage area reaches the second data file merge condition, the data files containing the complete records of the merge time saved in the complete data storage area are merged, and each primary key of the complete data storage area is deleted. Redundant record

After the completion of the above-mentioned cross-storage data file merging, when the complete data storage area forms a complete record of the merge time, the historical full record becomes invalid and needs to be recycled to eliminate Rowkey redundant data. Therefore, the data file merging process inside the complete storage area is further performed. This process can also be called a horizontal data file merging process, which is a data file merging process inside the complete storage area. The goal is to eliminate redundant Rowkeys, discard invalid Rowkey records, and reclaim storage space.

In the actual application process, the user may pre-set the complete data storage area data merge condition, that is, the second data file merge condition, for example, setting a predetermined time or the amount of data reaches a predetermined threshold or after completing the data merge of the incremental data storage area. The data file merge process that starts the full data store. As soon as the actual full data storage area reaches the second data file merge condition, the data files of the complete data storage area are merged.

The merging of each data file containing the complete record of each merge time saved in the complete data storage area may be implemented by multiple algorithms of data redundancy in the prior art, such as a merging algorithm. In another embodiment of the management method of the data file, each data file containing the complete record of each merge time saved in the complete data storage area is merged by the merging algorithm as an example. Referring to FIG. 5, FIG. 5 is a flowchart of a merge algorithm for combining data files including a complete record of each merge time saved in a complete data storage area. In this embodiment, the storage of the complete data storage area includes each merge time. The consolidation of the complete record of each data file includes the following substeps:

Sub-step S401: The saved record from the complete data storage area containing the complete record of each merged time In each data file, find the latest data file where each primary key is located, and the latest data file is the data file with the latest formation time;

Each data file of the complete data storage area containing the complete record of each merge time is all data files stored in the complete data storage area at the time of the merge. From these data files, find the latest data file where each Rowkey is located. This latest data file is the latest data file, because each data file in the complete data storage area is built with an old age. The scalar (such as the timestamp), the latest and most complete record of the Rowkey is recorded in the data file that forms the latest time.

As a preferred implementation manner, before the searching, the iterator sequentially iterates according to the Rowkey size order according to the order of generating the data files of the complete data storage area, for example, pressing Us er l, User 2, User 3, etc. The order is iterated sequentially, and then the latest data file of each Rowkey is found in the order of Rowkey size. That is, first find the latest data file where User l is located, then find the latest data file where Us er 2 is located... and so on.

Sub-step S402: obtaining a complete record corresponding to each primary key from the latest data file where each primary key is located and writing the merged data file of the complete data storage area, and deleting the completed merged data file of the complete data storage area;

Get the record segment corresponding to Rowkey from the latest data file where each Rowkey is located and write the merged data file of the complete data storage area, and then delete the completed data file of the complete data storage area. The merged data file is the target data file that the full data store uses to store the results of its internal data file merge.

The following example illustrates the internal integration process of the above complete data storage area. Please refer to FIG. 7. FIG. 7 is a schematic diagram of a complete data storage area, where the data file 1 and the data file 2 of the complete data storage area are two data files to be merged. Data file 3 is the target file for horizontal merge output, that is, the merged data file described above. Wherein, the data file 1 is generated at time t1, which holds the complete records of the times ls l and s er 3 at the time t1, that is, the f eedl of the Us er 1 and the f eedl of the User 3, which are the vertical file merge or the previous round. The result of the horizontal file merge. Here User l, Us er 3 are the different Rowke y mentioned above. Data file 2 is generated at time point 12, which holds U ser 1 at time 12 The full record, Userl's feedl and feed2, is the result of a vertical file merge. Where t2 is later than tl.

At the beginning of the merge, (1) the iterator sequentially iterates the data file 1 and the data file 2 in the order of the Rowkey size in the order of the file generation time, and takes out Rowkey=Userl; (2) finds the Rowkey from the data file 1 and the data file 2. = Userl's latest complete record file, find data file 2, and data file 1 is a historical full record; (3) Read the latest complete record of Rowkey= Userl from data file 2, including feedl and feed2, feedl and Feed2 is copied to data file 3; repeat the above steps to iteratively merge Rowkey= User3, its record only exists in data file 1, read record from data file 1 and write data file 3, horizontal data merge is completed, delete data file 1 And data file 2.

Because of the hierarchical storage structure, Rowkey may be in the main memory, incremental data storage area and full data storage area. When querying a Rowkey, you must summarize the results from these three storage areas. The following example illustrates the Rowkey query process after using the management method of the above data file:

Please refer to FIG. 8. FIG. 8 is a schematic diagram of a storage structure in an implementation manner of a data file management method according to the present application. For example, if the file is checked, the value of Rowkey=Userl is recorded. In the figure, Rowkey=User 1 is recorded in three stores. The area has a distribution, the query process is as follows: (1) First find the record of Rowkey=Userl in the main memory, find the feed5; (2) The data file 1 and the data file 2 in the incremental data storage area have the record of Rowkey=Userl Find out e eed3 and f eed4 ; (3) Find the data file 1 and data file 2 in the complete data storage area with the record of Rowkey=Userl. According to the timestamp comparison, the record of Rowkey=Userl on the data file 2 is The latest and most complete, so only look for feedl and feed2, and directly ignore the data file 1; (4) summarize and return the query results. In the above query process, it is clear that the exact lookup of Rowkey on the full datastore is only required once.

Through the description of the foregoing embodiment, the data file management method of the present application divides the data file into incremental data and complete data, hierarchically stores, and merges in stages, and solves the problem that Rowkey accurately queries multiple times in the complete data storage area. Achieving an accurate lookup of Rowkey on the full data store takes only 10 at a time.

Please refer to FIG. 9. FIG. 9 is a schematic structural diagram of an embodiment of a storage device of the present application. The storage device 100 includes a first merge module 11 and a write module 12, wherein: the first merge module 11 is configured to store incremental data when the incremental data storage area reaches the first data file merge condition. The record segments corresponding to each Rowkey in each data file in the region are respectively merged with the historical complete records corresponding to the found Rowkey, and a complete record of the merge time corresponding to each Rowkey is formed and output to the write module 12;

The user can preset the data merge condition of the incremental data storage area, that is, the first data file merge condition, for example, the preset predetermined time or the amount of data of the incremental data storage area reaches a predetermined threshold or as long as the incremental data storage area appears. The new incremental data is used to merge the data files of the incremental data store. As long as the incremental data storage area reaches the first data file merge condition, the data file merge process of the incremental data storage area is performed.

The first merge module 1 1 participates in the merge process of the complete data storage area of the Rockey when the data files of the incremental storage area are merged, and merges to obtain a complete record of the merge time corresponding to the Rowkey. The complete record of this merged moment can also be understood as the latest complete record, which is the complete record corresponding to the R owe y obtained after the merger. That is, the Rowkey record is complete until the next time the data file with the R owk e y record is merged. Each Rowkey record is formed with a new or old scalar (such as a timestamp).

In the embodiment of the present application, the complete record of the historical complete record and the merge time is distinguished, and the historical complete record refers to the first record of the Rowkey found by the new to the old on the complete data storage area before the file merge starts. The first record records all records that the Rowkey had before the file merge. There is no historical full record for Rowkey that was first inserted into the full datastore. The complete record of the so-called merge time means that after the current file merge is completed, the Rowkey corresponds to all records (including the previously merged and merged Rowkey records) written in the newly created data file of the complete data store. In the data file of the incremental data storage area, the data is arranged in order according to Rowkey. When merging, all the records of each Rowkey in the data file are merged with the historical complete records of the query, and each Rowkey is obtained. A complete record of the combined moments.

The writing module 12 is configured to write a complete record of the merge time corresponding to each Rowkey into a newly created data file in the complete data storage area, and the complete record of the merge time corresponding to each Rowkey is used as the next record combination of the next Rowkey. The Rowey output is accurately queried in the full data store.

The write module 12 writes the complete records of the merge time corresponding to each Rowkey obtained after the merge to the newly created data file of the complete data storage area, and the newly created data file is merged and then in the complete data storage area. The generated target data file is used to store a complete record of the merge time corresponding to each Rowkey in the data file of the incremental data storage area.

After the above merge process is completed, the write module 12 can delete the corresponding data file of the incremental data storage area to release the storage space.

Referring to FIG. 10, FIG. 10 is a schematic structural diagram of another embodiment of a storage device of the present application. The storage device 200 of the present embodiment includes a first merge module 21, a write module 11, a second merge module 23, and a lookup module 24, among them:

The first merging module 21 is configured to: when the incremental data storage area reaches the first data file merging condition, respectively record the record segments corresponding to each Rowkey in each data file in the incremental data storage area and the history corresponding to the found Rowkey Complete record merge, form a complete record of the merge time corresponding to each Rowkey and output to the write module 22; The writing module 22 is configured to write a complete record of the merge time corresponding to each Rowkey into a newly created data file in the complete data storage area, and the combined time of each Rowkey is completely recorded as the next record of the Rowkey before the merge. The complete data storage area accurately queries the output of the Rowkey.

The second merging module 2 3 is configured to merge each data file containing the complete record of each merge time saved in the complete data storage area when the complete data storage area reaches the second data file merging condition, and delete the complete data storage area. Redundant record for each Rowkey.

After the completion of the above-mentioned cross-storage data file merging, when the complete data storage area is formed into a complete record of each Rowkey merge time, the history complete record of the Rowkey becomes invalid and needs to be recycled to eliminate Rowkey redundant data. Therefore, the second merge module 2 3 further performs a data file merge process inside the complete storage area. This process may also be called a horizontal data file merge process, which is a data file merge process inside the complete storage area. The goal is to eliminate redundant Rowkeys, discard invalid Rowkey records, and reclaim storage.

The second merging module 231 combines the data files of the complete records stored in the complete data storage area and includes the data redundancy, such as a merging algorithm.

The lookup module 24 is used to look up each of the data files from the main memory or the full data storage area.

The historical complete record corresponding to the rowkey, and outputting the historical complete record corresponding to the found Rowkey to the first merge module 21;

The searching module 24 is configured to search for the historical complete record corresponding to each Rowkey from the data file of the main memory or the complete data storage area before the merging. In the specific search, the search is performed in the main data file, if not found. Look in the data file to the full datastore. At the time of the search, the data file is retrieved from new to old until the Rowkey record is found, and the found Rowkey record is the latest timestamp, that is, the history record of the Rowkey. The lookup module 24 performs the above search process for each Rowkey.

When the search module 24 does not find the historical complete record corresponding to the Rowkey, the first merge module 21 is configured to merge the record segments corresponding to the Rowkey in the data file of the incremental data storage area as a complete record of the merge time corresponding to the Rowkey. . Referring to FIG. 11, the second merging module 23 in this embodiment further includes a searching unit 111 and a writing unit 112, where:

The searching unit 111 is configured to search for the latest data file in which each Rowkey is located from each data file containing the complete record of each merge time saved in the complete data storage area, and output the latest data file to the writing unit 112. The latest data file is Refers to the data file with the latest formation time;

Each data file containing the complete record of each merge time saved in the complete data storage area is all the data files in the complete data storage area of the merge time. The searching unit 111 finds the latest data file of each Rowkey from the data files, and the latest data file is the data file with the latest formation time, because each data file of the complete data storage area is carried at the time of generation. A new and old scalar (such as timestamp), the latest data file to record the latest and most complete record of the Rowkey.

As a preferred embodiment, before searching, the searching unit 111 sequentially iterates the data files of the complete data storage area according to the Rowkey size order according to the order of generating the data files of the complete data storage area, for example, according to Userl, User2, User3... ...the order is iterated sequentially, and then the latest data file for each Rowkey is found in order of Rowkey size. That is, first find the latest data file where Userl is located, and then find the latest data file where User2 is located... and so on.

The writing unit 112 is configured to obtain a complete record corresponding to each Rowkey from the latest data file in which each Rowkey is located and write the data file merged in the complete data storage area, and delete the data storage area containing the merged time. Fully recorded data file.

The writing unit 112 acquires the record segment corresponding to the Rowkey from the latest data file in which each Rowkey is located and writes the merged data file of the complete data storage area, and then deletes the completed data file of the complete data storage area. The merged data file is the target file for the full data store to store the consolidated results of its internal data files.

Referring to FIG. 12, FIG. 12 is a schematic structural diagram of still another embodiment of a storage device of the present application. The storage device 300 of the present embodiment includes a processor 31, an interaction interface 32, a random access memory 33, a read only memory 34 bus 35, and a network. Interface unit 36. Wherein the processor 31 passes the total The line 35 is coupled to the interactive interface 32, the random access memory 33, the read only memory 34, and the network interface unit 36, respectively. Wherein, when the storage device 300 needs to be run, booting is performed by the boot loader booting system in the basic input/output system or the embedded system that is solidified in the read-only memory 34, and the storage device 300 is booted into a normal operating state. After the storage device 300 enters the normal operating state, the application program and the operating system are run in the random access memory 33, and data is received from the network or transmitted to the network through the network interface unit 36, so that:

The interaction interface 32 is a device interface for human-computer interaction, and is configured to receive an operation instruction of the user, and may be a USB interface, a display interface, or the like;

When the incremental data storage area reaches the first data file merge condition, the processor 31 receives the operation instruction of the user to merge the data files of the incremental data storage area through the interactive interface, and increments the data of the data storage area. The record segments corresponding to each Rowkey in the file are merged with the historical complete records corresponding to each Rowkey found, forming a complete record of the merge time corresponding to each Rowkey, and writing the complete record of the merge time corresponding to each Rowkey. In a newly created data file of the complete data storage area, the complete record of the merge time corresponding to each Rowkey is used as the next Rowkey record before the merge is merged, and the output result of the Rowkey is accurately queried in the complete data storage area;

On the other hand, the processor 31 further merges each data file containing the complete record of each merge time saved in the complete data storage area according to the user's operation instruction for merging the data of the complete data storage area, and deletes the complete data storage. Redundant records for each Rowkey of the zone;

In this embodiment, the processor 31 may be a central processing unit CPU, or a specific integrated circuit AS IC (Applicable Integrated Integrated Integrated Circuit), or implemented by the implementation of the present application. One or more integrated circuits in a manner.

In the present embodiment, the incremental data storage area and the complete data storage area described above may correspond to the random access memory 33 and the read only memory 34 of the storage device 300 of the present embodiment, respectively.

Through the above embodiments, it can be understood that the management method and apparatus for the data file of the present application merges the record segments corresponding to each Rowkey in the data file of the incremental data storage area with the historical complete records corresponding to the found Rowkey. , forming the merge moment corresponding to each Rowkey The complete record is written to the full data storage area. In this way, the data files of the incremental database are dynamically managed in the incremental data storage area and the complete data storage area, so that Rowkey is stored in a centralized state in the complete data storage area. , Reduces overhead by making precise queries for Rowkey in the full data store.

In addition, regular internal file consolidation of data files in the full datastore eliminates invalid records, reduces Rowkey redundancy and dispersion, improves Rowkey query performance, and efficiently reclaims storage space.

In the several embodiments provided herein, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the device implementations described above are merely illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be used. Combined or can be integrated into another system, or some features can be ignored, or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form. The components displayed for the unit may or may not be physical units, ie may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the present embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software function unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application, in essence or the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. , including a number of instructions to make a computer device (which can be a personal computer, A server, or a network device, or the like, or a processor, performs all or part of the steps of the methods described in various embodiments of the present application. The foregoing storage medium includes: a USB flash drive, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, Random Acces s Memory), a magnetic disk or an optical disk, and the like, which can store program codes. medium.

The above description is only the embodiment of the present application, and thus does not limit the scope of the patent application, and the equivalent structure or equivalent process transformation made by using the specification and the drawings of the present application, or directly or indirectly applied to other related technologies. The scope of the invention is included in the scope of patent protection of this application.

Claims

Rights request

A method for managing a data file, comprising:

When the incremental data storage area reaches the first data file merging condition, the recorded segments corresponding to each primary key in each data file in the incremental data storage area are respectively merged with the historical complete records corresponding to the found primary key. Forming a complete record of the merge time corresponding to each of the primary keys; writing a complete record of the merge time corresponding to each primary key into a newly created data file of the complete data storage area, wherein each of the The complete record of the merge time corresponding to the primary key is used as an output result of accurately querying the primary key in the complete data storage area.

The method according to claim 1, wherein the method further comprises: writing a complete record of the merge time corresponding to each of the primary keys to the main memory.

3. Method according to claim 1 or 2, characterized in that

The method further includes: when the complete data storage area reaches the second data file merging condition, merging each data file that is included in the complete data storage area and including the complete record of each merging time, deleting the complete A redundant record of each of the primary keys of the data store.

The method according to claim 3, wherein the data files stored in the complete data storage area and including the complete records of the combined time are merged, and each of the complete data storage areas is deleted. The redundant records of the primary keys are specifically:

And merging each data file containing the complete record of each merge time stored in the complete data storage area by a merge algorithm to delete redundant records of each of the primary keys of the complete data storage area.

5. The method of claim 4, wherein

And merging, by the merging algorithm, each data file containing the complete record of each merge time saved in the complete data storage area, and deleting the redundant record of each of the primary keys of the complete data storage area includes :

Finding, from each data file containing the complete record of each merge time saved in the complete data storage area, the latest data file where each of the primary keys is located, and the latest data file refers to the latest formation time. data files;

Obtaining a complete record corresponding to each of the primary keys from the latest data file in which each primary key is located and writing the merged data file of the complete data storage area, deleting the completed merged data of the complete data storage area The data file.

The method according to any one of claims 2 to 5, characterized in that

And combining the record segments corresponding to each primary key in each data file in the incremental data storage area with the historical complete records corresponding to the searched primary keys, to form a complete merge time corresponding to each primary key Before the steps recorded, it also includes:

A historical complete record corresponding to each of the primary keys is looked up from the data file of the primary storage or the full data storage area.

7. The method of claim 6 wherein:

The step of searching for a historical complete record corresponding to each of the primary keys from the data file of the primary storage or the complete data storage area includes:

Searching in the data file in the main memory according to the new record time of the complete record corresponding to each of the primary keys, if not retrieved in the main memory, and then going to the complete data storage area The retrieval is performed in the data file until the complete record corresponding to the primary key is retrieved, and the complete record of the retrieved primary key is a historical complete record corresponding to the primary key.

8. The method of claim 6 wherein:

When the historical complete record corresponding to the primary key is not found, the record segment corresponding to each primary key in each data file in the incremental data storage area and the historical complete record corresponding to the found primary key respectively Merging, forming a complete record of the combined moments corresponding to each of the primary keys, specifically:

And combining the record segments corresponding to the primary key in each data file in the incremental data storage area as a complete record of the merge time corresponding to the primary key.

The method according to any one of claims 1 to 8, wherein the method further comprises: deleting the data file of the incremental data storage area.

A storage device, comprising: a first merge module and a write module, wherein: the first merge module is configured to increase the incremental data storage area when the first data file merge condition is reached The recorded segments corresponding to each primary key in each data file in the data storage area are respectively merged with the historical complete records corresponding to the found primary keys, forming a complete record of the combined time corresponding to each primary key and outputting to the Write module

The writing module is configured to write a complete record of the merge time corresponding to each of the primary keys into a newly created data file of the complete data storage area, where the merge time corresponding to each primary key Complete record as an accurate query of the primary key in the complete data storage area The results.

The device according to claim 10, wherein the writing module is further configured to write a complete record of the merge time corresponding to each of the primary keys into the main memory.

The device according to claim 10 or 11, wherein the device further comprises a second merging module, wherein:

The second merging module is configured to merge, when the complete data storage area reaches the second data file merging condition, each data file that is included in the complete data storage area and includes a complete record of each merging time, and deletes the data file. A redundant record of each of the primary keys of the complete data store.

The device according to claim 12, wherein the second merging module comprises a searching unit and a writing unit, wherein:

The searching unit is configured to search, from each data file that includes the complete record of each merge time saved in the complete data storage area, the latest data file where each of the primary keys is located, and the latest data file is Refers to the data file with the latest formation time;

The writing unit is configured to obtain a complete record corresponding to each of the primary keys from a latest data file in which each primary key is located, and write the merged data file of the complete data storage area, and delete the complete data. The data file of the storage area that has been merged.

The device according to any one of claims 1 to 13, wherein the device further comprises a lookup module, wherein:

The searching module is configured to search for a historical complete record corresponding to each of the primary keys from a data file of the primary storage or the complete data storage area, and output a historical complete record corresponding to each of the found primary keys. Giving the first merge module.

15. Apparatus according to claim 14 wherein:

When the search module does not find the historical complete record corresponding to the primary key, the first merge module is configured to merge the record segments corresponding to the primary key in each data file in the incremental data storage area, as The complete record of the merged moment corresponding to the primary key.