WO2015024406A1 - Data file management method and device - Google Patents
Data file management method and device Download PDFInfo
- Publication number
- WO2015024406A1 WO2015024406A1 PCT/CN2014/079700 CN2014079700W WO2015024406A1 WO 2015024406 A1 WO2015024406 A1 WO 2015024406A1 CN 2014079700 W CN2014079700 W CN 2014079700W WO 2015024406 A1 WO2015024406 A1 WO 2015024406A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- complete
- storage area
- data file
- record
- data
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1744—Redundancy elimination performed by the file system using compression, e.g. sparse files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
Definitions
- the present invention relates to a method and apparatus for managing data files.
- NoSQL is a general term for all relational databases different from the traditional ones.
- NoSQL data storage does not require a fixed table mode, usually stored as a key-value pair.
- LSM-tree Log-Stmctured Merge-Tree
- LSM-tree converts access to many small files into continuous high-volume transfers, making most access to the file system sequential, thereby increasing disk bandwidth utilization and minimizing system access performance overhead. , especially suitable for application environments that generate a lot of insert operations. Therefore, NoSQL based on LSM-tree is also called incremental database.
- the LSM-tree consists of at least two parts.
- a component resident in memory called CO tree (or CO) can be a data structure for any convenient key value search.
- Other components are resident in the hard disk, called C1...
- Incremental database uses incremental write mode, that is, the database adds records or updates records, first put into the memory data structure (such as the main memory data table, Memory Table, Memtable), that is, the CO tree, which reaches a certain size to form a Small data files (such as Sorted String Table, Sstable) are brushed into the hard disk data structure, that is, C1 ... CK tree, internal key (Rowkey) is arranged in order. Such a file will not be modifiable. When querying, you need to query the Rowkey records from these small data files to form a complete Rowkey record.
- the memory data structure such as the main memory data table, Memory Table, Memtable
- Sstable Sorted String Table, Sstable
- a complete Rowkey record can be composed of Rowey record segments that are discrete in different data files.
- a Rowy exact query requires multiple memory input/output (10) consumption.
- the technical problem to be solved by the present invention is to provide a data file management method and device, which can change the discrete state of the incremental storage area to the centralized state of the complete data storage area, and reduce the accurate query of the Rowkey in the complete data storage area. 10 overhead.
- the first aspect of the present application provides a data file management method, including: when the incremental data storage area reaches a first data file merge condition, corresponding to each primary key in each data file in the incremental data storage area
- the recorded segments are respectively merged with the historical complete records corresponding to the found primary keys, forming a complete record of the merged moments corresponding to each of the primary keys; and writing the complete records of the merged moments corresponding to each of the primary keys to the complete record
- the complete record of the merge time corresponding to each of the primary keys is used as an output result of accurately querying the primary key in the complete data storage area.
- the method further includes: writing a complete record of the merge time corresponding to each of the primary keys to a main memory.
- the method further includes: reaching a second data file in the complete data storage area When the conditions are merged, each data file containing the complete record of each merge time saved in the complete data storage area is merged, and the redundant record of each of the primary keys of the complete data storage area is deleted.
- the data file that is saved in the complete data storage area and includes a complete record of each merge time Performing a merge to delete a redundant record of each of the primary keys of the complete data storage area, specifically: performing a merge algorithm to perform each data file that is stored in the complete data storage area and includes a complete record of each merged time Merging, deleting redundant records of each of the primary keys of the complete data store.
- the using a merge algorithm to complete a complete record that is saved in the complete data storage area and includes each merge time comprises: from the data files containing the complete records of the combined time saved in the complete data storage area Finding the latest data file where each of the primary keys is located, the latest data file refers to the data file with the latest formation time; obtaining each of the primary keys from the latest data file where each primary key is located Corresponding complete record and write the complete data storage The merged data file of the storage area, deleting the data file of the completed data storage of the complete data storage area.
- the method further includes: from the primary storage or the complete The data file of the data storage area is searched for the historical complete record corresponding to each of the primary keys.
- the searching for each of the data files from the main memory or the complete data storage area includes: searching, in the new and old manner, the data file in the main storage according to the formation time of the complete record corresponding to each of the primary keys, if the primary storage does not retrieve Then, the data is retrieved from the data file of the complete data storage area until the complete record corresponding to the primary key is retrieved, and the complete record of the retrieved primary key is a historical complete record corresponding to the primary key.
- the incremental data is stored when the historical full record corresponding to the primary key is not found
- the record segments corresponding to each primary key in each data file in the region are respectively merged with the historical complete records corresponding to the searched primary keys, and a complete record of the merge time corresponding to each primary key is formed, which is specifically as follows:
- the record segments corresponding to the primary keys in each data file in the volume data storage area are merged as a complete record of the merge time corresponding to the primary key.
- the method further includes: deleting the data file of the incremental data storage area.
- a second aspect of the present application provides a storage device, where the storage device includes a first merge module and a write module, where: the first merge module is configured to reach a first data file merge condition in an incremental data storage area And combining the record segments corresponding to each primary key in each data file in the incremental data storage area with the historical complete records corresponding to the found primary keys, to form a complete merge time corresponding to each primary key. Recording and outputting to the write module; the write module is configured to write a complete record of the merge time corresponding to each of the primary keys into a newly created data file of the complete data storage area, where Complete record of the merged moment corresponding to each primary key As an output result of accurately querying the primary key in the complete data storage area.
- the writing module is further configured to write a complete record of the merged moment corresponding to each of the primary keys into a main memory.
- the device further includes a second merging module, where: the second merging module is used by When the second data file merge condition is reached in the complete data storage area, each data file containing the complete record of each merge time saved in the complete data storage area is merged, and each complete data storage area is deleted. A redundant record of the primary keys.
- the second combining module includes a searching unit and a writing unit, where: the searching unit is used to In each data file of the complete record containing the complete record of each merge time, the latest data file where each of the primary keys is located is found, and the latest data file refers to the data with the latest time. a file; the writing unit is configured to obtain a complete record corresponding to each of the primary keys from a latest data file in which each primary key is located, and write the merged data file of the complete data storage area, deleting the file The data file of the completed data pool of the complete data store.
- the device further includes: a searching module, where: the searching module is used to Searching, in the data file of the main memory or the complete data storage area, a historical complete record corresponding to each of the primary keys, and outputting the historical complete record corresponding to each of the found primary keys to the first merge module .
- the first merge The module is configured to merge the record segments corresponding to the primary key in each data file in the incremental data storage area as a complete record of the merge time corresponding to the primary key.
- the present application combines the record segments corresponding to each Rowkey in the data file of the incremental data storage area with the historical complete records corresponding to the found Rowkey, forming each The complete record of the merge time corresponding to Rowkey is written into the complete data storage area.
- the data files of the incremental database are dynamically managed in the incremental data storage area and the complete data storage area, so that Rowkey is in the complete data.
- Storage area Stored in a centralized state, reducing overhead by accurately querying Rowkey in the full data store.
- FIG. 1 is a schematic diagram of a hierarchical storage structure of the present application.
- FIG. 2 is a flow chart of an embodiment of a method for managing a data file of the present application
- FIG. 3 is a flow chart showing a complete record of a merge time corresponding to each primary key in one embodiment of the data file management method of the present application;
- FIG. 4 is a flow chart of another embodiment of a method for managing a data file of the present application.
- a merge algorithm performs a process of merging data files including a complete record of each merge time stored in a complete data storage area
- FIG. 6 is a schematic diagram of a storage structure of one embodiment of a data file management method of the present application
- FIG. 7 is a schematic diagram of a storage structure of another embodiment of a data file management method of the present application
- FIG. 8 is another management method of the data file of the present application.
- FIG. 9 is a schematic structural diagram of an embodiment of a storage device of the present application;
- FIG. 10 is a schematic structural diagram of another embodiment of a storage device of the present application.
- HDDs are widely used as storage media for storage systems, such as databases.
- Hard disk-based databases typically use a two-tier storage structure of Ma in Memory + HDD. The data record is first written to the main memory and then persisted to the hard disk under certain trigger conditions.
- the development of the industry has been uneven, the performance of the main memory 10 has been greatly improved, and the performance of the hard disk 10 has been slow to grow. This has caused the hard disk-based database read and write performance to be severely limited by the hard disk 10.
- SSD solid-state drives
- a zero-level storage area refers to main storage
- a primary storage area and a secondary storage area are two types of storage devices, wherein the primary storage area is relatively
- the read and write performance in the secondary storage area is outstanding, but the price is relatively expensive, such as main memory and SSD combination, SSD and HDD combination, HDD and tape combination.
- the primary storage area and the secondary storage area can be understood as a combination of SSD and HDD, but are not limited to this combination in the embodiment of the present application.
- a level one storage area is also referred to as an incremental data storage area
- a secondary storage area is referred to as a complete data storage area.
- FIG. 1 is a schematic diagram of a hierarchical storage structure.
- A is shown in FIG. 2 as a schematic diagram of a two-layer storage structure, and B is a schematic diagram of a three-layer storage structure.
- the data flow direction is from a zero-level storage area to a primary storage area.
- the data store storage engine receives data writes (including inserts, updates, deletes) and the data is first written to the data set in the zero-level storage area.
- the storage engine monitors the data set. When a certain trigger condition is reached, such as the data set size exceeds a certain threshold, the data set that satisfies the condition is swamped (f lush ) to the persistent data file on the secondary storage area.
- the storage engine When the storage engine receives the data query (se lec t ) request, the storage engine retrieves (re tri eve ) the data record fragment that meets the query condition from the data set in the zero-level storage area and the persistent data file on the secondary storage area, respectively. Then, the data record segments from the two storage areas are spliced to form a complete data record as a result of the query.
- the data flow direction is from a zero-level storage area to a primary storage area, and then from a primary storage area to a secondary storage area.
- the database storage engine receives requests for data writes (including inserts, updates, deletes), and the data is first written to the data set in the zero-level storage area.
- the storage engine monitors the data set. When a certain trigger condition is reached, for example, the data set size exceeds a certain threshold, the data set that satisfies the condition is swiped to the persistent data file on the primary storage area. When the persistent data file on the primary storage area satisfies the set trigger condition, the data is transferred in a certain form to the persistent data file on the secondary storage area.
- the storage engine retrieves the data record segments that meet the query condition from the persistent data files on the data set, the primary storage area, and the secondary storage area in the zero-level storage area, respectively. Then splicing the data record segments from the three storage areas, Form a complete data record as a result of the query.
- the present application provides a data file management method and device, which can incrementally data the data files of the incremental database.
- the dynamic management of the storage area and the complete data storage area allows Rowkey to change from the discrete state of the initial incremental data storage area to the centralized state of the complete data storage area, reducing the overhead of Rowkey accurate query in the complete data storage area.
- FIG. 2 is a flowchart of an implementation manner of a data file management method according to the present application.
- the data file management method in this embodiment includes:
- Step S101 When the incremental data storage area reaches the first data file merge condition, the record segments corresponding to each primary key in each data file in the incremental data storage area are respectively merged with the historical complete records corresponding to the found primary key, Forming a complete record of the merged moments corresponding to each primary key;
- the primary key refers to the unique identifier of each sub-table mode of the nested structure supported by NoSQL.
- the following blog is an example to illustrate the nested type Schema and define the blog table ( Feed-Tab le ) Schema:
- Feed—Table's schema consists of a three-level child schema, which defines user information (user id, user_name), blog information (f eed_ id. feed-posttime. feed-content), and comment information (comment- id. Comment-posttime. comment-content ) , which have nested affiliation between them. User information, blog information, and comment information respectively have unique identifiers.
- the user id, f eed_ id. comment _ id where userid is called the primary key of the feed-table, that is, rowkey.
- the data file is divided into incremental data and complete data, corresponding to the storage area, and the incremental data is stored in the incremental data storage area.
- the incremental data is stored in the incremental data storage area.
- For a Rowkey it is the incremental data of the Rowkey, and the complete data storage.
- the user can preset the data merge condition of the incremental data storage area, that is, the first data file merge condition, for example, the preset predetermined time or the amount of data of the incremental data storage area reaches a predetermined threshold or as long as the incremental data storage area appears.
- the new incremental data is combined with the data files of the incremental data store. As long as the incremental data storage area reaches the first data file merge condition, the process of merging the data files of the incremental data storage area is performed.
- the history of the complete data storage area of the Rockey is participated in the merging process, and the complete record of the merging time corresponding to the Rowkey is obtained.
- the complete record of this merged moment can also be understood as the latest complete record, which is the complete record of the Rowey corresponding to this merge. That is, the Rowkey record is complete until the data file of the Rowkey record is merged in the next incremental data store.
- Each Rowkey record is formed with a new scalar (such as a timestamp).
- the complete record of the historical complete record and the merge time is distinguished, and the historical complete record refers to the first record of the Rowkey found by the new to the old on the complete data storage area before the file merge starts.
- the first record records all records that the Rowkey had before the file merge.
- the complete record of the so-called merge time means that after the current file merge is completed, the Rowkey corresponds to all the records in the data file written to the complete data storage area (including the records of the previously merged and the merged Rowkey).
- the complete record of this merged moment has a certain timeliness, that is, it is valid only until the next record with the corresponding Rowkey is merged.
- each Rowkey in the data file is merged with the historical complete records of the query, and the corresponding correspondence of each Rowkey is obtained.
- Each Rowkey record in the data file here refers to all the record segments corresponding to Rowkey.
- Step S1 02 Write the complete record of the merge time corresponding to each primary key into a newly created data file of the complete data storage area, wherein the complete record of the merge time corresponding to each primary key is used as an exact query in the complete data storage area.
- the complete records of the merge time corresponding to each Rowkey obtained after the merge are respectively written into the newly created data file of the complete data storage area, and the newly created data file is the target data generated in the complete data storage area after the merge.
- the file is used to store a complete record of the merge time corresponding to each Rowkey obtained by merging the data files of the incremental data storage area.
- the above merge process can also be called a vertical merge process. It is a file merge method across storage areas. It can merge Rowkey record segments and make Rowkey aggregates. .
- the data file of the incremental data storage area can be deleted to release the save data. Storage space.
- the management method of the data file of the present application merges the record segments corresponding to each Rowkey in each data file of the incremental data storage area with the historical complete records corresponding to the found Rowkey. Form a complete record of the merge time corresponding to each Rowkey and write to the complete data storage area.
- the data files of the incremental database are dynamically managed in the incremental data storage area and the complete data storage area, thereby making Rowkey
- the full data store is stored in a centralized state, reducing overhead by accurately querying the Rowkey in the full data store.
- FIG. 3 is a flowchart of forming a complete record of the merge time corresponding to each primary key, and the complete integration time of each primary key is formed in this embodiment.
- the record includes the following substeps:
- Sub-step S201 sequentially inverting the data files of the incremental data storage area into the recording segments of each primary key in the order of the primary keys to obtain an incremental record of each primary key;
- Incremental data storage n data files are iterated in order of Rowkey order, and all the records of each Rowkey iterated from the n data files are used as incremental records for each Rowkey.
- Sub-step S202 searching for a historical complete record corresponding to each primary key from a data file of the main storage or the complete data storage area;
- Sub-step S203 determining whether the historical complete record corresponding to the primary key is found
- Sub-step S204 Combining the incremental record of each primary key with the historical complete record corresponding to the found primary key to form a complete record of the combined time corresponding to each primary key;
- the found historical record of the Rowkey is merged with the incremental record of the Rowkey to form a complete record of the merged moment corresponding to the Rowkey, that is, the latest complete record.
- This merge process is performed for each Rowkey that finds a historical full record, and a complete record of the merge time corresponding to each Rowkey is obtained.
- Sub-step S205 The incremental record of the primary key is taken as a complete record of the merge time corresponding to the primary key;
- the Rowkey's incremental record is used as the complete record of the Rowkey's merged time, and is written to the target data file of the complete data storage area.
- the data file A and the data file B in the incremental data storage area include the user (User) 1, the user 1 and the user 3's feed increment data, that is, the data file A contains User1's f eed3, feed4, and User3's feed2 and feed3, data file B contains Userl's feed5 and User2's feedl.
- Userl, User2, User3 are the different Rowkeys mentioned above.
- the data file 1 and the data file 2 in the complete data storage area are data files generated by the previous vertical or horizontal merge process, wherein the data file 1 is generated at the time point t1, which holds the complete record of the Userl and User 3 at the time t1. That is, Userl's feedl and User3's feedl are the result of a vertical file merge or a previous round of horizontal file merge.
- Data file 2 is generated at time point 12, which holds the complete record of Userl at time t2, that is, Userl's feedl and feed2, which are the result of the merge of the portrait files. Where t2 is later than tl.
- Data file 3 is a newly created data file for storing the output of the current vertical merge. The specific process of vertical consolidation is as follows:
- the data file A and the data file B from the incremental data storage area are iterated in the order of Rowkey, and the Rowkey is iterated from the data file A and the data file B.
- the record fragment is used as the incremental record of the Rowkey, that is, Userl's feed3, feed4, and feed5 are used as Userl's incremental records, User2's feedl is used as User2's incremental record, and User3's feed2 and feed3 are used as User 3's incremental records;
- FIG. 4 is a flowchart of another embodiment of a method for managing a data file according to the present application.
- the method for managing a data file in this embodiment includes the following steps:
- Step S301 When the incremental data storage area reaches the first data file merge condition, the record segments corresponding to each primary key in each data file in the incremental data storage area are respectively merged with the historical complete records corresponding to the found primary key, Forming a complete record of the merged moments corresponding to each primary key;
- Step S302 Write a complete record of the merge time corresponding to each primary key into a newly created data file of the complete data storage area, wherein the complete record of the merge time corresponding to each primary key is used as the exact primary key in the complete data storage area.
- Step S304 When the complete data storage area reaches the second data file merge condition, the data files containing the complete records of the merge time saved in the complete data storage area are merged, and each primary key of the complete data storage area is deleted. Redundant record
- the data file merging process inside the complete storage area is further performed.
- This process can also be called a horizontal data file merging process, which is a data file merging process inside the complete storage area. The goal is to eliminate redundant Rowkeys, discard invalid Rowkey records, and reclaim storage space.
- the user may pre-set the complete data storage area data merge condition, that is, the second data file merge condition, for example, setting a predetermined time or the amount of data reaches a predetermined threshold or after completing the data merge of the incremental data storage area.
- the data file merge process that starts the full data store. As soon as the actual full data storage area reaches the second data file merge condition, the data files of the complete data storage area are merged.
- each data file containing the complete record of each merge time saved in the complete data storage area may be implemented by multiple algorithms of data redundancy in the prior art, such as a merging algorithm.
- each data file containing the complete record of each merge time saved in the complete data storage area is merged by the merging algorithm as an example.
- FIG. 5 is a flowchart of a merge algorithm for combining data files including a complete record of each merge time saved in a complete data storage area.
- the storage of the complete data storage area includes each merge time.
- the consolidation of the complete record of each data file includes the following substeps:
- Sub-step S401 The saved record from the complete data storage area containing the complete record of each merged time In each data file, find the latest data file where each primary key is located, and the latest data file is the data file with the latest formation time;
- Each data file of the complete data storage area containing the complete record of each merge time is all data files stored in the complete data storage area at the time of the merge. From these data files, find the latest data file where each Rowkey is located. This latest data file is the latest data file, because each data file in the complete data storage area is built with an old age. The scalar (such as the timestamp), the latest and most complete record of the Rowkey is recorded in the data file that forms the latest time.
- the iterator sequentially iterates according to the Rowkey size order according to the order of generating the data files of the complete data storage area, for example, pressing Us er l, User 2, User 3, etc.
- the order is iterated sequentially, and then the latest data file of each Rowkey is found in the order of Rowkey size. That is, first find the latest data file where User l is located, then find the latest data file where Us er 2 is located... and so on.
- Sub-step S402 obtaining a complete record corresponding to each primary key from the latest data file where each primary key is located and writing the merged data file of the complete data storage area, and deleting the completed merged data file of the complete data storage area;
- the merged data file is the target data file that the full data store uses to store the results of its internal data file merge.
- FIG. 7 is a schematic diagram of a complete data storage area, where the data file 1 and the data file 2 of the complete data storage area are two data files to be merged.
- Data file 3 is the target file for horizontal merge output, that is, the merged data file described above.
- the data file 1 is generated at time t1, which holds the complete records of the times ls l and s er 3 at the time t1, that is, the f eedl of the Us er 1 and the f eedl of the User 3, which are the vertical file merge or the previous round.
- the result of the horizontal file merge is
- User l, Us er 3 are the different Rowke y mentioned above.
- Data file 2 is generated at time point 12, which holds U ser 1 at time 12
- the full record, Userl's feedl and feed2 is the result of a vertical file merge. Where t2 is later than tl.
- Rowkey may be in the main memory, incremental data storage area and full data storage area.
- querying a Rowkey you must summarize the results from these three storage areas.
- the following example illustrates the Rowkey query process after using the management method of the above data file:
- the data file management method of the present application divides the data file into incremental data and complete data, hierarchically stores, and merges in stages, and solves the problem that Rowkey accurately queries multiple times in the complete data storage area. Achieving an accurate lookup of Rowkey on the full data store takes only 10 at a time.
- FIG. 9 is a schematic structural diagram of an embodiment of a storage device of the present application.
- the storage device 100 includes a first merge module 11 and a write module 12, wherein: the first merge module 11 is configured to store incremental data when the incremental data storage area reaches the first data file merge condition.
- the record segments corresponding to each Rowkey in each data file in the region are respectively merged with the historical complete records corresponding to the found Rowkey, and a complete record of the merge time corresponding to each Rowkey is formed and output to the write module 12;
- the data file is divided into incremental data and complete data, corresponding to the storage area, and the incremental data is stored in the incremental data storage area.
- the incremental data is stored in the incremental data storage area.
- For a Rowkey it is the incremental data of the Rowkey, and the complete data storage.
- the user can preset the data merge condition of the incremental data storage area, that is, the first data file merge condition, for example, the preset predetermined time or the amount of data of the incremental data storage area reaches a predetermined threshold or as long as the incremental data storage area appears.
- the new incremental data is used to merge the data files of the incremental data store. As long as the incremental data storage area reaches the first data file merge condition, the data file merge process of the incremental data storage area is performed.
- the first merge module 1 1 participates in the merge process of the complete data storage area of the Rockey when the data files of the incremental storage area are merged, and merges to obtain a complete record of the merge time corresponding to the Rowkey.
- the complete record of this merged moment can also be understood as the latest complete record, which is the complete record corresponding to the R owe y obtained after the merger. That is, the Rowkey record is complete until the next time the data file with the R owk e y record is merged.
- Each Rowkey record is formed with a new or old scalar (such as a timestamp).
- the complete record of the historical complete record and the merge time is distinguished, and the historical complete record refers to the first record of the Rowkey found by the new to the old on the complete data storage area before the file merge starts.
- the first record records all records that the Rowkey had before the file merge.
- the complete record of the so-called merge time means that after the current file merge is completed, the Rowkey corresponds to all records (including the previously merged and merged Rowkey records) written in the newly created data file of the complete data store.
- the data file of the incremental data storage area the data is arranged in order according to Rowkey.
- the writing module 12 is configured to write a complete record of the merge time corresponding to each Rowkey into a newly created data file in the complete data storage area, and the complete record of the merge time corresponding to each Rowkey is used as the next record combination of the next Rowkey.
- the Rowey output is accurately queried in the full data store.
- the write module 12 writes the complete records of the merge time corresponding to each Rowkey obtained after the merge to the newly created data file of the complete data storage area, and the newly created data file is merged and then in the complete data storage area.
- the generated target data file is used to store a complete record of the merge time corresponding to each Rowkey in the data file of the incremental data storage area.
- the above merge process can also be called a vertical merge process. It is a file merge method across storage areas. It can merge Rowkey record segments and make Rowkey aggregates. .
- the write module 12 can delete the corresponding data file of the incremental data storage area to release the storage space.
- FIG. 10 is a schematic structural diagram of another embodiment of a storage device of the present application.
- the storage device 200 of the present embodiment includes a first merge module 21, a write module 11, a second merge module 23, and a lookup module 24, among them:
- the first merging module 21 is configured to: when the incremental data storage area reaches the first data file merging condition, respectively record the record segments corresponding to each Rowkey in each data file in the incremental data storage area and the history corresponding to the found Rowkey Complete record merge, form a complete record of the merge time corresponding to each Rowkey and output to the write module 22;
- the writing module 22 is configured to write a complete record of the merge time corresponding to each Rowkey into a newly created data file in the complete data storage area, and the combined time of each Rowkey is completely recorded as the next record of the Rowkey before the merge.
- the complete data storage area accurately queries the output of the Rowkey.
- the second merging module 2 3 is configured to merge each data file containing the complete record of each merge time saved in the complete data storage area when the complete data storage area reaches the second data file merging condition, and delete the complete data storage area. Redundant record for each Rowkey.
- the second merge module 2 3 further performs a data file merge process inside the complete storage area.
- This process may also be called a horizontal data file merge process, which is a data file merge process inside the complete storage area. The goal is to eliminate redundant Rowkeys, discard invalid Rowkey records, and reclaim storage.
- the second merging module 231 combines the data files of the complete records stored in the complete data storage area and includes the data redundancy, such as a merging algorithm.
- the lookup module 24 is used to look up each of the data files from the main memory or the full data storage area.
- the searching module 24 is configured to search for the historical complete record corresponding to each Rowkey from the data file of the main memory or the complete data storage area before the merging. In the specific search, the search is performed in the main data file, if not found. Look in the data file to the full datastore. At the time of the search, the data file is retrieved from new to old until the Rowkey record is found, and the found Rowkey record is the latest timestamp, that is, the history record of the Rowkey. The lookup module 24 performs the above search process for each Rowkey.
- the first merge module 21 is configured to merge the record segments corresponding to the Rowkey in the data file of the incremental data storage area as a complete record of the merge time corresponding to the Rowkey.
- the second merging module 23 in this embodiment further includes a searching unit 111 and a writing unit 112, where:
- the searching unit 111 is configured to search for the latest data file in which each Rowkey is located from each data file containing the complete record of each merge time saved in the complete data storage area, and output the latest data file to the writing unit 112.
- the latest data file is Refers to the data file with the latest formation time;
- Each data file containing the complete record of each merge time saved in the complete data storage area is all the data files in the complete data storage area of the merge time.
- the searching unit 111 finds the latest data file of each Rowkey from the data files, and the latest data file is the data file with the latest formation time, because each data file of the complete data storage area is carried at the time of generation.
- a new and old scalar (such as timestamp), the latest data file to record the latest and most complete record of the Rowkey.
- the searching unit 111 sequentially iterates the data files of the complete data storage area according to the Rowkey size order according to the order of generating the data files of the complete data storage area, for example, according to Userl, User2, User3... ...the order is iterated sequentially, and then the latest data file for each Rowkey is found in order of Rowkey size. That is, first find the latest data file where Userl is located, and then find the latest data file where User2 is located... and so on.
- the writing unit 112 is configured to obtain a complete record corresponding to each Rowkey from the latest data file in which each Rowkey is located and write the data file merged in the complete data storage area, and delete the data storage area containing the merged time. Fully recorded data file.
- the writing unit 112 acquires the record segment corresponding to the Rowkey from the latest data file in which each Rowkey is located and writes the merged data file of the complete data storage area, and then deletes the completed data file of the complete data storage area.
- the merged data file is the target file for the full data store to store the consolidated results of its internal data files.
- FIG. 12 is a schematic structural diagram of still another embodiment of a storage device of the present application.
- the storage device 300 of the present embodiment includes a processor 31, an interaction interface 32, a random access memory 33, a read only memory 34 bus 35, and a network. Interface unit 36.
- the processor 31 passes the total The line 35 is coupled to the interactive interface 32, the random access memory 33, the read only memory 34, and the network interface unit 36, respectively.
- booting is performed by the boot loader booting system in the basic input/output system or the embedded system that is solidified in the read-only memory 34, and the storage device 300 is booted into a normal operating state.
- the application program and the operating system are run in the random access memory 33, and data is received from the network or transmitted to the network through the network interface unit 36, so that:
- the interaction interface 32 is a device interface for human-computer interaction, and is configured to receive an operation instruction of the user, and may be a USB interface, a display interface, or the like;
- the processor 31 receives the operation instruction of the user to merge the data files of the incremental data storage area through the interactive interface, and increments the data of the data storage area.
- the record segments corresponding to each Rowkey in the file are merged with the historical complete records corresponding to each Rowkey found, forming a complete record of the merge time corresponding to each Rowkey, and writing the complete record of the merge time corresponding to each Rowkey.
- the complete record of the merge time corresponding to each Rowkey is used as the next Rowkey record before the merge is merged, and the output result of the Rowkey is accurately queried in the complete data storage area;
- the processor 31 further merges each data file containing the complete record of each merge time saved in the complete data storage area according to the user's operation instruction for merging the data of the complete data storage area, and deletes the complete data storage. Redundant records for each Rowkey of the zone;
- the processor 31 may be a central processing unit CPU, or a specific integrated circuit AS IC (Applicable Integrated Integrated Integrated Circuit), or implemented by the implementation of the present application. One or more integrated circuits in a manner.
- CPU central processing unit
- AS IC Applicable Integrated Integrated Integrated Circuit
- the incremental data storage area and the complete data storage area described above may correspond to the random access memory 33 and the read only memory 34 of the storage device 300 of the present embodiment, respectively.
- the management method and apparatus for the data file of the present application merges the record segments corresponding to each Rowkey in the data file of the incremental data storage area with the historical complete records corresponding to the found Rowkey. , forming the merge moment corresponding to each Rowkey The complete record is written to the full data storage area.
- the data files of the incremental database are dynamically managed in the incremental data storage area and the complete data storage area, so that Rowkey is stored in a centralized state in the complete data storage area. , Reduces overhead by making precise queries for Rowkey in the full data store.
- the disclosed systems, devices, and methods may be implemented in other ways.
- the device implementations described above are merely illustrative.
- the division of the modules or units is only a logical function division.
- there may be another division manner for example, multiple units or components may be used. Combined or can be integrated into another system, or some features can be ignored, or not executed.
- the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
- the components displayed for the unit may or may not be physical units, ie may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the present embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above integrated unit can be implemented in the form of hardware or in the form of a software function unit.
- the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
- the technical solution of the present application in essence or the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. , including a number of instructions to make a computer device (which can be a personal computer, A server, or a network device, or the like, or a processor, performs all or part of the steps of the methods described in various embodiments of the present application.
- the foregoing storage medium includes: a USB flash drive, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, Random Acces s Memory), a magnetic disk or an optical disk, and the like, which can store program codes. medium.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Quality & Reliability (AREA)
Abstract
Disclosed are a data file management method and device. The data file management method comprises: when an incremental data storage area reaches a first data file merging condition, merging record segments corresponding to each primary key in various data files in the incremental data storage area with found historical complete records corresponding to the primary keys respectively, so as to form a complete record of a merging time corresponding to each primary key; and writing the complete record of the merging time corresponding to each primary key into a newly-built data file of a complete data storage area, wherein the complete record of the merging time corresponding to each primary key is taken as an output result of accurately querying the primary keys in the complete data storage area. By means of the above-mentioned method, the present application enables the records of the primary keys to be centralized, thereby reducing the IO overheads for accurately querying the primary keys in the complete data storage area.
Description
一种数据文件的管理方法及装置 Data file management method and device
技术领域 本发明涉及一种数据文件的管理方法及装置。 背景挾术 TECHNICAL FIELD The present invention relates to a method and apparatus for managing data files. Background
数据库分为关系型数据库和非关系型数据库 (Not Only SQL , NoSQL) , NoSQL是对所有不同于传统的关系型数据库的统称。 NoSQL数据存储可以 不需要固定的表格模式, 通常以键值对存储。 目前多数 NoSQL的数据存储 以日志结构合并树 (Log-Stmctured Merge-Tree, LSM-tree)为基础, 提出一种延 迟更新、 批量写入硬盘的数据结构及其算法。 LSM-tree通过将很多小文件的 存取转换为连续的大批量传输, 使得对于文件系统的大多数存取都是顺序性 的, 从而提高磁盘带宽利用率, 最小化系统的存取性能的开销, 特别适用于 会产生大量插入操作的应用环境。 所以, 以 LSM-tree为基础的 NoSQL也被 称为增量数据库。 The database is divided into a relational database and a non-relational database (Not Only SQL, NoSQL). NoSQL is a general term for all relational databases different from the traditional ones. NoSQL data storage does not require a fixed table mode, usually stored as a key-value pair. At present, most NoSQL data storages are based on Log-Stmctured Merge-Tree (LSM-tree), and a data structure and algorithm for delay updating and batch writing to hard disks are proposed. LSM-tree converts access to many small files into continuous high-volume transfers, making most access to the file system sequential, thereby increasing disk bandwidth utilization and minimizing system access performance overhead. , especially suitable for application environments that generate a lot of insert operations. Therefore, NoSQL based on LSM-tree is also called incremental database.
LSM-tree由至少两个部件构成。一个部件常驻内存,称为 CO树(或 CO ) , 可以为任何方便键值查找的数据结构, 其他部件常驻硬盘之中, 称为 C1 ... ... The LSM-tree consists of at least two parts. A component resident in memory, called CO tree (or CO), can be a data structure for any convenient key value search. Other components are resident in the hard disk, called C1...
CK树(或 C1 ... ... CK ) , C1 ... ... CK中经常被访问的结点也将会被緩存在 主存中。 增量数据库釆用增量写模式, 即数据库新增记录或者更新记录, 首 先放入内存数据结构 (如主存内数据表, Memory Table, Memtable ) 中, 即 CO树, 它达到一定大小形成一个小数据文件(如有序字符串表, Sorted String Table, Sstable )刷入硬盘数据结构, 即 C1 ... ... CK树, 内部主键 (Rowkey) 有序排列。 这样的文件将不可修改。 查询时, 则需要分别从这些小数据文件 查询 Rowkey记录片段, 共同构成一条完整 Rowkey记录。 The nodes that are frequently accessed in the CK tree (or C1 ... CK), C1 ... CK will also be cached in main memory. Incremental database uses incremental write mode, that is, the database adds records or updates records, first put into the memory data structure (such as the main memory data table, Memory Table, Memtable), that is, the CO tree, which reaches a certain size to form a Small data files (such as Sorted String Table, Sstable) are brushed into the hard disk data structure, that is, C1 ... CK tree, internal key (Rowkey) is arranged in order. Such a file will not be modifiable. When querying, you need to query the Rowkey records from these small data files to form a complete Rowkey record.
釆用增量写模式, 一条完整 Rowkey记录在存储上可以是离散在不同数 据文件的 Rowkey记录片段构成。 这样, 导致一次 Rowkey精确查询需要多 次存储器输入 /输出(Input/Output , 10)消耗。 发明内容
本发明主要解决的技术问题是提供一种数据文件的管理方法及装置, 能 够使 Rowkey由增量存储区的离散状态变为完整数据存储区的集中状态, 为 在完整数据存储区 Rowkey精确查询减少 10开销。 In the incremental write mode, a complete Rowkey record can be composed of Rowey record segments that are discrete in different data files. In this way, a Rowy exact query requires multiple memory input/output (10) consumption. Summary of the invention The technical problem to be solved by the present invention is to provide a data file management method and device, which can change the discrete state of the incremental storage area to the centralized state of the complete data storage area, and reduce the accurate query of the Rowkey in the complete data storage area. 10 overhead.
本申请第一方面, 提供一种数据文件的管理方法, 包括: 在增量数据存 储区达到第一数据文件合并条件时, 将所述增量数据存储区中的各数据文件 中每个主键对应的记录片段分别与查找到的所述主键对应的历史完整记录合 并, 形成所述每个主键对应的合并时刻的完整记录; 将所述每个主键对应的 所述合并时刻的完整记录写入完整数据存储区的一个新建的数据文件中, 其 中, 所述每个主键对应的所述合并时刻的完整记录作为在所述完整数据存储 区精确查询所述主键的输出结果。 The first aspect of the present application provides a data file management method, including: when the incremental data storage area reaches a first data file merge condition, corresponding to each primary key in each data file in the incremental data storage area The recorded segments are respectively merged with the historical complete records corresponding to the found primary keys, forming a complete record of the merged moments corresponding to each of the primary keys; and writing the complete records of the merged moments corresponding to each of the primary keys to the complete record In a newly created data file of the data storage area, the complete record of the merge time corresponding to each of the primary keys is used as an output result of accurately querying the primary key in the complete data storage area.
结合第一方面, 在第一方面的第一种可能的实现方式中: 所述方法还包 括: 将所述每个主键对应的所述合并时刻的完整记录写入主存。 In conjunction with the first aspect, in a first possible implementation manner of the first aspect, the method further includes: writing a complete record of the merge time corresponding to each of the primary keys to a main memory.
结合第一方面或第一方面的的第一种可能的实现方式, 在第一方面的第 二种可能的实现方式中: 所述方法还包括: 在所述完整数据存储区达到第二 数据文件合并条件时, 对所述完整数据存储区中保存的包含各合并时刻的完 整记录的各数据文件进行合并, 删除所述完整数据存储区的每个所述主键的 冗余记录。 With reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the method further includes: reaching a second data file in the complete data storage area When the conditions are merged, each data file containing the complete record of each merge time saved in the complete data storage area is merged, and the redundant record of each of the primary keys of the complete data storage area is deleted.
结合第一方面的第二种可能的实现方式, 在第一方面的第三种可能的实 现方式中, 所述对所述完整数据存储区中保存的包含各合并时刻的完整记录 的各数据文件进行合并, 删除所述完整数据存储区的每个所述主键的冗余记 录, 具体为: 釆用归并算法对所述完整数据存储区中保存的包含各合并时刻 的完整记录的各数据文件进行合并, 删除所述完整数据存储区的每个所述主 键的冗余记录。 With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the data file that is saved in the complete data storage area and includes a complete record of each merge time Performing a merge to delete a redundant record of each of the primary keys of the complete data storage area, specifically: performing a merge algorithm to perform each data file that is stored in the complete data storage area and includes a complete record of each merged time Merging, deleting redundant records of each of the primary keys of the complete data store.
结合第一方面的第三种可能的实现方式, 在第一方面的第四种可能的实 现方式中: 所述釆用归并算法对所述完整数据存储区中保存的包含各合并时 刻的完整记录的各数据文件进行合并, 删除所述完整数据存储区的每个所述 主键的冗余记录的步骤包括: 从所述完整数据存储区中保存的包含各合并时 刻的完整记录的各数据文件中,查找出每个所述主键所在的最新的数据文件, 所述最新的数据文件是指形成时间最晚的数据文件; 从所述每个主键所在的 最新的数据文件中获取每个所述主键对应的完整记录并写入所述完整数据存
储区的合并的数据文件, 删除所述完整数据存储区的已完成合并的所述数据 文件。 With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the using a merge algorithm to complete a complete record that is saved in the complete data storage area and includes each merge time The data files are merged, and the step of deleting the redundant records of each of the primary keys of the complete data storage area comprises: from the data files containing the complete records of the combined time saved in the complete data storage area Finding the latest data file where each of the primary keys is located, the latest data file refers to the data file with the latest formation time; obtaining each of the primary keys from the latest data file where each primary key is located Corresponding complete record and write the complete data storage The merged data file of the storage area, deleting the data file of the completed data storage of the complete data storage area.
结合第一方面的第二种至第四种任一可能的实现方式, 在第一方面的第 五种可能的实现方式中: 所述将所述增量数据存储区中的各数据文件中每个 主键对应的记录片段分别与查找到的所述主键对应的历史完整记录合并, 形 成所述每个主键对应的合并时刻的完整记录的步骤之前, 还包括: 从所述主 存或所述完整数据存储区的数据文件中查找每个所述主键对应的历史完整记 录。 With reference to any one of the second to fourth possible implementation manners of the first aspect, in a fifth possible implementation manner of the first aspect, the data data file in the incremental data storage area Before the step of merging the record segments corresponding to the primary keys with the historical complete records corresponding to the found primary keys, forming a complete record of the merged moments corresponding to each of the primary keys, the method further includes: from the primary storage or the complete The data file of the data storage area is searched for the historical complete record corresponding to each of the primary keys.
结合第一方面的第五种可能的实现方式, 在第一方面的第六种可能的实 现方式中: 所述从所述主存或所述完整数据存储区的数据文件中查找每个所 述主键对应的历史完整记录的步骤包括: 按照每个所述主键对应的完整记录 的形成时间由新到旧的方式在所述主存中的数据文件中进行检索, 若所述主 存中没有检索到, 再到所述完整数据存储区的数据文件中进行检索, 直到检 索到所述主键对应的完整记录, 所述检索到的主键的完整记录为所述主键对 应的历史完整记录。 With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the searching for each of the data files from the main memory or the complete data storage area The step of recording the historical complete record corresponding to the primary key includes: searching, in the new and old manner, the data file in the main storage according to the formation time of the complete record corresponding to each of the primary keys, if the primary storage does not retrieve Then, the data is retrieved from the data file of the complete data storage area until the complete record corresponding to the primary key is retrieved, and the complete record of the retrieved primary key is a historical complete record corresponding to the primary key.
结合第一方面的第五种可能的实现方式, 在第一方面的第七种可能的实 现方式中: 在没有查找到所述主键对应的历史完整记录时, 所述将所述增量 数据存储区中的各数据文件中每个主键对应的记录片段分别与查找到的所述 主键对应的历史完整记录合并, 形成所述每个主键对应的合并时刻的完整记 录, 具体为: 将所述增量数据存储区中的各数据文件中所述主键对应的记录 片段合并, 作为所述主键对应的所述合并时刻的完整记录。 In conjunction with the fifth possible implementation of the first aspect, in a seventh possible implementation manner of the first aspect, the incremental data is stored when the historical full record corresponding to the primary key is not found The record segments corresponding to each primary key in each data file in the region are respectively merged with the historical complete records corresponding to the searched primary keys, and a complete record of the merge time corresponding to each primary key is formed, which is specifically as follows: The record segments corresponding to the primary keys in each data file in the volume data storage area are merged as a complete record of the merge time corresponding to the primary key.
结合第一方面, 在第一方面的第八种可能的实现方式中: 所述方法还包 括: 删除所述增量数据存储区的所述数据文件。 In conjunction with the first aspect, in an eighth possible implementation of the first aspect, the method further includes: deleting the data file of the incremental data storage area.
本申请的第二方面, 提供一种存储装置, 所述存储装置包括第一合并模 块和写入模块, 其中: 所述第一合并模块用于在增量数据存储区达到第一数 据文件合并条件时, 将所述增量数据存储区中的各数据文件中每个主键对应 的记录片段分别与查找到的所述主键对应的历史完整记录合并, 形成所述每 个主键对应的合并时刻的完整记录并输出给所述写入模块; 所述写入模块用 于将所述每个主键对应的所述合并时刻的完整记录写入完整数据存储区的一 个新建的数据文件中, 其中, 所述每个主键对应的所述合并时刻的完整记录
作为在所述完整数据存储区精确查询所述主键的输出结果。 A second aspect of the present application provides a storage device, where the storage device includes a first merge module and a write module, where: the first merge module is configured to reach a first data file merge condition in an incremental data storage area And combining the record segments corresponding to each primary key in each data file in the incremental data storage area with the historical complete records corresponding to the found primary keys, to form a complete merge time corresponding to each primary key. Recording and outputting to the write module; the write module is configured to write a complete record of the merge time corresponding to each of the primary keys into a newly created data file of the complete data storage area, where Complete record of the merged moment corresponding to each primary key As an output result of accurately querying the primary key in the complete data storage area.
结合第二方面, 在第二方面的第一种可能的实现方式中: 所述写入模块 还用于将所述每个主键对应的所述合并时刻的完整记录写入主存。 With reference to the second aspect, in a first possible implementation manner of the second aspect, the writing module is further configured to write a complete record of the merged moment corresponding to each of the primary keys into a main memory.
结合第二方面或第二方面的第一种可能的实现方式, 在第二方面的第二 种可能的实现方式中: 所述装置还包括第二合并模块, 其中: 所述第二合并 模块用于在所述完整数据存储区达到第二数据文件合并条件时, 对所述完整 数据存储区中保存的包含各合并时刻的完整记录的各数据文件进行合并, 删 除所述完整数据存储区的每个所述主键的冗余记录。 With reference to the second aspect or the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the device further includes a second merging module, where: the second merging module is used by When the second data file merge condition is reached in the complete data storage area, each data file containing the complete record of each merge time saved in the complete data storage area is merged, and each complete data storage area is deleted. A redundant record of the primary keys.
结合第二方面的第二种可能的实现方式, 在第二方面的第三种可能的实 现方式中: 所述第二合并模块包括查找单元和写入单元, 其中: 所述查找单 元用于从所述完整数据存储区中保存的包含各合并时刻的完整记录的各数据 文件中, 查找出每个所述主键所在的最新的数据文件, 所述最新的数据文件 是指形成时间最晚的数据文件; 所述写入单元用于从所述每个主键所在的最 新的数据文件中获取每个所述主键对应的完整记录并写入所述完整数据存储 区的合并的数据文件, 删除所述完整数据存储区的已完成合并的所述数据文 件。 With reference to the second possible implementation of the second aspect, in a third possible implementation manner of the second aspect, the second combining module includes a searching unit and a writing unit, where: the searching unit is used to In each data file of the complete record containing the complete record of each merge time, the latest data file where each of the primary keys is located is found, and the latest data file refers to the data with the latest time. a file; the writing unit is configured to obtain a complete record corresponding to each of the primary keys from a latest data file in which each primary key is located, and write the merged data file of the complete data storage area, deleting the file The data file of the completed data pool of the complete data store.
结合第二方面的第一种至第三种任一可能的实现方式, 在第二方面的第 四种可能的实现方式中: 所述装置还包括查找模块, 其中: 所述查找模块用 于从所述主存或所述完整数据存储区的数据文件中查找每个所述主键对应的 历史完整记录, 并将查找到的每个所述主键对应的历史完整记录输出给所述 第一合并模块。 With reference to any one of the first to third possible implementations of the second aspect, in a fourth possible implementation manner of the second aspect, the device further includes: a searching module, where: the searching module is used to Searching, in the data file of the main memory or the complete data storage area, a historical complete record corresponding to each of the primary keys, and outputting the historical complete record corresponding to each of the found primary keys to the first merge module .
结合第二方面的第四种可能的实现方式, 在第二方面的第五种可能的实 现方式中: 在所述查找模块没有查找到所述主键对应的历史完整记录时, 所 述第一合并模块用于将所述增量数据存储区中的各数据文件中所述主键对应 的记录片段合并, 作为所述主键对应的所述合并时刻的完整记录。 With reference to the fourth possible implementation of the second aspect, in a fifth possible implementation manner of the second aspect, when the searching module does not find the historical complete record corresponding to the primary key, the first merge The module is configured to merge the record segments corresponding to the primary key in each data file in the incremental data storage area as a complete record of the merge time corresponding to the primary key.
本发明的有益效果是: 区别于现有技术的情况, 本申请将增量数据存储 区的数据文件中每个 Rowkey对应的记录片段, 分别与查找到的 Rowkey对 应的历史完整记录合并, 形成每个 Rowkey对应的合并时刻的完整记录并写 入完整数据存储区, 通过这样的方式, 对增量数据库的数据文件在增量数据 存储区和完整数据存储区进行动态管理, 从而使 Rowkey在完整数据存储区
呈集中状态存储, 为在完整数据存储区 Rowkey精确查询减少 10开销。 附图说明 The beneficial effects of the present invention are as follows: Different from the prior art, the present application combines the record segments corresponding to each Rowkey in the data file of the incremental data storage area with the historical complete records corresponding to the found Rowkey, forming each The complete record of the merge time corresponding to Rowkey is written into the complete data storage area. In this way, the data files of the incremental database are dynamically managed in the incremental data storage area and the complete data storage area, so that Rowkey is in the complete data. Storage area Stored in a centralized state, reducing overhead by accurately querying Rowkey in the full data store. DRAWINGS
图 1是本申请分层存储结构示意图; 1 is a schematic diagram of a hierarchical storage structure of the present application;
图 2是本申请数据文件的管理方法一个实施方式的流程图; 2 is a flow chart of an embodiment of a method for managing a data file of the present application;
图 3是本申请数据文件的管理方法其中一个实施方式中,形成每个主键对 应的合并时刻的完整记录的流程图; 3 is a flow chart showing a complete record of a merge time corresponding to each primary key in one embodiment of the data file management method of the present application;
图 4是本申请数据文件的管理方法另一个实施方式的流程图; 4 is a flow chart of another embodiment of a method for managing a data file of the present application;
图 5是本申请数据文件的管理方法其中一个实施方式中,归并算法对完整 数据存储区的保存的包含各合并时刻的完整记录的各数据文件进行合并的流 程图; 5 is a flowchart of a method for managing a data file of the present application. In one embodiment, a merge algorithm performs a process of merging data files including a complete record of each merge time stored in a complete data storage area;
图 6是本申请数据文件的管理方法其中一个实施方式的存储结构示意图; 图 7是本申请数据文件的管理方法另一个实施方式的存储结构示意图; 图 8是本申请数据文件的管理方法又一个实施方式的存储结构示意图; 图 9是本申请存储装置一个实施方式的结构示意图; 6 is a schematic diagram of a storage structure of one embodiment of a data file management method of the present application; FIG. 7 is a schematic diagram of a storage structure of another embodiment of a data file management method of the present application; FIG. 8 is another management method of the data file of the present application. FIG. 9 is a schematic structural diagram of an embodiment of a storage device of the present application;
图 10是本申请存储装置另一个实施方式的结构示意图; 10 is a schematic structural diagram of another embodiment of a storage device of the present application;
图 11是本申请存储装置一个实施方式中笫二合并模块的结构示意图; 图 12是本申请存储装置又一个实施方式的结构示意图。 具体实施方式 硬盘驱动器(Ha rd Di sk Dr ive , HDD)作为存储信息的媒介广泛用于存储 系统, 比如数据库。 基于硬盘的数据库通常采用主存 (Ma in Memory ) +HDD 的两层存储结构。 数据记录首先写入到主存, 再在一定触发条件下持久化到 硬盘。 但长期以来, 工业界二者发展不均衡, 主存 10性能大大提高, 而硬盘 10性能增长緩慢, 这就造成基于硬盘的数据库的读写性能严重受限于硬盘 10。 固态硬盘(So l i d S ta te D i s k , SSD)的问世给数据库带来可观的优化空间。 SSD具有良好的读写性能, 相对于 HDD更快, 通常作为容量有限的读 /写緩存 引入到存储系统, 构成了 Ma in Memory+SSD +HDD的多层存储结构, 充分发挥
硬件优势, 寻求性能、 容量、 价格三者的平衡。 SSD和 HDD均是非易失性存 储介质。 11 is a schematic structural diagram of a second merge module in an embodiment of the storage device of the present application; and FIG. 12 is a schematic structural diagram of still another embodiment of the storage device of the present application. DETAILED DESCRIPTION Hard disk drives (HDDs) are widely used as storage media for storage systems, such as databases. Hard disk-based databases typically use a two-tier storage structure of Ma in Memory + HDD. The data record is first written to the main memory and then persisted to the hard disk under certain trigger conditions. However, for a long time, the development of the industry has been uneven, the performance of the main memory 10 has been greatly improved, and the performance of the hard disk 10 has been slow to grow. This has caused the hard disk-based database read and write performance to be severely limited by the hard disk 10. The advent of solid-state drives (Solid S ta te D isk, SSD) has brought considerable optimization space to the database. SSD has good read and write performance, is faster than HDD, and is usually introduced into the storage system as a limited capacity read/write cache, which constitutes a multi-layer storage structure of Ma in Memory+SSD + HDD. Hardware advantages, seeking a balance of performance, capacity, and price. Both SSD and HDD are non-volatile storage media.
本申请中, 定义零级存储区、 一级存储区和二级存储区: 零级存储区特 指主存; 一级存储区和二级存储区是两类存储设备, 其中一级存储区相对于 二级存储区读写性能突出, 但价格较为 贵, 如主存和 SSD组合、 SSD和 HDD 组合、 HDD和磁带组合等。 一级存储区和二级存储区可以理解为 SSD和 HDD 组合, 但在本申请的实施例中不仅仅局限于这种组合。 在本申请中, 也将一 级存储区叫做增量数据存储区, 而二级存储区叫做完整数据存储区。 In this application, a zero-level storage area, a primary storage area, and a secondary storage area are defined: a zero-level storage area refers to main storage; a primary storage area and a secondary storage area are two types of storage devices, wherein the primary storage area is relatively The read and write performance in the secondary storage area is outstanding, but the price is relatively expensive, such as main memory and SSD combination, SSD and HDD combination, HDD and tape combination. The primary storage area and the secondary storage area can be understood as a combination of SSD and HDD, but are not limited to this combination in the embodiment of the present application. In this application, a level one storage area is also referred to as an incremental data storage area, and a secondary storage area is referred to as a complete data storage area.
请参阅图 1 , 图 1是分层存储结构示意图, 其中, A中所示为两层存储结 构示意图, B中所示为三层存储结构示意图。 Referring to FIG. 1 , FIG. 1 is a schematic diagram of a hierarchical storage structure. A is shown in FIG. 2 as a schematic diagram of a two-layer storage structure, and B is a schematic diagram of a three-layer storage structure.
在两层存储结构中, 数据流向是从零级存储区向一级存储区。 数据库存 储引擎接收数据写入(包括插入、 更新、 删除)请求, 数据首先写入到零级 存储区内的数据集。 存储引擎监控数据集, 当达到一定触发条件, 比如数据 集大小超过一定阀值, 将满足条件的数据集刷( f lush )到二级存储区上的持 久化数据文件。 存储引擎接收数据查询 (s e lec t )请求时, 存储引擎将分别 从零级存储区内的数据集和二级存储区上的持久化数据文件检索( re t r i eve ) 符合查询条件的数据记录片段, 然后对来自这两个存储区的数据记录片段进 行拼接, 构成完整数据记录作为查询结果返回。 In a two-tier storage structure, the data flow direction is from a zero-level storage area to a primary storage area. The data store storage engine receives data writes (including inserts, updates, deletes) and the data is first written to the data set in the zero-level storage area. The storage engine monitors the data set. When a certain trigger condition is reached, such as the data set size exceeds a certain threshold, the data set that satisfies the condition is swamped (f lush ) to the persistent data file on the secondary storage area. When the storage engine receives the data query (se lec t ) request, the storage engine retrieves (re tri eve ) the data record fragment that meets the query condition from the data set in the zero-level storage area and the persistent data file on the secondary storage area, respectively. Then, the data record segments from the two storage areas are spliced to form a complete data record as a result of the query.
在三层存储结构中, 数据流向是从零级存储区向一级存储区, 再从一级 存储区向二级存储区。 数据库存储引擎接收数据写入(包括插入、 更新、 删 除)请求, 数据首先写入到零级存储区内的数据集。 存储引擎监控数据集, 当达到一定触发条件, 比如数据集大小超过一定阀值, 将满足条件的数据集 刷到一级存储区上的持久化数据文件。 当一级存储区上的持久化数据文件满 足设定的触发条件时, 以一定形式转移这些数据到二级存储区上的持久化数 据文件。 引擎接收数据查询 (s e l ec t )请求时, 存储引擎将分别从零级存储 区内的数据集、 一级存储区和二级存储区上的持久化数据文件检索符合查询 条件的数据记录片段, 然后对来自这三个存储区的数据记录片段进行拼接,
构成完整数据记录作为查询结果返回。 In the three-tier storage structure, the data flow direction is from a zero-level storage area to a primary storage area, and then from a primary storage area to a secondary storage area. The database storage engine receives requests for data writes (including inserts, updates, deletes), and the data is first written to the data set in the zero-level storage area. The storage engine monitors the data set. When a certain trigger condition is reached, for example, the data set size exceeds a certain threshold, the data set that satisfies the condition is swiped to the persistent data file on the primary storage area. When the persistent data file on the primary storage area satisfies the set trigger condition, the data is transferred in a certain form to the persistent data file on the secondary storage area. When the engine receives the data query (sel ec t ) request, the storage engine retrieves the data record segments that meet the query condition from the persistent data files on the data set, the primary storage area, and the secondary storage area in the zero-level storage area, respectively. Then splicing the data record segments from the three storage areas, Form a complete data record as a result of the query.
现有增量数据库通常釆用增量写模式,从而导致一条完整 Rowkey记录在 存储上可以是离散在不同数据文件的 Rowkey记录片段构成。这样,导致一次 Rowkey精确查询多次存储器 10消耗。 Existing delta databases typically use incremental write mode, resulting in a complete Rowkey record on the storage that can be discretely composed of Rowkey records of different data files. This results in a single Rowkey precise query for multiple memory 10 consumption.
基于现有技术在存储设备上形成大量数据文件,造成 Rowkey离散, 不利 于查询操作的技术问题, 本申请提供一种数据文件的管理方法及装置, 能够 对增量数据库的数据文件在增量数据存储区和完整数据存储区进行动态管 理 ,使 Rowkey从最初的增量数据存储区的离散状态变为完整数据存储区的集 中状态, 为完整数据存储区内 Rowkey精确查询减少 10开销。 Based on the prior art, a large number of data files are formed on the storage device, which causes the Rowkey to be discrete, which is not conducive to the technical problem of the query operation. The present application provides a data file management method and device, which can incrementally data the data files of the incremental database. The dynamic management of the storage area and the complete data storage area allows Rowkey to change from the discrete state of the initial incremental data storage area to the centralized state of the complete data storage area, reducing the overhead of Rowkey accurate query in the complete data storage area.
以下结合具体实施方式, 对本申请的数据文件的管理方法及装置进行详 细说明, 但是并不用以限制本申请的保护范围。 The method and device for managing the data file of the present application are described in detail below with reference to the specific embodiments, but are not intended to limit the scope of the application.
请参阅图 2 , 图 2是本申请数据文件的管理方法一个实施方式的流程图, 本实施方式的数据文件的管理方法包括: Referring to FIG. 2, FIG. 2 is a flowchart of an implementation manner of a data file management method according to the present application. The data file management method in this embodiment includes:
步骤 S101 : 在增量数据存储区达到第一数据文件合并条件时, 将增量数 据存储区中的各数据文件中每个主键对应的记录片段分别与查找到的主键对 应的历史完整记录合并, 形成每个主键对应的合并时刻的完整记录; Step S101: When the incremental data storage area reaches the first data file merge condition, the record segments corresponding to each primary key in each data file in the incremental data storage area are respectively merged with the historical complete records corresponding to the found primary key, Forming a complete record of the merged moments corresponding to each primary key;
本申请实施方式中, 主键(Rowkey)是指 NoSQL所支持的嵌套结构的表格 模式(Schema)的每个子表格模式的唯一性标识, 以下博客为例来说明嵌套类 型 Schema , 定义博客表 ( Feed-Tab l e ) 的 Schema : In the embodiment of the present application, the primary key (Rowkey) refers to the unique identifier of each sub-table mode of the nested structure supported by NoSQL. The following blog is an example to illustrate the nested type Schema and define the blog table ( Feed-Tab le ) Schema:
{ 〃 以下定义博客表( feed—tab l e ) { 〃 The following definition blog table ( feed_tab l e )
user i d //用户 id User i d //user id
us er .name //用户名 feed- i d 〃博文 id Us er .name //username feed- i d 〃博文 id
feed pos t t ime 〃发博文时间 Feed pos t t ime
feed- content 〃博文内容 Feed- content
{
comment- id //评论 id { Comment- id //comment id
comment-posttime //评论时间 Comment-posttime // comment time
comment .content //评论内容 Comment .content //Comment
Feed— Table的 Schema包括三层子 Schema,分另 'J定义用户信息( user id、 user— name) 、 博文信息 ( f eed_ id. feed-posttime. feed— content ) 、 评论 信息 ( comment- id. comment-posttime. comment-content ) , 它们三者之间 具有嵌套从属关系。 用户信息、 博文信息和评论信息分别具有唯一性标识, 在 Feed—Table中分另 ll是 user id、 f eed_ id. comment _ id, 其中 userid称为 feed-table的主键, 即 rowkey。 Feed—Table's schema consists of a three-level child schema, which defines user information (user id, user_name), blog information (f eed_ id. feed-posttime. feed-content), and comment information (comment- id. Comment-posttime. comment-content ) , which have nested affiliation between them. User information, blog information, and comment information respectively have unique identifiers. In the Feed_Table, the user id, f eed_ id. comment _ id, where userid is called the primary key of the feed-table, that is, rowkey.
本申请实施方式中, 数据文件区分为增量数据和完整数据, 对应到存储 区, 增量数据存储在增量数据存储区, 对一个 Rowkey而言, 就是该 Rowkey 的增量数据, 完整数据存储在完整数据存储区, 对一个 Rowkey而言, 就是该 Rowkey的完整数据。 In the embodiment of the present application, the data file is divided into incremental data and complete data, corresponding to the storage area, and the incremental data is stored in the incremental data storage area. For a Rowkey, it is the incremental data of the Rowkey, and the complete data storage. In the full data store, for a Rowkey, it is the complete data of the Rowkey.
用户可以根据需要预先设置增量数据存储区的数据合并条件即第一数据 文件合并条件, 比如预设预定时间或增量数据存储区的数据量达到预定阔值 或者是只要增量数据存储区出现新的增量数据就进行增量数据存储区的数据 文件合并。 只要增量数据存储区的达到第一数据文件合并条件, 即执行对增 量数据存储区的数据文件进行合并的过程。 The user can preset the data merge condition of the incremental data storage area, that is, the first data file merge condition, for example, the preset predetermined time or the amount of data of the incremental data storage area reaches a predetermined threshold or as long as the incremental data storage area appears. The new incremental data is combined with the data files of the incremental data store. As long as the incremental data storage area reaches the first data file merge condition, the process of merging the data files of the incremental data storage area is performed.
在对增量存储区的数据文件进行合并时,将 Rockey在完整数据存储区的 历史记录参与到合并过程, 合并得到该 Rowkey对应的合并时刻的完整记录。 这个合并时刻的完整记录也可以理解为最新完整记录, 是本次合并后得到的 该 Rowey对应的完整记录。 也就是说, 在下一次增量数据存储区有该 Rowkey 记录的数据文件合并之前, 该 Rowkey的记录是完整的。 每个 Rowkey记录形 成时都带有一个新旧程度的标量 (如时间戳) 。
本申请实施方式中, 区分历史完整记录和合并时刻的完整记录, 所述历 史完整记录是指在文件合并开始前, 完整数据存储区上按时间由新到旧找到 的该 Rowkey的第一条记录, 该第一记录记载了该 Rowkey在文件合并之前的 所有记录。 对于第一次插入到完整数据存储区的 Rowkey不存在历史完整记 录。 而所谓合并时刻的完整记录是指当前这次文件合并结束后, 该 Rowkey 对应写入到完整数据存储区的数据文件中的所有记录(包括之前合并的和本 次合并的 Rowkey的记录)。 这个合并时刻的完整记录具有一定的时效性, 也 就是说, 只在下一次有该 Rowkey对应的记录合并前有效。 When merging the data files of the incremental storage area, the history of the complete data storage area of the Rockey is participated in the merging process, and the complete record of the merging time corresponding to the Rowkey is obtained. The complete record of this merged moment can also be understood as the latest complete record, which is the complete record of the Rowey corresponding to this merge. That is, the Rowkey record is complete until the data file of the Rowkey record is merged in the next incremental data store. Each Rowkey record is formed with a new scalar (such as a timestamp). In the embodiment of the present application, the complete record of the historical complete record and the merge time is distinguished, and the historical complete record refers to the first record of the Rowkey found by the new to the old on the complete data storage area before the file merge starts. The first record records all records that the Rowkey had before the file merge. There is no historical full record for Rowkey that was first inserted into the full datastore. The complete record of the so-called merge time means that after the current file merge is completed, the Rowkey corresponds to all the records in the data file written to the complete data storage area (including the records of the previously merged and the merged Rowkey). The complete record of this merged moment has a certain timeliness, that is, it is valid only until the next record with the corresponding Rowkey is merged.
增量数据存储区的数据文件中,数据是按 Rowkey依次排列的,在进行合 并时,将数据文件中的每个 Rowkey的记录都与查询到的历史完整记录进行合 并, 得到每个 Rowkey对应的合并时刻的完整记录。 这里数据文件中的每个 Rowkey的记录是指 Rowkey对应的所有记录片段。 In the data file of the incremental data storage area, the data is arranged in order according to Rowkey. When merging, the records of each Rowkey in the data file are merged with the historical complete records of the query, and the corresponding correspondence of each Rowkey is obtained. A complete record of the time of the merger. Each Rowkey record in the data file here refers to all the record segments corresponding to Rowkey.
步骤 S 1 02 : 将每个主键对应的合并时刻的完整记录写入完整数据存储区 的一个新建的数据文件中, 其中, 每个主键对应的合并时刻的完整记录作为 在完整数据存储区精确查询主键的输出结果; Step S1 02: Write the complete record of the merge time corresponding to each primary key into a newly created data file of the complete data storage area, wherein the complete record of the merge time corresponding to each primary key is used as an exact query in the complete data storage area. The output result of the primary key;
将进行合并后得到的每个 Rowkey对应的合并时刻的完整记录都分别写 入到完整数据存储区的新建的数据文件中, 该新建的数据文件即进行合并后 在完整数据存储区生成的目标数据文件, 用于存储对增量数据存储区的数据 文件进行合并而得到的每个 Rowkey对应的合并时刻的完整记录。 The complete records of the merge time corresponding to each Rowkey obtained after the merge are respectively written into the newly created data file of the complete data storage area, and the newly created data file is the target data generated in the complete data storage area after the merge. The file is used to store a complete record of the merge time corresponding to each Rowkey obtained by merging the data files of the incremental data storage area.
由于在完整数据存储区对 Rowkey进行精确查询时,是根据文件的生成时 间顺序进行的, 所以, 在合并结束后, 下一次该 Rowkey记录合并之前, 如果 在完整数据存储区对 Rowkey进行查询, 那么该 Rowkey对应的合并时刻的完 整己录即为查询该 Rowkey的输出结果。 Since the exact query of the Rowkey in the complete data storage area is performed according to the time sequence of the file generation, after the merge is completed, if the Rowkey record is merged in the complete data storage area before the next time the Rowkey record is merged, then The complete record of the merge time corresponding to the Rowkey is the result of querying the Rowkey.
上述的合并过程也可以叫纵向合并过程, 是一种跨存储区的文件合并方 式, 其能够合并 Rowkey记录片段, 使 Rowkey聚集, 做到对于完整数据存储 区的任意一次 Rowkey精确查询只需要一次 10。 The above merge process can also be called a vertical merge process. It is a file merge method across storage areas. It can merge Rowkey record segments and make Rowkey aggregates. .
上述合并过程完成后, 可以删除增量数据存储区的数据文件, 以释放存
储空间。 After the above merge process is completed, the data file of the incremental data storage area can be deleted to release the save data. Storage space.
通过上述实施方式的阐述, 可以理解, 本申请数据文件的管理方法, 将 增量数据存储区的各数据文件中每个 Rowkey对应的记录片段,分别与查找到 的 Rowkey对应的历史完整记录合并, 形成每个 Rowkey对应的合并时刻的完 整记录并写入完整数据存储区, 通过这样的方式, 对增量数据库的数据文件 在增量数据存储区和完整数据存储区进行动态管理,从而使 Rowkey在完整数 据存储区呈集中状态存储, 为在完整数据存储区 Rowkey精确查询减少 10开 销。 Through the above description of the embodiments, it can be understood that the management method of the data file of the present application merges the record segments corresponding to each Rowkey in each data file of the incremental data storage area with the historical complete records corresponding to the found Rowkey. Form a complete record of the merge time corresponding to each Rowkey and write to the complete data storage area. In this way, the data files of the incremental database are dynamically managed in the incremental data storage area and the complete data storage area, thereby making Rowkey The full data store is stored in a centralized state, reducing overhead by accurately querying the Rowkey in the full data store.
本申请数据文件的管理方法另一个实施方式中, 请参阅图 3 , 图 3是形 成每个主键对应的合并时刻的完整记录的流程图, 本实施方式的形成每个主 键对应的合并时刻的完整记录包括以下子步骤: In another embodiment of the management method of the data file of the present application, please refer to FIG. 3. FIG. 3 is a flowchart of forming a complete record of the merge time corresponding to each primary key, and the complete integration time of each primary key is formed in this embodiment. The record includes the following substeps:
子步骤 S201 : 将增量数据存储区的数据文件按主键的排列顺序对每个主 键的记录片段依次迭代得到每个主键的增量记录; Sub-step S201: sequentially inverting the data files of the incremental data storage area into the recording segments of each primary key in the order of the primary keys to obtain an incremental record of each primary key;
增量数据存储区 n个数据文件, 按照 Rowkey排列顺序依次迭代, 从这 n 个数据文件中迭代出的每个 Rowkey的全部记录片段作为每个 Rowkey的增量 记录。 Incremental data storage n data files are iterated in order of Rowkey order, and all the records of each Rowkey iterated from the n data files are used as incremental records for each Rowkey.
子步骤 S202 : 从主存或完整数据存储区的数据文件中查找每个主键对应 的历史完整记录; Sub-step S202: searching for a historical complete record corresponding to each primary key from a data file of the main storage or the complete data storage area;
从主存或完整数据存储区的数据文件中查找每个 Rowkey对应的历史完 整记录, 具体查找时, 先在主存的数据文件中进行查找, 如果没有找到再到 完整数据存储区的数据文件中进行查找。 在查找的时候, 根据每个主键形成 时间由新 j 1曰进行检索 , 直 j找 j Rowkey 々 己录, 该找 j的 Rowkey 己录尤 是时间戳最新的, 即该 Rowkey的历史完整记录。 对每个 Rowkey都执行以上 的查找过程。 Find the historical complete record corresponding to each Rowkey from the data file of the main memory or the complete data storage area. When searching specifically, first search in the main data file, if not found in the data file of the complete data storage area. Find it. In the search, according to each primary key formation time is retrieved by the new j 1曰, straight j to find j Rowkey 々 has recorded, the j found Rowkey has recorded the latest timestamp, that is, the Rowkey history complete record. The above search process is performed for each Rowkey.
子步骤 S203 : 判断是否查找到主键对应的历史完整记录; Sub-step S203: determining whether the historical complete record corresponding to the primary key is found;
在对每个 Rowkey都执行完以上查找过程后, 判断是否有查找到 Rowkey 对应的历史完整记录,对于没有找到 Rowkey对应的历史完整记录的 Rowkey ,
执行子步骤 S205, 对于查找到 Rowkey对应的历史完整记录的 Rowkey, 执行 子步骤 S204。 After performing the above search process for each Rowkey, it is determined whether there is a historical complete record corresponding to the Rowkey, and for the Rowkey that does not find the historical complete record corresponding to the Rowkey, Execution sub-step S205, for the Rowkey that finds the historical complete record corresponding to the Rowkey, sub-step S204 is performed.
子步骤 S204: 将每个主键的增量记录与查找到的该主键对应的历史完整 记录进行合并, 形成每个主键对应的合并时刻的完整记录; Sub-step S204: Combining the incremental record of each primary key with the historical complete record corresponding to the found primary key to form a complete record of the combined time corresponding to each primary key;
对于查找到历史完整记录的 Rowkey, 将查找到的该 Rowkey的历史完整 记录与该 Rowkey的增量记录进行合并, 形成该 Rowkey对应的合并时刻的完 整记录, 即最新完整记录。对于每个查找到历史完整记录的 Rowkey都执行这 样的合并过程, 得到每个 Rowkey对应的合并时刻的完整记录。 For the Rowkey that finds the historical complete record, the found historical record of the Rowkey is merged with the incremental record of the Rowkey to form a complete record of the merged moment corresponding to the Rowkey, that is, the latest complete record. This merge process is performed for each Rowkey that finds a historical full record, and a complete record of the merge time corresponding to each Rowkey is obtained.
子步骤 S205: 将该主键的增量记录作为该主键对应的合并时刻的完整记 录; Sub-step S205: The incremental record of the primary key is taken as a complete record of the merge time corresponding to the primary key;
对于没有查找到历史完整记录的 Rowkey, 将该 Rowkey的增量记录作为 该 Rowkey的合并时刻完整记录, 写入到完整数据存储区的目标数据文件。 For a Rowkey that does not find a historical full record, the Rowkey's incremental record is used as the complete record of the Rowkey's merged time, and is written to the target data file of the complete data storage area.
以下举例具体说明纵向合并过程, 请参阅图 6所示的存储结构示意图, 如图所示: The following example specifies the vertical merge process. Please refer to the storage structure diagram shown in Figure 6, as shown in the figure:
其中, 增量数据存储区中的数据文件 A和数据文件 B包含用户(User) 1、 用户 1和用户 3的博文(feed)增量数据,即数据文件 A中包含 Userl的 f eed3、 feed4以及 User3的 feed2和 feed3,数据文件 B中包含 Userl的 feed5以及 User2的 feedl。 这里 Userl、 User2、 User3即上文提到的不同的 Rowkey。 The data file A and the data file B in the incremental data storage area include the user (User) 1, the user 1 and the user 3's feed increment data, that is, the data file A contains User1's f eed3, feed4, and User3's feed2 and feed3, data file B contains Userl's feed5 and User2's feedl. Here Userl, User2, User3 are the different Rowkeys mentioned above.
完整数据存储区中的数据文件 1和数据文件 2是之前纵向或横向合并过 程生成的数据文件, 其中, 数据文件 1是在时间点 tl生成, 它保存了 tl时 刻 Userl和 User 3的完整记录, 即 Userl的 feedl和 User3的 feedl, 是纵 向文件合并或者前一轮横向文件合并的结果。数据文件 2是在时间点 12生成, 它保存了在 t2时刻 Userl的完整记录, 即 Userl的 feedl和 feed2, 是纵向 文件合并的结果。 其中, t2晚于 tl。 数据文件 3是新建数据文件, 用于存储 当前次纵向合并的输出结果。 纵向合并具体过程如下: The data file 1 and the data file 2 in the complete data storage area are data files generated by the previous vertical or horizontal merge process, wherein the data file 1 is generated at the time point t1, which holds the complete record of the Userl and User 3 at the time t1. That is, Userl's feedl and User3's feedl are the result of a vertical file merge or a previous round of horizontal file merge. Data file 2 is generated at time point 12, which holds the complete record of Userl at time t2, that is, Userl's feedl and feed2, which are the result of the merge of the portrait files. Where t2 is later than tl. Data file 3 is a newly created data file for storing the output of the current vertical merge. The specific process of vertical consolidation is as follows:
(1)纵向合并开始时, 从增量数据存储区的数据文件 A和数据文件 B按 Rowkey排列顺序依次迭代, 从数据文件 A和数据文件 B中迭代出的 Rowkey
记录片段作为该 Rowkey的增量记录, 即 Userl的 feed3、 feed4、 feed5作为 Userl的增量记录, User2的 feedl作为 User2的增量记录, User3的 feed2 和 feed3作为 User 3的增量记录; (1) At the beginning of the vertical merge, the data file A and the data file B from the incremental data storage area are iterated in the order of Rowkey, and the Rowkey is iterated from the data file A and the data file B. The record fragment is used as the incremental record of the Rowkey, that is, Userl's feed3, feed4, and feed5 are used as Userl's incremental records, User2's feedl is used as User2's incremental record, and User3's feed2 and feed3 are used as User 3's incremental records;
(2)从主存或者完整数据存储区的数据文件中查找每个 Rowkey的历史完 整记录, 具体为, 先检索主存, 没有的话再到完整数据存储区查找。 查找的 时候, 按照每个主键形成时间由新到旧进行查找, 直到找到 Rowkey的记录, 这个找到的记录就是时间戳最新的, 即 Rowkey的历史完整记录。本实施方式 默认为主存都没有找到 Rowkey的历史完整记录的情况。在完整数据存储区的 数据文件中, 首先查找 Userl的历史完整记录, 找到数据文件 2中的 Userl 的 feedl和 feed2, 即为 Userl的历史完整记录, 接着用同样的方法查找 User2, 但没有找到对应的历史完整记录, 再查找到 User3的历史完整记录, 即数据文件 1的 User3的 feedl; (3)将查找到的 Rowkey的历史完整记录与 该 Rowkey的增量记录进行合并, 得到该 Rowkey的最新完整记录, 写入完整 数据存储区的新建的数据文件。 即将 Userl的 feedl-feed5写入数据文件 3, 而没有历史完整记录的 User2, 直接将 User2的增量数据 feedl写入数据文 件 3, User3的 feedl和 feed2都写入完整数据存储区的数据文件 3, 当然, 上述的写入过程也可以同时写入到主存; (2) Find the historical complete record of each Rowkey from the data file of the main memory or the complete data storage area, specifically, first retrieve the main memory, and then go to the complete data storage area to find it. When searching, the primary key formation time is searched from new to old, until the Rowkey record is found. The found record is the latest timestamp, that is, the history complete record of Rowkey. This embodiment defaults to the case where the history of Rowkey is not found in the main memory. In the data file of the complete data storage area, first look up the historical complete record of Userl, find the feedl and feed2 of Userl in data file 2, which is the historical complete record of Userl, and then use the same method to find User2, but no corresponding correspondence is found. The complete history of the history, and then find the historical complete record of User3, that is, the feedl of User3 of data file 1; (3) merge the historical complete record of the found Rowkey with the incremental record of the Rowkey to obtain the latest update of Rowkey. A full record, a new data file written to the full data store. Userl's feedl-feed5 is written to data file 3, and User2, which has no historical full record, directly writes User2's incremental data feed1 to data file 3. User3's feedl and feed2 are written to the data file of the complete data storage area. Of course, the above writing process can also be written to the main memory at the same time;
(4)纵向合并完成,删除增量数据存储区的已合并的数据文件 A和数据文 件 B, 结束。 (4) The vertical merge is completed, and the merged data file A and data file B of the incremental data storage area are deleted, and the process ends.
请参阅图 4, 图 4是本申请数据文件的管理方法另一个实施方式的流程 图, 本实施方式的数据文件的管理方法包括以下步骤: Referring to FIG. 4, FIG. 4 is a flowchart of another embodiment of a method for managing a data file according to the present application. The method for managing a data file in this embodiment includes the following steps:
步骤 S301: 在增量数据存储区达到第一数据文件合并条件时, 将增量数 据存储区中的各数据文件中每个主键对应的记录片段分别与查找到的主键对 应的历史完整记录合并, 形成每个主键对应的合并时刻的完整记录; Step S301: When the incremental data storage area reaches the first data file merge condition, the record segments corresponding to each primary key in each data file in the incremental data storage area are respectively merged with the historical complete records corresponding to the found primary key, Forming a complete record of the merged moments corresponding to each primary key;
步骤 S302: 将每个主键对应的合并时刻的完整记录写入完整数据存储区 的一个新建的数据文件中, 其中, 每个主键对应的合并时刻的完整记录作为 在完整数据存储区精确查询主键的输出结果;
步骤 S 303: 删除增量数据存储区的数据文件; Step S302: Write a complete record of the merge time corresponding to each primary key into a newly created data file of the complete data storage area, wherein the complete record of the merge time corresponding to each primary key is used as the exact primary key in the complete data storage area. Output result Step S303: deleting the data file of the incremental data storage area;
在完成增量数据存储区的数据文件中每个 Rowkey记录的合并以及将合 并得到的每个 Rowkey的合并时刻的完整记录写入完整数据存储区后,删除增 量数据存储区的数据文件, 为增量数据存储区释放空间以写入下一次的增量 数据。 After completing the merge of each Rowkey record in the data file of the incremental data storage area and writing the complete record of the merge time of each Rowkey obtained by the merge into the complete data storage area, deleting the data file of the incremental data storage area, The delta datastore frees up space to write the next incremental data.
步骤 S 304 : 在完整数据存储区达到第二数据文件合并条件时, 对完整数 据存储区中保存的包含各合并时刻的完整记录的各数据文件进行合并, 删除 完整数据存储区的每个主键的冗余记录; Step S304: When the complete data storage area reaches the second data file merge condition, the data files containing the complete records of the merge time saved in the complete data storage area are merged, and each primary key of the complete data storage area is deleted. Redundant record
对于完成上述的跨存储区数据文件的合并之后, 在完整数据存储区形成 合并时刻的完整记录时, 历史完整记录就变为无效, 需要进行回收, 以消除 Rowkey冗余数据。 因此, 进一步进行完整存储区内部的数据文件合并过程, 这个过程也可以叫做横向数据文件合并过程, 是完整存储区内部的数据文件 合并过程。 目的是消除冗余 Rowkey , 舍弃无效的 Rowkey记录, 回收存储空 间。 After the completion of the above-mentioned cross-storage data file merging, when the complete data storage area forms a complete record of the merge time, the historical full record becomes invalid and needs to be recycled to eliminate Rowkey redundant data. Therefore, the data file merging process inside the complete storage area is further performed. This process can also be called a horizontal data file merging process, which is a data file merging process inside the complete storage area. The goal is to eliminate redundant Rowkeys, discard invalid Rowkey records, and reclaim storage space.
实际应用过程中, 用户可以根据需要预先设置完整数据存储区数据合并 条件即第二数据文件合并条件, 比如设置预定时间或数据量达到预定阈值或 者是只要完成一次增量数据存储区的数据合并之后就启动完整数据存储区的 数据文件合并过程。 只要实际完整数据存储区达到第二数据文件合并条件, 开始对完整数据存储区的数据文件进行合并。 In the actual application process, the user may pre-set the complete data storage area data merge condition, that is, the second data file merge condition, for example, setting a predetermined time or the amount of data reaches a predetermined threshold or after completing the data merge of the incremental data storage area. The data file merge process that starts the full data store. As soon as the actual full data storage area reaches the second data file merge condition, the data files of the complete data storage area are merged.
其中, 对完整数据存储区中保存的包含各合并时刻的完整记录的各数据 文件进行合并可以釆用现有技术中数据消冗的多种算法实现,比如归并算法。 在数据文件的管理方法另一个实施方式中, 以归并算法对完整数据存储区保 存的包含各合并时刻的完整记录的各数据文件进行合并作为举例说明。 请参 阅图 5 , 图 5是归并算法对完整数据存储区保存的包含各合并时刻的完整记 录的各数据文件进行合并的流程图, 本实施方式中对完整数据存储区的保存 的包含各合并时刻的完整记录的各数据文件进行合并包括以下子步骤: The merging of each data file containing the complete record of each merge time saved in the complete data storage area may be implemented by multiple algorithms of data redundancy in the prior art, such as a merging algorithm. In another embodiment of the management method of the data file, each data file containing the complete record of each merge time saved in the complete data storage area is merged by the merging algorithm as an example. Referring to FIG. 5, FIG. 5 is a flowchart of a merge algorithm for combining data files including a complete record of each merge time saved in a complete data storage area. In this embodiment, the storage of the complete data storage area includes each merge time. The consolidation of the complete record of each data file includes the following substeps:
子步骤 S401 : 从完整数据存储区的保存的包含各合并时刻的完整记录的
各数据文件中, 查找出每个主键所在的最新的数据文件, 最新的数据文件是 形成时间最晚的数据文件; Sub-step S401: The saved record from the complete data storage area containing the complete record of each merged time In each data file, find the latest data file where each primary key is located, and the latest data file is the data file with the latest formation time;
完整数据存储区的包含各合并时刻的完整记录的各数据文件即为合并时 刻存储在完整数据存储区内的所有数据文件。 从这些数据文件中, 查找出每 个 Rowkey所在的最新的数据文件,这个最新的数据文件是形成时间最晚的数 据文件, 因为完整数据存储区的每个数据文件在生成时都携带一个新旧程度 的标量(如时间戳), 形成时间最晚的数据文件中记载该 Rowkey最新最全的 记录片段。 Each data file of the complete data storage area containing the complete record of each merge time is all data files stored in the complete data storage area at the time of the merge. From these data files, find the latest data file where each Rowkey is located. This latest data file is the latest data file, because each data file in the complete data storage area is built with an old age. The scalar (such as the timestamp), the latest and most complete record of the Rowkey is recorded in the data file that forms the latest time.
作为一种优选的实施方式, 在查找前, 迭代器按照完整数据存储区的数 据文件的生成顺序, 对数据文件按照 Rowkey大小顺序依次迭代, 比如按 Us er l、 User2、 User 3……这样的顺序依次迭代, 然后按照 Rowkey大小顺序 去查找每个 Rowkey所在的最新数据文件。即先查找 User l所在的最新数据文 件, 再查找 Us er 2所在的最新数据文件 ... ...依次类推。 As a preferred implementation manner, before the searching, the iterator sequentially iterates according to the Rowkey size order according to the order of generating the data files of the complete data storage area, for example, pressing Us er l, User 2, User 3, etc. The order is iterated sequentially, and then the latest data file of each Rowkey is found in the order of Rowkey size. That is, first find the latest data file where User l is located, then find the latest data file where Us er 2 is located... and so on.
子步骤 S402 : 从每个主键所在的最新的数据文件中获取每个主键对应的 完整记录并写入完整数据存储区的合并的数据文件, 删除完整数据存储区的 已完成合并的数据文件; Sub-step S402: obtaining a complete record corresponding to each primary key from the latest data file where each primary key is located and writing the merged data file of the complete data storage area, and deleting the completed merged data file of the complete data storage area;
从每个 Rowkey所在的最新的数据文件中获取 Rowkey对应的记录片段并 写入完整数据存储区的合并的数据文件, 然后删除完整数据存储区的已完成 合并的数据文件。 合并的数据文件是完整数据存储区用于存储其内部的数据 文件合并结果的目标数据文件。 Get the record segment corresponding to Rowkey from the latest data file where each Rowkey is located and write the merged data file of the complete data storage area, and then delete the completed data file of the complete data storage area. The merged data file is the target data file that the full data store uses to store the results of its internal data file merge.
以下举例说明上述完整数据存储区内部合并过程, 请参阅图 7 , 图 7是 完整数据存储区的示意图, 其中, 完整数据存储区的数据文件 1和数据文件 2是两个待合并的数据文件, 数据文件 3是横向合并输出的目标文件, 即上 述的合并的数据文件。 其中, 数据文件 1是在时间点 t l生成, 它保存了 t l 时刻 Us er l和 Us er 3的完整记录, 即 Us er 1的 f eedl和 User 3的 f eedl , 是 纵向文件合并或者前一轮横向文件合并的结果。这里 User l、 Us er 3即上文提 到的不同的 Rowke y。数据文件 2是在时间点 12生成,它保存了在 12时刻 U s e r 1
的完整记录, 即 Userl的 feedl和 feed2, 是纵向文件合并的结果。 其中, t2晚于 tl。 The following example illustrates the internal integration process of the above complete data storage area. Please refer to FIG. 7. FIG. 7 is a schematic diagram of a complete data storage area, where the data file 1 and the data file 2 of the complete data storage area are two data files to be merged. Data file 3 is the target file for horizontal merge output, that is, the merged data file described above. Wherein, the data file 1 is generated at time t1, which holds the complete records of the times ls l and s er 3 at the time t1, that is, the f eedl of the Us er 1 and the f eedl of the User 3, which are the vertical file merge or the previous round. The result of the horizontal file merge. Here User l, Us er 3 are the different Rowke y mentioned above. Data file 2 is generated at time point 12, which holds U ser 1 at time 12 The full record, Userl's feedl and feed2, is the result of a vertical file merge. Where t2 is later than tl.
合并开始时,(1)迭代器按照文件的生成时间顺序对数据文件 1和数据文 件 2按 Rowkey大小顺序依次迭代, 取出 Rowkey= Userl; (2)从数据文件 1 和数据文件 2中查找出 Rowkey= Userl的最新完整记录所在文件, 找到数据 文件 2,而数据文件 1 已经是历史完整记录; (3)从数据文件 2中读取 Rowkey= Userl的最新完整记录, 包括 feedl和 feed2,将 feedl和 feed2拷贝到数据 文件 3; 重复上述步骤迭代合并 Rowkey= User3, 它的记录只存在数据文件 1 中, 从数据文件 1中读取记录并写入数据文件 3, 横向数据合并完成, 删除 数据文件 1和数据文件 2。 At the beginning of the merge, (1) the iterator sequentially iterates the data file 1 and the data file 2 in the order of the Rowkey size in the order of the file generation time, and takes out Rowkey=Userl; (2) finds the Rowkey from the data file 1 and the data file 2. = Userl's latest complete record file, find data file 2, and data file 1 is a historical full record; (3) Read the latest complete record of Rowkey= Userl from data file 2, including feedl and feed2, feedl and Feed2 is copied to data file 3; repeat the above steps to iteratively merge Rowkey= User3, its record only exists in data file 1, read record from data file 1 and write data file 3, horizontal data merge is completed, delete data file 1 And data file 2.
由于釆用分层存储结构, Rowkey可能在主存、 增量数据存储区和完整数 据存储区都有, 在查询某一 Rowkey, 则必须从这三个存储区汇总结果。 下面 举例说明在釆用了上述数据文件的管理方法以后, Rowkey的查询过程: Because of the hierarchical storage structure, Rowkey may be in the main memory, incremental data storage area and full data storage area. When querying a Rowkey, you must summarize the results from these three storage areas. The following example illustrates the Rowkey query process after using the management method of the above data file:
请参阅图 8, 图 8为本申请数据文件的管理方法一个实施方式中存储结 构示意图, 比^口要查 i句 Rowkey=Userl的 己录, 图中 Rowkey=User 1的 己录在 三个存储区都有分布, 查询过程如下: (1)首先到主存查找 Rowkey=Userl的 记录, 找到 feed5; (2)在增量数据存储区的数据文件 1和数据文件 2都有 Rowkey=Userl的记录, 查找出 f eed3和 f eed4 ; (3)在完整数据存储区查找 到数据文件 1和数据文件 2都有 Rowkey=Userl的记录,按时间戳比较可知数 据文件 2上的 Rowkey=Userl的记录是最新最全的, 所以只查找出 feedl和 feed2, 而直接忽略数据文件 1; (4)汇总并返回查询结果。 上述查询过程, 很显然, 完整数据存储区上对 Rowkey的精确查找只需一次 10。 Please refer to FIG. 8. FIG. 8 is a schematic diagram of a storage structure in an implementation manner of a data file management method according to the present application. For example, if the file is checked, the value of Rowkey=Userl is recorded. In the figure, Rowkey=User 1 is recorded in three stores. The area has a distribution, the query process is as follows: (1) First find the record of Rowkey=Userl in the main memory, find the feed5; (2) The data file 1 and the data file 2 in the incremental data storage area have the record of Rowkey=Userl Find out e eed3 and f eed4 ; (3) Find the data file 1 and data file 2 in the complete data storage area with the record of Rowkey=Userl. According to the timestamp comparison, the record of Rowkey=Userl on the data file 2 is The latest and most complete, so only look for feedl and feed2, and directly ignore the data file 1; (4) summarize and return the query results. In the above query process, it is clear that the exact lookup of Rowkey on the full datastore is only required once.
通过上述实施方式的描述, 本申请数据文件的管理方法, 将数据文件区 分增量数据和完整数据, 分级存储, 分阶段合并, 解决在完整数据存储区上 Rowkey精确查询多次 10消耗的问题, 达到在完整数据存储区上对 Rowkey的 精确查找只需一次 10。 Through the description of the foregoing embodiment, the data file management method of the present application divides the data file into incremental data and complete data, hierarchically stores, and merges in stages, and solves the problem that Rowkey accurately queries multiple times in the complete data storage area. Achieving an accurate lookup of Rowkey on the full data store takes only 10 at a time.
请参阅图 9, 图 9是本申请存储装置一个实施方式的结构示意图, 本实
施方式的存储装置 1 00包括第一合并模块 1 1和写入模块 12 , 其中: 第一合并模块 1 1用于在增量数据存储区达到第一数据文件合并条件时, 将增量数据存储区中的各数据文件中每个 Rowkey对应的记录片段分别与查 找到的 Rowkey对应的历史完整记录合并, 形成每个 Rowkey对应的合并时刻 的完整记录并输出给写入模块 12 ; Please refer to FIG. 9. FIG. 9 is a schematic structural diagram of an embodiment of a storage device of the present application. The storage device 100 includes a first merge module 11 and a write module 12, wherein: the first merge module 11 is configured to store incremental data when the incremental data storage area reaches the first data file merge condition. The record segments corresponding to each Rowkey in each data file in the region are respectively merged with the historical complete records corresponding to the found Rowkey, and a complete record of the merge time corresponding to each Rowkey is formed and output to the write module 12;
本申请实施方式中, 数据文件区分为增量数据和完整数据, 对应到存储 区, 增量数据存储在增量数据存储区, 对一个 Rowkey而言, 就是该 Rowkey 的增量数据, 完整数据存储在完整数据存储区, 对一个 Rowkey而言, 就是该 Rowkey的完整数据。 In the embodiment of the present application, the data file is divided into incremental data and complete data, corresponding to the storage area, and the incremental data is stored in the incremental data storage area. For a Rowkey, it is the incremental data of the Rowkey, and the complete data storage. In the full data store, for a Rowkey, it is the complete data of the Rowkey.
用户可以根据需要预先设置增量数据存储区的数据合并条件即第一数据 文件合并条件, 比如预设预定时间或增量数据存储区的数据量达到预定阔值 或者是只要增量数据存储区出现新的增量数据就进行增量数据存储区的数据 文件合并条件。 只要增量数据存储区的达到第一数据文件合并条件, 即执行 对增量数据存储区的数据文件进行合并过程。 The user can preset the data merge condition of the incremental data storage area, that is, the first data file merge condition, for example, the preset predetermined time or the amount of data of the incremental data storage area reaches a predetermined threshold or as long as the incremental data storage area appears. The new incremental data is used to merge the data files of the incremental data store. As long as the incremental data storage area reaches the first data file merge condition, the data file merge process of the incremental data storage area is performed.
第一合并模块 1 1在对增量存储区的数据文件进行合并时, 将 Rockey在 完整数据存储区的历史记录参与到合并过程,合并得到该 Rowkey对应的合并 时刻的完整记录。 这个合并时刻的完整记录也可以理解为最新完整记录, 是 本次合并后得到的该 R owe y对应的完整记录。也就是说,在下一次有该 R owk e y 记录的数据文件合并之前, 该 Rowkey的记录是完整的。 每个 Rowkey记录形 成时都带有一个新旧程度的标量 (如时间戳) 。 The first merge module 1 1 participates in the merge process of the complete data storage area of the Rockey when the data files of the incremental storage area are merged, and merges to obtain a complete record of the merge time corresponding to the Rowkey. The complete record of this merged moment can also be understood as the latest complete record, which is the complete record corresponding to the R owe y obtained after the merger. That is, the Rowkey record is complete until the next time the data file with the R owk e y record is merged. Each Rowkey record is formed with a new or old scalar (such as a timestamp).
本申请实施方式中, 区分历史完整记录和合并时刻的完整记录, 所述历 史完整记录是指在文件合并开始前, 完整数据存储区上按时间由新到旧找到 的该 Rowkey的第一条记录, 该第一记录记载了该 Rowkey在文件合并之前的 所有记录。 对于第一次插入到完整数据存储区的 Rowkey不存在历史完整记 录。 而所谓合并时刻的完整记录是指当前这次文件合并结束后, 该 Rowkey 对应写入到完整数据存储器的新建的数据文件中的所有记录(包括之前合并 的和本次合并的 Rowkey的记录)。
增量数据存储区的数据文件中,数据是按 Rowkey依次排列的,在进行合 并时,将数据文件中的每个 Rowkey的所有记录都与查询到的历史完整记录进 行合并, 得到每个 Rowkey对应的合并时刻的完整记录。 In the embodiment of the present application, the complete record of the historical complete record and the merge time is distinguished, and the historical complete record refers to the first record of the Rowkey found by the new to the old on the complete data storage area before the file merge starts. The first record records all records that the Rowkey had before the file merge. There is no historical full record for Rowkey that was first inserted into the full datastore. The complete record of the so-called merge time means that after the current file merge is completed, the Rowkey corresponds to all records (including the previously merged and merged Rowkey records) written in the newly created data file of the complete data store. In the data file of the incremental data storage area, the data is arranged in order according to Rowkey. When merging, all the records of each Rowkey in the data file are merged with the historical complete records of the query, and each Rowkey is obtained. A complete record of the combined moments.
写入模块 12用于将每个 Rowkey对应的合并时刻的完整记录写入完整数 据存储区一个新建的数据文件中,每个 Rowkey对应的合并时刻的完整记录作 为下一次该 Rowkey的记录合并前, 在完整数据存储区精确查询该 Rowkey的 输出结果。 The writing module 12 is configured to write a complete record of the merge time corresponding to each Rowkey into a newly created data file in the complete data storage area, and the complete record of the merge time corresponding to each Rowkey is used as the next record combination of the next Rowkey. The Rowey output is accurately queried in the full data store.
写入模块 12将进行合并后得到的每个 Rowkey对应的合并时刻的完整记 录都分别写入到完整数据存储区的新建的数据文件中, 该新建的数据文件即 进行合并后在完整数据存储区生成的目标数据文件, 用于存储增量数据存储 区的数据文件中每个 Rowkey对应的合并时刻的完整记录。 The write module 12 writes the complete records of the merge time corresponding to each Rowkey obtained after the merge to the newly created data file of the complete data storage area, and the newly created data file is merged and then in the complete data storage area. The generated target data file is used to store a complete record of the merge time corresponding to each Rowkey in the data file of the incremental data storage area.
由于在完整数据存储区对 Rowkey进行精确查询时,是根据文件的生成时 间顺序进行的, 所以, 在合并结束后, 下一次该 Rowkey记录合并之前, 如果 在完整数据存储区对 Rowkey进行查询, 那么该 Rowkey对应的合并时刻的完 整记录即为查询该 Rowkey的输出结果。 Since the exact query of the Rowkey in the complete data storage area is performed according to the time sequence of the file generation, after the merge is completed, if the Rowkey record is merged in the complete data storage area before the next time the Rowkey record is merged, then The complete record of the merge time corresponding to the Rowkey is the result of querying the Rowkey.
上述的合并过程也可以叫纵向合并过程, 是一种跨存储区的文件合并方 式, 其能够合并 Rowkey记录片段, 使 Rowkey聚集, 做到对于完整数据存储 区的任意一次 Rowkey精确查询只需要一次 10。 The above merge process can also be called a vertical merge process. It is a file merge method across storage areas. It can merge Rowkey record segments and make Rowkey aggregates. .
上述合并过程完成后,写入模块 12可以删除增量数据存储区的相应数据 文件, 以释放存储空间。 After the above merge process is completed, the write module 12 can delete the corresponding data file of the incremental data storage area to release the storage space.
请参阅图 1 0 , 图 1 0是本申请存储装置另一个实施方式的结构示意图, 本实施方式存储装置 200包括第一合并模块 21、写入模块 11、第二合并模块 23以及查找模块 24 , 其中: Referring to FIG. 10, FIG. 10 is a schematic structural diagram of another embodiment of a storage device of the present application. The storage device 200 of the present embodiment includes a first merge module 21, a write module 11, a second merge module 23, and a lookup module 24, among them:
第一合并模块 21用于在增量数据存储区达到第一数据文件合并条件时, 将增量数据存储区中的各数据文件中每个 Rowkey对应的记录片段分别与查 找到的 Rowkey对应的历史完整记录合并, 形成每个 Rowkey对应的合并时刻 的完整记录并输出给写入模块 22 ;
写入模块 22用于将每个 Rowkey对应的合并时刻完整记录写入完整数据 存储区的一个新建的数据文件中,每个 Rowkey对应的合并时刻完整记录作为 下一次该 Rowkey的记录合并前, 在完整数据存储区精确查询该 Rowkey的输 出结果。 The first merging module 21 is configured to: when the incremental data storage area reaches the first data file merging condition, respectively record the record segments corresponding to each Rowkey in each data file in the incremental data storage area and the history corresponding to the found Rowkey Complete record merge, form a complete record of the merge time corresponding to each Rowkey and output to the write module 22; The writing module 22 is configured to write a complete record of the merge time corresponding to each Rowkey into a newly created data file in the complete data storage area, and the combined time of each Rowkey is completely recorded as the next record of the Rowkey before the merge. The complete data storage area accurately queries the output of the Rowkey.
第二合并模块 2 3用于在完整数据存储区达到第二数据文件合并条件时, 对完整数据存储区中保存的包含各合并时刻的完整记录的各数据文件进行合 并, 删除完整数据存储区的每个 Rowkey的冗余记录。 The second merging module 2 3 is configured to merge each data file containing the complete record of each merge time saved in the complete data storage area when the complete data storage area reaches the second data file merging condition, and delete the complete data storage area. Redundant record for each Rowkey.
对于完成上述的跨存储区数据文件的合并之后, 在完整数据存储区形成 每个 Rowkey合并时刻完整记录时, 该 Rowkey的历史完整记录就变为无效, 需要进行回收, 以消除 Rowkey冗余数据。 因此, 第二合并模块 2 3进一步进 行完整存储区内部的数据文件合并过程, 这个过程也可以叫做横向数据文件 合并过程,是完整存储区内部的数据文件合并过程。 目的是消除冗余 Rowkey , 舍弃无效的 Rowkey记录, 回收存储空间。 After the completion of the above-mentioned cross-storage data file merging, when the complete data storage area is formed into a complete record of each Rowkey merge time, the history complete record of the Rowkey becomes invalid and needs to be recycled to eliminate Rowkey redundant data. Therefore, the second merge module 2 3 further performs a data file merge process inside the complete storage area. This process may also be called a horizontal data file merge process, which is a data file merge process inside the complete storage area. The goal is to eliminate redundant Rowkeys, discard invalid Rowkey records, and reclaim storage.
其中,第二合并模块 2 3对完整数据存储区中保存的包含各合并时刻的完 整记录的各数据文件进行合并可以釆用数据消冗的多种算法,比如归并算法。 The second merging module 231 combines the data files of the complete records stored in the complete data storage area and includes the data redundancy, such as a merging algorithm.
查找模块 24用于从主存或完整数据存储区的数据文件中查找每个 The lookup module 24 is used to look up each of the data files from the main memory or the full data storage area.
Rowkey对应的历史完整记录, 并将查找到的 Rowkey对应的历史完整记录输 出给第一合并模块 21 ; The historical complete record corresponding to the rowkey, and outputting the historical complete record corresponding to the found Rowkey to the first merge module 21;
查找模块 24用于在合并前,从主存或完整数据存储区的数据文件中查找 每个 Rowkey对应的历史完整记录, 具体查找时,先在主存的数据文件中进行 查找, 如果没有找到再到完整数据存储区的数据文件中进行查找。 在查找的 时候,根据数据文件的生成顺序由新到旧进行检索,直到找到 Rowkey的记录, 该找到的 Rowkey记录就是时间戳最新的, 即该 Rowkey的历史完整记录。 查 找模块 24对每个 Rowkey都执行以上的查找过程。 The searching module 24 is configured to search for the historical complete record corresponding to each Rowkey from the data file of the main memory or the complete data storage area before the merging. In the specific search, the search is performed in the main data file, if not found. Look in the data file to the full datastore. At the time of the search, the data file is retrieved from new to old until the Rowkey record is found, and the found Rowkey record is the latest timestamp, that is, the history record of the Rowkey. The lookup module 24 performs the above search process for each Rowkey.
在查找模块 24没有查找到 Rowkey对应的历史完整记录时, 第一合并模 块 21用于将增量数据存储区的数据文件中该 Rowkey对应的记录片段合并, 作为该 Rowkey对应的合并时刻的完整记录。
其中, 请参阅图 11, 本实施方式中第二合并模块 23进一步包括查找单 元 111以及写入单元 112, 其中: When the search module 24 does not find the historical complete record corresponding to the Rowkey, the first merge module 21 is configured to merge the record segments corresponding to the Rowkey in the data file of the incremental data storage area as a complete record of the merge time corresponding to the Rowkey. . Referring to FIG. 11, the second merging module 23 in this embodiment further includes a searching unit 111 and a writing unit 112, where:
查找单元 111用于从完整数据存储区中保存的包含各合并时刻的完整记 录的各数据文件中,查找出每个 Rowkey所在的最新的数据文件并输出给写入 单元 112, 最新的数据文件是指形成时间最晚的数据文件; The searching unit 111 is configured to search for the latest data file in which each Rowkey is located from each data file containing the complete record of each merge time saved in the complete data storage area, and output the latest data file to the writing unit 112. The latest data file is Refers to the data file with the latest formation time;
完整数据存储区中保存的包含各合并时刻的完整记录的各数据文件即为 合并时刻完整数据存储区内的所有数据文件。 查找单元 111从这些数据文件 中, 查找出每个 Rowkey所在的最新的数据文件,这个最新的数据文件是形成 时间最晚的数据文件, 因为完整数据存储区的每个数据文件在生成时都携带 一个新旧程度的标量(如时间戳) , 形成时间最晚的数据文件记载该 Rowkey 最新最全的记录片段。 Each data file containing the complete record of each merge time saved in the complete data storage area is all the data files in the complete data storage area of the merge time. The searching unit 111 finds the latest data file of each Rowkey from the data files, and the latest data file is the data file with the latest formation time, because each data file of the complete data storage area is carried at the time of generation. A new and old scalar (such as timestamp), the latest data file to record the latest and most complete record of the Rowkey.
作为一种优选的实施方式, 在查找前, 查找单元 111按照完整数据存储 区的数据文件的生成顺序,对完整数据存储区的数据文件按照 Rowkey大小顺 序依次迭代, 比如按 Userl、 User2、 User3……这样的顺序依次迭代, 然后 按照 Rowkey大小顺序去查找每个 Rowkey所在的最新的数据文件。 即先查找 Userl所在的最新的数据文件, 再查找 User2所在的最新的数据文件 ......依 次类推。 As a preferred embodiment, before searching, the searching unit 111 sequentially iterates the data files of the complete data storage area according to the Rowkey size order according to the order of generating the data files of the complete data storage area, for example, according to Userl, User2, User3... ...the order is iterated sequentially, and then the latest data file for each Rowkey is found in order of Rowkey size. That is, first find the latest data file where Userl is located, and then find the latest data file where User2 is located... and so on.
写入单元 112用于从每个 Rowkey所在的最新的数据文件中获取每个 Rowkey对应的完整记录并写入完整数据存储区合并的数据文件,删除完整数 据存储区中保存的包含各合并时刻的完整记录的数据文件。 The writing unit 112 is configured to obtain a complete record corresponding to each Rowkey from the latest data file in which each Rowkey is located and write the data file merged in the complete data storage area, and delete the data storage area containing the merged time. Fully recorded data file.
写入单元 112从每个 Rowkey所在的最新的数据文件中获取 Rowkey对应 的记录片段并写入完整数据存储区的合并的数据文件, 然后删除完整数据存 储区的已完成合并的数据文件。 合并的数据文件是完整数据存储区用于存储 其内部数据文件合并结果的目标文件。 The writing unit 112 acquires the record segment corresponding to the Rowkey from the latest data file in which each Rowkey is located and writes the merged data file of the complete data storage area, and then deletes the completed data file of the complete data storage area. The merged data file is the target file for the full data store to store the consolidated results of its internal data files.
请参阅图 12, 图 12是本申请存储装置又一个实施方式的结构示意图, 本实施方式的存储装置 300包括处理器 31、 交互接口 32、 随机存取存储器 33、 只读存储器 34总线 35以及网络接口单元 36。 其中, 处理器 31通过总
线 35分别耦接交互接口 32、 随机存取存储器 33、只读存储器 34 以及网络接 口单元 36。 其中, 当需要运行存储装置 300时, 通过固化在只读存储器 34 中的基本输入输出系统或者嵌入式系统中的 boot loader引导系统进行启动, 引导存储装置 300进入正常运行状态。在存储装置 300进入正常运行状态后, 在随机存取存储器 33中运行应用程序和操作系统, 通过网络接口单元 36从 网络接收数据或者向网络发送数据, 使得: Referring to FIG. 12, FIG. 12 is a schematic structural diagram of still another embodiment of a storage device of the present application. The storage device 300 of the present embodiment includes a processor 31, an interaction interface 32, a random access memory 33, a read only memory 34 bus 35, and a network. Interface unit 36. Wherein the processor 31 passes the total The line 35 is coupled to the interactive interface 32, the random access memory 33, the read only memory 34, and the network interface unit 36, respectively. Wherein, when the storage device 300 needs to be run, booting is performed by the boot loader booting system in the basic input/output system or the embedded system that is solidified in the read-only memory 34, and the storage device 300 is booted into a normal operating state. After the storage device 300 enters the normal operating state, the application program and the operating system are run in the random access memory 33, and data is received from the network or transmitted to the network through the network interface unit 36, so that:
交互接口 32是人机交互的设备接口, 用于接收用户的操作指令,可以是 USB接口、 显示接口等; The interaction interface 32 is a device interface for human-computer interaction, and is configured to receive an operation instruction of the user, and may be a USB interface, a display interface, or the like;
处理器 31在增量数据存储区达到第一数据文件合并条件时,通过交互接 口接收到用户的对增量数据存储区的数据文件进行合并的操作指令时, 将增 量数据存储区的各数据文件中每个 Rowkey对应的记录片段,分别与查找到的 每个 Rowkey对应的历史完整记录合并, 形成每个 Rowkey对应的合并时刻的 完整记录,并将每个 Rowkey对应的合并时刻的完整记录写入完整数据存储区 的一个新建的数据文件中 ,每个 Rowkey对应的合并时刻的完整记录作为下一 次 Rowkey的记录合并前, 在所述完整数据存储区精确查询该 Rowkey的输出 结果; When the incremental data storage area reaches the first data file merge condition, the processor 31 receives the operation instruction of the user to merge the data files of the incremental data storage area through the interactive interface, and increments the data of the data storage area. The record segments corresponding to each Rowkey in the file are merged with the historical complete records corresponding to each Rowkey found, forming a complete record of the merge time corresponding to each Rowkey, and writing the complete record of the merge time corresponding to each Rowkey. In a newly created data file of the complete data storage area, the complete record of the merge time corresponding to each Rowkey is used as the next Rowkey record before the merge is merged, and the output result of the Rowkey is accurately queried in the complete data storage area;
另一方面,处理器 31进一步根据用户的对完整数据存储区的数据进行合 并的操作指令, 对完整数据存储区中保存的包含各合并时刻的完整记录的各 数据文件进行合并, 删除完整数据存储区的每个 Rowkey的冗余记录; On the other hand, the processor 31 further merges each data file containing the complete record of each merge time saved in the complete data storage area according to the user's operation instruction for merging the data of the complete data storage area, and deletes the complete data storage. Redundant records for each Rowkey of the zone;
本实施方式中, 处理器 31可能是一个中央处理器 CPU , 或者是特定集成 电路 AS IC ( App l i ca t i on Spec i f i c Integra ted C i rcui t ) , 或者是被西己置 成实施本申请实施方式的一个或多个集成电路。 In this embodiment, the processor 31 may be a central processing unit CPU, or a specific integrated circuit AS IC (Applicable Integrated Integrated Integrated Circuit), or implemented by the implementation of the present application. One or more integrated circuits in a manner.
本实施方式中, 上述的增量数据存储区和完整数据存储区可以分别对应 本实施方式的存储装置 300的随机存取存储器 33和只读存储器 34。 In the present embodiment, the incremental data storage area and the complete data storage area described above may correspond to the random access memory 33 and the read only memory 34 of the storage device 300 of the present embodiment, respectively.
通过以上实施方式的阐述, 可以理解, 本申请数据文件的管理方法及装 置,将增量数据存储区的数据文件中每个 Rowkey对应的记录片段,分别与查 找到的 Rowkey对应的历史完整记录合并, 形成每个 Rowkey对应的合并时刻
的完整记录并写入完整数据存储区, 通过这样的方式, 对增量数据库的数据 文件在增量数据存储区和完整数据存储区进行动态管理,从而使 Rowkey在完 整数据存储区呈集中状态存储, 为在完整数据存储区 Rowkey精确查询减少 10开销。 Through the above embodiments, it can be understood that the management method and apparatus for the data file of the present application merges the record segments corresponding to each Rowkey in the data file of the incremental data storage area with the historical complete records corresponding to the found Rowkey. , forming the merge moment corresponding to each Rowkey The complete record is written to the full data storage area. In this way, the data files of the incremental database are dynamically managed in the incremental data storage area and the complete data storage area, so that Rowkey is stored in a centralized state in the complete data storage area. , Reduces overhead by making precise queries for Rowkey in the full data store.
另外, 定期对完整数据存储区的数据文件进行内部文件的合并, 消除无 效记录, 减少 Rowkey冗余度和离散度, 提高 Rowkey查询性能, 而且能够有 效的回收存储空间。 In addition, regular internal file consolidation of data files in the full datastore eliminates invalid records, reduces Rowkey redundancy and dispersion, improves Rowkey query performance, and efficiently reclaims storage space.
在本申请所提供的几个实施方式中, 应该理解到, 所揭露的系统, 装置 和方法, 可以通过其它的方式实现。 例如, 以上所描述的装置实施方式仅仅 是示意性的, 例如, 所述模块或单元的划分, 仅仅为一种逻辑功能划分, 实 际实现时可以有另外的划分方式, 例如多个单元或组件可以结合或者可以集 成到另一个系统, 或一些特征可以忽略, 或不执行。 另一点, 所显示或讨论 的相互之间的耦合或直接耦合或通信连接可以是通过一些接口, 装置或单元 的间接耦合或通信连接, 可以是电性, 机械或其它的形式。 为单元显示的部件可以是或者也可以不是物理单元, 即可以位于一个地方, 或者也可以分布到多个网络单元上。 可以根据实际的需要选择其中的部分或 者全部单元来实现本实施方式方案的目的。 In the several embodiments provided herein, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the device implementations described above are merely illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be used. Combined or can be integrated into another system, or some features can be ignored, or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form. The components displayed for the unit may or may not be physical units, ie may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the present embodiment.
另外, 在本申请各个实施方式中的各功能单元可以集成在一个处理单元 中, 也可以是各个单元单独物理存在, 也可以两个或两个以上单元集成在一 个单元中。 上述集成的单元既可以釆用硬件的形式实现, 也可以釆用软件功 能单元的形式实现。 In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software function unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售 或使用时, 可以存储在一个计算机可读取存储介质中。 基于这样的理解, 本 申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的 全部或部分可以以软件产品的形式体现出来, 该计算机软件产品存储在一个 存储介质中, 包括若干指令用以使得一台计算机设备(可以是个人计算机,
服务器, 或者网络设备等)或处理器(processor )执行本申请各个实施方式 所述方法的全部或部分步骤。 而前述的存储介质包括: U盘、 移动硬盘、 只 读存储器(ROM, Read-Only Memory ) , 随机存取存储器( RAM, Random Acces s Memory) 、 磁碟或者光盘等各种可以存储程序代码的介质。 The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application, in essence or the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. , including a number of instructions to make a computer device (which can be a personal computer, A server, or a network device, or the like, or a processor, performs all or part of the steps of the methods described in various embodiments of the present application. The foregoing storage medium includes: a USB flash drive, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, Random Acces s Memory), a magnetic disk or an optical disk, and the like, which can store program codes. medium.
以上所述仅为本申请的实施方式, 并非因此限制本申请的专利范围, 凡 是利用本申请说明书及附图内容所作的等效结构或等效流程变换, 或直接或 间接运用在其他相关的技术领域, 均同理包括在本申请的专利保护范围内。
The above description is only the embodiment of the present application, and thus does not limit the scope of the patent application, and the equivalent structure or equivalent process transformation made by using the specification and the drawings of the present application, or directly or indirectly applied to other related technologies. The scope of the invention is included in the scope of patent protection of this application.
Claims
1.一种数据文件的管理方法, 其特征在于, 包括: A method for managing a data file, comprising:
在增量数据存储区达到第一数据文件合并条件时,将所述增量数据存储 区中的各数据文件中每个主键对应的记录片段分别与查找到的所述主键对 应的历史完整记录合并, 形成所述每个主键对应的合并时刻的完整记录; 将所述每个主键对应的所述合并时刻的完整记录写入完整数据存储区 的一个新建的数据文件中, 其中, 所述每个主键对应的所述合并时刻的完整 记录作为在所述完整数据存储区精确查询所述主键的输出结果。 When the incremental data storage area reaches the first data file merging condition, the recorded segments corresponding to each primary key in each data file in the incremental data storage area are respectively merged with the historical complete records corresponding to the found primary key. Forming a complete record of the merge time corresponding to each of the primary keys; writing a complete record of the merge time corresponding to each primary key into a newly created data file of the complete data storage area, wherein each of the The complete record of the merge time corresponding to the primary key is used as an output result of accurately querying the primary key in the complete data storage area.
2.根据权利要求 1所述的方法, 其特征在于, 所述方法还包括: 将所述 每个主键对应的所述合并时刻的完整记录写入主存。 The method according to claim 1, wherein the method further comprises: writing a complete record of the merge time corresponding to each of the primary keys to the main memory.
3.根据权利要求 1或 2所述的方法, 其特征在于, 3. Method according to claim 1 or 2, characterized in that
所述方法还包括: 在所述完整数据存储区达到第二数据文件合并条件 时,对所述完整数据存储区中保存的包含各合并时刻的完整记录的各数据文 件进行合并, 删除所述完整数据存储区的每个所述主键的冗余记录。 The method further includes: when the complete data storage area reaches the second data file merging condition, merging each data file that is included in the complete data storage area and including the complete record of each merging time, deleting the complete A redundant record of each of the primary keys of the data store.
4.根据权利要求 3所述的方法, 其特征在于, 所述对所述完整数据存储 区中保存的包含各合并时刻的完整记录的各数据文件进行合并,删除所述完 整数据存储区的每个所述主键的冗余记录, 具体为: The method according to claim 3, wherein the data files stored in the complete data storage area and including the complete records of the combined time are merged, and each of the complete data storage areas is deleted. The redundant records of the primary keys are specifically:
釆用归并算法对所述完整数据存储区中保存的包含各合并时刻的完整 记录的各数据文件进行合并,删除所述完整数据存储区的每个所述主键的冗 余记录。 And merging each data file containing the complete record of each merge time stored in the complete data storage area by a merge algorithm to delete redundant records of each of the primary keys of the complete data storage area.
5.根据权利要求 4所述的方法, 其特征在于, 5. The method of claim 4, wherein
所述釆用归并算法对所述完整数据存储区中保存的包含各合并时刻的 完整记录的各数据文件进行合并,删除所述完整数据存储区的每个所述主键 的冗余记录的步骤包括: And merging, by the merging algorithm, each data file containing the complete record of each merge time saved in the complete data storage area, and deleting the redundant record of each of the primary keys of the complete data storage area includes :
从所述完整数据存储区中保存的包含各合并时刻的完整记录的各数据 文件中, 查找出每个所述主键所在的最新的数据文件, 所述最新的数据文件 是指形成时间最晚的数据文件; Finding, from each data file containing the complete record of each merge time saved in the complete data storage area, the latest data file where each of the primary keys is located, and the latest data file refers to the latest formation time. data files;
从所述每个主键所在的最新的数据文件中获取每个所述主键对应的完 整记录并写入所述完整数据存储区的合并的数据文件,删除所述完整数据存 储区的已完成合并的所述数据文件。
Obtaining a complete record corresponding to each of the primary keys from the latest data file in which each primary key is located and writing the merged data file of the complete data storage area, deleting the completed merged data of the complete data storage area The data file.
6.根据权利要求 2至 5任一项所述的方法, 其特征在于, The method according to any one of claims 2 to 5, characterized in that
所述将所述增量数据存储区中的各数据文件中每个主键对应的记录片 段分别与查找到的所述主键对应的历史完整记录合并,形成所述每个主键对 应的合并时刻的完整记录的步骤之前, 还包括: And combining the record segments corresponding to each primary key in each data file in the incremental data storage area with the historical complete records corresponding to the searched primary keys, to form a complete merge time corresponding to each primary key Before the steps recorded, it also includes:
从所述主存或所述完整数据存储区的数据文件中查找每个所述主键对 应的历史完整记录。 A historical complete record corresponding to each of the primary keys is looked up from the data file of the primary storage or the full data storage area.
7.根据权利要求 6所述的方法, 其特征在于, 7. The method of claim 6 wherein:
所述从所述主存或所述完整数据存储区的数据文件中查找每个所述主 键对应的历史完整记录的步骤包括: The step of searching for a historical complete record corresponding to each of the primary keys from the data file of the primary storage or the complete data storage area includes:
按照每个所述主键对应的完整记录的形成时间由新到旧的方式在所述 主存中的数据文件中进行检索, 若所述主存中没有检索到, 再到所述完整数 据存储区的数据文件中进行检索, 直到检索到所述主键对应的完整记录, 所 述检索到的主键的完整记录为所述主键对应的历史完整记录。 Searching in the data file in the main memory according to the new record time of the complete record corresponding to each of the primary keys, if not retrieved in the main memory, and then going to the complete data storage area The retrieval is performed in the data file until the complete record corresponding to the primary key is retrieved, and the complete record of the retrieved primary key is a historical complete record corresponding to the primary key.
8.根据权利要求 6所述的方法, 其特征在于, 8. The method of claim 6 wherein:
在没有查找到所述主键对应的历史完整记录时,所述将所述增量数据存 储区中的各数据文件中每个主键对应的记录片段分别与查找到的所述主键 对应的历史完整记录合并, 形成所述每个主键对应的合并时刻的完整记录, 具体为: When the historical complete record corresponding to the primary key is not found, the record segment corresponding to each primary key in each data file in the incremental data storage area and the historical complete record corresponding to the found primary key respectively Merging, forming a complete record of the combined moments corresponding to each of the primary keys, specifically:
将所述增量数据存储区中的各数据文件中所述主键对应的记录片段合 并, 作为所述主键对应的所述合并时刻的完整记录。 And combining the record segments corresponding to the primary key in each data file in the incremental data storage area as a complete record of the merge time corresponding to the primary key.
9.根据权利要求 1至 8任一项所述的方法, 其特征在于, 所述方法还包 括: 删除所述增量数据存储区的所述数据文件。 The method according to any one of claims 1 to 8, wherein the method further comprises: deleting the data file of the incremental data storage area.
10.—种存储装置,其特征在于, 包括第一合并模块以及写入模块,其中: 所述第一合并模块用于在增量数据存储区达到第一数据文件合并条件 时,将所述增量数据存储区中的各数据文件中每个主键对应的记录片段分别 与查找到的所述主键对应的历史完整记录合并,形成所述每个主键对应的合 并时刻的完整记录并输出给所述写入模块; A storage device, comprising: a first merge module and a write module, wherein: the first merge module is configured to increase the incremental data storage area when the first data file merge condition is reached The recorded segments corresponding to each primary key in each data file in the data storage area are respectively merged with the historical complete records corresponding to the found primary keys, forming a complete record of the combined time corresponding to each primary key and outputting to the Write module
所述写入模块用于将所述每个主键对应的所述合并时刻的完整记录写 入完整数据存储区的一个新建的数据文件中, 其中, 所述每个主键对应的所 述合并时刻的完整记录作为在所述完整数据存储区精确查询所述主键的输
出结果。 The writing module is configured to write a complete record of the merge time corresponding to each of the primary keys into a newly created data file of the complete data storage area, where the merge time corresponding to each primary key Complete record as an accurate query of the primary key in the complete data storage area The results.
1 1 .根据权利要求 10所述的装置, 其特征在于, 所述写入模块还用于将 所述每个主键对应的所述合并时刻的完整记录写入主存。 The device according to claim 10, wherein the writing module is further configured to write a complete record of the merge time corresponding to each of the primary keys into the main memory.
12.根据权利要求 10或 1 1所述的装置, 其特征在于, 所述装置还包括第 二合并模块, 其中: The device according to claim 10 or 11, wherein the device further comprises a second merging module, wherein:
所述第二合并模块用于在所述完整数据存储区达到第二数据文件合并 条件时,对所述完整数据存储区中保存的包含各合并时刻的完整记录的各数 据文件进行合并, 删除所述完整数据存储区的每个所述主键的冗余记录。 The second merging module is configured to merge, when the complete data storage area reaches the second data file merging condition, each data file that is included in the complete data storage area and includes a complete record of each merging time, and deletes the data file. A redundant record of each of the primary keys of the complete data store.
13.根据权利要求 12所述的装置, 其特征在于, 所述第二合并模块包括 查找单元和写入单元, 其中: The device according to claim 12, wherein the second merging module comprises a searching unit and a writing unit, wherein:
所述查找单元用于从所述完整数据存储区中保存的包含各合并时刻的 完整记录的各数据文件中, 查找出每个所述主键所在的最新的数据文件, 所 述最新的数据文件是指形成时间最晚的数据文件; The searching unit is configured to search, from each data file that includes the complete record of each merge time saved in the complete data storage area, the latest data file where each of the primary keys is located, and the latest data file is Refers to the data file with the latest formation time;
所述写入单元用于从所述每个主键所在的最新的数据文件中获取每个 所述主键对应的完整记录并写入所述完整数据存储区的合并的数据文件,删 除所述完整数据存储区的已完成合并的所述数据文件。 The writing unit is configured to obtain a complete record corresponding to each of the primary keys from a latest data file in which each primary key is located, and write the merged data file of the complete data storage area, and delete the complete data. The data file of the storage area that has been merged.
14.根据权利要求 1 1至 13任一项所述的装置, 其特征在于, 所述装置还 包括查找模块, 其中: The device according to any one of claims 1 to 13, wherein the device further comprises a lookup module, wherein:
所述查找模块用于从所述主存或所述完整数据存储区的数据文件中查 找每个所述主键对应的历史完整记录,并将查找到的每个所述主键对应的历 史完整记录输出给所述第一合并模块。 The searching module is configured to search for a historical complete record corresponding to each of the primary keys from a data file of the primary storage or the complete data storage area, and output a historical complete record corresponding to each of the found primary keys. Giving the first merge module.
15.根据权利要求 14所述的装置, 其特征在于, 15. Apparatus according to claim 14 wherein:
在所述查找模块没有查找到所述主键对应的历史完整记录时,所述第一 合并模块用于将所述增量数据存储区中的各数据文件中所述主键对应的记 录片段合并, 作为所述主键对应的所述合并时刻的完整记录。
When the search module does not find the historical complete record corresponding to the primary key, the first merge module is configured to merge the record segments corresponding to the primary key in each data file in the incremental data storage area, as The complete record of the merged moment corresponding to the primary key.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310373456.8 | 2013-08-23 | ||
CN201310373456.8A CN104424219B (en) | 2013-08-23 | 2013-08-23 | A kind of management method and device of data file |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015024406A1 true WO2015024406A1 (en) | 2015-02-26 |
Family
ID=52483032
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2014/079700 WO2015024406A1 (en) | 2013-08-23 | 2014-06-12 | Data file management method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104424219B (en) |
WO (1) | WO2015024406A1 (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156070B (en) * | 2015-03-31 | 2019-07-12 | 华为技术有限公司 | A kind of querying method, file mergences method and relevant apparatus |
CN105138622B (en) * | 2015-08-14 | 2018-05-22 | 中国科学院计算技术研究所 | For the insertion operation of LSM tree storage systems and reading and the merging method of load |
CN107861959A (en) * | 2016-09-22 | 2018-03-30 | 阿里巴巴集团控股有限公司 | Data processing method, apparatus and system |
CN107402980A (en) * | 2017-07-06 | 2017-11-28 | 北京亿赛通网络安全技术有限公司 | A kind of processing method and system of big data under Network Environment |
CN110019254A (en) * | 2017-07-17 | 2019-07-16 | 中兴通讯股份有限公司 | Processing method, device and the computer readable storage medium of planning region increment record |
CN109947775B (en) * | 2019-03-13 | 2021-03-23 | 北京微步在线科技有限公司 | Data processing method and device, electronic equipment and computer readable medium |
CN111309673B (en) * | 2020-02-12 | 2023-06-23 | 普信恒业科技发展(北京)有限公司 | Snapshot data generation method and device for incremental data |
CN112395276B (en) * | 2020-11-13 | 2024-05-28 | 中国人寿保险股份有限公司 | Data comparison method and related equipment |
CN113568883B (en) * | 2021-07-29 | 2024-06-04 | 上海哔哩哔哩科技有限公司 | Data writing method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1867902A (en) * | 2003-08-05 | 2006-11-22 | 赛帕顿有限公司 | Emulated storage system |
US20080103692A1 (en) * | 2006-10-25 | 2008-05-01 | Denso Corporation | Road information storage apparatus, program for the same, and system for the same |
CN101794299A (en) * | 2010-01-27 | 2010-08-04 | 浪潮(山东)电子信息有限公司 | Method for increment definition and processing of historical data management |
CN102096685A (en) * | 2009-12-11 | 2011-06-15 | 阿里巴巴集团控股有限公司 | Method and device for synchronizing distributive data into data warehouse |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100373385C (en) * | 2003-01-17 | 2008-03-05 | 中兴通讯股分有限公司 | Method for back-up and restoring important data |
-
2013
- 2013-08-23 CN CN201310373456.8A patent/CN104424219B/en active Active
-
2014
- 2014-06-12 WO PCT/CN2014/079700 patent/WO2015024406A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1867902A (en) * | 2003-08-05 | 2006-11-22 | 赛帕顿有限公司 | Emulated storage system |
US20080103692A1 (en) * | 2006-10-25 | 2008-05-01 | Denso Corporation | Road information storage apparatus, program for the same, and system for the same |
CN102096685A (en) * | 2009-12-11 | 2011-06-15 | 阿里巴巴集团控股有限公司 | Method and device for synchronizing distributive data into data warehouse |
CN101794299A (en) * | 2010-01-27 | 2010-08-04 | 浪潮(山东)电子信息有限公司 | Method for increment definition and processing of historical data management |
Also Published As
Publication number | Publication date |
---|---|
CN104424219B (en) | 2018-10-09 |
CN104424219A (en) | 2015-03-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2015024406A1 (en) | Data file management method and device | |
CN106575297B (en) | High throughput data modification using blind update operations | |
US8683112B2 (en) | Asynchronous distributed object uploading for replicated content addressable storage clusters | |
US7730071B2 (en) | Data management system and data management method | |
CN109906448B (en) | Method, apparatus, and medium for facilitating operations on pluggable databases | |
EP2863310B1 (en) | Data processing method and apparatus, and shared storage device | |
US10606865B2 (en) | Database scale-out | |
US10642837B2 (en) | Relocating derived cache during data rebalance to maintain application performance | |
CN103595797B (en) | Caching method for distributed storage system | |
EP2562657B1 (en) | Management of update transactions and crash recovery for columnar database | |
WO2018205151A1 (en) | Data updating method and storage device | |
US20130262535A1 (en) | Method of managing data of file system using database management system | |
US20210240585A1 (en) | Database-level automatic storage management | |
CN106909651A (en) | A kind of method for being write based on HDFS small documents and being read | |
CN113853778B (en) | Cloning method and device of file system | |
CN104054071A (en) | Method for accessing storage device and storage device | |
CN113535670B (en) | Virtual resource mirror image storage system and implementation method thereof | |
WO2012083754A1 (en) | Method and device for processing dirty data | |
US10838944B2 (en) | System and method for maintaining a multi-level data structure | |
CN114600094A (en) | Generate hash tree for database schema | |
WO2013075306A1 (en) | Data access method and device | |
WO2024174715A1 (en) | Data storage method and apparatus, and electronic device | |
CN113867627A (en) | Method and system for optimizing performance of storage system | |
US10095700B2 (en) | Persistent file handle object container memory expiry | |
US10691757B1 (en) | Method and system for cached document search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14837365 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14837365 Country of ref document: EP Kind code of ref document: A1 |