CN108241892B

CN108241892B - Data modeling method and device

Info

Publication number: CN108241892B
Application number: CN201611207678.2A
Authority: CN
Inventors: 方晓春
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2021-02-19
Anticipated expiration: 2036-12-23
Also published as: CN108241892A

Abstract

The invention discloses a data modeling method and a data modeling device, wherein the method comprises the following steps: acquiring a data source; identifying original variables from the data source; acquiring a derivative variable corresponding to the original variable according to a preset rule base; selecting a preset classification model, and configuring data modeling parameters; and performing data modeling by using the preset classification model according to the data modeling parameters, the original variables and the derivative variables corresponding to the original variables. The modeling method and the modeling device can provide end-to-end modeling service, and reduce the technical threshold of modeling, thereby reducing the technical requirements of a user on data modeling by using a machine learning tool.

Description

Data modeling method and device

Technical Field

The invention relates to the technical field of data modeling, in particular to a data modeling method and device.

Background

With the development of big data technology, data modeling has also derived some data modeling tools. Such as microsoft Azure's machine learning tool. The machine learning tools can find out the factors with the strongest association with the target event from a plurality of data, establish a model and predict a new event. Machine learning can be used as a typical example to establish a value model by analyzing various online and offline behaviors of users, identify high-value users, improve follow-up degree of the users, issue proper promotion advertisements, and realize maximum embodiment of user values and maximum marketing efficiency.

The importance of machine learning is becoming more and more obvious, and the problems and solutions to be solved are also more definite, that is, a special group is identified from the whole, and the method for identifying the group is grasped, so as to improve the accuracy of predicting the new group. The typical application scenario is to identify users with good or bad quality and execute corresponding service policies.

However, the requirement of the computer and statistical professional skills of the user for data modeling by using the machine learning tool is high, for example, the user is required to have strong technical background, familiarity with data analysis, mathematical statistics and other professional knowledge, and the user is required to have rich experience on modeling, understand typical procedures of data modeling and configure and optimize various parameters.

Disclosure of Invention

In view of the above, the present invention has been made to provide a data modeling method and apparatus that overcomes or at least partially solves the above problems.

A method of data modeling, comprising:

acquiring a data source;

identifying original variables from the data source;

acquiring a derivative variable corresponding to the original variable according to a preset rule base; the corresponding relation between the original variable and the derived variable is stored in the preset rule base;

selecting a preset classification model, and configuring data modeling parameters;

and performing data modeling by using the preset classification model according to the data modeling parameters, the original variables and the derivative variables corresponding to the original variables.

Optionally, the identifying the original variable from the data source specifically includes: and extracting data information from a data source according to a first preset rule, wherein the extracted data information is the original variable.

Optionally, before identifying the original variable from the data source, the method further includes: and extracting a substring from the character string in the data source according to a second preset rule, and taking the substring as the data source for identifying the original variable.

Optionally, the selected preset classification model includes: selecting a plurality of preset classification models;

the configuration data modeling parameters include: configuring random seeds and configuring the proportion of a training set and a test set;

the data modeling is performed by using the preset classification model according to the data modeling parameters, the original variables and the derivative variables corresponding to the original variables, and the method specifically comprises the following steps:

and respectively utilizing each preset classification model to perform data modeling according to the data modeling parameters, the original variables and the derivative variables corresponding to the original variables.

Optionally, after the data modeling is performed according to the data modeling parameters, the original variables, and the derivative variables corresponding to the original variables by using each preset classification model, the method further includes:

and outputting a modeling result after modeling according to each preset classification model, and comparing the modeling result to recommend an optimal classification model.

Optionally, the method further comprises:

and acquiring the data distribution condition of the original variable and/or the derivative variable.

A data modeling apparatus, comprising:

a first obtaining unit for obtaining a data source;

an identification unit for identifying an original variable from the data source;

the second obtaining unit is used for obtaining a derivative variable corresponding to the original variable according to a preset rule base; the corresponding relation between the original variable and the derived variable is stored in the preset rule base;

a selecting unit for selecting a preset classification model;

the configuration unit is used for configuring data modeling parameters;

and the modeling unit is used for carrying out data modeling by utilizing the preset classification model according to the data modeling parameters, the original variables and the derivative variables corresponding to the original variables.

Optionally, the apparatus further comprises:

and the extraction unit is used for extracting a sub-string from the character string in the data source according to a second preset rule, and taking the sub-string as the data source for identifying the original variable.

Optionally, the selecting unit is specifically a unit for selecting a plurality of preset classification models;

the configuration unit is specifically a unit for configuring random seeds and configuring the proportion of a training set and a test set;

the modeling unit is specifically a unit for performing data modeling by respectively utilizing each preset classification model according to the data modeling parameters, the original variables and the derivative variables corresponding to the original variables.

Optionally, the apparatus further comprises:

and the output unit is used for outputting the modeling results after modeling according to the preset classification models and comparing the modeling results to recommend the optimal classification model.

By means of the technical scheme, the data modeling method and the data modeling device provided by the invention can be used for configuring data modeling parameters to the whole modeling process of data modeling from data processing (including character string service processing, namely extracting substrings from character strings of a data source, automatically identifying original variables, and automatically acquiring derivative variables corresponding to the original variables according to the original variables), and the user does not need to finish the steps by himself but finishes the whole process step by step through product guidance.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a schematic flow chart illustrating a data modeling method according to an embodiment of the present invention;

FIG. 2 illustrates a schematic diagram of generating a geographic location based on an identified IP address;

FIG. 3 illustrates an interface diagram of selected classification models and configuration modeling parameters provided by an embodiment of the present invention;

FIG. 4 illustrates a data histogram based on age information;

FIG. 5 is a flow chart of a data modeling method according to a second embodiment of the present invention;

FIG. 6 is a diagram illustrating modeling results of various classification models provided by the second embodiment of the present invention;

fig. 7 shows a schematic structural diagram of a data modeling apparatus according to a third embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Example one

Fig. 1 is a schematic flow chart of a data modeling method according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

and S11, acquiring a data source.

S12, extracting a sub-string from the character string in the data source according to a second preset rule, and taking the sub-string as the data source for identifying the original variable:

generally, the data source comprises a plurality of character strings, a sub-string is extracted from the character strings in the data source according to a second preset rule for the character strings in the data source, and the sub-string is used as a data source for identifying original variables, so that more original variables can be identified in subsequent steps.

This step can also be considered as a business process on the strings in the data source.

By way of example, the business processing of the character strings in the data source can be name splitting or the like, so as to extract the last name in the name from the character strings.

S13, identifying original variables from the data source:

the method specifically comprises the following steps: the original variables are identified from the data source according to a first preset rule. As an example, the original variable may be an IP address, date.

In the embodiment of the present invention, the first preset rule may be, for example: a string of characters is identified where only numbers exist, beginning and ending with numbers, and including 3 half-corner periods in between.

It should be noted that, if the service processing is performed on the character string in the data source before the original variable is identified, the step specifically includes: and identifying an original variable from a data source of the character string after service processing according to a first preset rule.

By way of example, the data modeling method provided by the invention can automatically identify original variable contents such as IP addresses, dates and the like.

As an example, a specific method for automatically identifying an IP address may be: through a regular expression, a character string which only exists numbers, starts with the numbers and ends with the numbers and comprises 3 half-angle periods in the middle is identified as the IP address.

S14, acquiring a derivative variable corresponding to the original variable according to a preset rule base; the preset rule base stores the corresponding relation between the original variable and the derivative variable:

as an example, embodiments of the present invention may generate a corresponding physical location based on the identified IP address. Fig. 2 shows a schematic diagram of generating a geographical location (i.e. a derived variable) based on an identified IP address according to a correspondence between the IP address and the geographical location.

S15, selecting a preset classification model, configuring data modeling parameters:

the configuration data modeling parameters comprise configuration random seeds and proportion of a training set and a test set;

the configuration of the random seed may specifically be: multiple seeds were randomly tested and the predictive performance of the multiple seeds was averaged.

The configuration of the training set and the test set proportion may specifically be: and configuring the percentage of the training set and the test set according to preset requirements. As shown in fig. 3, the training set accounts for 70% and the test set accounts for 30%.

The preset classification model is a classification model that is pre-selected in response to a user's manipulation. The classification model may include at least one of a random forest model, a support vector machine model, and a logistic regression model.

The preset classification model may be one type or a plurality of types. A specific embodiment of configuring a plurality of preset classification models will be described in detail in example two.

By way of example, fig. 3 illustrates a classification model and configuration data modeling parameter interface diagram for a selected preset provided by an embodiment of the present invention.

S16, performing data modeling by using the preset classification model according to the data modeling parameters, the original variables and the derivative variables corresponding to the original variables:

the method specifically comprises the following steps: and taking the original variable as an input variable, taking a derivative variable corresponding to the original variable as an output variable, and performing data modeling by using a preset classification model according to the data modeling parameters.

In an embodiment of the present invention, the data modeling may include: at least one of missing value automatic processing, sample imbalance processing, and data type optimization processing. The data type optimization process is mainly to optimize the data type, for example, converting continuous data into discrete data. This is because: some classification models perform better on discrete data, and therefore, optimizing the data type may be based on whether the classification model converts continuous data into discrete data.

The processing processes can save the adjustment of the model parameters and the work in the aspect of data characteristic engineering by users, apply the best practice of the modeling process and improve the modeling efficiency.

The sample imbalance process may specifically be: the sampling is performed by oversampling (sampling times) for a small number of samples and by limiting the sampling (controlling the number of sampling times) for a large number of samples, thereby balancing the number of samples having a small number with the number of samples having a large number.

As another specific embodiment of the present invention, in a specific implementation of the data modeling method, the method may further include: and acquiring the data distribution condition of the single original variable and/or the single derivative variable. And a histogram or histogram of the data distribution of the single variable may be provided. Fig. 4 shows a data histogram based on age information.

The data modeling method provided by the embodiment of the invention can automatically identify the original variable from the data source, and can automatically acquire the derivative variable corresponding to the original variable according to the corresponding relation between the original variable and the derivative variable in the preset rule base. Therefore, the data modeling method provided by the invention does not need a user to define derived variables, reduces the requirements on computer and statistical professional skills of the user, and further reduces the technical threshold of modeling by using a machine learning tool.

Further, in the specific embodiment, the whole modeling process from data processing (including string business processing, automatic identification of original variables, and automatic generation of derivative variables according to the original variables) and data modeling parameters configuration to data modeling is performed, and a user does not need to complete the steps by himself but gradually completes the whole process through product guidance.

The above is a specific implementation of the data modeling method provided in the first embodiment of the present invention. In this embodiment, the configuration of the data modeling parameters is described by way of example as configuring a classification model. In fact, the data modeling method provided by the present invention can configure a plurality of classification models, and can compare the modeling effects of the classification models, so that the user can select the optimal classification model.

Example two

Fig. 5 is a schematic flow chart of a data modeling method according to a second embodiment of the present invention. As shown in fig. 5, the method comprises the steps of:

S51-S54 are the same as S11-S14 in the first embodiment, and for brevity, will not be described in detail here.

S55, selecting a plurality of preset classification models, configuring data modeling parameters:

the data modeling parameters include random seeds, and the proportion of training set to test set:

the data modeling method provided by the invention can configure a plurality of classification models. Therefore, the user can conveniently select the optimal classification model for the data source.

S56, respectively utilizing each preset classification model to perform data modeling according to the data modeling parameters, the original variables and the derivative variables corresponding to the original variables:

s57, outputting the modeling results after modeling according to the preset classification models, and comparing the modeling results to recommend the optimal classification model:

in order to compare the advantages and disadvantages of the classification models, the embodiment of the invention can also recommend the optimal classification model according to the modeling result after the modeling of each preset classification model. And enabling a user to select the optimal classification model for data modeling according to the modeling result of each classification model.

By way of example, the embodiment of the present invention may output the modeling result after each classification model is modeled in the form of a model report. As shown in fig. 6, core indicators modeled by each classification model, such as F1Score, TP Rate (positive class is judged as positive class proportion), FP Rate (negative class is judged as positive class proportion), Accuracy, Recall, Precision, and AUC (Precision), may be included in the model report.

In addition, a curve corresponding to the core index of each classification model can be provided in the model report, so that a user can judge the interpretation effect of different classification models on data by comparing curves of the same core index of different classification models, and an optimal model can be selected for data modeling.

The above is a specific implementation of the data modeling method provided in the second embodiment of the present invention. In the specific embodiment, the whole modeling process of data processing (including character string service processing, automatic identification of original variables and automatic generation of derivative variables according to the original variables), classification model configuration, data modeling and comparison of the modeling results of the classification models is adopted, and a user does not need to complete the steps by himself but gradually completes the whole process under the guidance of a product.

In addition, the data modeling method provided by the invention can also be used for modeling the same data source by adopting a plurality of classification models, and can compare the modeling results of the classification models, so that a user can select the optimal classification model from the modeling results for modeling. Therefore, the data modeling method can automatically optimize the model parameters in the modeling process and select the optimal model from the plurality of classification models, so that the data modeling method improves the accuracy of data modeling, can quickly and accurately find the optimal classification model, can popularize the optimal classification model and ensures the consistency of modeling effects.

Based on the data modeling method provided by the embodiment, the invention further provides a data modeling device, which is specifically referred to as embodiment three.

EXAMPLE III

Fig. 7 is a schematic structural diagram of a data modeling apparatus according to a third embodiment of the present invention. As shown in fig. 7, the data modeling apparatus includes:

a first acquisition unit 71 configured to acquire a data source;

an identifying unit 72 for identifying original variables from the data source;

a second obtaining unit 73, configured to obtain, according to a preset rule base, a derivative variable corresponding to the original variable; the corresponding relation between the original variable and the derived variable is stored in the preset rule base;

a selecting unit 74 for selecting a preset classification model;

a configuration unit 75 for configuring data modeling parameters;

and the modeling unit 76 is configured to perform data modeling by using the preset classification model according to the data modeling parameters, the original variables and the derivative variables corresponding to the original variables.

As a specific embodiment of the present invention, the data modeling apparatus may further include:

and the extracting unit 77 is configured to extract a sub-string from the character string in the data source according to a second preset rule for the character string in the data source, and use the sub-string as a data source for identifying an original variable.

The data modeling device is used for configuring data modeling parameters to the whole modeling process of data modeling from data processing (including character string service processing, automatic identification of original variables and automatic acquisition of corresponding derivative variables according to the original variables) and the whole modeling process of data modeling, and a user does not need to finish the steps by himself but finishes the whole process step by step through product guidance.

As an embodiment of the present invention, the selecting unit 74 may be a unit for selecting a plurality of preset classification models; the configuration unit 75 may be specifically a unit for configuring random seeds, and configuring the proportion of the training set and the test set;

after a plurality of preset classification models are selected, the data source is modeled separately for each classification model. The modeling unit 76 may be specifically a unit that performs data modeling by using each preset classification model according to the data modeling parameters, the original variables, and the derivative variables corresponding thereto.

Furthermore, in order to compare the modeling effects of the classification models, the data modeling apparatus may further include:

and the output unit 78 is configured to output the modeling results after modeling according to the preset classification models, and compare the modeling results to recommend an optimal classification model.

The data modeling device comprises a processor and a memory, wherein the first acquisition unit, the identification unit, the second acquisition unit, the selection unit, the configuration unit, the modeling unit, the extraction unit, the output unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the technical requirements on a user when the machine learning tool is used for data modeling are reduced by adjusting the kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

Technical effects of the device

The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device:

acquiring a data source;

identifying original variables from the data source;

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for modeling data, comprising:

acquiring a data source; the data source comprises at least an IP address;

identifying original variables from the data source; the original variable comprises an IP address; the original variables are identified from the data source according to a first preset rule; the first preset rule is to identify a character string which only has numbers and comprises 3 half-angle periods in the middle of the beginning and the end of the numbers;

acquiring a derivative variable corresponding to the original variable according to a preset rule base; the corresponding relation between the original variable and the derived variable is stored in the preset rule base; the derivative variable comprises a geographic location; the corresponding relation between the original variable and the derived variable comprises the corresponding relation between an IP address and a geographic position;

2. The method according to claim 1, wherein the identifying of the original variables from the data source is in particular: and extracting data information from a data source according to a first preset rule, wherein the extracted data information is the original variable.

3. The method of claim 1, wherein prior to identifying the original variable from the data source, further comprising: and extracting a substring from the character string in the data source according to a second preset rule, and taking the substring as the data source for identifying the original variable.

4. The method according to any of claims 1-3, wherein the selecting of the preset classification model comprises: selecting a plurality of preset classification models;

5. The method of claim 4, wherein after the data modeling is performed according to the data modeling parameters, the original variables and the derived variables corresponding to the original variables by using the preset classification models, the method further comprises:

6. The method according to any one of claims 1-3, further comprising:

7. A data modeling apparatus, comprising:

a first obtaining unit for obtaining a data source; the data source comprises at least an IP address;

an identification unit for identifying an original variable from the data source; the original variable comprises an IP address; the original variables are identified from the data source according to a first preset rule; the first preset rule is to identify a character string which only has numbers and comprises 3 half-angle periods in the middle of the beginning and the end of the numbers;

the second obtaining unit is used for obtaining a derivative variable corresponding to the original variable according to a preset rule base; the corresponding relation between the original variable and the derived variable is stored in the preset rule base; the derivative variable comprises a geographic location; the corresponding relation between the original variable and the derived variable comprises the corresponding relation between an IP address and a geographic position;

a selecting unit for selecting a preset classification model;

the configuration unit is used for configuring data modeling parameters;

8. The apparatus of claim 7, further comprising:

9. The apparatus according to claim 7 or 8, wherein the selecting unit is specifically a unit for selecting a plurality of preset classification models;

10. The apparatus of claim 9, further comprising: