GB2532075A

GB2532075A - System and method for toy recognition and detection based on convolutional neural networks

Info

Publication number: GB2532075A
Application number: GB1419928.5A
Authority: GB
Inventors: Velic Marko
Original assignee: Lego AS
Current assignee: Lego AS
Priority date: 2014-11-10
Filing date: 2014-11-10
Publication date: 2016-05-11
Also published as: EP4350645A2; EP3218076B1; WO2016075081A1; GB201419928D0; EP3744410B1; EP3744410A1; DK3744410T3; US10974152B2; US20190184288A1; US10213692B2; US20230264109A1; US20170304732A1; US11794110B2; US12070693B2; EP3218076A1; US20210205707A1; DK3218076T3; EP4350645A3

Abstract

The invention concerns a system and method for automatic computer-aided optical recognition of toys, for example, construction toy elements. The aim is to detect toy elements 101,102 in digital images and associate the elements with existing information, possibly as part of a toy-to-life system. The method comprises: reading a digital image, representing a still image or a fragment of a video sequence; creating a learning database of digital images containing toy elements; recognizing a toy element; segmenting the image and detecting one or more toy elements (201,202) present in a digital image; matching the information from a central database or a web source with a recognized element. The learning and recognising steps may be achieved using a deep convolutional neural network. The method may advantageously recognize toy elements of various sizes invariant of distance, rotation, camera angle, background etc..

Description

Intellectual Property Office Application No. GII1419928.5 RTM Date:12 October 2015 The following terms are registered trade marks and should be read as such wherever they occur in this document: LEGO, LEGO technic, LEGO Duplo,K'Nex, Tinkertoys, Playskool Pipeworks, Cleversticks, Zometool, Erector Set, Meccano, Merkur, Steel Tec, Trix, FAC-System, tTherstix, Tog'1, Jovo Click 'N Construct, Zaks, Polydron, Geomag, toyblocks, Anchor Stone Blocks, KEVA planks, Kapla, UnitBricks, Rokenbok, Coco, Rasti, Tente, Mega Bloks, Fischertechnik, Playmobil, LocBlocs, Cobiblocks, BettaBuilda, Oxford, Kre0, Lincoln Logs, GIK, Sticklebricks, EnlightenBrick, Capsela.

Intellectual Property Office is an operating name of the Patent Office www.gov.uk /ipo System and method for toy recognition and detection based on convolutional neural networks

FIELD OF THE INVENTION

Present invention relates to application of computer vision technology in toy-to-life segment particularly the system and method for recognition and detection of construction toys that use modular elements based on dimensional constants and their assemblies based on convolutional neutral networks.

BACKGROUND OF THE INVENTION

Toy-to-life market segment currently involves systems wherein toys must have physical component configured to communicate with a special reader via some form of wireless communication like RFID, NFC etc. Examples of this technology are Skylanders, Disney Infinity or NintentoAmiibo. This invention presents a system where no additional hardware is needed to identify a toy element, and create its virtual digital representation and associate it with additional digital data. There are products which incorporate computer vision to accomplish this task, but these are not robust i.e. not invariant of toy element distance from the image acquiring device for example camera, not invariant of rotation of the toy element, not invariant of angle of the camera, not invariant of background, not invariant of illumination or need a predefined region where a toy element should be placed. Examples of this approach are LEGO Life of George and LEGO FUSION. LEGO FUSION, see web link: hitp://youtu.beAN41xL.VUE1qX8?1=21145s, is an example of the toy-to-life game where assembled model has to be placed on a special part i.e. scanning plate with specific pattern printed on it and image-acquiring device needs to be aligned to the specific angle from the toy element assembly for the recognition to take place. Document US885536B2 discloses method and apparatus for tracking three-dimensional (3D) objects. Method of tracking a 3D object includes constructing a database to store a set of two-dimensional (2D) images of the 3D object using a tracking background, where the tracking background includes at least one known pattern, receiving a tracking image, determining whether the tracking image matches at least one image in the database in accordance with feature points of the tracking image, and providing information about the tracking image in respond to the tracking image matches the at least one image in the database. The method of constructing a database also includes capturing the set of 2D images of the 3D object with the tracking background, extracting a set of feature points from each 2D image, and storing the set of feature points in the database. Present invention does not extract features points, any background or surface can be used and it is rotation invariant. Accordingly it is more robust, enables easier application without additional background elements and it is more accurate. Further according to present invention accuracy of recognition does not depend on rotation, size, scale, illumination or background change. Therefore result achieved by present invention is improvement of state of the art solutions in view of easier and faster object recognition without affecting accuracy based on convolutional neural networks.

Neural networks, especially deep convolutional neural networks are state of the art method for various object recognition tasks and the latest scientific work in this field demonstrates top performance of these kind of models based on results performing on the several recognized benchmarks.

The general purpose computing on graphics processing units (GPGPU) allows for training of very complex deep neural network architectures in reasonable time. This allows for training very accurate models invariant of various common image recognition problems like scale, rotation, illumination, background changes etc.

BRIEF SUMMARY OF THE INVENTION

In accordance with aspects of the invention, a system includes toy elements, an imager device, processor configured to execute detection and recognition tasks and identify one or more objects by utilizing the convolutional neural network. Convolutional neural networks allow for very accurate and robust object recognition and thus are very reliable method for recognition of toy elements and their assemblies invariant to various common problems in computer vision tasks like rotation, size, scale, illumination or background change.

One aspect of the invention provides a learning module i.e. a method for training a deep convolutional neural network to recognize various toy elements.

Another aspect of the invention provides a learning database containing many different variants of images of said toy elements with the accompanying information regarding the identification of toy element to allow for supervised learning.

Another aspect of the invention provides a recognition module, configured to use a learned model to recognize toy element on a new previously unseen image in a real time.

Another aspect of the invention provides a detection module, configured to perform image segmentation and recognize one or more toy elements on a digital image.

Another aspect of the invention provides a processing module configured to match recognized one or more toy parts with the information stored in the central database or on the web.

Another aspect of the invention provides a computer readable program encoded in a computer readable medium (e.g. a disk drive or other memory device). The computer readable program includes an executable computer program code configured to instruct a system to perform steps for learning, detection and object recognition described herein.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Additional features and advantages will be made apparent from the following detailed description of embodiments that proceeds with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE FIGURES

Figure 1 presents one possible configuration of objects on the scene. As it can be seen on the left image of the figure there are three objects 101, 102 and 103. All these objects are segmented and correctly classified which is shown on the right image of the figure 1. Bounding boxes are placed around the objects and labels are put on the top left corner of the objects recognized. This is example of segmentation and recognition.

Figure 2 depicts one scenario where the system is used to classify hierarchical objects. As can be seen mini-figure is correctly recognized and then lower hierarchy level objects are recognized such as weapons, more specifically sword 201 and shield 202.

Figure 3 depicts steps needed to implement the method.

DETAILED DESCRIPTION

The following description relates to usage of convolutional neural network trained on central processing units ("CPUs") or graphics processing units ("GPUs") architectures to enable accurate and robust optical recognition of toy elements, particularly construction toys and parts. The one or more GPUs are used to quickly perform series of forward and backward passes on input data in parallel manner, modifying and refining the network parameters on each pass.

Mentioned construction toys are those that use modular elements based on dimensional constants, constraints and matches,with various assembly systems like magnets, studs, notches, sleeves, without interlocking connection etc. Examples of these systems include and are not limited to: LEGO, LEGO technic, LEGO Duplo,K'Nex, Tinkertoys, Playskool Pipeworks, Cleversticks, Zometool, Erector Set, Meccano, Merkur, Steel Tec, Trix, FAC-System, Uberstix, Tog'1, Jovo Click 'N Construct, Zaks, Polydron, Geomag, toyblocks, Anchor Stone Blocks, KEVA planks, Kapla, UnitBricks, Rokenbok, Coco, Rasti, Tente, Mega Bloks, Fischertechnik, Playmobil,LocBlocs, Cobiblocks, BettaBuilda, Oxford, Kre-O, Lincoln Logs, GIK, Sticklebricks,EnlightenBrick and Capsela. Patents LiS3005282 A and USD2 describe one suchinterlocking system and toyfigures.

In various embodiments of the invention, toy elements can be interlocking bricks, parts, accessories, mini-figures, weapons, animals, plants or other pieces that can be physically attached to form a toy assembly.

A system configured to automatically recognize and detect a real world toy elementse.g. LEGO brick, minifigure, minifigure part, weapon, object, accessories, animals or any other construction part or artefact from image or video in real time, the system comprising: one or more toy elements; an imager device; one or more processors configured to execute instructions of computer program modules. Processors further comprising: a reading module configured to receive image from digital image capture device;a learning database of annotated digital images containing toy elements taken in various conditions and altered with transformations to ensure significant variance for training procedure;a learning module configured to learn digital representations of toy elements by training a deep convolutional neural network thus enabling for very accurate and robust recognitions;a recognition module configured to recognize toy elements from a digital image;a detection module configured to detect and recognize one or more toy elements from a digital image; and a processing module configured to match recognized one or more toy parts with the information stored in the central database or on the web.

Reading module is configured to read an image, said image including an image of one or more real world toy elements; extract from said read image, said image of said one or more toy elements. Learning module is configured to learn digital representations of toy elements whose images are stored in the learning database and thus create a model for recognition wherein learning database contains annotated images where images are coupled with information about the toy elements present on said images. The learning database can be expanded by adding additional images of toy elements that exist in the database and new images of toy elements that do not exist in the database. Recognition module is configured to use the model for recognition to recognize toy element on a digital image. Further, detection module is configured to detect one or more toy elements on a digital image and their extract position represented by coordinates of the toy elements and extract the detected elements from the background and recognize the extracted elements. Processing module uses identification of the recognized element and matches that information with additional information stored in the central database or a web source. According to the present invention a learning module is implemented as a deep convolutional neural network comprising of convolutional, rectification, and normalization, pooling and fully connected layers of artificial neurons. The learning module implementation is a set of computer program instructions that can be ran on one or more central processing units ("CPU") or one or more graphics processing units ("GPU") and thus enabling for faster learning process.

A recognition model is configured to feedforward the said digital image data into a learned convolutional neural network causing a network to output prediction consisting of possibilities that said image contains certain objects from the said learning database. A detection module is configured to segment a digital image and detect one or more toy elements on a digital image and recognize each of them thus associating one or more said toy elements with the objects stored in learning database. A processing module is configured to match information about the recognized toy element with the information stored in a central database or on the World Wide Web via a web service or other network protocol. The recognition module is implemented as computer program that runs on a mobile device, or on a personal computer, or on a server.

A toy element is a construction toy part such as a minifigure, a minifigure part, a weapon, an accessory, a tool, an animal, a motor part, an electronic sensor part, or a construction brick, wherein a toy element is an assembly of more toy elements which are physically combinable.

Figure 1 is one possible scenario where various toy elements are placed on the surface and a digital image is captured via an imager device such as digital camera. Left image on the figure 1 depicts original scene with three different objects -barrel 101, goat 102 and a Viking minifigure holding a spear 103. Right image includes bounding boxes surrounding detected objects and labels on top of the rectangles showing results of the recognition task. As can be seen barrel 101 and goat 102 are recognized and regarding the minifigure, specific configuration of the Viking figure holding a spear is recognized 103. Figure 2 depicts a scenario where the same figure is recognized as holding a sword weapon.

Figure 2 depicts a scenario where the system is used to make hierarchical object recognition. At the top most level recognized object is Viking mini-figure holding a sword. At lower hierarchy level there are recognized objects sword 201 and shield 202, more specifically Viking's shield 202.

Figure 3 depicts workflow of a computer implemented method consisting of learning (a) and prediction (b) phases of the recognition task and segmentation (c) workflow for detection and recognition task.

A computer implemented method comprising the steps of: reading a digital image, said image representing still image captured with digital camera or a fragment of a video sequence captured with video camera; creating a learning database of digital images containing toy elements; learning from captured images with a goal to be able to recognize toy element if present on an image; recognizing toy element if the said element is present on an image that has not been used in the learning step; detecting one or more toy elements present on a said digital image; and matching the information from a central database or a web source with a recognized real world toy element.

Reading a digital image from an input device, where said image containing an image of a toy element is done by one or more processors. Creating of the learning database is done by taking many photographs of toy elements, annotating them with an information about which toy element is present on each image and subtracting the mean value to normalize the pixel colour intensities. Photographs of elements contain elements on various locations in an image showing toy elements captured from various angles, distances, rotations, from different cameras and in different illumination conditions. A leaming database is artificially expanded by creating new images from the existing ones by applying horizontal or vertical flips, scaling, rotation, changing colour intensities and affine transformations of existing images. Learning process is achieved by training a deep convolutional neural network whose hyper-parameters are chosen based on the performance on validation portion of the learning database. Training of deep convolutional neural network is conducted by a series of forward and backward passes of input information and gradients respectively, wherein deep convolutional neural network consists of convolutional, rectification, normalization, pooling, interconnected and softmax layers. Top level interconnected layers of the said network can be replaced by another classification algorithm which uses outputs from convolutional layers as inputs for classification. Further, recognizing the toy element from an input digital image is done by conducting a forward pass through trained neural network and output from the network represents probabilities for the existing classes being present on said image. Recognizing can be done as a series of forward passes of different subcrops of an input digital image and outputs from the trained network can be averaged to increase the prediction accuracy. Detection of the one or more toy elements is performed as one or more recognition after the segmentation of a digital image, wherein the segmentation is done by taking different subcrops of a digital image by sliding window or after edge detection and contours extraction or other segmentation algorithm. Matching information from a central database or a web source with a recognized real world toy element is conducted by querying a central database or a web service, wherein the matching information is price of an element, existing colours, construction sets where the element appears, buying locations and other information stored in the central database or a web source.

Learning phase is implemented as a series of training steps in which a deep convolutional neural network is optimized via an optimization algorithm like e.g. stochastic gradient descent. Learning database is created by capturing digital images of the toy elements which should be recognized along with the information regarding which object is present on each image in the database. Set of input images, along with an indication of a correct element each image should be interpreted as, is called annotated database.

Learning database is created by capturing large number ofimages of toy elements that need to be recognized on various surfaces with various backgrounds, different angles, zoom levels, rotations, scene positioning and different illumination conditions. Images could also be created by using more than one camera thus allowing for more variation in the learning database. The bigger and more various learning database is, the more reliable predictions by using the leamed model will be achieved as a measure of preventing neural network model from over-fitting the dataset. To achieve more robustness for different colour intensities, mean colour values from all the images are calculated and subtracted from each image and thus normalize the colour intensities in the learning database.

Next step is data augmentation in which every image from the learning database can be used to artificially generate more images through image processing steps like various distortions including but not limited to horizontal or vertical flipping, skewing (perception transform), resizing, rotating, colour intensity change and cropping. In this way more variance is introduced in the learning database.

During the learning process, each of the images from the learning database is used as an input into the convolutional neural network and the output is calculated to determine how close or far of the proper object recognition network is. Then the error of the network is propagated backwards throughout the network and modification of the network's parameters is done by using an optimization method e.g. stochastic gradient descent. This process is conducted as a series of matrix multiplications and partial derivative calculations.

Neural network that is capable of achieving very accurate recognition of objects on the images it has never seen before has to be assembled of several or many consecutive layers. These layers may include and are not limited to the following layer types: input layer, convolutional layers, pooling layers, rectification, normalization, nonlinear layers, interconnected layers, dropout layers, softmax layers and an output layer. Deep architecture of the neural network provides capabilities of recognizing complex shapes in hierarchical fashion similar to the process of object recognition in the visual cortex of animals or humans. Lower layers specialize for the detection of edges (gradients), colour and texture patterns while deeper layers specialize for recognition of the more complex formations and patterns including elements from lower layers. Consecutive usage of convolutional and pooling layers allows for the position invariance thus enabling object recognition on various locations in the image.

After the learning process has finished, model containing the optimized neural network can be stored on a digital media and be ready to use in recognition tasks. Figure 3.b shows a workflow where mentioned trained model is being used for the recognition task. After capturing an image, digital image is sent to the recognition module which uses learned model to recognize object i.e. toy element on an image. Network outputs probabilities that certain objects appear on the image in question. Once the toy element is recognized, its identity information can be used to acquire additional information about the object from other sources like databases or World Wide Web resources.

Figure 3.c depicts a workflow in which additional step is performed before the actual recognition in order to segment the image if more than one object have to be recognized. Segmentation can be done in several ways like extracting the contours after edge detection with e.g. Canny algorithm and then performing the above described recognition tasks for each image region where contour is found or by using the sliding window approach and gathering top prediction scores and thus identifying regions of the image which contain interesting objects, colour-based segmentation, selective search or any other segmentation method.

Described neural network training process can be achieved on any of a variety of devices in which digital signal processing can be performed. Examples of these devices include and are not limited to: desktop and laptop computers, servers, digital recording devices, gaming consoles, portable gaming devices, mobile devices etc. The mentioned storage for storing the executable files of the described system and for storing the learning database or trained model may be removable or non-removable. This includes magnetic disks, magnetic tapes or cassettes, solid state disks, CD or DVD ROMs, USB disks or any other medium which can be used to store information and which can be accessed within the computing environment.

Convolutional neural networks design involves a lot of hyper-parameters to be specified. These hyper-parameters include and are not limited to: number of neurons, number of layers, types of layers (convolutional, pooling, rectification, normalization, dropout, fully connected, softmax etc.) and their arrangement, learning rates, decay rates, number of learning iterations, batch sizes etc. Exact configuration of these parameters is best found via experiments on the particular problem that is being solved i.e. specific learning database that is being used by monitoring network's performance on the validation and test data sets. The various hyper-parameters of the deep convolutional neural network can be used in combination or independently.

Present invention further concerns a computer readable medium having stored thereon instructions which when executed by one or more central or graphics processing units cause performing the method of: reading a digital image, said image representing still image captured with digital camera or a fragment of a video sequence captured with video camera; creating a learning database of digital images containing toy elements; learning from captured images with a goal to be able to recognize toy element if present on an image; recognizing toy element if the said element is present on an image that has not been used in the learning step; detecting one or more toy elements present on a said digital image; and matching the information from a central database or a web source with a recognized real world toy element.

Although the invention has been described with reference to certain specific embodiments, various modifications thereof will be apparent to those skilled in art without departing from the spirit and scope of the invention as outlined in claims appended hereto.

Claims

Claims 1. A system configured to automatically recognize and detect real world toy construction elements, which can be interconnected or combinedsuch asbrick, minifigure, minifigure part, weapon, object, accessories, animals or any other construction part or artefact from image or video in real time, the system comprising: one or more toy elements; an imager device; one or more processors configured to execute instructions of computer program modules comprising: a reading module configured to receive image from digital image capture device; a learning database of annotated digital images containing toy elements taken in various conditions and altered with transformations to ensure significant variance for training procedure; a learning module configured to learn digital representations of toy elements by training a deep convolutional neural network thus enabling for very accurate and robust recognitions; a recognition module configured to recognize toy elements from a digital image; a detection module configured to segment image, detect and recognize one or more toy elements from a digital image; a processing module configured to match recognized one or more toy elements with the information stored in the central database or on the web.
2. The system of claim 1, wherein reading module is configured to read an image, said image including an image of one or more real world toy elements, extract from said read image one or more toy elements.
3. The system of claim 1, wherein learning module is configured to learn digital representations of toy elements whose images are stored in the learning database and create a model for recognition.
4. The system of claim 1, wherein a learning database contains annotated images wherein images are coupled with information relating to the toy elements present on said images.
5. The system of claims 1 to4 wherein the learning database is expanded by adding additional images of toy elements that exist in the database and new images of toy elements that do not exist in the database wherein new images of toy elements are annotated.
6. The system of claim 1 to4, wherein recognition module is configured to use the model for recognition to recognize toy element on a digital image.
7. The system of claim 1, wherein the detection module is configured to segment image and detect one or more toy elements on a digital image and extract position represented by coordinates of the toy elements and extract the detected elements from the background and recognize the extracted elements.
8. The system of claim 1, wherein the processing module uses identification of the recognized element and matches that information with additional information stored in the central database or a web source.
9. The system of preceding claims, wherein a learning module is implemented as a deep convolutional neural network comprising of convolutional, rectification, normalization, pooling and fully connected layers of artificial neurons.
10. The system of preceding claims, wherein a learning module implementation is a set of computer program instructions that can be ran on one or more central processing units ("CPU") or one or more graphics processing units ("GPU") and thus enabling for faster learning process.
11. The system of preceding claims, wherein a recognition model is configured to feedforward the said digital image data into a learned convolutional neural network causing a network to output prediction consisting of possibilities that said image contains certain objects from the said learning database.
12. The system of preceding claims, wherein a detection module is configured to segment a digital image and detect one or more toy elements on a digital image and recognize each of them thus associating one or more said toy elements with the objects stored in learning database.
13. The system of preceding claims, wherein a processing module is configured to match information about the recognized toy element with the information stored in a central database or on the World Wide Web via a web service or other network protocol.
14. The system of preceding claims, wherein the recognition module is implemented as computer program that runs on a mobile device.
15. The system of preceding claims, wherein the recognition module is implemented as computer program that runs on a personal computer.
16. The system of preceding claims, wherein the recognition module is implemented as computer program that runs on a server.
17. The system of claim 1, wherein a toy element is a construction toy part.
18. The system of claim 1, wherein a toy element is a minifigure.
19. The system of claim 1, wherein a toy element is a minifigure part.
20. The system of claim 1, wherein a toy element is a weapon.
21. The system of claim 1, wherein a toy element is an accessory.
22. The system of claim 1, wherein a toy element is a tool.
23. The system of claim 1, wherein a toy element is an animal.
24. The system of claim 1, wherein a toy element is a motor part.
25. The system of claim 1, wherein a toy element is an electronic sensor part.
26. The system of claim 1, wherein a toy element is a brick.
27. The system of claim 1, wherein a toy element is an assembly of more toy elements which are physically combinable.
28. A computer implemented method for automatic recognition and detection of real world toy construction elements which can be interconnected or combined such asbrick, minifigure, minifigure part, weapon, object, accessories, animals or any other construction part or artefact from image or video in real time, comprising the steps of: reading a digital image, said image representing still image captured with digital camera or a fragment of a video sequence captured with video camera; creating a learning database of digital images containing toy elements; learning from captured images with a goal to be able to recognize toy element if present on an image; recognizing toy element if the said element is present on an image that has not been used in the learning step; segmenting image and detecting one or more toy elements present on a said digital image; matching the information from a central database or a web source with a recognized real world toy element.
29. The method of claim 28, wherein reading a digital image from an input device, said image containing an image of a toy element is done by one or more processors.
30. The method of claim 28, wherein creating of the learning database is performed by taking large number ofdigital imagesof toy elements, annotating them with an information about which toy element is present on each image and subtracting the mean value to normalize the pixel colour intensities.
31. The method of claim 28 to30, wherein digital images of toy elements contain elements on various locations in an image showing toy elements captured from various angles, distances, rotations, from different cameras and in different illumination conditions.
32. The method of claim 28 to31 wherein a learning database is artificially expanded by creating new digital images from the existing ones by applying horizontal or vertical flips, scaling, rotation, changing colour intensities and affine transformations of existing images.
33. The method of claim 28, wherein learning process is achieved by training a deep convolutional neural network whose hyper-parameters are chosen based on the performance on validation portion of the learning database.
34. The method of claim 33, wherein training of deep convolutional neural network is conducted by a series of forward and backward passes of input information and gradients respectively.
35. The method of claim 33, wherein deep convolutional neural network consists of convolutional, rectification, normalization, pooling, interconnected and softmax layers.
36. The method of claim 33 and 35, wherein top level interconnected layers of the said network can be replaced by another classification algorithm which uses outputs from convolutional layers as inputs for classification.
37. The method of claim 28 to 36, wherein recognizing the toy element from an input digital image is done by conducting a forward pass through trained neural network and output from the network represents probabilities for the existing classes being present on said image.
38. The method of claim 28 to37, wherein recognizing can be done as a series of forward passes of different subcrops of an input digital image and outputs from the trained network can be averaged to increase the prediction accuracy.
39. The method of claim 28 to 38, wherein detection of the one or more toy elements is performed as one or more recognition after the segmentation of a digital image.
40. The method of claim 28 to39, wherein the segmentation is done by taking different subcrops of a digital image by sliding window or after edge detection and contours extraction or other segmentation algorithm.
41. The method of claim 28 to 40, wherein the matching information from a central database or a web source with a recognized real world toy element is conducted by querying a central database or a web service.
42. The method of claim 41, wherein the matching information is price of an element, existing colours, construction sets where the element appears, buying locations and other information stored in the central database or a web source.42. A computer readable medium having stored thereon instructions which when executed by one or more central or graphics processing units causing performing the method for automatic recognition and detection of real world toy construction elements which can be interconnected or combined such as brick, minifigure, minifigure pad, weapon, object, accessories, animals or any other construction part or artefact from image or video in real time comprising means for: reading a digital image, said image representing still image captured with digital camera or a fragment of a video sequence captured with video camera; creating a learning database of digital images containing toy elements; learning from captured images with a goal to be able to recognize toy element if present on an image; recognizing toy element if the said element is present on an image that has not been used in the learning step; segmenting image and detecting one or more toy elements present on a said digital image; matching the information from a central database or a web source with a recognized real world toy element.