Data Mining Hub, through the eyes of scientists

Hello, Habr!

We started Data Mining Hub and I want to tell you what it is and why it may be useful to you.

Data Mining Hub (DMH) is a platform for the development of algorithms for data mining (Data Mining) and machine learning (Machine Learning), which is based on an iterative approach, it is also a tool for businesses to analyze large amounts of data and extract from these data useful and necessary information.

DMH difference from similar resources, such as kaggle and algomost:
the
    the
  • the task into the iteration;
  • the
  • code of the algorithm remains the author, the Customer takes it only in the lease;
  • the
  • calculations, evaluation and manipulation of the money runs DMH;
  • the
  • participation does not require verification and proof of qualification.


In the DMH there are two sides. The first is the client that describes the task, and the second is a scientist who tries to solve this problem.

Scientists DMH provides the opportunity to take part in solving interesting problems, to compete with other participants and, of course, to get payment if their algorithm was chosen by the Customer. If he was not selected in this iteration, he can always be chosen next. DMH will automatically transfer the results from the last iteration to the new, if not changed the original data. But also have the opportunity to improve your algorithm and get paid at the expense of the improved algorithm in the next iteration.

For the customer DMH is a single point of integration with a large number of scientists and easy way to use different algorithms on the same data.

Briefly the working principle of DMH can be described as follows:
the
    the
  • the Client creates a job, provides a description, indicates the approximate budget, duration and period of the decision for each iteration.
  • the
  • Customer loads the data, which then will work as scientists.
  • the
  • the Customer confirms the job, and then scientists data becomes available.
  • the
  • based On the data scientists create algorithms, download them to DMH and specifies the cost of using algorithm.
  • the
  • Customer can select any algorithm, and then transferred the payment to the Scientist.


Everyone is welcome to link to www.datamininghub.com/invite/me and ask DMH to invite them, just putting the email.

Consider what scientist have to do to participate in the solution of the problem. In principle, everything is quite simple. It is necessary to select a task, create it under the algorithm, test it on the source data. If satisfied with the result, you then have to specify the cost of using algorithm.

the

View more detail



After authentication datamininghub.com page opens and lists all the tasks that need to be addressed. You need to choose any task and to download the source data in the partition Data Set



Next, you need to develop an algorithm using any development tools. Importantly, the algorithm was jar file (or several thereof), which could be run as a job on hadoop.

A small example of the algorithm in Scala is available here: github.com/datamininghub/example-algorithm

A real example of the solution of the task is available at www.datamininghub.com/task/1 or on the same Scala available here: github.com/datamininghub/example-bill-status-prediction

To download your algorithm should be:
    the
  1. to go to DMH.
  2. the
  3. menu, select Algorithms, which opens a page that lists all the algorithms a given user.
  4. the
  5. Click add new algorithm


  6. the
  7. If previously the user profile was not tied to the AWS account, then system will ask you to do this at this stage:


    If the AWS account is missing, you will need to register it.
    By clicking on the link http://aws.amazon.com/free/ possible to register a new account and use the free limits during the year.
    After that you will need to follow the link Sign up for Amazon S3 — Find my keys to create keys that are needed to enter in the DMH.

  8. the
  9. After they binds AWS account page will appear Algorithm details, which will be reflected in the default name of the algorithm N DataMiningHub algorithm for Hadoop 1.0.3 where you must click Edit:


  10. the
  11. On the Algorithm edit it is possible to change the name of the algorithm on some or the other, change the version of Hadoop. Then you must click on the Add step to add a step represents the addition of a jar file containing the code of the algorithm and identify the arguments with which this file is run:


  12. the
  13. On the Add file you need to select the jar file to load and click Upload or to specify S3 reference to the file.



    Taken as example a file with filename bill-status-prediction.jar
    Note: the file download may take some time!
  14. the
  15. Now you must specify the arguments in Step algorithm edit that this jar file will run, and click Save:



    For example, we use the following arguments: -o {output} --events {events} --bill_deputy {bill_deputy} -f

  16. the
  17. After you set the arguments, again the page will appear Algorithm edit, but with the information already entered step. If necessary, you can load other jar files, just click Add step and following steps 6 through 8.
  18. Now in Algorithm details you must click bet on the navigation bar to determine the cost of using the algorithm and perform calculations:


    the

  19. On the Algorithm bet you must select a task in which the algorithm will be used:



    In this example there is only one iteration of Prediction if a bill becomes the law in future or not.
  20. the
  21. On the Add new bet use algorithm %algorithm_name% it is necessary to determine the cost of using the algorithm, and click bet it:


  22. the
  23. On the Edit calculation in the section Mappings to make a mapping of the names of all arguments of all steps (steps) c the original data by clicking on the assign in front of each parameter name and selecting the required data source and click calculate:



    If necessary, you can save the calculation by clicking the Save button.
  24. the
  25. After all the manipulations the page will appear Сalculation details, which displays the status of the calculation. After the computation ends, the result will be sent to the postal address linked to this profile.

    Calculation example in processing:



    Example of a completed calculation:


  26. the
  27. When the calculation is finished its result will appear in the task description, as well as the cost of using the algorithm, and the Customer can choose this algorithm as a solution to tasks:





It is possible to check the efficiency of the algorithm on any input to the issue of the cost of the use of this algorithm by clicking on try it on the navigation bar in Algorithm details. The page will appear edit calculations, section Mappings which will need to load the data for the calculations and click calculate in the navigation pane.
p.s. — special thanks to Eugene for their invaluable contribution to this text!
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Fresh hay from the cow, or 3000 icons submitted!

Knowledge base. Part 2. Freebase: make requests to the Google Knowledge Graph

Group edit the resources (documents) using MIGXDB