The creation of a library search Junior programmer — what is it?

recently came across the publication of my age, and it prompted me to write his story about his project, which absolutely did not help, but only prevented the admission to the UNIVERSITY.

image

the

Introduction


One fine day I went to the library for one story. Saying the title and author of the story the librarian, got a stack of collections of the author. To find among this variety need a story, I had to go through all of the pieces. It would be much easier to "Google" the right product and get what you want in a few clicks.

image

And then I thought: ”why libraries are still there like this? It's so convenient!”. Of course, as any good lazy programmer, I went straight to a search engine to look for similar projects. And encountered problems. All the projects were either commercial (paid) or Amateur and of poor quality.

Naturally, this injustice urgently needed to solve, despite the approaching exam and admission to the UNIVERSITY.

the

choose language


First, it became clear that without the web part will not do. Despite my personal intolerance to web programming, I sat down at Google and began to study this question.

Word about performance already true

After several hours I finally chose the required language. Not exactly the language and the FrameWork RubyOnRails. Why him? Because it uses a full OOP language, actively developed by a giant of the community and features a fast core. Besides, Ruby love girls (the presentation is better not to speak).

Naturally, it was founded on the “perfect” Ruby. Since I'm on the Junior level know Java to learn any other OOP language for me was not difficult. A week every evening, I watched 2 lectures and eventually could boast the required knowledge of Ruby's syntax. And then my confidence dropped... And all because the FrameWork is so huge and extensive that the knowledge of Ruby helped me just a little bit. After studying MVC technology and other ideas the evil web programmers, I started the implementation of the project.

the

Authorization


The first obstacle in my way was, oddly enough, authorization. I decided to use OAuth. Its principle is that used tokens instead of passwords. The use of tokens allows us not to worry that your password is stolen, and to set each token by itself (for example, you can take one-time tokens with the rights of only one operation). For storing passwords on the server part I decided to use MD5 encryption, but after reading about it on the Internet, I decided that this method is deprecated. Hacking on new computers is in just 1 minute. So I decided to use bCrypt, which provides almost 100% protection from decrypt password in DB. So I decided to make a similar system, had to think about optimization. First, they had to translate from a string to a numeric value to the database, you could search for binary search.

image

It would seem that the complex can be in the elementary school project and authorization? Let's start with the fact that the window of authorization I did 20 hours. And continue with the fact that my project is fully OpenSource and runs on a local network, this means that you only need to deploy your server on mobile to catch the tokens. So I've come up with a system of checking the validity of the server. With each token, the server returns two values. ID in the database and unique small indentificator. Every time a client wants to verify the validity of the server, it sends the ID to the server and the server returns him indentificator. And only then the client sends the server token. Obtained authorization from both sides.
the

search Algorithm


image

In all such projects, the weak point is the search. I would not want to in my project was attended by such disadvantage. Ready-to-use solutions to completely swept aside due to the fact that all the major search engines require an Internet connection. So to this question, I approached with the utmost seriousness. After reading a book about algorithms, I decided that for comparing strings better to use a distance algorithm, Arnulf-Levenshtein, as it has the maximum accuracy at the expense of performance, most importantly for the project. But there is a problem in comparing whole sentences. First, I used a similar algorithm by expanding the comparison table to the tree graphs. Quite a long time I suffered with this, but in the end I didn't get the result. Therefore, all torn down and put another algorithm to compare sentences. Its principle is simple to madness. If the word has less than 3 mistakes, take it as correct and add to the total index of 1 point. If two correct words are consecutive, add 1 score. And so on.

image

Because search very is hard-coded to use the syntactic meaning of the sentence does not make sense. Therefore, in the future we plan to add phonetic search, and continue to work towards improvement of self-learning system and centralize all data.

image

Another way to improve the quality of search has become self-learning. The fact that each request is recorded in the database and, if necessary, fosamine to issue add the scores of popular books. For example, the user “Gosha” loves science-fiction. We realized this by the fact that the number of books on fiction is much higher than other genres. Therefore, subsequently, for the issuance of search results, the search system will add the user “Gosha” extra points for books on fiction. Besides, such a large body of statistical research to come in handy.

Of course, this algorithm then got a variety of optimizations, but more about that another time. Such a resource-intensive algorithm I decided to implement in C++ and make a bridge from Ruby using the Rice plugin. Here is another clear example of why I chose RubyOnRails and have not regretted.

the

image Recognition


To solve the problem of rapid digitalization will help only the mechanisms of recognition of the text on the picture. So as to bring the web PC camera image for reading is pretty stupid, it was decided to develop a mobile app for smartphones. Writing basic apps took about weeks because smartphones actively used multi-threading and rather rigid criteria for the design of the application.

image

After, after a deep analysis of the existing solutions, I chose the Tessaract as the best free open-source product based on the technology of OCR. Besides, actively supported by Google. Trying to play with him, I realized that such a result will not suit. For accurate recognition had long enough to work with advanced image processing. In addition, the source code of this project I found a few rough algorithmic errors, the correction of which may take doooolgo. So with sadness in my voice, I went to ask for a temporary license from Abbyy. After two weeks of daily calls, I got it. I can not mention that the documentation and the Java wrapper are very well written and was a pleasure to work with such a product. However, the recognition mechanism is far from 100% accuracy. So I had to make a correction. For several hours I wrote a bot that downloaded the entire database of books from the popular website labirint.ru. Using a similar database and the aforementioned search algorithm, the recognition accuracy of books has increased significantly.
the

ISBN


image

Despite the obvious ease of use reading books, I decided to use the autocomplete field using the ISBN. How to download the entire database of books ISBN is stupid, decided to use Google Book Api to search for ISBN. As well as the Offline users is still available the database of the maze with >200 000 Russian books. Moreover, it is easy to implement the barcode scanner books (on the same Abbyy API) and then adding books is just fun.

the

Algorithm ultra-fast digitization of library


image

I thought it was a good idea to take a picture of the shelf and add all the books in the electronic catalogue. Implement this easily possible only with the use of the free library, OpenCV and integrated detector boundaries Kenny. Besides, you will need to write your own algorithm for more serious analysis of the main lines of the books and send each spine in Abbyy API and get the desired text, process it through the database maze and display elements interacting with the UI on top of the photographed image. (Unfortunately, I never got around to finished this functionality to mind. There is no time)

the

Conclusion


image

The project cost is low and broad technology base allows us to access high technologies and comfortable even the ordinary students of municipal schools. Please take that fact in attention that the product is totally open and works perfectly without Internet access. Besides, the cost of the project allows to start the server even with Raspberry PI value of 20$.

Despite the fact that it did not help me in any way in my life, I gained valuable experience developing a full website and app and to me it was wildly entertaining. Well, a couple of screenshots:

image image

the

Sources


the

    a Big thank you to English-speaking audience. www.stackoverflow.com the

  • Big thanks to user habrahabr.ru ntz chic for a series of articles on fuzzy search
  • the
  • Also, the project has benefited www.wikipedia.org
  • the
  • Big thanks to my classmate Eugenia Landryova for a beautiful and colorful logos
  • the
  • a Giant thanks to my teacher for their support in my endeavors
  • the
  • Giant thanks to ABBYY for providing OCR to recognize text. Without him I would not be able to implement this project
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Tactoom. How about the middle of blogging?

SumIT Weekend of 18-19 February, the idea for iPad and Hackathon

Knowledge base. Part 2. Freebase: make requests to the Google Knowledge Graph