recently came across the publication of my age, and it prompted me to write his story about his project, which absolutely did not help, but only prevented the admission to the UNIVERSITY.

the

Introduction

One fine day I went to the library for one story. Saying the title and author of the story the librarian, got a stack of collections of the author. To find among this variety need a story, I had to go through all of the pieces. It would be much easier to "Google" the right product and get what you want in a few clicks.

And then I thought: ”why libraries are still there like this? It's so convenient!”. Of course, as any good lazy programmer, I went straight to a search engine to look for similar projects. And encountered problems. All the projects were either commercial (paid) or Amateur and of poor quality.

Naturally, this injustice urgently needed to solve, despite the approaching exam and admission to the UNIVERSITY.

the

choose language

First, it became clear that without the web part will not do. Despite my personal intolerance to web programming, I sat down at Google and began to study this question.

Word about performance already true

After several hours I finally chose the required language. Not exactly the language and the FrameWork RubyOnRails. Why him? Because it uses a full OOP language, actively developed by a giant of the community and features a fast core. Besides, Ruby love girls (the presentation is better not to speak).

Naturally, it was founded on the “perfect” Ruby. Since I'm on the Junior level know Java to learn any other OOP language for me was not difficult. A week every evening, I watched 2 lectures and eventually could boast the required knowledge of Ruby's syntax. And then my confidence dropped... And all because the FrameWork is so huge and extensive that the knowledge of Ruby helped me just a little bit. After studying MVC technology and other ideas the evil web programmers, I started the implementation of the project.

the

Authorization

The first obstacle in my way was, oddly enough, authorization. I decided to use OAuth. Its principle is that used tokens instead of passwords. The use of tokens allows us not to worry that your password is stolen, and to set each token by itself (for example, you can take one-time tokens with the rights of only one operation). For storing passwords on the server part I decided to use MD5 encryption, but after reading about it on the Internet, I decided that this method is deprecated. Hacking on new computers is in just 1 minute. So I decided to use bCrypt, which provides almost 100% protection from decrypt password in DB. So I decided to make a similar system, had to think about optimization. First, they had to translate from a string to a numeric value to the database, you could search for binary search.

It would seem that the complex can be in the elementary school project and authorization? Let's start with the fact that the window of authorization I did 20 hours. And continue with the fact that my project is fully OpenSource and runs on a local network, this means that you only need to deploy your server on mobile to catch the tokens. So I've come up with a system of checking the validity of the server. With each token, the server returns two values. ID in the database and unique small indentificator. Every time a client wants to verify the validity of the server, it sends the ID to the server and the server returns him indentificator. And only then the client sends the server token. Obtained authorization from both sides.
the

search Algorithm

In all such projects, the weak point is the search. I would not want to in my project was attended by such disadvantage. Ready-to-use solutions to completely swept aside due to the fact that all the major search engines require an Internet connection. So to this question, I approached with the utmost seriousness. After reading a book about algorithms, I decided that for comparing strings better to use a distance algorithm, Arnulf-Levenshtein, as it has the maximum accuracy at the expense of performance, most importantly for the project. But there is a problem in comparing whole sentences. First, I used a similar algorithm by expanding the comparison table to the tree graphs. Quite a long time I suffered with this, but in the end I didn't get the result. Therefore, all torn down and put another algorithm to compare sentences. Its principle is simple to madness. If the word has less than 3 mistakes, take it as correct and add to the total index of 1 point. If two correct words are consecutive, add 1 score. And so on.

Because search very is hard-coded to use the syntactic meaning of the sentence does not make sense. Therefore, in the future we plan to add phonetic search, and continue to work towards improvement of self-learning system and centralize all data.

Another way to improve the quality of search has become self-learning. The fact that each request is recorded in the database and, if necessary, fosamine to issue add the scores of popular books. For example, the user “Gosha” loves science-fiction. We realized this by the fact that the number of books on fiction is much higher than other genres. Therefore, subsequently, for the issuance of search results, the search system will add the user “Gosha” extra points for books on fiction. Besides, such a large body of statistical research to come in handy.

Of course, this algorithm then got a variety of optimizations, but more about that another time. Such a resource-intensive algorithm I decided to implement in C++ and make a bridge from Ruby using the Rice plugin. Here is another clear example of why I chose RubyOnRails and have not regretted.

the

image Recognition

To solve the problem of rapid digitalization will help only the mechanisms of recognition of the text on the picture. So as to bring the web PC camera image for reading is pretty stupid, it was decided to develop a mobile app for smartphones. Writing basic apps took about weeks because smartphones actively used multi-threading and rather rigid criteria for the design of the application.

After, after a deep analysis of the existing solutions, I chose the Tessaract as the best free open-source product based on the technology of OCR. Besides, actively supported by Google. Trying to play with him, I realized that such a result will not suit. For accurate recognition had long enough to work with advanced image processing. In addition, the source code of this project I found a few rough algorithmic errors, the correction of which may take doooolgo. So with sadness in my voice, I went to ask for a temporary license from Abbyy. After two weeks of daily calls, I got it. I can not mention that the documentation and the Java wrapper are very well written and was a pleasure to work with such a product. However, the recognition mechanism is far from 100% accuracy. So I had to make a correction. For several hours I wrote a bot that downloaded the entire database of books from the popular website labirint.ru. Using a similar database and the aforementioned search algorithm, the recognition accuracy of books has increased significantly.
the

ISBN

Despite the obvious ease of use reading books, I decided to use the autocomplete field using the ISBN. How to download the entire database of books ISBN is stupid, decided to use Google Book Api to search for ISBN. As well as the Offline users is still available the database of the maze with >200 000 Russian books. Moreover, it is easy to implement the barcode scanner books (on the same Abbyy API) and then adding books is just fun.

the

Algorithm ultra-fast digitization of library

I thought it was a good idea to take a picture of the shelf and add all the books in the electronic catalogue. Implement this easily possible only with the use of the free library, OpenCV and integrated detector boundaries Kenny. Besides, you will need to write your own algorithm for more serious analysis of the main lines of the books and send each spine in Abbyy API and get the desired text, process it through the database maze and display elements interacting with the UI on top of the photographed image. (Unfortunately, I never got around to finished this functionality to mind. There is no time)

the

Conclusion

The project cost is low and broad technology base allows us to access high technologies and comfortable even the ordinary students of municipal schools. Please take that fact in attention that the product is totally open and works perfectly without Internet access. Besides, the cost of the project allows to start the server even with Raspberry PI value of 20$.

Despite the fact that it did not help me in any way in my life, I gained valuable experience developing a full website and app and to me it was wildly entertaining. Well, a couple of screenshots:

the

Sources

the

a Big thank you to English-speaking audience. www.stackoverflow.com the

Big thanks to user habrahabr.ru ntz chic for a series of articles on fuzzy search
Also, the project has benefited www.wikipedia.org
Big thanks to my classmate Eugenia Landryova for a beautiful and colorful logos
a Giant thanks to my teacher for their support in my endeavors
Giant thanks to ABBYY for providing OCR to recognize text. Without him I would not be able to implement this project

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express

The creation of a library search Junior programmer — what is it?

Introduction

choose language

Authorization

search Algorithm

image Recognition

ISBN

Algorithm ultra-fast digitization of library

Conclusion

Sources

Комментарии

Отправить комментарий

Популярные сообщения из этого блога

Kiddy.me — diary of your baby

Tactoom. How about the middle of blogging?

SumIT Weekend of 18-19 February, the idea for iPad and Hackathon