
As the name suggests its a search engine for documents, especially for a large volume of the dataset. It’s software as a service. This application will serve similarly like google search, in addition to that, it also supports topic-based search with meaningful insight per document level as well as the overall insight for all the documents using dashboards and charts.
Why we need Document Search Engine?
In simple words: Because we don’t have time. When we are living in the data world where each second millions of documents generated and saved, it’s hard to get any useful information from that volume of the dataset. To overcome this problem it’s necessary to come up with a solution that can help the business need in the short term as well as in the long term.
Real-Life Use case
- Knowledge management: Searching across documents(most obvious application.), Answering natural language questions like we do normal googling every day, its a 1-day topic, but I‘ll leave in short and simple here for now.
- People and places: Matching candidates to jobs, if you have thousands of CV in your bucket and you want to shortlist with a simple query like “candidates which has 5 years of experience and who has 2 years of experienced in cloud and who live in silicon valley”, Finding people even when you can’t remember how to spell their names
- Offloading relational databases (RDBs): In SQL or NoSQL database, for each query, we need to specifically write the query and do its indexing and due to that its use case is very hectic to triggered ad-hoc query, “means you can’t query in SQL or NoSQL like what we do in normal day’s google search.
- If you have millions of documents and users query is not restricted then its time to move to something new, something interesting and life-changing.
- E-commerce and customer services: Product search
- Legal and contracts: ( Litigation research, Finding laws and regulations
- Security and intelligence : ( Identifying public threats, Identifying insider threats)
- Oil and gas : Finding places to drill for oil
- Content suggestions: ( Supporting the type-ahead or query completion search feature)
- Website: Website search
Feature List
- Searching: Keyword Based Search, Topic-Based Searching, Semantic Search
- KeyPhrase Extraction
- Text Summarization
- HighLighting the query result
- Document Categorization
- Word2Vec & Synonyms Enrichment
- FeedBack Learning / Query Re-ranking
Technology Stack
1. Topic Modeling
Topic Modelling is a technique in the field of text mining. As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.
It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.
Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “python”, “java”, “GO” for a topic – Programming Language.
Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection.
There are many approaches for obtaining topics from a text such as – Term Frequency and Inverse Document Frequency. NonNegative Matrix Factorization techniques and the most important one i.e Latent Dirichlet Allocation and you guess it right we are using LDA for topic Modeling.
2. Word2Vec
Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, etc.
What are word embedding exactly? Loosely speaking, they are vector representations of a particular word. Having said this, what follows is how do we generate them? More importantly, how do they capture the context?
Word2Vec is one of the most popular techniques to learn word embeddings using shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.
Consider the following similar sentences: Have a good day and Have a great day. They hardly have different meaning.
Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW)
We are using common Bag of words approach
CBOW Model: This method takes the context of each word as the input and tries to predict the word corresponding to the context. Consider our example: Have a great day.
3. Solr
Solr runs as a standalone full-text search server. It uses the Lucene Java search library at its core for full-text indexing and search. It uses inverted index for its indexing
An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document