Indexed search with Javascript

16 June 2020

Algorithm Data Structure Javascript

1. The goals
2. The approach
3. Additions
1. 3.1 Passing query to the URL
2. 3.2 New category
4. Final thoughts

It started with an idea: "Wouldn't it be nice to have a search functionality on my blog?". Sure, nowadays there are many libraries and external sites such as Google offering this in some form or other. But, as far as my personal projects are concerned, I am not one to be satiated with a solution that would likely involve said third party injecting hell-knows-what into my little home-brew site. Even if that was an option it would probably be overkill for what I actually need.

Other reasons for doing it from scratch are firstly "why not?" and secondly an opportunity to flex my rather atrophied javascript muscle. Last I did anything related to it was 2013 so might as well say the Cambrian era...

1. The goals

So what's the end game here? Defining goals early on helps keep the scope from ever-expanding during implementation. It is always tempting to explore more than is actually needed and get so spread out that the project gets never finished becoming a time sink with a slice of guilt attached to it as a bonus.

As the search will execute entirely on the client side some considerations must be made towards having descent performance. As this site doesn't have, nor will likely have, 20,000 articles it should be relatively easy to have snippy responses to searches.

The requirements:

site-local solution that runs entirely on the client
results based on the index.json file generated by Blogator
result pool generated at least from:
- article's title,
- article's tags,
- tag names.
result filtering for blog article and tags (maybe even projects as an extra)
accept both exact and partial words in a search string
case-insensitive search
no duplication in results
multiple word matches (individual word result set intersection ∩)
graceful failure when Javascript is disabled in the client browser

2. The approach

The structure of the source JSON generated by Blogator is as follows:

{
    "articles": [
        {
            "title"   : "...",
            "date"    : "...",
            "authors" : [ "..." ],
            "tags"    : [ "..." ],
            "headings": [ "..." ],
            "href"    : "..."
        },
        ...
    ],
    "tags": [
        {
            "name": "...",
            "href": "..."
        },
        ...
    ]
}

As the solution needs to differentiate between categories (articles/tags) the JSON index can be parsed as-is without further manipulations and kept in memory as the source reference. The work will mainly be on creating a searchable index and getting the results out based on the words given in the search string.

2.1 Data-structure

Since it would be pretty cool to have a sort of "auto-complete" feature which fills up results as a query is typed I've gone with a collection of tree data-structure which I'm calling a "grove"^[1] since there won't be such a large number of trees as to call it a forest... (arborism related joke ☺)

This grove contains the root nodes of the trees. Each tree represent a group of words sharing the same starting letter. Meaning that each tree can efficiently hold multiple words that stem from the same root letter or share the same prefix. Each node is identified with a single letter and can link back to sources as well as have any number of children connected to it.

Let's go with an example. Here we have a collection of words that can be grouped into 2 trees: one with a root node 'b' and the other with a root node of 'h':

blog (tag)
blogator (articles)
hell (articles and tag)
hello (articles)
hellraiser (articles)
help (tag)

Visually it will look like this:

Tree-based searching

When transversing a tree from root to all the leaf nodes, all words can be listed. This means that sources of words with a shared prefix can be collected using a recursive function.

accumulateResults = function( node, articles, tags ) {
    gatherNodeResults( node, articles, tags );

    node.child_nodes.forEach(
        ( value ) => { this.accumulateResults( value, articles, tags, projects ); }
    );
}

Recursive tree transversal

For a descent average case performance (O(1) access), children are held in associative array (Maps) whose key are the next possible character for an indexed word.

Where applicable, nodes hold references to the articles/tags that are related to the word assembled by traversing the tree from root to said node.

It's important to note that this approach would not be so great with very large data-sets of lexically different words. The size of the grove and complexity of the trees therein would lead to a longer construction, an increase in the memory footprint, and also have a negative impact on recursive accumulation of matches. The latter could be mitigated by restricting search results to exact matches but that would get rid of the auto-complete feature (the main attraction with this approach).

Thankfully the number and lexical diversity of the words contained in headings and tags of this subject-specific blog are not that huge.

^[1] A grove is a small group of trees.

2.2 Populating the grove

Adding words to the grove is pretty straightforward with this setup. First, words from a source must be extracted then converted to lowercase (case-insensitive search) and, finally, any duplicates removed. Duplicates can be taken care with just using a Set to store the extracted words.

All that's left afterwards is to iterate through the characters on each word whilst transversing the tree at the same time and adding nodes where needed until the last character is reached. Then a reference to the source is added to the corresponding node.

Grove.prototype.addWord = function( src_ref, src_type, word, node ) {
    if( word === '' )
        return;

    const c = word.charAt( 0 );

    if( !node.children.has( c ) )
        node.children.set( c, new Node() );

    let next_node = node.children.get( c );

    if( word.length > 1 ) {
        this.addWord( src_ref, src_type, word.slice( 1 ), next_node );

    } else { //i.e.: last char in word
        if( src_type === "article" )
            next_node.articles_src.push( src_ref );
        if( src_type === "tag"" )
            next_node.tags_src.push( src_ref );
    }
};

Grove.prototype.addGroveTree = function( word, src_ref, src_type ) {
    if( word.length < 2 )
        return;

    const c = word.charAt( 0 );

    if( !this.trees.has( c ) )
        this.trees.set( c, new Node() );

    this.addWord( src_ref, src_type, word.slice( 1 ), this.trees.get( c ) );
}

Recursive word adding for the Grove.

2.3 Parsing search strings

Search string can be 1..n words in length that can either result in partial and exact matches or just exact matches alone. So essentially there are 3 sub-problems to solve here:

how to recognise and deal with partial words
how to recognise and deal with exact words
how to recognise and deal with multiple words that can each be either partial or exact

For the sake of simplicity words that should return exact matches only will have double quotation marks "". No quotation marks should return any sources that has an exact match or has word(s) whose prefix match the given word. As for multiple words, a space will act as a list delimiter in the search string. As examples:

partial matches: java
exact matches: "java"
word set intersection matches (∩): java "notes"

To begin dealing with all of that, the search string needs to be split into its individual words then each of these need to be checked for quotation marks (regex helps here). Based on the result of the check, words can be put into 1 of 2 sets: exact and partial words.

With these two sets, the search can begin for each of the words. As there are two source categories (articles and tags), their results are kept apart to allow category-based filtering on output later on. All results for words can be put into individual sets that are, in turn, stored in arrays.

let article_sets = [];
let tag_sets     = [];

for( let i = 0; i < exact_words.length; ++i ) {
    let a = article_sets.push( new Set() ) - 1;
    let t = tag_sets.push( new Set() ) - 1;

    getExactResults( exact_words[i], article_sets[a], tag_sets[t] );
}

for( let i = 0; i < partial_words.length; ++i ) {
    let a = article_sets.push( new Set() ) - 1;
    let t = tag_sets.push( new Set() ) - 1;

    getPartialResults( partial_words[i], article_sets[a], tag_sets[t] );
}

Getting each of the word's matches

In the case where there was only 1 word in total from both the partial and exact word sets, then any result gathered is returned. When there are more than 1 word total a set intersection must be made from all different result sets in the arrays (article_sets and tag_sets).

The easiest way to to that is just to take the first set in the array, iterate over it and check each item for existence in the other sets. If one doesn't exist (i.e.: intersects) then we can do an early return as there won't be any results.

if( article_sets.length ) {
    article_sets[0].forEach( value => {
        for( let i = 0; i < article_sets.length; ++i ) {
            if( !article_sets[i].has( value ) )
                return;
        }
        article_results.push( value );
    } );
}

Set intersection checking

2.4 Output

The search has completed and there are arrays of source reference (article and tag) that match. What now?

2.4.1 Filtering

For filtering purposes the results can be inserted into different containers (article-results, tag-results). This way they can be shown/hidden with the help of a tick box and an onclick call to a js method that sets the display properties.

showSearchResults = function( cat ) {
    const article_wrapper = document.getElementById( "article-results" );
    const article_opt     = document.getElementById( "show-articles" );
    const tag_div         = document.getElementById( "tag-results-out" );
    const tag_opt         = document.getElementById( "show-tags" );

    switch ( cat ) {
        case "articles": {
            if( article_opt.checked )
                article_wrapper.style.display = 'flex';
            else
                article_wrapper.style.display = 'none';
            break;
        }
        case "tags": {
            if( tag_opt.checked )
                tag_div.style.display = 'flex';
            else
                tag_div.style.display = 'none';
            break;
        }
    }
}

Method to show/hide result blocks

It's simple enough and avoids having to re-run a script.

2.4.2 DOM injection

This is where these results can finally be injected into the result div. There are 2 ways to do that: use the proper JS DOM manipulation functions or just inject HTML code. I went for the latter just because it takes less typing and is visually easier to debug.

populateSearchResults = function( articles, tags ) {
    const container_msg = document.getElementById( "search-msg" );
    const container_a   = document.getElementById( "article-results-out" );
    const container_t   = document.getElementById( "tag-results-out" );

    container_a.innerHTML = "";
    container_t.innerHTML = "";

    if( document.getElementById( "show-articles" ).checked )
        showSearchResults( "articles" );

    if( document.getElementById( "show-tags" ).checked )
        showSearchResults( "tags" );

    if( articles.length || tags.length ) {

        articles.forEach( value => {
            let html = `<a href="../${value.href}">
                            <div>
                                <h3>${value.title}</h3>
                                <span class="date-stamp">${value.date}</span>
                            </div>
                            <div>
                                <div class="tags">`

            value.tags.forEach( tag => { html += `<span class="tag">${tag}</span>`; } )

            html +=             `</div>
                            </div>
                        </a>`;

            container_a.innerHTML += html;
        } );

        tags.forEach( value => {
            container_t.innerHTML +=
                `<a href="../${value.href}">
                    <h3>${value.name}</h3>
                </a>`;
        } );

        container_msg.style.display = 'none';

    } else {
        container_msg.style.display = 'block';
    }
}

HTML injection of results

Each of the results are encased in a link whose 'href' is grabbed from the result's source.

2.5 Graceful failure

In case Javascript is disabled on the host machine a message needs to be seen explaining the situation to the operator. To do that is actually really simple: have a div or other container with the appropriate message and have the script auto-hide it on the page load. If the script can't be run then the message stays in sight.

window.onload = function loadIndexJSON() {
    /* ... code to init the search 'engine' */

    document.getElementById( "search-msg-no-js" ).style.display = 'none';  /* JS disabled */
    document.getElementById( "search-msg" ).style.display       = 'block'; /* JS enabled */
}

Auto-hide 'javascript disabled' message on load

3. Additions

3.1 Passing query to the URL

A neat thing that HTML forms with a GET method can do is append parameters to an 'action' URL. This means that a search bar can be put anywhere on or off the site and redirect to the search page along with the search string the operator typed.

<form action="search.html" method="GET" autocomplete="off">
    <button type="submit"></button>
    <label><input type="text" placeholder="Type search string here." name="search-string" /></label>
</form>

HTML form on other page

loadSearch = function() {
    const regex  = new RegExp( "\\?search-string=(.*)" );
    let   result = regex.exec( decodeURIComponent( window.location.search ) );

    if( result && result.length > 1 )
        postSearchResults( result[1] );
}

JS method to grab URL search parameters

3.2 New category

Once the script worked without issues, the 'Projects' category was added. With the help of Blogator's JSON append functionality the index.json file can now contain the content of a project-specific JSON index so it made sense to include it in the search.

All that was left to do was to amend the code to include that extra category (node, word insertion, result accumulation and output). It does add some complexity but being able to search that category as well is worth it.

{
    "projects": [
        {
            "name": "...",
            "year": "...",
            "tags": [ "..." ],
            "meta": [ "..." ],
            "href": "..."
        },
        ...
    ]
}

Project JSON structure

4. Final thoughts

In the end the search functionality turned out to be pretty responsive and, considering it's the same on a 6 year old mobile device, it should hold up for some time yet.

There are other approaches to this problem but as far as it goes on a small site like mine here, it does a fine job for now.

For a complete look at the glory of the script here's a link to the file.