A software engineer from New York got so fed up with the irrelevant results and SEO spam in search engines that he decided to create a better one. Two months later, he has a demo search engine up and running. Here is how he did it, and four important insights about what he feels are the hurdles to creating a high-quality search engine.

One of the motives for creating a new search engine was the perception that mainstream search engines contained increasing amount of SEO spam. After two months the software engineer wrote about their creation:

“What’s great is the comparable lack of SEO spam.”

Neural Embeddings

The software engineer, Wilson Lin, decided that the best approach would be neural embeddings. He created a small-scale test to validate the approach and noted that the embeddings approach was successful.

Chunking Content

The next phase was how to process the data, like should it be divided into blocks of paragraphs or sentences? He decided that the sentence level was the most granular level that made sense because it enabled identifying the most relevant answer within a sentence while also enabling the creation of larger paragraph-level embedding units for context and semantic coherence.

But he still had problems with identifying context with indirect references that used words like “it” or “the” so he took an additional step in order to be able to better understand context:

“I trained a DistilBERT classifier model that would take a sentence and the preceding sentences, and label which one (if any) it depends upon in order to retain meaning. Therefore, when embedding a statement, I would follow the “chain” backwards to ensure all dependents were also provided in context.

This also had the benefit of labelling sentences that should never be matched, because they were not “leaf” sentences by themselves.”

Identifying The Main Content

A challenge for crawling was developing a way to ignore the non-content parts of a web page in order to index what Google calls the Main Content (MC). What made this challenging was the fact that all websites use different markup to signal the parts of a web page, and although he didn’t mention it, not all websites use semantic HTML, which would make it vastly easier for crawlers to identify where the main content is.

So he basically relied on HTML tags like the paragraph tag

to identify which parts of a web page contained the content and which parts did not.

This is the list of HTML tags he relied on to identify the main content:

  • blockquote – A quotation
  • dl – A description list (a list of descriptions or definitions)
  • ol – An ordered list (like a numbered list)
  • p – Paragraph element
  • pre – Preformatted text
  • table – The element for tabular data
  • ul – An unordered list (like bullet points)

Issues With Crawling

Crawling was another part that came with a multitude of problems to solve. For example, he discovered, to his surprise, that DNS resolution was a fairly frequent…


Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We blogs.grocliq.com want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at [email protected]

 

 

Categorized in:

Blog,

Last Update: August 19, 2025