How Getty Images Executes Managed Search with Apache Solr

September 22, 2015, 10:05 am

≫ Next: How Bloomberg Scales Apache Solr in a Multi-tenant Environment

≪ Previous: How CareerBuilder Executes Semantic and Multilingual Strategies with Apache Lucene/Solr

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Jacob Graves’s session on how they configure Apache Solr for managed search at Getty Images. The problem is to create a framework for business users that will:

Hide technical complexity
Allows control over scoring components and result ordering
Allows balancing of these scoring components against each other
Provides feedback
Allows visualization of the result of their changes

We call this Managed Search

Managed Search: Presented by Jacob Graves, Getty Images from Lucidworks

Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post How Getty Images Executes Managed Search with Apache Solr appeared first on Lucidworks.

↧

How Bloomberg Scales Apache Solr in a Multi-tenant Environment

September 23, 2015, 10:09 am

≫ Next: Pushing the Limits of Apache Solr at Bloomberg

≪ Previous: How Getty Images Executes Managed Search with Apache Solr

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Bloomberg engineer Harry Hight’s session on scaling Solr in a multi-tenant environment. Bloomberg Vault is a hosted communications archive and search solution, with over 2.5 billion documents in a 45TB Solr index. This talk will cover some of the challenges we encountered during the development of our Solr search backend, and the steps we took to overcome them, with emphasis on security and scalability. Basic security always starts with different users having access to subsets of the documents, but gets more interesting when users only have access to a subset of the data within a given document, and their search results must reflect that restriction to avoid revealing information. Scaling Solr to such extreme sizes presents some interesting challenges. We will cover some of the techniques we used to reduce hardware requirements while still maintaining fast responses times. Harry Hight is a software engineer for Bloomberg Vault. He has been working with Solr/Lucene for the last 3 years building, extending, and maintaining a communications archive/e-discovery search back-end.

Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P. from Lucidworks

The post How Bloomberg Scales Apache Solr in a Multi-tenant Environment appeared first on Lucidworks.

↧

Pushing the Limits of Apache Solr at Bloomberg

September 25, 2015, 10:17 am

≫ Next: How StubHub De-Dupes with Apache Solr

≪ Previous: How Bloomberg Scales Apache Solr in a Multi-tenant Environment

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Anirudha Jadhav’s session on going beyond the conventional constraints of Solr. The goal of the presentation is to delve into the implementation of Solr, with a focus on how to optimize Solr for big data search. Solr implementations are frequently limited to 5k-7k ingest rates in similar use cases. I conducted several experiments to increase the ingest rate as well as throughput of Solr, and achieved a 5x increase in performance, or north of 25k documents per second. Typically, optimizations are limited by the available network bandwidth. I used three key metrics to benchmark the performance of my Solr implementation: time triggers, document size triggers and document count triggers. The talk will delve into how I optimized the search engine, and how my peers can coax similar performance out of Solr. This is intended to be an in-depth description of the high-frequency search implementation, with q/a with the audience. All implementations described here are based on latest SolrCloud multi-datacenter setups. Anirudha Jadhav is a big data search expert, and has architected and deployed arguably one of the world’s largest Lucene-based search deployments , tipping the scale at a little over 86 billion documents for Bloomberg LP. He has deep expertise in building financial applications, high-frequency trading and search applications as well as solving complex search and ranking problems. In his free time, he also enjoys scuba-diving, off-road treks with his 18th century British Army motorbike, building tri-copters and underwater photography. Anirudha earned his Masters in Computer Science from Courant Institute of Mathematical Sciences, New York University.

Never Stop Exploring – Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P. from Lucidworks

The post Pushing the Limits of Apache Solr at Bloomberg appeared first on Lucidworks.

↧

How StubHub De-Dupes with Apache Solr

September 29, 2015, 12:07 pm

≫ Next: Lasso Some Prizes by Stumping The Chump in Austin Texas

≪ Previous: Pushing the Limits of Apache Solr at Bloomberg

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting StubHub engineer Neeraj Jain’s session on de-duping in Solr. Stubhub handles large number of events and related documents. Use of Solr within Stubhub has grown from search for events/tickets to content ingestion. One of the major challenges that are faced in content ingestion systems is to detect and remove duplicates without compromising on quality and performance. We present a solution that involves spatial searching, custom update handler, custom geodist function etc, to solve the de-duplication problem. In this talk, we’ll present design and implementation details of the custom modules and APIs and discuss some of the challenges that we faced and how we overcame them. We’ll also present the comparison analysis between old and the new system used for de-duplication. Neeraj Jain is an engineer working with Stubhub Inc in San Francisco. He has a special interest in search domain and has been working with SOLR for over 4 years. He also has interest in mobile app development; he works as a freelancer and has applications on Google play store and iTunes store that are built using SOLR. Neeraj has a Masters in Technology degree from the Indian Institute of Technology, Kharagpur.

Deduplication Using Solr: Presented by Neeraj Jain, Stubhub from Lucidworks

The post How StubHub De-Dupes with Apache Solr appeared first on Lucidworks.

↧

Lasso Some Prizes by Stumping The Chump in Austin Texas

September 30, 2015, 11:53 am

≫ Next: Building a Large Scale SEO/SEM Application with Apache Solr

≪ Previous: How StubHub De-Dupes with Apache Solr

Professional Rodeo riders typically only have a few seconds to prove themselves and win big prizes. But you’ve still got two whole weeks to prove you can Stump The Chump with your tough Lucene/Solr questions, and earn both bragging rights and one of these prizes…

1st Prize: $100 Amazon gift certificate
2nd Prize: $50 Amazon gift certificate
3rd Prize: $25 Amazon gift certificate

You don’t have to know how to rope a steer to win, just check out the session information page for details on how to submit your questions. Even if you can’t make it to Austin to attend the conference, you can still participate — and do your part to humiliate me — by submitting your questions.

To keep up with all the “Chump” related info, you can subscribe to this blog (or just the “Chump” tag).

The post Lasso Some Prizes by Stumping The Chump in Austin Texas appeared first on Lucidworks.

↧

Building a Large Scale SEO/SEM Application with Apache Solr

October 2, 2015, 2:44 pm

≫ Next: Approaching Join Index in Apache Lucene

≪ Previous: Lasso Some Prizes by Stumping The Chump in Austin Texas

POST TITLE: As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Rahul Jain’s session on indexing large scale SEO/SEM data. Search engine optimization (SEO) is the process of affecting the visibility of a website or a web page in a search engine’s natural or un-paid (organic) search results while other side Search engine marketing (SEM) is a form of Internet marketing that involves the promotion of websites by increasing their visibility in search engine results pages (SERPs) through optimization and advertising. We are working on building a SEO/SEM application where an end user search for a keyword or a domain and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is as much as 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr. Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.” Rahul Jain is a Freelance Big Data/Search Consultant from Hyderabad, India where he helps organizations in scaling their big-data/search applications. He has 7 years of experience in development of Java and J2EE based distributed systems with 2 years of experience in working with Big data technologies (Apache Hadoop/Spark) and Search/IR systems (Lucene/Solr/Elasticsearch). In his previous assignments, he was associated with Aricent Technologies and Wipro Technologies Ltd, in Bangalore where he worked on development of multiple products. He is a frequent speaker and had given several talks/presentations on multiple topics in Search/IR domain at various meetup/conferences.

Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain from Lucidworks

The post Building a Large Scale SEO/SEM Application with Apache Solr appeared first on Lucidworks.

↧

Approaching Join Index in Apache Lucene

October 2, 2015, 3:06 pm

≫ Next: Quantifying Performance Gains When Batching Indexing Updates to Solr

≪ Previous: Building a Large Scale SEO/SEM Application with Apache Solr

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Mikhail Khludnev’s session on joins and block-joins in Lucene. Lucene works great with independent text documents, but real life problems often require to handle relations between documents. Aside from several workarounds, like term encodings, field collapsing or term positions, we have two mainstream approaches to handle document relations: join and block-join. Both have their downsides. Join lacks performance, while block-join makes is really expensive to handle index updates, since it requires to wipe a whole block of related documents. This session presents an attempt to apply join index, borrowed from RDBMS world, for addressing drawbacks of the both join approaches currently present in Lucene. We will look into the idea per se, possible implementation approaches, and review the benchmarking results. Mikhail has years of experience building backend systems for retail industry. His interests span from general systems architecture, API design and performance engineering all the way to testing approaches. For last few years he works on eCommerce search platform extending Lucene and Solr, contributes back to community, spokes at Lucene Revolution and other conferences.

Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics from Lucidworks

The post Approaching Join Index in Apache Lucene appeared first on Lucidworks.

↧

Quantifying Performance Gains When Batching Indexing Updates to Solr

October 5, 2015, 7:49 am

≫ Next: Implementing Apache Solr at Target

≪ Previous: Approaching Join Index in Apache Lucene

Batching when indexing is good:

For quite some time it’s been part of the lore that one should batch updates when indexing from SolrJ (the post tool too, but I digress). I recently had the occasion to write a test that put some numbers to this general understanding. As usual, YMMV. The interesting bit isn’t that the absolute numbers, it’s the relative differences. I thought it might be useful to share the results.

Take-aways:

Well, the title says it all, batching when indexing is good. The biggest percentage jump is the first order of magnitude, i.e. batching 10 docs instead of 1. Thereafter, while the throughput increases, the jump from 10 -> 100 isn’t nearly as dramatic as the jump from 1 -> 10. And this is particularly acute with small numbers of threads.

I have heard anecdotal reports of incremental improvements when going to 10,000 document/packet, so I urge you to experiment. Just don’t send a single document at a time and wonder why “Indexing to Solr is sooooo slooooowwwww”.

Note that by just throwing a lot of client threads at the problem, one can make up for the inefficiencies of small batches. This illustrates that the majority of the time spent in the small-batch scenario is establishing the connection and sending the documents over the wire. For up to 20 threads in this experiment, though, throughput increases with the packet size. And I didn’t try more than 20 threads.

All these threads were run from a single program, it’s perfectly reasonable to run multiple client programs instead if the data can be partitioned amongst them and/or you’d rather not deal with multi-threading.

This was not SolrCloud. I’d expect these general results to hold though, especially if CloudSolrClient (CloudSolrServer in 4.x/5x) were used.

Minor rant:

Eventually, you can max out the CPUs on the Solr servers. At that point, you’ve got your maximum possible throughput. Your query response time will suffer if you’re indexing and querying at the same time of course. I had to slip this comment in here because it’s quite often the case that people on the Solr User’s list ask “Why is my indexing slow?”. 90+ percent of the time it’s because the client isn’t delivering the documents to Solr fast enough and Solr is just idling along using 10% of the CPU. And there’s a very simple way to figure that out… comment out the line in your program that sends docs to solr, usually a line like:

server.add(doclist);

Anyway, enough ranting, here are the results, I’ll talk about the environment afterward:

Nice tabular results:

As I mentioned, I stopped at 20 threads. You might increase throughput with more threads, but the general trend is clear enough that I stopped. The rough doubling from 1 to 2 threads indicates that Solr is simply idling along most of the time. Note that by the time we get to 20 threads, the increase is not linear with respect to the number of threads and eventually adding more threads will not increase throughput at all.

Threads Packet Size Docs/second

20	1	5,714
20	10	16,666
20	100	18,450
20	1,000	20,408
2	1	767
2	10	4,201
2	100	7,751
2	1,000	9,259
1	1	382
1	10	2,369
1	100	5,319
1	1,000	5,464

Test environment:

Solr is running a single node on a Mac Pro with 64G of memory, 16G is given to Solr. That said, indexing isn’t a very memory-heavy operation so the memory allocated to Solr is probably not much of an issue.
The files are being parsed locally on a Macbook Pro laptop, connected by a Thunderbolt cable to the Mac Pro.
The documents are very simple, there is only a single analyzed field. The rest of the fields are string or numeric types. There are 30 or so short string fields, a couple of integer fields and a date field or two. Hey, it’s the data I had available!
There are 200 files of 5,000 documents each for a total of 1M documents.
The index always started with no documents.
This is the result of a single run at each size.
There is a single HttpSolrServer being shared amongst all the threads on the indexing client.
There is no query load on this server.

How the program works:

There are two parameters that vary with each run, number of threads to fire up simultaneously and number of Solr documents to put in each packet sent to Solr.

The program then recursively descends from a root directory and every time it finds a JSON file it passes that file to a thread in a FixedThreadPool that parses the documents out of the JSON file, packages them up in groups and sends them to Solr. After all files are found, it waits for all the threads to finish and reports throughput.

I felt the results were consistent enough that running a statistically valid number of tries and averaging across them all and, you know, doing a proper analysis wasn’t time well spent.

Conclusion:

Batch documents when using SolrJ ;). My purpose here was to give some justification to why updates should be batched, just saying “it’s better” has much less immediacy than seeing a 1,400% increase in throughput (1 thread, the difference between 1 doc/packet and 1,000 docs/packet).

The gains would be less dramatic if Solr was doing more work I’m sure. For instance, if instead of a bunch of un-analyzed fields you threw in 6 long text fields with complex regex analysis chains that used back-references, the results would be quite different. Even so, batching is still recommended if at all possible.

And I want to emphasize that this was on a single, non SolrCloud node since I wanted to concentrate entirely on the effects of batching. On a properly set-up SolrCloud system, I’d expect the aggregate indexing process to scale nearly linearly with the number of shards in the system when using the CloudSolrClient (CloudSolrServer in 4x).

The post Quantifying Performance Gains When Batching Indexing Updates to Solr appeared first on Lucidworks.com.

↧

Implementing Apache Solr at Target

October 8, 2015, 1:27 pm

≫ Next: Know When To Hold ’em … Know When To Run – Time Is Running Out To Stump The Chump

≪ Previous: Quantifying Performance Gains When Batching Indexing Updates to Solr

Sending Solr into action on a high volume, high profile website within a large corporation presents several challenges — and not all of them are technical. This will be an open discussion and overview of the journey at Target to date. We’ll cover some of the wins, losses and ties that we’ve had while implementing Solr at Target as a replacement for a legacy enterprise search platform. In some cases the solutions were basic, while others required a little more creativity. We’ll cover both to paint the whole picture.

Raja Ramachandran is an experienced Solr architect with a passion for improving relevancy and acquiring data signals to improve search’s contextual understanding of its user.

Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target from Lucidworks

lucenerevolution-avatar Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Implementing Apache Solr at Target appeared first on Lucidworks.com.

↧

Know When To Hold ’em … Know When To Run – Time Is Running Out To Stump The Chump

October 8, 2015, 9:37 pm

≫ Next: LinkedIn’s Galene Search Architecture Built on Apache Lucene

≪ Previous: Implementing Apache Solr at Target

Are you a Gambler? Even if you aren’t, what are you waiting for?

There’s no ante or no buy in needed to “go all in” for a nice pot of prize money in this years Stump The Chump contest at Lucene/Solr Revolution 2015 in Austin Texas. But time is running out! There are only a few days for you to submit your most challenging questions.

Even if you can’t make it to Austin to attend the conference, you can still participate. Check out the session information page for details on how to submit your questions.

To keep up with all the “Chump” related info, you can subscribe to this blog (or just the “Chump” tag).

The post Know When To Hold ’em … Know When To Run – Time Is Running Out To Stump The Chump appeared first on Lucidworks.com.

↧

LinkedIn’s Galene Search Architecture Built on Apache Lucene

October 9, 2015, 10:55 am

≫ Next: Data As a Virtuous Cycle

≪ Previous: Know When To Hold ’em … Know When To Run – Time Is Running Out To Stump The Chump

LinkedIn’s corpus is a richly structured professional graph comprised of 300M+ people, 3M+ companies, 2M+ groups, and 1.5M+ publishers. Members perform billions of searches, and each of those searches is highly personalized based on the searcher’s identity and relationships with other professional entities in LinkedIn’s economic graph. And all this data is in constant flux as LinkedIn adds more than 2 members every second in over 200 countries (2/3 of whom are outside the United States). As a result, we’ve built a system quite different from those used for other search applications. In this talk, we will discuss some of the unique systems challenges we’ve faced as we deliver highly personalized search over semi-structured data at massive scale.”

Diego (“Mono”) Buthay is a staff engineer at LinkedIn, where he works on the back-end infrastructure for all of LinkedIn’s search products. Before that, he built the search-as-a-service platform at IndexTank, which LinkedIn acquired in 2011. He has BS and MS degrees in computer software engineering from the University of Buenos Aires. Sriram Sankar is a principal staff engineer at LinkedIn, where he leads the development of its next-generation search architecture. Before that, he led Facebook’s search quality efforts for Graph Search, and was a key contributor to Unicorn. He previously worked at Google on search quality and ads infrastructure. He is also the author of JavaCC, a leading parser generator for Java. Sriram has a PhD from Stanford University and a BS from IIT Kanpur.”

Galene – LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram Sankar, LinkedIn from Lucidworks

The post LinkedIn’s Galene Search Architecture Built on Apache Lucene appeared first on Lucidworks.com.

↧

Data As a Virtuous Cycle

October 16, 2015, 10:12 am

≫ Next: Focusing on Search Quality at Lucene/Solr Revolution 2015

≪ Previous: LinkedIn’s Galene Search Architecture Built on Apache Lucene

Deck from Lucidworks CEO Will Hayes’s opening remarks on the first day of Lucene/Solr Revolution 2015. Video coming soon.

Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes from Lucidworks

The post Data As a Virtuous Cycle appeared first on Lucidworks.com.

↧

Focusing on Search Quality at Lucene/Solr Revolution 2015

October 19, 2015, 1:03 pm

≫ Next: Stump The Chump: Austin Winners

≪ Previous: Data As a Virtuous Cycle

I just got back from Lucene/Solr Revolution 2015 in Austin on a big high. There were a lot of exciting talks at the conference this year, but one thing that was particularly exciting to me was the focus that I saw on search quality (accuracy and relevance), on the problem of inferring user intent from the queries, and of tracking user behavior and using that to improve relevancy and so on. There were also plenty of great talks on technology issues this week that attack the other ‘Q’ problem – we keep pushing the envelope of what is possible with SolrCloud at scale and under load, are indexing data faster and faster with streaming technologies such as Spark and are deploying Solr to more and more interesting domains. Big data integrations with SolrCloud continue to be a hot topic – as they should since search is probably the most (only?) effective answer to dealing with the explosion of digital information. But without quality results, all the technology improvements in speed, scalability, reliability and the like will be of little real value. Quantity and quality are two sides of the same coin. Quantity is more of a technology or engineering problem (authors like myself that tend to “eschew brevity” being a possible exception) and quality is a language and user experience problem. Both are critical to success where “success” is defined by happy users. What was really cool to me was the different ways people are using to solve the same basic problem – what does the user want to find? And, how do we measure how well we are doing?

Our Lucidworks CTO Grant Ingersoll started the ball rolling in his opening keynote address by reminding us of the way that we typically test search applications by using a small set of what he called “pet peeve queries” that attack the quality problem in piecemeal fashion but don’t come near to solving it. We pat ourselves on the back when we go to production and are feeling pretty smug about it until real users start to interact with our system and the tweets and/or tech support calls start pouring in – and not with the sentiments we were expecting. We need better ways of developing and measuring search quality. Yes, the business unit is footing the bill and has certain standards (which tend to be their pet peeve queries as Grant pointed out) so we give them knobs and dials that they can twist to calm their nerves and to get them off our backs, but when the business rules become so pervasive that they start to take over from what the search engine is designed to do, we have another problem. To be clear, there are some situations where we know that the search engine is not going to get it right so we have to do a manual override. We can either go straight a destination (using a technique that we call “Landing Pages” ) or force what we know to be the best answer to the top – so called “Best Bets” which is implemented in Solr using the QueryElevationComponent. However, this is clearly a case where moderation is needed! We should use these tools to tweak our results – i.e. fix the intractable edge cases, not to fix the core problems.

This ad-hoc or subjective way of measuring search quality that Grant was talking about is pervasive. The reason is that quality – unlike quantity – is hard to measure. What do you mean by “best”? And we know from our own experience and from our armchair data science-esque cogitations on this, that what is best for one user may not be best for another and this can in fact change over time for a given user. So quality, relevance is “fuzzy”. But what can we do? We’re engineers not psychics dammit! Paul Nelson, the Chief Scientist at Search Technologies, then proceeded to show us what we can do to measure search quality (precision and recall) in an objective (i.e. scientific!) way. Paul gave a fascinating talk showing the types of graphs that you typically see in a nuts-and-bolts talk that tracked the gradual improvement in accuracy over time during the course of search application development. The magic behind all of this are query logs and predictive analytics. So given that you have this data (even if from your previous search engine app) and want to know if you are making “improvements” or not, Paul and his team at Search Technologies have developed a way to use this information to essentially regression test for search quality – pretty cool huh? Check out Paul’s talk if you didn’t get a chance to see it.

But look, lets face it, getting computers to understand language is a hard problem. But rather than throwing up our hands, in my humble opinion, we are really starting to dig into solving this one! The rubber is hitting the road folks. One of the more gnarly problems in this domain is name recognition. Chris Mack of Basis Technologies gave a very good presentation of how Basis is using their suite of language technologies to help solve this. Name matching is hard because there are many ambiguities and alternate ways of representing names and there are many people that share the same name, etc. etc. etc. Chris’s family name is an example of this problem – is it a truck, a cheeseburger (spelled Mac) or a last name? For those of you out there that are migrating from Fast ESP to Solr (a shoutout here to that company in Redmond Washington for sunsetting enterprise support for Fast ESP – especially on Linux – thanks for all of the sales leads guys! Much appreciated!) – you should know that Basis Technologies (and Search Technologies as well I believe) have a solution for Lemmatization that you can plug into Solr (a more comprehensive way to do stemming). I was actually over at the Basis Tech booth to see about getting a dev copy of their lemmatizer for myself so that we could demonstrate this to potential Fast ESP customers when I met Chris. Besides name recognition, Basis Tech has a lot of other cool things. Their flagship product is Rosette – a world class ontology / rules-based classification engine among other things. Check it out.

Next up on my list was Trey Grainger of CareerBuilder. Trey is leading a team there that is doing some truly outstanding work on user intent recognition and using that to craft more precise queries. When I first saw the title of Trey’s talk “Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine”, I thought that he and his team had scooped me since my own title is very similar – great minds think alike I guess, (certainly true in Trey’s case, a little self-aggrandizement on my part here but hey, its my blog post so cut me some slack!). What they are basically doing is using classification approaches such as machine learning to build a Knowledge Graph in Solr and then using that at query time to determine what the user is asking for and then to craft a query that brings back those things and other closely related things. The “related to” thing is very important especially in the buzz-word salad that characterizes most of our resumes these days. The query rewrite that you can do if you get this right can slice through noise hits like a hot knife through butter.

Trey is also the co-author of Solr in Action with our own Tim Potter – I am already on record about this wonderful book – but it was cool what Trey did – he offered a free signed copy to the person who had the best tweet about his talk. Nifty idea – wish I had thought of it but, oh yeah, I’d have to write a book first – whoever won, don’t just put this book on your shelf when you get home – read it!

Not to be outdone, Simon Hughes of Dice.com, Trey’s competitor in the job search sector gave a very interesting talk about how they are using machine learning techniques such as Latent Semantic Analysis (LSA) and Google’s Word2Vec software to do similar things. They are using Lucene payloads in very interesting ways and building Lucene Similarity implementations to re-rank queries – heavy duty stuff that the nuts-and-bolts guys would appreciate too (the code that Simon talked about is open sourced). The title of the talk was “Implementing Conceptual Search in Solr using LSA and Word2Vec”. The keyword here is “implementing” – as I said earlier in this post, we are implementing this stuff now, not just talking about it as we have been doing for too long in my opinion. Simon also stressed the importance of phrase recognition and I was excited to realize that the techniques that Dice is using can feed into some of my own work, specifically to build autophrasing dictionaries that can then be ingested by the AutoPhraseTokenFilter. In the audience with me were Chris Morley of Wayfair.com and Koorosh Vakhshoori of Synopsys.com who have made some improvements to my autophrasing code that we hope to submit to Solr and github soon.

Nitin Sharma and Li Ding of BloomReach introduced us to a tool that they are working on called NLP4L – a natural language processing tool for Lucene. In the talk, they emphasized important things like precision and recall and how to use NLP techniques in the context of a Lucene search. It was a very good talk but I was standing too near the door– because getting a seat was hard – and some noisy people in the hallway were making it difficult to hear well. That’s a good problem to have as this talk like the others were very well attended. I’ll follow up with Nitin and Li because what they are doing is very important and I want to understand it better. Domo Arrigato!

Another fascinating talk was by Rama Yannam and Viju Kothuvatiparambil (“Viju”) of Bank of America. I had met Viju earlier in the week as he attended our Solr and Big Data course ably taught by my friend and colleague Scott Shearer. I had been tapped to be a Teaching Assistant for Scott. Cool, a TA, hadn’t done that since Grad School, made me feel younger … Anyway, Rama and Viju gave a really great talk on how they are using open-source natural language processing tools such as UIMA, Open NLP, Jena/SPARQL and others to solve the Q&A problem for users coming to the BofA web site. They are also building/using an Ontology (that’s where Jena and SPARQL come in) which as you may know is a subject near and dear to my heart, as well as NLP techniques like Parts Of Speech (POS) detection.

They have done some interesting customizations on Solr but unfortunately this is proprietary. They were also not allowed to publish this talk by having their slides shared online or the talk recorded. People were talking pictures of the slides with their cell phones (not me, I promise) but were asked not to upload them to Facebook, LinkedIn, Instagram or such. There was also a disclaimer bullet on one of their slides like you see on DVDs – the opinions expressed are the authors own and not necessarily shared by BofA – ta da ta dum – lawyereze drivel for we are not liable for ANYTHING these guys say but they’ll be sorry if they don’t stick to the approved script! So you will have to take my word for it, it was a great talk, but I have to be careful here – I may be on thin ice already with BofA legal and at the end of the day, Bank Of America already has all of my money! That said, I was grateful for this work because it will benefit me personally as a BofA customer even if I can’t see the source code. Their smart search knows the difference between when I need to “check my balance” vs when I need to “order checks”. As they would say in Boston – “Wicked Awesome”! One interesting side note here, Ramman and Viju mentioned that the POS tagger that they are using works really well for full sentences (on which the models were trained) but less well on sentence fragments (noun phrases) – still not too bad though – about 80%. More on this in a bit. But hey Banks – gotta love it – don’t get me started on ATM fees.

Last but not least (hopefully?) – as my boss Grant Ingersoll is fond of saying – was my own talk where I tried to stay competitive with all of this cool stuff. I had to be careful not to call it a Ted talk because this is a patented trademark and I didn’t want to get caught by the “Ted Police”. Notice that I didn’t use all caps to spell my own name here – they registered that so it probably would have been flagged by the Ted autobots. But enough about me. First I introduced my own pet peeve – why we should think of precision and recall before we worry about relevance tuning because technically speaking that is exactly what the Lucene engine does. If we don’t get precision and recall right we have created a garbage in – garbage out problem for the ranking engine. I then talked about autophrasing a bit, bringing out my New York – Big Apple demo yet again. I admitted that this is a toy problem but it does show that you can absolutely nail the phrase recognition and synonym problem which brings precision and recall to 100%. Although this is not a real world problem, I have gotten feedback that autophrasing is currently solving production problems, which is why Chris and Koorosh (mentioned above) needed to improve the code over my initial hack, for their respective dot-coms.

The focus of my talk then shifted to the work I have been doing on Query Autofiltering where you get the noun phrases from the Lucene index itself courtesy of the Field Cache (and yes Hoss, uh Chump, it works great, is less filling than some other NLP techniques – and there is a JIRA: SOLR-7539, take a look). This is more useful in a structured data situation where you have string fields with noun phrases in them. Autophrasing is appropriate for Solr text fields (i.e. tokenized / analyzed fields) so the techniques are entirely complementary. I’m not going to bore you with the details here since I have already written three blog posts on this but I will tell you that the improvements I have made recently will impell me to write a fourth installment – (hey, maybe I can get a movie deal like the guy who wrote The Martian which started out as a blog … naaaah, his was techy but mine is way too techy and it doesn’t have any NASA tie ins … )

Anyway, what I am doing now is adding verb/adjective resolution to the mix. The Query Autofiltering stuff is starting to resemble real NLP now so I am calling it NLP-Lite. “Pseudo NLP”, “Quasi-NLP” and “query time NLP” are also contenders. I tried to do a demo on this (which was partially successful) using a Music Ontology I am developing where I could get the questions “Who’s in The Who” and “Beatles songs covered by Joe Cocker” right, but Murphy was heavily on my case so I had to move on because the “time’s up” enforcers were looming and I had a plane to catch. I should say that the techniques that I was talking about do not replace classical NLP – rather we (collectively speaking) are using classic NLP to build knowledge bases that we can use on the query side with techniques such as query autofiltering. That’s very important and I have said this repeatedly – the more tools we have, the better chance we have of finding the right one for a given situation. POS tagging works well on full sentences and less well on sentence fragments, where the Query Autofilter excels. So its “front-end NLP” – you use classic NLP techniques to mine the data at index time and to build your knowledge base, and you use this type of technique to harvest the gold at query time. Again, the “knowledge base” as Trey’s talk and my own stressed can be the Solr/Lucene index itself!

Finally, I talked about some soon-to-be-published work I am doing on auto suggest. I was looking for a way to generate more precise typeahead queries that span multiple fields which the Query Autofilter could then process. I discovered a way to use Solr facets, especially pivot facets to generate multi-field phrases and regular facets to pull context so that I could build a dedicated suggester collection derived from a content collection. (whew!!) The pivot facets allow me to turn a pattern like “genre,musician_type” into “Jazz Drummers”, “Hard Rock Guitarists”, “Classical Pianists”, “Country Singers” and so on. The facets enable me to then grab related information to the subject so if I do a pivot pattern like “name,composition_type” to generate suggestions like “Bob Dylan Songs”, I can pull back other related things to Bob Dylan such as “The Band” and “Folk Rock” that I can then use to create user context for the suggester. Now, if you are searching for Bob Dylan songs, the suggester can start to boost them so that song titles that would normally be down the list will come to the top.

This matches a spooky thing that Google was doing while I was building the music ontology – after awhile, it would start to suggest long song titles with just two words entered if my “agenda” for that moment was consistent. So if I am searching for Beatles songs for example, after a few searches, typing “ba” brings back (in the typeahead) “Baby’s In Black and “Baby I’m a Rich Man” above the myriad of songs that start with Baby as well as everything else in their typeahead dictionary starting with “ba”. WOW – that’s cool – and we should be able to do that too! (i.e., be more “Google-esque” as one of my clients put it in their Business Requirements Document) I call it “On-The-Fly Predictive Analytics” – as we say in the search quality biz – its ALL about context!

I say “last but not least” above, because for me, that was the last session that I attended due to my impending flight reservation. There were a few talks that I missed for various other reasons (there was a scheduling conflict, my company made me do some pre-sales work, I was wool gathering or schmoozing/networking, etc) where the authors seem to be on the same quest for search quality. Talks like “Nice Docs Finish First” by Fiona Condon at Etsy, “Where Search Meets Machine Learning” by folks at Verizon, “When You Have To Be Relevant” by Tom Burgmans of Wolters-Kluwer and “Learning to Rank” by those awesome Solr guys at Bloomberg – who have got both ‘Qs’ working big time!

Since I wasn’t able to attend these talks and don’t want to write about them from a position of ignorance, I invite the authors (or someone who feels inspired to talk about it) to add comments to this post so we can get a post-meeting discussion going here. Also, any author that I did mention who feels that I botched my reporting of their work should feel free to correct me. And finally, anybody who submitted on the “Tweet about Trey’s Talk and Win an Autographed Book” contest is encouraged to re-tweet – uh post, your gems here.

So, thanks for all the great work on this very important search topic. Maybe next year we can get Watson to give a talk so we can see what the computers think about all of this. After all, Watson has read all of Bob Dylan’s song lyrics so he (she?) must be a pretty cool dude/gal by now. I wonder what it thinks about “Stuck Inside of Mobile with the Memphis Blues Again”? To paraphrase the song, yes Mama, this is really the end. So, until we meet again at next year’s Revolution, Happy searching!

The post Focusing on Search Quality at Lucene/Solr Revolution 2015 appeared first on Lucidworks.com.

↧

Stump The Chump: Austin Winners

October 19, 2015, 2:31 pm

≫ Next: Solr on Docker

≪ Previous: Focusing on Search Quality at Lucene/Solr Revolution 2015

Last week was another great Stump the Chump session at Lucene/Solr Revolution in Austin. After a nice weekend of playing tourist and eating great BBQ, today I’m back at my computer and happy to announce last weeks winners:

Barani Bikshandi ($100 Amazon gift certificate)
Carlos Eduardo Sponchiado (Sponch) ($50 Amazon gift certificate)
Aditya Varun Chadha ($25 Amazon gift certificate)

I want to thank everyone who participated — either by sending in your questions, or by being there in person to heckle me. But I would especially like to thank the judges and our moderator Cassandra Targett, who had to do all the hard work preparing the questions.

Keep an eye on the Lucidworks YouTube page to see the video once it’s available. And if you can make it to Cambridge, MA next week, make sure to sign up for the October 28th Boston Lucene/Solr MeetUp and hear all about the winning questions, and how I think they stacked up over the past 5 years.

The post Stump The Chump: Austin Winners appeared first on Lucidworks.com.

↧

Solr on Docker

November 3, 2015, 5:17 am

≫ Next: Lucidworks Announces $21 Million in Series D Funding

≪ Previous: Stump The Chump: Austin Winners

It is now even easier to get started with Solr: you can run Solr on Docker with a single command:

$ docker run --name my_solr -d -p 8983:8983 -t solr

That creates a new Docker container using the new official Solr image, which includes OpenJDK and the latest release of Solr.

Then with a web browser go to http://localhost:8983/ to see the Admin Console (adjust the hostname for your docker host).

To use Solr, you need to create a “core”, an index for your data. For example:

$ docker exec -it --user=solr my_solr bin/solr create_core -c gettingstarted

In the web UI if you click on “Core Admin” you should now see the “gettingstarted” core.

If you want to load some example data:

$ docker exec -it --user=solr my_solr bin/post -c gettingstarted example/exampledocs/manufacturers.xml

In the UI, find the “Core selector” popup menu and select the “gettingstarted” core, then select the “Query” menu item. This gives you a default search for “:” which returns all docs. Hit the “Execute Query” button, and you should see a few docs with data. Congratulations!

This video demonstrates the image used with the user interface (Kitematic) from the Docker Toolbox on OSX:

Further instructions, including on how to run in a multi-container configuration can be found in the documentation, with further details in the FAQ. The code for this image is available in the docker-solr Github repository.

For those interested in how this came together: the image is based on the popular makuk66/docker-solr image, and you can see how that was further refined to be even friendlier to use and to better fit into Docker’s maintenance model in this pull request. A big thank-you to the Docker team for their help there.

The post Solr on Docker appeared first on Lucidworks.com.

↧

Lucidworks Announces $21 Million in Series D Funding

November 18, 2015, 11:00 am

≫ Next: Query Autofiltering IV: – A Novel Approach to Natural Language Processing

≪ Previous: Solr on Docker

It’s a great Wednesday:

Lucidworks, the chosen search solution for leading brands and organizations around the world, today announced $21 million in new financing. Allegis Capital led the round with participation from existing investors Shasta Ventures and Granite Ventures. Lucidworks will use the funds to accelerate its product-focused mission enabling companies to translate massive amounts of data into actionable business intelligence.

“Organizations demand powerful data applications with the ease of use of a mobile app. Our platform provides universal data access to end users both inside and outside the enterprise,” said Will Hayes, CEO, Lucidworks. “With this investment, Lucidworks will expand our efforts to be the premier platform for building search and data-driven applications.”

“Lucidworks has proven itself, not only by providing the software and solutions that businesses need to benefit from Lucene/Solr search, but also by expanding its vision with new products like Fusion that give companies the ability to fully harness search technology suiting their particular customers,” says Spencer Tall, Managing Director, Allegis Capital. “We fully support Lucidworks, not only for what it has achieved to date — disruptive search solutions that offer real, immediate benefits to businesses — but for the promising future of its product technology.”

Full details http://news.sys-con.com/node/3561957

The post Lucidworks Announces $21 Million in Series D Funding appeared first on Lucidworks.com.

↧

Query Autofiltering IV: – A Novel Approach to Natural Language Processing

November 19, 2015, 6:46 am

≫ Next: PagerDuty Integration in Lucidworks Fusion

≪ Previous: Lucidworks Announces $21 Million in Series D Funding

This is my fourth blog post on a technique that I call Query Autofiltering. The basic idea is that we can use meta information stored within the Solr/Lucene index itself (in the form of string or non-tokenized text fields) to generate a knowledge base from which we can parse user queries and map phrases within the query to metadata fields in the index. This enables us to re-write the user’s query to achieve better precision in the response.

Recent versions of Query Autofiltering, which uses the Lucene FieldCache as a knowledge store, are able to do this job rather well but still leave some unresolved ambiguities. This can happen when a given metadata value occurs in more than one field (some examples of this below), so the query autofilter will create a complex boolean query to handle all of the possibile permutations. With multiple fields involved, some of the cross-field combinations don’t exist in the index (the autofilter can’t know that) and an additional filtering step happens serendipitously when the query is run. This often gives us exactly the right result but there is an element of luck involved, which means that there are bound to be situations where our luck runs out.

As I was developing demos for this approach using a music ontology I am working on, I discovered some of these use cases. As usual, once you see a problem and understand the root cause, you can then find other examples of it. I will discuss a biomedical / personal health use case below that I had long thought was difficult or impossible to solve with conventional search methods (not that query autofiltering is “conventional”). But I am getting ahead of myself. The problem crops up when users add verbs, adjectives or prepositions to their query to constrain the results, and these terms do not occur as field values in the index. Rather, they map to fields in the index. The user is telling us that they want to look for a key phrase in a certain metadata context, not all of the contexts in which the phrase can occur. It’s a Natural Language 101 problem! – Subject-Verb-Object stuff. We get the subject and object noun phrases from query autofiltering. We now need a way to capture the other key terms (often verbs) to do a better job of parsing these queries – to give the user the accuracy that they are asking for.

I think that a real world example is needed here to illustrate what I am talking about. In the Music ontology, I have entities like songs, the composers/songwriters/lyricists that wrote them and the artists that performed or recorded them. There is also the concept of a “group” or “band” which consists of group members who can be songwriters, performers or both.

One of my favorite artists (and I am sure that some, but maybe not all of my readers would agree) is Bob Dylan. Dylan wrote and recorded many songs and many of his songs were covered by other artists. One of the interesting verbs in this context is “covered”. A cover in my definition, is a recording by an artist who is not one of the song’s composers. The verb form “to cover” is the act of recording or performing another artist’s composition. Dylan, like other artists, recorded both his own songs and songs of other musicians, but a cover can be a signature too. So for example, Elvis Presley covered many more songs than he wrote, but we still think of “Jailhouse Rock” as an Elvis Presley song even though he didn’t write it (Jerry Leiber and Mike Stoller did).

So if I search for “Bob Dylan Songs” – I mean songs that Dylan either wrote or recorded (i.e. both). However if I search for “Songs Bob Dylan covered”, I mean songs that Bob Dylan recorded but didn’t write and “covers of Bob Dylan songs” would mean recordings by other artists of songs that Dylan wrote– Jimi Hendrix’s amazing cover of “All Along The Watchtower” immediately comes to mind here. (There is another linguistic phenomenon besides verbs going on here that I will talk about in a bit.)

So how do we resolve these things? Well, we know that the phrase “Bob Dylan” can occur many places in our ontology/dataset. It is a value in the “composer” field, the “performer” field and in the title field of our record for Bob Dylan himself. It is also the value of an album entity since his first album was titled “Bob Dylan”. So given the query “Bob Dylan” we should get all of these things – and we do – the ambiguity of the query matches the ambiguities discovered by the autofilter, so we are good. “Bob Dylan Songs” gives us songs that he wrote or recorded – now the query is more specific but still some ambiguities here, but still good because we have value matches for the whole query. However, if we say “Songs Bob Dylan recorded” vs “Songs Bob Dylan wrote” we are asking for different subsets of “song” things. Without help, the autofilter misses this subtlety because there is no matching fields for the terms “recorded” or “wrote” so it treats them as filler words.

To make the query autofilter a bit “smarter” we can give it some rules. The rule states that if a term like “recorded” or “performed” is near an entity (detected by the standard query autofilter parsing step) like “Bob Dylan” that maps to the field “performer_ss” then just use that field by itself and don’t fan it out to the other fields that the phrase also maps to. We configure this like so:

performed, recorded,sang => performer_ss

and for songs composed or written:

composed,wrote,written by => composer_ss

Where the list of synonymous verb or adjective phrases is on the left and the field or fields that these should map to on the right. Now these queries work as expected! Nice.

Another example is if we want to be able to answer questions about the bands that an artist was in or the members of a group. For questions like “Who’s in The Who?” or “Who were the members of Procol Harum?” we would map the verb or prepositional phrases “who’s in” and “members of” to the group_members_ss and member_of_group fields in the index.

who’s in,was in,were in,member,members => group_members_ss,member_of_group_ss

Now, searching for “who’s in the who” brings back just Messrs. Daltrey, Entwistle, Moon and Towhshend – cool!!!

Going Deeper – handling covers with noun-noun phrase ambiguities

The earlier example that I gave, “songs Bob Dylan covered” vs. “covers of Bob Dylan songs” contains additional complexities that the simple verb-to-field mapping doesn’t solve yet. Looking at this problem from a language perspective (rather than from a software hacking point of view) I was able to find a explanation and from that a solution. A side note here is that the output of my pre-processing of the ontology to detect when a recording was a cover, was the opposite relation when the performer of a song is also one of the composers. Index records of this type get tagged with an “original_performer_s” field and a “version_s:Original” to distinguish them from covers at query time (which are tagged “version_s:Cover”).

Getting back to the language thing, it turns out that in the phrase “Bob Dylan songs covered”, the subject noun phrase is “Bob Dylan songs”! That is the noun entity is the plural form of song, and the noun phrase “Bob Dylan” qualifies that noun to specify songs by him – its what is known in linguistics as a “noun-noun phrase” meaning that one noun “Bob Dylan” serves as an adjective to another one, “song” in this case. Remember – language is tricky! However, in the phrase “Songs Bob Dylan covered”, now “Songs” is the object noun, “Bob Dylan” is the subject noun and “covered” is the verb. To get this one right, I devised an additional rule which I call a pattern rule: if an original_performer entity precedes a composition_type song entity, use that pattern for query autofiltering. This is expressed in the configuration like so:

covered,covers:performer_ss => version_s:Cover | original_performer_s:_ENTITY_,recording_type_ss:Song=>original_performer_s:_ENTITY_

To break this down, the first part does the mapping of ‘covered’ and ‘covers’ to the field performer_ss. The second part sets a static query parameter version_s:Cover and the third part:

original_performer_s:_ENTITY_,recording_type_ss:Song=>original_performer_s:_ENTITY_

Translates to: if an original performer is followed by a recording type of “song”, use original_performer_s as the field name.

We also want this pattern to be applied in a context sensitive manner – it is needed to disambiguate the bi-directional verb “cover” so we only use it in this situation. That is this pattern rule is only triggered if the verb “cover” is encountered in the query. Again, these rules are use-case dependent and we can grow or refine them as needed. Rule-based approaches like this require curation and analysis of query logs but can be a very effective way to handle edge cases like this. Fortunately, the “just plug it in and forget it” part of the query autofiltering setup handles a large number of use cases without any help. That’s a good balance.

With this rule in place, I was able to get queries like “Beatles Songs covered by Joe Cocker” and “Smokey Robinson songs covered by the Beatles” to work as expected. (The answer to the second one is that great R&B classic “You’ve Really Got A Hold On Me”).

Healthcare concerns

Let’s examine another domain to see the generality of these techniques. In healthcare, there is a rich ontology that we can think of relating diseases, symptoms, treatments and root biomedical causes. There are also healthcare providers of various specialties and pharmaceutical manufacturers in the picture among others. In this case, the ontologies are out there (like MeSH) courtesy of the National Institutes of Medicine and other affiliated agencies. So, imagine that we have a consumer healthcare site with pages that discuss these entities and provide ways to navigate between them. The pages would also have metadata that we can both facet and perform query autofiltering on.

Lets take a concrete example. Suppose that you are suffering from abdominal pain (sorry about that). This is an example of a condition or symptom that may be benign (you ate or drank too much last night) or a sign of something more serious. Symptoms can be caused by diseases like appendicitis or gastroenteritis, can be treated with drugs or may even be caused by a drug side effect or adverse reaction. So if you are on this site, you may be asking questions like “what drugs can treat abdominal pain?” and maybe also “what drugs can cause abdominal pain?”. This is a hard problem for traditional search methods and the query autofilter, without the type of assistance I am discussing here would not get it right either. For drugs, the metadata fields for the page would be “indication” for positive relationships (an indication is what the drug has been approved for by the FDA) and “side_effect” or “adverse_reaction” for the dark side of pharmaceuticals (don’t those disclaimers on TV ads just seem to go on and on and on?).

With our new query autofilter trick, we can now configure these verb preposition phrases to map to the right fields:

treat,for,indicated => indication_ss

cause,produce => side_effect_ss,adverse_reaction_ss

Now these queries should work correctly: our search application is that much smarter – and our users will be much happier with us – because as we know, users asking questions like this are highly motivated to get good, usable answers and don’t have the time/patience to wade through noise hits (i.e. they may already be in pain).

You may be wondering at this point how many of these rules will we need? One thing to keep in mind and the reason for my using examples from two different domains is to illustrate the domain-specific nature of these problems. For general web search applications like Google, this list of rules might be very large (but then again so is Google). For domain specific applications as occur in enterprise search or eCommerce, the list can be much more manageable and use-case driven. That is, we will probably discover these fixes as we examine our query logs, but now we have another tool in our arsenal to tackle language problems like this.

Using Natural Language Processing techniques to detect and respond to User Intent

The general technique that I am illustrating here is something that I have been calling “Query Introspection”. A more plain-english way to say this is inferring user intent. That is, using techniques like this we can do a better job of figuring out what the user is looking for and then modifying the query to go get it if we can. It’s a natural language processing or NLP problem. There are other approaches that have been successful here, notably using parts of speech analysis on the query (POS) to get at the nouns, verbs and prepositions that I have been talking about. This can be based on machine learning or algorithmic approaches (rules based) and can be a good way of parsing the query into its linguistic component parts. IBM’s famous Watson program needed a pretty good one to parse Jeopardy questions. Machine learning approaches can also be applied directly to Q&A problems. A good discussion of this is in Ingersoll et.al.’s great book Taming Text.

The user intent detection step, which classical NLP techniques discussed above and now the query autofilter can do, represents phase one of the process. Translating this into an appropriately accurate query is the second phase. For POS tagged approaches, this usually involves a knowledge base that enables parts of speech phrases to be mapped to query fields. Obviously, the query autofilter does this natively but it can get the information from the “horses mouth” so to speak. The POS / knowledge base approach may be more appropriate when there is less metadata structure in the index itself as the KB can be the output of data mining operations. There were some excellent talks on this at the recent Lucene/Solr Revolution in Austin (see my blog post on this). However, if you already have tagged your data, manually or automagically, give query autofiltering a shot.

Source code is available

The Java source code for this is available on github for both Solr4.x and Solr 5 versions. Technical details about the code and how it works is available there. Download and use this code if you want to incorporate this feature into your search applications now. There is also a Solr JIRA submission (SOLR-7539).

The post Query Autofiltering IV: – A Novel Approach to Natural Language Processing appeared first on Lucidworks.com.

↧

PagerDuty Integration in Lucidworks Fusion

November 20, 2015, 12:49 pm

≫ Next: Visualizing Search Results in Solr: /browse and Beyond

≪ Previous: Query Autofiltering IV: – A Novel Approach to Natural Language Processing

Alerts in Fusion

Sending alerts messages is not a new feature for Fusion. Since version 1.4, Fusion users could use an integrated Messaging system to log or send out email or Slack alerts in response to events like the presence of a specific text in a stream of documents being indexing, or while processing query pipelines.

If you’re not familiar with alerts here’s a primer on Fusion’s Alert and Messaging architecture. The primer also includes practical instructions on how to set up Fusion to send email or Slack alerts while running Indexing and Query pipelines.

With the release of Fusion version 2.1, in addition to Logging, Email and Slack alerts, Fusion provides a method to send PagerDuty alerts as well.

What is PagerDuty?

PagerDuty is an incident management platform that helps IT operations professionals reduce incident resolution time, improve infrastructure-wide visibility, and improve operational performance. It collects signals from 150+ monitoring tools and connects the problem to the appropriate on-call engineer via phone, SMS, push notification, and email. In addition to IT operations team members, PagerDuty also gives support teams a unified view of all systems, no matter what tools are used and what systems are monitored.

PagerDuty incident management platform includes housing team contact information, alert workflow, automatic escalations, on-call scheduling, and analytics for system and team performance.

The Fusion PagerDuty Integration

The integration allows a Fusion user to manage PagerDuty incidents. For every incident, PagerDuty sends alerts according to alert workflow mentioned above. Fusion uses PagerDuty’s Incident API to communicate with PagerDuty servers in real time, so alerts originated in Fusion are sent to the relevant parties literally in seconds.

Examples of such events could be presence of certain text in indexing streams, any changes in data that Fusion processes or manages, or health check events associated with various problems (say, Solr collection is empty or number of recently indexed documents is less than expected and so on). Similarly to Slack or Email alerting, Fusion can trigger a PagerDuty alert while processing an indexing or query pipeline, based on certain conditions that are user configurable. Since the PagerDuty is very support oriented, it makes sense to use it for any kind of alerts related to Fusion services health and data processing or integrity problems that require immediate attention.

Note that Fusion’s support for PagerDuty does not allowing users to see or configure the escalation policies, or to see alerts history; however we consider creating a PagerDuty connector to Fusion in the future (so, for example, PagerDuty alerts history can be indexed and searched in Fusion).

Setting Integration Up

The PagerDuty uses “services” to integrate with monitoring tools. Each “service” has its own alerting and escalation rules, called “escalation policies”. This feature is used to route alerts to the people best able to handle them. So, the first thing you would need to enable the PagerDuty integration is to establish a Fusion related service in your PagerDuty account. Once that’s done, all Fusion generated alerts will be associated with this service and dispatched properly to staff who are Fusion (and Fusion related services like Solr ) support specialists.

To create a Service in PD, you need to create the Escalation Policy that will be associated to this service first. So go to the Configuration → Escalation Policies in your PagerDuty account and create one that will be used with Fusion related incidents. To create a new service, choose New Escalation Policy button in top right corner.

Once the escalation policy is configured (let’s say it is named “Fusion Support”), go to the Configuration → Services menu item and create a new Service that will be associated and integrated with your Fusion instance. Use the Add New Service button in the top right corner to create a new Service. That brings you to the screen to add a new Service. Fill the details for your newly created service. Note that for the Integration Type radio button you should choose the Use our API directly option. Configure the Notification Urgency and Incident Behavior policies as desired and complete Service creation with Add Service button. All done!

Now look at the important key you’ll find in the Integration Settings section of the newly created Service:

This is the unique Service Key you will need to enter in your Fusion UI to configure PD integration.

Here is how:

In Fusion UI, go to Applications → System screen and pick up the Messaging Services:

From Configure Messaging Service combo box pick the Pager Duty Message Service entry. Now enter the Integration Key for the Service you configured in PagerDuty into the Pager Duty Service Key field. As for the Pager Duty Service API URL, just keep the default value. Save the changes.

That’s it, the PD integration is now configured in Fusion and now it is time to start using it.

PagerDuty Message Stage Configuration

The PagerDuty integration code in Fusion interacts with a PagerDuty service every time when a Send PagerDuty Message stage executes as a part of a pipeline. The name is a bit misleading – there is no actual message sending to PagerDuty. The Fusion stage uses an http API call. The actual PagerDuty alert could not be a message as well. This naming reflects the general concept Fusion uses to deal with alerts – send them via messaging system. In some sense, you may think that Fusion sends an alert message via PagerDuty – similar to sending Email messages.

This stage triggers, acknowledges or resolves a PagerDuty incident – an event that requires attention until resolved or expired. The incident either gets resolved by a human (via PagerDuty web site or app once the job on this incident is done) or it will expire and will be resolved automatically according to configured PD time out rules. Until the incident gets resolved, the PD will continue sending alerts according to escalation policy. The incident can be acknowledged to indicate that someone is already working on the issue and to silence alerts for some time. Once the incident is resolved, it becomes history; however if the incident with the same Incident Key is triggered again, the existing closed incident will be reopened.

The stage configuration defines two things. The first one is the incident details – what data and how will be presented to the people dealing with the issue once they receive the PagerDuty alert. The second is the condition that is needed to be fulfilled so the stage will be executed, and it is done in the stage’s Conditional Script. Typically that involves evaluation of Fusion objects available in stage context at the moment of processing (for Query pipeline stage – the Fusion/Solr query Request and Response, and for the Index pipeline stage – the Fusion Pipeline Document). Examples would be evaluation of whether the Query response has 0 documents or whether the pipeline document has some particular data in some field (see below).

Let’s talk about setting the incident details first, and then about how to trigger the stage to be executed depending on Conditional Script outcome.

Setting Incident Details

There are 3 required fields to be configured for this stage – the Event Type, the Description and the Incident Key. The Event type is one of trigger, acknowledge or resolve. That defines whether the message will trigger the PagerDuty incident, acknowledge or resolve it. The Description itself is a short description of the event. This field (or a truncated version, the maximum length allowed by PD is 1024 characters) will be used when generating phone calls, SMS messages and alert emails. It will also appear in the incidents table in the PagerDuty UI. And the Incident Key field identifies the incident to which this trigger event should be applied. If there’s no open (i.e. unresolved) incident with this key, a new one will be created. If there’s already an open incident with a matching key, this event will be appended to that incident’s log. So the Incident Key allows you to de-duplicate or group all the incidents related to the same event.

Besides those 3 fields, the rest of the fields that define the Incident context are optional

The Client field is an optional field that represents name of the monitoring client that is triggering this event, for example just Fusion. And the Client URL field represents the callback URL to be called to see more details about the event on some site (other than PagerDuty), for example this URL can open some page of Fusion UI or some page of your search application to help solve the original problem.

The stage may have also a list of Incident Details, a list of a name-value pairs of arbitrary data. It will be a part of incident description on PD site and will be included in the incident log. The same is for Incident Context Links and Incident Context Details: those lists are lists of an arbitrary data that could be helpful to people working on incident. The Context Link is a pair of arbitrary URL (it is clickable in PagerDuty UI) and the text, describing the URL. Context Image is a set of fields defining some arbitrary clickable image (the image src URL should be of secure protocol, i.e. start with https://) that will be a part of incident context visible to someone investigating the incident.

All value fields for those list entries could be parameterized, i.e. the values for them can be represented by String Templates (see www.stringtemplate.org) and the actual values will be injected by Fusion from objects available to the stage, at the moment of execution. For example, the template expression <doc.id> can be used in Incident Detail value field; the actual value will be the id of the Fusion pipeline document that the stage was processing:

Will be shown on PagerDuty site as a part of incident context, just like this:

Fusion Document Id: 345

Setting Stage Execution Conditions

Now, let’s talk about how to configure the stage so it will be triggered only when we want to execute PagerDuty stage (and therefore trigger or resolve an incident). We definitely do not want to trigger PagerDuty incident for every Fusion pipeline document processed by Index pipeline, or for every Query executed by Query pipeline. The part of the stage configuration – a Conditional Script field – defines when the stage is going to be executed. To execute the stage, the evaluation of JavaScript expression in the Conditional Script field should result in true (1). If the expression evaluates to false (0), the stage will not be executed.

For example, we may run this stage only when we see a pipeline document with field first_name having value John:

The expression in Conditional Script field checks the data in the first_name field:

Similarly, for the Query Pipeline’s stage we may want to trigger the PagerDuty alert in case if some important query brings zero results (something that should not normally happen). The conditional script for Query Pipeline’s Send PagerDuty Message stage should look like this:

If the incident Description and/or Incident Key are parameterized to use <request.q>, the incident will be descriptive in PD UI incidents list (assuming the query string is not too long):

Note that to successfully process the response.initialEntity.query() in Conditional Script field, the wt parameter for Solr query should be set to json. Convenient place to do this is in Set Query Params stage. And, of course, the Query Solr stage should come prior to Send Pager Duty Message stage in your Query pipeline.

A Solr query like fetchedDate_dt:[NOW-1HOUR TO NOW], in combination with Conditional Script that checks for zero hits would make a good check for the constant flow of document indexed by your Fusion application, assuming the query pipeline is executed on a regular basis (I.e registered with Fusion Scheduler).

Multiple Stages in Pipelines

The Send PagerDuty Message stage does not change the Fusion objects passed to the stage (i.e. Fusion pipeline document or Request-and-Response), it only evaluates them and, if conditional script resolves to true, sends the message to PagerDuty. So it is possible to have a multiple Send PagerDuty Message stages in the same Fusion pipeline. For example, you may configure the pipeline to have 2 PD stages – one triggers the PD incident and another resolves it, based on certain pipeline document data. Pay attention to use the same Incident Key and not to send multiple duplicate Resolve PD messages.

Also, for Index Pipeline, you may use Set Property stage prior to Send PagerDuty Message stage to set some property on pipeline context or pipeline document, to make PagerDuty’s Conditional Script look simpler.

Conclusion

PagerDuty integration allows Fusion users that are interested in monitoring managed data to add notifications and alert functionality to their search applications, typically by triggering PagerDuty incidents when something significant is discovered while executing Fusion indexing or query pipelines. The Send PagerDuty Message stage is used to notify PagerDuty services. Minimal configuration is required on both PagerDuty and Fusion sides to enable the integration.

The post PagerDuty Integration in Lucidworks Fusion appeared first on Lucidworks.com.

↧

Visualizing Search Results in Solr: /browse and Beyond

December 8, 2015, 8:00 am

≫ Next: Data Security and Human Insecurities: How Scammers Take Advantage

≪ Previous: PagerDuty Integration in Lucidworks Fusion

The Series

This is the second in a three part series demonstrating how it’s possible to build a real application using just a few simple commands. The three parts to this are:

Getting data into Solr using bin/post
==> (you are here) Visualizing search results: /browse and beyond
Up next: Putting it together realistically: example/files – a concrete useful domain-specific example of bin/post and /browse

/browse – A simple, configurable, built-in templated results view

We foreshadowed to this point in the previous, bin/post, article, running these commands –

$ bin/solr create -c solr_docs
$ bin/post -c solr_docs docs/

And here we are: http://localhost:8983/solr/solr_docs/browse?q=faceting

Or sticking with the command-line, this will get you there:

$ open http://localhost:8983/solr/solr_docs/browse?q=faceting

The legacy “collection1”, also known as techproducts

Seasoned Solr developers probably have seen the original incarnation of /browse. Remember /collection1/browse with the tech products indexed? With Solr 5, things got a little cleaner with this example and it can easily be launched with the -e switch:

$ bin/solr start -e techproducts

The techproducts example will not only create a techproducts collection it will also index a set of example documents, the equivalent of running:

$ bin/solr create -c techproducts -d sample_techproducts_configs
$ bin/post -c techproducts example/exampledocs/*.xml

You’re ready to /browse techproducts. This can be done using “open” from the command-line:

$ open http://localhost:8983/solr/techproducts/browse

An “ipod” search results in:

/techproducts/browse

The techproducts example is the fullest featured /browse interface, but it suffers from the kitchen sink syndrome. It’s got some cool things in there like as-you-type term suggest (type “ap” and pause, you’ll see “apple” appear), geographic search (products have contrived associated “store” locations), results grouping, faceting, more-like-this links, and “did you mean?” suggestions. While those are all great features often desired in our search interfaces, the techproducts /browse has been overloaded to support not only just the tech products example data, but also the example books data (also in example/exampledocs/) and even made to demonstrate rich text files (note the content_type facet). It’s convoluted to start with the techproducts templates and trim it down to your own needs, so the out of the box experience got cleaned up for Solr 5.

New and… generified

With Solr 5, /browse has been designed to come out of the box with the default configuration, data_driven_configs (aka “schema-less”). The techproducts example has its own separate configuration (sample_techproducts_configs) and custom set of templates, and they were left alone and as you see above. In order to make the templates work generically for most any type of data you’ve indexed, the default templates were stripped down to the basics and baked in. The first example above, solr_docs, illustrates the out of the box “data driven” experience with /browse. It doesn’t matter what data you put in to a data driven collection, the /browse experience starts with the basic search box and results display. Let’s delve into the /browse side of things with some very simple data in a fresh collection:

$ bin/solr create -c example
$ bin/post -c example -params "f.tags.split=true" -type text/csv \
  -d $'id,title,tags\n1,first document,A\n2,second document,"A,B"\n3,third document,B' 
$ open http://localhost:8983/solr/example/browse

This generic interface shows search results from a query specified in the search box, displays stored field values, includes paging controls, has debugging/troubleshooting features (covered below) and includes a number of other capabilities that aren’t initially apparent.

Faceting

Because the default templates make no assumptions about the type of data or values in fields, there is no faceting on by default, but the templates support it. Add facet.field=tags to a /browse request such as http://localhost:8983/solr/example/browse?facet.field=tags and it’ll render as shown here.

tags facet

Clicking the value of a facet filters the results as naturally expected, using Solr’s fq parameter. The built-in, generic /browse templates, as of Solr 5.4, only support field faceting. Other faceting options (range, pivot, and query) are not supported by the templates – they simply won’t render in the UI. You’ll notice as you click around after manually adding “facet.field=tags” that the links do not include the manually added parameter. We’ll see below how to go about customizing the interface, including how to add a field facet to the UI. But let’s first delve into how /browse works.

Note: the techproducts templates do have some hard-coded support for other facets, which can be borrowed from as needed; continue on to see how to customize the view to suit your needs.

What makes /browse work?

In Solr technical speak, /browse is a search request handler, just like /select – in fact, on any /browse request you can set wt=xml to see the standard results that drive the view. The difference is that /browse has some additional parameters defined as defaults to enhance querying, faceting, and response writing. Queries are configured to use the edismax query parser. Faceting is turned on though no fields are specified initially, and facet.mincount=1 so as to not show zero count buckets. Response writing tweaks make the secret sauce to /browse, but otherwise it’s just a glorified /select.

VelocityResponseWriter

Requests to /browse are standard Solr search requests with the addition of three parameters:

wt=velocity: Use the VelocityResponseWriter for generating the HTTP response from the internal SolrQueryRequest and SolrQueryResponse objects
v.template=browse: The name of the template to render
v.layout=layout: The name of the template to use as a “layout”, a wrapper around the main v.template specified

Solr generally returns search results as data, as XML, JSON, CSV, or even other data formats. At the end of a search request processing the response object is handed off to a QueryResponseWriter to render. In the data formats, the response object is simply traversed and wrapped with angle, square, and curly brackets. The VelocityResponseWriter is a bit different, handing off the response data object to a flexible templating system called Velocity.

“Velocity”? Woah, slow down! What is this ancient technology of which you speak? Apache Velocity has been around for a long time; it’s a top-notch, flexible, templating library. Velocity lives up to its name – it’s fast too. A good starting point to understanding Velocity is an article I wrote many (fractions of) light years ago: “Velocity: Fast Track to Templating”. Rather than providing a stand-alone Velocity tutorial here, we’ll do it by example in the context of customizing the /browse view. Refer to the VelocityResponseWriter documentation in the Reference Guide for more detailed information.

Note: Unless you’ve taken other precautions, users that can access /browse could also add, modify, change, delete or otherwise affect collections, documents, and all kinds of things opening the possibility to data security leaks, denial of service attacks, or wiping out partial or complete collections. Sounds bad, but nothing new or different when it comes to /browse compared to /select, it just looks prettier, and user-friendly enough to want to expose to non-developers.

Customizing the view

There are several ways to customize the view; it ultimately boils down to the Velocity templates rendering what you want. Not all modifications require template hacking though. The built-in /browse handler uses a relatively new feature to Solr called “param sets”, which debuted in Solr 5.0. The handler is defined like this:

<requestHandler name="/browse" class="solr.SearchHandler" useParams="query,facets,velocity,browse">

The useParams setting specifies which param set(s) to use as default parameters, allowing them to be grouped and controlled through an HTTP API. An implementation detail, but param sets are defined in a conf/params.json file, and the default set of parameters is spelled out as such:

{"params":{
  "query":{
    "defType":"edismax",
    "q.alt":"*:*",
    "rows":"10",
    "fl":"*,score",
    "":{"v":0}
  },
  "facets":{
    "facet":"on",
    "facet.mincount": "1",
    "":{"v":0}
  },
 "velocity":{
   "wt": "velocity",
   "v.template":"browse",
   "v.layout": "layout",
   "":{"v":0}
 }
}}

The various sets aim to keep parameters grouped by function. Note that the “browse” param set is not defined, but it is used as a placeholder set name that can be filled in later. So far so good with straightforward typical Solr parameters being used initially. Again, ultimately everything that renders is a result of the template driving it. In the case of facets, all field facets in the Solr response will be rendered (from facets.vm). Using the param set API, we can add the “tags” field to the “facets” param set:

$ curl http://localhost:8983/solr/example/config/params -H 'Content-type:application/json'  -d '{
"update" : {
  "facets": {
    "facet.field":"tags"
    }
  }
}'

Another nicety about param sets – their effect is immediate, whereas changes to request handler definitions require the core to be reloaded or Solr to be restarted. Just hit refresh in your browser on /browse, and the new tags facet will appear without being explicitly specified in the URL.

See also the file example/films/README.txt for an example adding a facet field and query term highlighting. The built-in templates are already set up to render field facets and field highlighting when enabled, making it easy to do some basic domain-specific adjustments without having to touch a template directly.

At this point, /browse is equivalent to this /select request: http://localhost:8983/solr/example/select?defType=edismax&q.alt=*:*&rows=10&fl=*,score&facet=on&facet.mincount=1&wt=velocity&v.template=browse&v.layout=layout&facet.field=tags

Again, set wt=xml or wt=json and see the standard Solr response.

Overriding built-in templates

VelocityResponseWriter has a somewhat sophisticated mechanism for locating templates to render. Using a “resource loader” search path chain, it can get templates from a file system directory, the classpath, a velocity/ subdirectory of the conf/ directory (either on the file system or in ZooKeeper), and even optionally from request parameters. By default, templates are only configured to come from Solr’s resource loader which pulls from conf/velocity/ or from the classpath (including solrconfig.xml configured JAR files or directories). The built-in templates live within the solr-velocity JAR file. These templates can be extracted, even as Solr is running, to conf/velocity so that they can be adjusted. To extract the built-in templates to your collections conf/velocity directory, the following command can be used, assuming the “example” collection that we’re working with here.

$ unzip dist/solr-velocity-*.jar velocity/*.vm -d server/solr/example/conf/

This trick works when Solr is running in standalone mode. In SolrCloud mode, conf/ is in ZooKeeper as is conf/velocity/ and the underlying template files; if you’re not seeing your changes to a template be sure the template is where Solr is looking for it which may require uploading it to ZooKeeper. With these templates extracted from the JAR file, you can now edit them to suit your needs. Template files use the extension .vm, which stands for “Velocity macro” (“macro” is a bit overloaded, unfortunately, really these are best called “template” files). Let’s demonstrate changing the Solr logo in the upper left to a magnifying glass clip art image. Open server/solr/example/conf/velocity/layout.vm with a text editor, change the <div id="head"> to the following, save the file, and refresh /browse in your browser:

<div id="head">
  <a href="#url_for_home">
     <!-- Borrowed from https://commons.wikimedia.org/wiki/File:Twemoji_1f50e.svg -->
     <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Twemoji_1f50e.svg/50px-Twemoji_1f50e.svg.png"/>
  </a>
</div>

#protip: your boss will love seeing the company logo on your quick Solr prototype. Don’t forget the colors too: the CSS styles can be customized in head.vm.

Customizing results list

The /browse results list is rendered using results_list.vm, which just iterates over all “hits” in the response (the current page of documents) rendering hit.vm for each result. The rendering of a document in the search results commonly is an area that needs some domain-specific attention. The templates that were extracted will now be used, overriding the built-in ones. Any templates that you don’t need to customize can be removed, falling back to the default ones. In this example, the template changed was specific to the “example” core. Newly created collections, even data-driven based ones, won’t have this template change.

NOTE: changes made will be lost if you delete the example collection – see the -Dvelocity.template.base.dir technique to externalize templates from the configuration.

Debugging/Troubleshooting

I like using /browse for debugging and troubleshooting. In the footer of the default view there is an “enable debug” link adding a “debug=true” to the current search request. The /browse templates add a “toggle parsed query” link under the search box and a “toggle explain” by each search result hit. Searching for “faceting”, enabling debug, and toggling the parsed query tells us how the users query was interpreted, including what field(s) are coming into play and any analysis transformations like stemming or stop word removal that took place.

Toggling the explain on a document provides detailed, down to the Lucene-level, explanation of how this document matched and how the relevancy score was computed. As shown below, “faceting” appears in the _text_ field (a data_driven_configs copyField destination for all fields making everything searchable). “faceting” appears 4 times in this particular document (tf, term frequency), and appears in 24 total documents (docFreq, document frequency). The fieldNorm factor can be a particularly important one, a factor based on the number of terms in the field generally giving shorter fields a relevancy advantage over longer ones.

Conclusion

VelocityResponseWriter: it’s not for everyone or every use case. Neither is wt=xml for that matter. Over the years, /browse has gotten flack for being a “toy” or not “production-ready”. It’s both of those, and then some. VelocityResponseWriter has been used for:

effective sales demos
rapid prototyping
powering one our Lucidworks Fusion customers entire UI, through the secure Fusion proxy
and even generating nightly e-mails from a job search site!

Ultimately, wt=velocity is for generating textual (not necessarily HTML) output from a Solr request.

The post Visualizing Search Results in Solr: /browse and Beyond appeared first on Lucidworks.com.

↧

Data Security and Human Insecurities: How Scammers Take Advantage

December 16, 2015, 2:16 pm

≫ Next: Open Source Hadoop Connectors for Solr

≪ Previous: Visualizing Search Results in Solr: /browse and Beyond

Lucidworks CEO Will Hayes latest Forbes columns looks at the ways scammers take advantage of the big holes in big data to prey on all of us:

“The immense amount of data we expose about ourselves make it incredibly easy to get targeted. … These profiles make it easier than ever for up-to-no-gooders to target us — they know exactly where our personal insecurities are and they can tailor attacks in ways that are perfectly suited for their victims. Here’s a look at seven common human insecurities and how scammers attempt to take advantage.

1. Money: You’ve likely seen scams like this: “Earn $800 a week just sitting home and filling out surveys!” Scammers promise quick money for little effort, and all you have to do is pay the “low” price of $34.95 to access the survey database that probably doesn’t even exist. Another common scam is job listings that promise employment from government sources. These fraudulent postings lure people into giving away personal and financial data with the promise of getting a stable, well-paid job.”

Read the other six deadly sins of scamming…

The post Data Security and Human Insecurities: How Scammers Take Advantage appeared first on Lucidworks.com.

↧