Exploring Solr Anti-Patterns with Sematext’s Rafał Kuć

August 26, 2015, 1:50 pm

≫ Next: Indexing Arabic Content in Apache Solr

≪ Previous: Solr as an Apache Spark SQL DataSource

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Sematext’s Rafał Kuć’s session about Solr Anti-Patterns. Be sure to catch his talk at this year’s conference: Large Scale Log Analytics with Solr. Through his work as a consultant and software engineer, Rafał has seen multiple patterns in how Solr is used and how it should be used. We usually say what should be done, but we don’t talk and point out what shouldn’t be done. This talk will point out common mistakes and roads that should be avoided at all costs. This session will not only to show the bad patterns, but also show the differences before and after. The talk is divided into three major sections:

General configuration pitfalls that people are used to making. We will discuss different use cases showing the proper path that one should take
We will focus on data modeling and what to avoid when making your data indexable. Again, we will see real life use cases followed by the guidance around how to handle them properly
Finally, we will talk about queries and all the juicy mistakes made when it comes to searching for indexed data

Each shown use case will be illustrated by the before and after analysis – we will see the metrics changes, so the talk will not only bring pure facts, but hopefully know-how worth remembering. ” Full deck available on SlideShare:

Solr Anti-Patterns: Presented by Rafał Kuć, Sematext from Lucidworks

Be sure to catch Rafał’s talk – Large Scale Log Analytics with Solr – at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Exploring Solr Anti-Patterns with Sematext’s Rafał Kuć appeared first on Lucidworks.

↧

Indexing Arabic Content in Apache Solr

August 27, 2015, 11:27 am

≫ Next: How Shutterstock Searches 35 Million Images by Color Using Apache Solr

≪ Previous: Exploring Solr Anti-Patterns with Sematext’s Rafał Kuć

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Ramzi Alqrainy‘s session on using Solr to index and search documents and files in Arabic. Arabic language poses several challenges faced by the Natural Language Processing (NLP), largely due to the fact that Arabic language, unlike European languages, has a very rich and sophisticated morphological system. This talk will cover some of the challenges and how to solve them with Solr and will also present the challenges that were handled by Opensooq as a real case in the Middle East. Ramzi Alqrainy is one of the most recognized experts within Artificial Intelligence and Information Retrieval fields in the Middle East. He is an active researcher and technology blogger, with a focus on information retrieval.

Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq from Lucidworks

Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Indexing Arabic Content in Apache Solr appeared first on Lucidworks.

↧

How Shutterstock Searches 35 Million Images by Color Using Apache Solr

August 28, 2015, 11:00 am

≫ Next: Mining Events for Recommendations

≪ Previous: Indexing Arabic Content in Apache Solr

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Shutterstock engineer Chris Becker’s session on how they use Apache Solr to search 35 million images by color. This talk covers some of the methods they’ve used for building color search applications at Shutterstock using Solr to search 40 million images. A couple of these applications can be found in Shutterstock Labs – notably Spectrum and Palette. We’ll go over the steps for extracting color data from images and indexing them into Solr, as well as looking at some ways to query color data in your Solr index. We’ll cover some issues such as what does relevance mean when you’re searching for colors rather than text, and how you can achieve various effects by ranking on different visual attributes. At the timeof this presetnation, Chris was the Principal Engineer of Search at Shutterstock– a stock photography marketplace selling over 35 million images– where he’s worked on image search since 2008. In that time he’s worked on all the pieces of Shutterstock’s search technology ecosystem from the core platform, to relevance algorithms, search analytics, image processing, similarity search, internationalization, and user experience. He started using Solr in 2011 and has used it for building various image search and analytics applications.

Searching Images by Color: Presented by Chris Becker, Shutterstock from Lucidworks

The post How Shutterstock Searches 35 Million Images by Color Using Apache Solr appeared first on Lucidworks.

↧

Mining Events for Recommendations

August 31, 2015, 3:33 am

≫ Next: Apache Solr for Multi-language Content Discovery Through Entity Driven Search

≪ Previous: How Shutterstock Searches 35 Million Images by Color Using Apache Solr

Summary: The “EventMiner” feature in Lucidworks Fusion can be used to mine event logs to power recommendations. We describe how the system uses graph navigation to generate diverse and high-quality recommendations.

User Events

The log files that most web services generate are a rich source of data for learning about user behavior and modifying system behavior based on this. For example, most search engines will automatically log details on user queries and the resulting clicked documents (URLs). We can define a (user, query, click, time) record which records a unique “event” that occurred at a specific time in the system. Other examples of event data include e-commerce transactions (e.g. “add to cart”, “purchase”), call data records, financial transactions etc. By analyzing a large volume of these events we can “surface” implicit structures in the data (e.g. relationships between users, queries and documents), and use this information to make recommendations, improve search result quality and power analytics for business owners. In this article we describe the steps we take to support this functionality.

1. Grouping Events into Sessions

Event logs can be considered as a form of “time series” data, where the logged events are in temporal order. We can then make use of the observation that events close together in time will be more closely related than events further apart. To do this we need to group the event data into sessions.

A session is a time window for all events generated by a given source (like a unique user ID). If two or more queries (e.g. “climate change” and “sea level rise”) frequently occur together in a search session then we may decide that those two queries are related. The same would apply for documents that are frequently clicked on together. A “session reconstruction” operation identifies users’ sessions by processing raw event logs and grouping them based on user IDs, using the time-intervals between each and every event. If two events triggered by the same user occur too far apart in time, they will be treated as coming from two different sessions. For this to be possible we need some kind of unique ID in the raw event data that allows us to tell that two or more events are related because they were initiated by the same user within a given time period. However, from a privacy point of view, we do not need an ID which identifies an actual real person with all their associated personal information. All we need is an (opaque) unique ID which allows us to track an “actor” in the system.

2. Generating a Co-Occurrence Matrix from the Session Data

We are interested in entities that frequently co-occur, as we might then infer some kind of interdependence between those entities. For example, a click event can be described using a click(user, query, document) tuple, and we associate each of those entities with each other and with other similar events within a session. A key point here is that we generate the co-occurrence relations not just between the same field types e.g. (query, query) pairs, but also “cross-field” relations e.g. (query, document), (document, user) pairs etc. This will give us an N x N co-occurrence matrix, where N = all unique instances of the field types that we want to calculate co-occurrence relations for. Figure 1 below shows a co-occurrence matrix that encodes how many times different characters co-occur (appear together in the text) in the novel “Les Miserables”. Each colored cell represents two characters that appeared in the same chapter; darker cells indicate characters that co-occurred more frequently. The diagonal line going from the top left to the bottom right shows that each character co-occurs with itself. You can also see that the character named “Valjean”, the protagonist of the novel, appears with nearly every other character in the book.

Figure 1. “Les Miserables” Co-occurrence Matrix by Mike Bostock.

In Fusion we generate a similar type of matrix, where each of the items is one of the types specified when configuring the system. The value in each cell will then be the frequency of co-occurrence for any two given items e.g. a (query, document) pair, a (query, query) pair, a (user, query) pair etc.

For example, if the query “Les Mis” and a click on the web page for the musical appear together in the same user session then they will be treated as having co-occurred. The frequency of co-occurrence is then the number of times this has happened in the raw event logs being processed.

3. Generating a Graph from the Matrix

The co-occurrence matrix from the previous step can also be treated as an “adjacency matrix”, which encodes whether two vertices (nodes) in a graph are “adjacent” to each other i.e. have a link or “co-occur”. This matrix can then be used to generate a graph, as shown in Figure 2:

Figure 2. Generating a Graph from a Matrix.

Here the values in the matrix are the frequency of co-occurrence for those two vertices. We can see that in the graph representation these are stored as “weights” on the edge (link) between the nodes e.g. nodes V2 and V3 co-occurred 5 times together.

We encode the graph structure in a collection in Solr using a simple JSON record for each node. Each record contains fields that list the IDs of other nodes that point “in” at this record, or which this node points “out” to.

Fusion provides an abstraction layer which hides the details of constructing queries to Solr to navigate the graph. Because we know the IDs of the records we are interested in we can generate a single boolean query where the individual IDs we are looking for are separated by OR operators e.g. (id:3677 OR id:9762 OR id:1459). This means we only make a single request to Solr to get the details we need.

In addition, the fact that we are only interested in the neighborhood graph around a start point means the system does not have to store the entire graph (which is potentially very large) in memory.

4. Powering Recommendations from the Graph

At query/recommendation time we can use the graph to make suggestions on which other items in that graph are most related to the input item, using the following approach:

Navigate the co-occurrence graph out from the seed item to harvest additional entities (documents, users, queries).
Merge the list of entities harvested from different nodes in the graph so that the more lists an entity appears in the more weight it receives and the higher it rises in the final output list.
Weights are based on the reciprocal rank of the overall rank of the entity. The overall rank is calculated as the sum of the rank of the result the entity came from and the rank of the entity within its own list.

The following image shows the graph surrounding the document “Midnight Club: Los Angeles” from a sample data set:

Figure 3. An Example Neighborhood Graph.

Here the relative size of the nodes shows how frequently they occurred in the raw event data, and the size of the arrows is a visual indicator of the weight or frequency of co-occurrence between two elements.

For example, we can see that the query “midnight club” (blue node on bottom RHS) most frequently resulted in a click on the “Midnight Club: Los Angeles Complete Edition Platinum Hits” product (as opposed to the original version above it). This is the type of information that would be useful to a business analyst trying to understand user behavior on a site.

Diversity in Recommendations

For a given item, we may only have a small number of items that co-occur with it (based on the co-occurrence matrix). By adding in the data from navigating the graph (which comes from the matrix), we increase the diversity of suggestions. Items that appear in multiple source lists then rise to the top. We believe this helps improve the quality of the recommendations & reduce bias. For example, in Figure 4 we show some sample recommendations for the query “Call of Duty”, where the recommendations are coming from a “popularity-based” recommender i.e. it gives a large weight to items with the most clicks. We can see that the suggestions are all from the “Call of Duty” video game franchise:

Figure 4. Recommendations from a “popularity-based” recommender system.

In contrast, in Figure 5 we show the recommendations from EventMiner for the same query:

Figure 5. Recommendations from navigating the graph.

Here we can see that the suggestions are now more diverse, with the first two being games from the same genre (“First Person Shooter” games) as the original query.

In the case of an e-commerce site, diversity in recommendations can be an important factor in suggesting items to a user that are related to their original query, but which they may not be aware of. This in turn can help increase the overall CTR (Click-Through Rate) and conversion rate on the site, which would have a direct positive impact on revenue and customer retention.

Evaluating Recommendation Quality

To evaluate the quality of the recommendations produced by this approach we used CrowdFlower to get user judgements on the relevance of the suggestions produced by EventMiner. Figure 6 shows an example of how a sample recommendation was presented to a human judge:

Figure 6. Example relevance judgment screen (CrowdFlower).

Here the original user query (“resident evil”) is shown, along with an example recommendation (another video game called “Dead Island”). We can see that the judge is asked to select one of four options, which is used to give the item a numeric relevance score:

Off Topic
Acceptable
Good
Excellent

In this example the user might judge the relevance for this suggestion as “good”, as the game being recommended is in the same genre (“survival horror”) as the original query. Note that the product title contains no terms in common with the query i.e. the recommendations are based purely on the graph navigation and do not rely on an overlap between the query and the document being suggested. In Table 1 we summarize the results of this evaluation:

Items	Judgements	Users	Avg. Relevance (1 – 4)
1000	2319	30	3.27

Here we can see that the average relevance score across all judgements was 3.27 i.e. “good” to “excellent”.

Conclusion

If you want an “out-of-the-box” recommender system that generates high-quality recommendations from your data please consider downloading and trying out Lucidworks Fusion.

The post Mining Events for Recommendations appeared first on Lucidworks.

↧

Apache Solr for Multi-language Content Discovery Through Entity Driven Search

August 31, 2015, 12:21 pm

≫ Next: Better Search with Fusion Signals

≪ Previous: Mining Events for Recommendations

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Alessandro Benedetti’s session on using entity driven search for multi-language content discovery and search. This talk is about the description of the implementation of a Semantic Search Engine based on Solr. Meaningfully structuring content is critical, Natural Language Processing and Semantic Enrichment is becoming increasingly important to improve the quality of Solr search results. Our solution is based on three advanced features:

Entity-oriented search – Searching not by keyword, but by entities (concepts in a certain domain)
Knowledge graphs – Leveraging relationships amongst entities: Linked Data datasets (Freebase, DbPedia, Custom …)
Search assistance – Autocomplete and Spellchecking are now common features, but using semantic data makes it possible to offer smarter features, driving the users to build queries in a natural way.

The approach includes unstructured data processing mechanisms integrated with Solr to automatically index semantic and multi-language information. Smart Autocomplete will complete users’ query with entity names and properties from the domain knowledge graph. As the user types, the system will propose a set of named entities and/or a set of entity types across different languages. As the user accepts a suggestion, the system will dynamically adapt following suggestions and return relevant documents. Semantic More Like This will find similar documents to a seed one, based on the underlying knowledge in the documents, instead of tokens. Alessandro Benedetti is a search expert and semantic technology passionate, working in the R&D division of Zaizi. His favorite work is in R&D on information retrieval, NLP and machine learning with a big emphasis on data structures, algorithms and probability theory. Alessandro earned his Masters in Computer Science with full grade in 2009, then spent 6 month with Universita’ degli Studi di Roma working on his masters thesis around a new approach to improve semantic web search. Alessandro spent 3 years with Sourcesense as a Search and Open Source consultant and developer.

Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi from Lucidworks

The post Apache Solr for Multi-language Content Discovery Through Entity Driven Search appeared first on Lucidworks.

↧

Better Search with Fusion Signals

September 1, 2015, 2:45 pm

≫ Next: How Twitter Uses Apache Lucene for Real-Time Search

≪ Previous: Apache Solr for Multi-language Content Discovery Through Entity Driven Search

Signals in Lucidworks Fusion leverage information about external activity, e.g., information collected from logfiles and transaction databases, to improve the quality of search results. This post follows on my previous post, Basics of Storing Signals in Solr with Fusion for Data Engineers, which showed how to index and aggregate signal data. In this post, I show how to write and debug query pipelines using this aggregated signal information.

User clicks provide a link between what people ask for and what they choose to view, given a set of search results, usually with product images. In the aggregate, if users have winnowed the set of search results for a given kind of thing, down to a set of products that are exactly that kind of thing, e.g., if the logfile entries link queries for “Netgear”, or “router”, or “netgear router” to clicks for products that really are routers, then this information can be used to improve new searches over the product catalog.

The Story So Far

To show how signals can be used to improve search in an e-commerce application, I created a set of Fusion collections:

A collection called “bb_catalog”, which contains Best Buy product data, a dataset comprised of over 1.2M items, mainly consumer electronics such as household appliances, TVs, computers, and entertainment media such as games, music, and movies. This is the primary collection.
An auxiliary collection called “bb_catalog_signals”, created from a synthetic dataset over Best Buy query logs from 2011. This is the raw signals data, meaning that each logfile entry is stored as an individual document.
An auxiliary collection called “bb_catalog_signals_aggr” derived from the data in “bb_catalog_signals” by aggregating all raw signal records based on the combination of search query, field “query_s”, item clicked on, field “doc_id_s”, and search categories, field “filters_ss”.

All documents in collection “bb_catalog” have a unique product ID stored in field “id”. All items belong to one of more categories which are stored in the field “categories_ss”.

The following screenshot shows the Fusion UI search panel over collection “bb_catalog”, after using the Search UI Configuration tool to limit the document fields displayed. The gear icon next to the search box toggles this control open and closed. The “Documents” settings are set so that the primary field displayed is “name_t”, the secondary field is “id”, and additional fields are “name_t”, “id”, and “category_ss”. The document in the yellow rectangle is a Netgear router with product id “1208844”.

bb_catalog

For collection “bb_catalog_signals”, the search query string is stored in field “query_s”, the timestamp is stored in field “tz_timestamp_txt”, the id of the document clicked on is stored in field “doc_id_s”, and the set of category filters are stored in fields “filters_ss” as well as “filters_orig_ss”.

The following screenshot shows the results of a search for raw signals where the id of the product clicked on was “1208844”.

bb_catalog

The collection “bb_catalog_signals_aggr” contains aggregated signals. In addition to the fields “doc_id_s”, “query_s”, and “filter_ss”, aggregated click signals contain fields:

“count_i” – the number of raw signals found for this query, doc, filter combo.
“weight_d” – a real-number used as a multiplier to boost the score of these documents.
“tz_timestamp_txt” – all timestamps of raw signals, stored as a list of strings.

The following screenshot shows aggregated signals for searches for “netgear”. There were 3 raw signals where the search query “netgear” and some set of category choices resulted in a click on the item with id “1208844”:

bb_catalog

Using Click Signals in a Fusion Query Pipeline

Fusion's Query Pipelines take as input a set of search terms and process them into Solr query request. The Fusion UI Search panel has a control which allows you to choose the processing pipeline. In the following screenshot of the collection “bb_catalog”, the query pipeline control is just below the search input box. Here the pipeline chosen is “bb_catalog-default” (circled in yellow):

bb_catalog

The pre-configured default query pipelines consist of 3 stages:

A Search Fields query stage, used to define common Solr query parameters. The initial configuration specifies that the 10 best-scoring documents should be returned.
A Facet query stage which defines the facets to be returned as part of the Solr search results. No facet field names are specified in the initial defaults.
A Solr query stage which transforms a query request object into a Solr query and submits the request to Solr. The default configuration specifies the HTTP method as a POST request.

In order to get text-based search over the collection “bb_catalog” to work as expected, the Search Field query stage must be configured to specify the set of fields that which contain relevant text. For the majority of the 1.2M products in the product catalog, the item name, found in field “name_t” is only field amenable to free text search. The following screenshot shows how to add this field to the Search Fields stage by editing the query pipeline via the Fusion 2 UI:

add search field, search term: ipad

The search panel on the right displays the results of a search for “ipad”. There were 1,359 hits for this query, which far exceeds the number of items that are an Apple iPad. The best scoring items contain “iPad” in the title, sometimes twice, but these are all iPad accessories, not the device itself.

Recommendation Boosting query stage

A Recommendation Boosting stage uses aggregated signals to selectively boost items in the set of search results. The following screenshot show the results of the same search after adding a Recommendations Boosting stage to the query pipeline:

recommendations boost, search term: ipad

The edit pipeline panel on the left shows the updated query pipeline “bb_catalog-default” after adding a “Recommendations Boosting” stage. All parameter settings for this stage have been left at their default values. In particular, the recommendation boosts are applied to field “id”. The search panel on the right shows the updated results for the search query “ipad”. Now the three most relevant items are for Apple iPads. They are iPad 2 models because the click dataset used here is based on logfile data from 2011, and at that time, the iPad 2 was the most recent iPad on the market. There were more clicks on the 16GB iPads over the more expensive 32GB model, and for the color black over the color white.

Peeking Under the Hood

Of course, under the hood, Fusion is leveraging the awesome power of Solr. To see how this works, I show both the Fusion query and the JSON of the Solr response. To display the Fusion query, I go into the Search UI Configuration and change the “General” settings and check the set “Show Query URL” option. To see the Solr response in JSON format, I change the display control from “Results” to “JSON”.

The following screenshot shows the Fusion UI search display for “ipad”:

recommendations boost, under the hood

The query “ipad” entered via the Fusion UI search box is transformed into the following request sent to the Fusion REST-API:

/api/apollo/query-pipelines/bb_catalog-default/collections/bb_catalog/select?fl=*,score&echoParams=all&wt=json&json.nl=arrarr&sort&start=0&q=ipad&debug=true&rows=10

This request to the Query Pipelines API sends a query through the query pipeline “bb_catalog-default” for the collection “bb_catalog” using the Solr “select” request handler, where the search query parameter “q” has value “ipad”. Because the parameter “debug” has value “true”, the Solr response contains debug information, outlined by the yellow rectangle. The “bb_catalog-default” query pipeline transforms the query “ipad” into the following Solr query:

"parsedquery": "(+DisjunctionMaxQuery((name_t:ipad)) 
id:1945531^4.0904393 id:2339322^1.5108471 id:1945595^1.0636971
id:1945674^0.4065684 id:2842056^0.3342921 id:2408224^0.4388061
id:2339386^0.39254773 id:2319133^0.32736558 id:9924603^0.1956079
id:1432551^0.18906432)/no_coord"

The outer part of this expression, “( … )/no_coord” is a reporting detail, indicating Solr's “coord scoring” feature wasn't used.

The enclosed expression consists of:

The search: “+DisjunctionMaxQuery(name_t:ipad)”.
A set of selective boosts to be applied to the search results

The field name “name_t” is supplied by the set of search fields specified by the Search Fields query stage. (Note: if no search fields are specified, the default search field name “text” is used. Since the documents in collection “bb_catalog” don't contain a field named “text”, this stage must be configured with the appropriate set of search fields.)

The Recommendations Boosting stage was configured with the default parameters:

Number of Recommendations: 10
Number of Signals: 100

There are 10 documents boosted, with ids ( 1945531, 2339322, 1945595, 1945674, 2842056, 2408224, 2339386, 2319133, 9924603, 1432551 ). This set of 10 documents represents documents which had at least 100 clicks where “ipad” occurred in the user search query. The boost factor is a number derived from the aggregated signals by the Recommendation Boosting stage. If those documents contain the term “name_t:ipad”, then they will be boosted. If those documents don't contain the term, then they won't be returned by the Solr query.

To summarize: adding in the Recommendations Boosting stage results in a Solr query where selective boosts will be applied to 10 documents, based on clickstream information from an undifferentiated set of previous searches. The improvement in the quality of the search results is dramatic.

Even Better Search

Adding more processing to the query pipeline allows for user-specific and search-specific refinements. Like the Recommendations Boosting stage, these more complex query pipelines leverage Solr's expressive query language, flexible scoring, and lightning fast search and indexing. Fusion query pipelines plus aggregated signals give you the tools you need to rapidly improve the user search experience.

The post Better Search with Fusion Signals appeared first on Lucidworks.

↧

How Twitter Uses Apache Lucene for Real-Time Search

September 2, 2015, 9:44 am

≫ Next: How Bloomberg Executes Search Analytics with Apache Solr

≪ Previous: Better Search with Fusion Signals

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Michael Busch’s session session on how Twitter executes real-time search with Apache Lucene. Twitter’s search engine serves billions of queries per day from different Lucene indexes, while appending more than hundreds of millions of tweets per day in real time. This session will give an overview of Twitter’s search architecture and recent changes and improvements that have been made. It will focus on the usage of Lucene and the modifications that have been made to it to support Twitter’s unique performance requirements. Michael Busch is architect in Twitter’s Search & Content organization. He designed and implemented Twitter’s current search index, which is based on Apache Lucene and optimized for realtime search. Prior to Twitter Michael worked at IBM on search and eDiscovery applications. Michael is Lucene committer and Apache member for many years.

Search at Twitter: Presented by Michael Busch, Twitter from Lucidworks

Twitter’s search engine serves billions of queries per day from different Lucene indexes, while appending more than hundreds of millions of tweets per day in real time. This session will give an overview of Twitter’s search architecture and recent changes and improvements that have been made. It will focus on the usage of Lucene and the modifications that have been made to it to support Twitter’s unique performance requirements. lucenerevolution-avatar

The post How Twitter Uses Apache Lucene for Real-Time Search appeared first on Lucidworks.

↧

How Bloomberg Executes Search Analytics with Apache Solr

September 3, 2015, 1:00 am

≫ Next: Lucene Revolution Presents, Inside Austin(‘s) City Limits: Stump The Chump!

≪ Previous: How Twitter Uses Apache Lucene for Real-Time Search

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Steven Bower’s session on how Bloomberg uses Solr for search analytics. Search at Bloomberg is not just about text, it’s about numbers, lots of numbers. In order for our clients to research, measure and drive decisions from those numbers we must provide flexible, accurate and timely analytics tools. We decided to build these tools using Solr, as Solr provides the indexing performance, filtering and faceting capabilities needed to achieve the flexibility and timeliness required by the tools. To perform the analytics required we developed an Analytics component for Solr. This talk will cover the Analytics Component that we built at Bloomberg, some use cases that drove it and then dive into features/functionality it provides. Steven Bower has worked for 15 years in the web/enterprise search industry. First as part of the R&D and Services teams at FAST Search and Transfer, Inc. and then as a principal engineer at Attivio, Inc. He has participated/lead the delivery of hundreds of search applications and now leads the search infrastructure team at Bloomberg LP, providing a search as a service platform for 80+ applications.

Search Analytics Component: Presented by Steven Bower, Bloomberg L.P. from Lucidworks

The post How Bloomberg Executes Search Analytics with Apache Solr appeared first on Lucidworks.

↧

Lucene Revolution Presents, Inside Austin(‘s) City Limits: Stump The Chump!

September 3, 2015, 5:35 pm

≫ Next: Using Thoth as a Real-Time Solr Monitor and Search Analysis Engine

≪ Previous: How Bloomberg Executes Search Analytics with Apache Solr

ONE NIGHT ONLY! THURSDAY OCTOBER 15TH! LIVE AT THE HILTON AUSTIN! STUMP! THE! CHUMP!

It’s that time of year again folks…

Six weeks from today, Stump The Chump will be coming to Austin Texas at Lucene/Solr Revolution 2015.

If you are not familiar with “Stump the Chump” it’s a Q&A style session where “The Chump” (that’s me) is put on the spot with tough, challenging, unusual questions about Solr & Lucene — live, on stage, in front of hundreds of rowdy convention goers, with judges (who have all had a chance to review and think about the questions in advance) taking the opportunity to mock The Chump (still me) and award prizes to people whose questions do the best job of “Stumping The Chump”.

If that sounds kind of insane, it’s because it kind of is.

You can see for yourself by checking out the videos from past events like Lucene/Solr Revolution Dublin 2013 and Lucene/Solr Revolution 2013 in San Diego, CA. (Unfortunately no video of Stump The Chump is available from Lucene/Solr Revolution 2014: D.C. due to audio problems.)

Information on how to submit questions is available on the conference website.

I’ll be posting more details as we get closer to the conference, but until then you can subscribe to this blog (or just the “Chump” tag) to stay informed.

The post Lucene Revolution Presents, Inside Austin(‘s) City Limits: Stump The Chump! appeared first on Lucidworks.

↧

Using Thoth as a Real-Time Solr Monitor and Search Analysis Engine

September 4, 2015, 1:00 am

≫ Next: Search-Time Parallelism at Etsy: An Experiment With Apache Lucene

≪ Previous: Lucene Revolution Presents, Inside Austin(‘s) City Limits: Stump The Chump!

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Mhatre Braga and Praneet Damiano’s session on how Trulia uses Thoth and Solr for real-time monitoring and analysis. Managing a large and diversified Solr search infrastructure can be challenging and there is still a lack of good tools that can help monitor the entire system and help the scaling process. This session will cover Thoth: an open source real-time Solr monitor and search analysis engine that we wrote and currently use at Trulia. We will talk about how Thoth was designed, why we chose Solr to analyze Solr and the challenges that we encountered while building and scaling the system. Then, we will talk about some Thoth useful features like integration with Apache ActiveMQ and Nagios for real-time paging, generation of reports on query volume, latency, time period comparisons and the Thoth dashboard. Following that, we will summarize our application of machine learning algorithms and its results to the process of query analysis and pattern recognition. Then we will talk about the future directions of Thoth, opportunities to expand the project with new plug-ins and integration with Solr Cloud. Damiano is part of the search team at Trulia where he also helps managing the search infrastructure and creating internal tools to help the scaling process. Prior to Trulia, he studied and worked for the University of Ferrara (Italy) where he completed his Master Degree in Computer science Engineering. Praneet works as a Data Mining Engineer on Trulia’s Algorithms team. He works on property data handling algorithms, stats and trends generation, comparable homes and other data driver projects at Trulia. Before Trulia, he got his Bachelors degree in Computer Engineering from VJTI, India and his Masters in Computer Science from the University of California, Irvine.

Thoth – Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet Mhatre, Trulia from Lucidworks

The post Using Thoth as a Real-Time Solr Monitor and Search Analysis Engine appeared first on Lucidworks.

↧

Search-Time Parallelism at Etsy: An Experiment With Apache Lucene

September 8, 2015, 1:52 am

≫ Next: Reading Metadata Between the Lines: Searching for Stories, People, Places and More in Television News

≪ Previous: Using Thoth as a Real-Time Solr Monitor and Search Analysis Engine

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Shikhar Bhushan from Etsy’s experiments at Etsy with search-time parallelism. Is it possible to gain the parallelism benefit of sharding your data into multiple indexes, without actually sharding? Isn’t your Lucene index already composed of shards i.e. segments? This talk will present an experiment in parallelizing Lucene’s guts: the collection protocol. An express goal was to try to do this in a lock-free manner using divide-and-conquer. Changes to the Collector API were necessary, such as orienting it to work at the level of child “leaf”-collectors so that segment-level state could be accumulated in parallel. I will present technical details that were learned along the way, such as how Lucene’s TopDocs collectors are implemented using priority queues and custom comparators. Onto the parallelizability of collectors — how some collectors like hit counting are embarrassingly parallelizable, how some like DocSet collection were a delightful challenge, and others where the space-time tradeoffs need more consideration. Performance testing results, which currently span from worse to exciting, will be discussed. Shikhar works on Search Infrastructure at Etsy, the global handmade and vintage marketplace. He has contributed patches to Solr/Lucene, and maintains several open-source projects such as a Java SSH library and a discovery plugin for elasticsearch. He previously worked at Bloomberg where he delivered talks introducing developers to Python and internal Python tooling. He has a special interest in JVM technology and distributed systems.

Search-time Parallelism: Presented by Shikhar Bhushan, Etsy from Lucidworks

The post Search-Time Parallelism at Etsy: An Experiment With Apache Lucene appeared first on Lucidworks.

↧

Reading Metadata Between the Lines: Searching for Stories, People, Places and More in Television News

September 9, 2015, 1:00 am

≫ Next: Min/Max On Multi-Valued Field For Functions & Sorting

≪ Previous: Search-Time Parallelism at Etsy: An Experiment With Apache Lucene

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Kai Chan’s experiments at UC with media metadata search. UCLA’s NewsScape has over 200,000 hours of television news from the United States and Europe. In the last two years, the project has generated a large set of “metadata”: story segment boundaries, story types and topics, name entities, on-screen text, image labels, etc. Including them in searches opens new opportunities for research, understanding, and visualization, and helps answer questions such as “Who were interviewed on which shows about the Ukraine crisis in May 2014″ and “What text or image is shown on the screen as a story is being reported”. However, metadata search poses significant challenges, because the search engine needs to consider not only the content, but also its position and time relative to other metadata instances, whether search terms are found in the same or different metadata instances, etc. We will describe how we have implemented metadata search with Lucene/Solr’s block join and custom query types, as well as the collection’s position-time data. We will describe our work on using time as the distance unit for proximity search and filtering search results by metadata boundaries. We will also describe our metadata-aware, multi-field implementation of auto-suggest. Kai Chan is the lead programmer for the NewsScape project at the University of California, Los Angeles. He has extensive experience programming with Lucene, Solr, Java, PHP, and MySQL and has been especially involved with the development and programming of video and text search engines for the archive. Other projects that he has worked on are ClassWeb, Moodle, and Video Annotation Tool. He has given numerous presentations regarding his work to faculty and researchers at the university, as well as Lucene and Solr tutorials to the public. Kai earned his B.S. and M.S. degrees in Computer Science from UCLA.

Reading Metadata Between the Lines – Searching for Stories, People, Places and More: Presented by Kai Chan, UCLA from Lucidworks

The post Reading Metadata Between the Lines: Searching for Stories, People, Places and More in Television News appeared first on Lucidworks.

↧

Min/Max On Multi-Valued Field For Functions & Sorting

September 9, 2015, 5:07 pm

≫ Next: How Cloudera Secures Solr with Apache Sentry

≪ Previous: Reading Metadata Between the Lines: Searching for Stories, People, Places and More in Television News

One of the new features added in Solr 5.3 is the ability to specify that you wanted Solr to use either the min or max value of a multi-valued numeric field — either to use directly (perhaps as a sort), or to incorporate into a larger more complex function.

For example: Suppose you were periodically collecting temperature readings from a variety of devices and indexing them into Solr. Some devices may only have a single temperature sensor, and return a single reading; But other devices might have 2, or 3, or 50 different sensors, each returning a different temperature reading. If all of the devices were identical — or at least: had identical sensors — you might choose to create a distinct field for each sensor, but if the devices and their sensors are not homogeneous, and you generally just care about the “max” temperature returned, you might just find it easier to dump all the sensor readings at a given time into a single multi-valued “all_sensor_temps” field, and also index the “max” value into it’s own “max_sensor_temp” field. (You might even using the MaxFieldValueUpdateProcessorFactory so Solr would handle this for you automatically every time you add a document.)

With this type of setup, you can use different types of Solr options to answer a lot of interesting questions, such as:

For a given time range, you can use sort=max_sensor_temp desc to see at a glance which devices had sensors reporting the hottest temperatures during that time frame.
Use fq={!frange l=100}max_sensor_temp to restrict your results to situations where at least one sensor on a device reported that it was “overheating”

…etc. But what if you one day decide you’d really like to which devices have sensors reporting the lowest temperature? Or readings that have ever been below some threshold which is so low it must be a device error?

Since you didn’t create a min_sensor_temp field in your index before adding all of those documents, there was no easy way in the past to answer those types of questions w/o either reindexing completely, or by using a cursor to fetch all matching documents and determine the “min” value of the all_sensor_temps field yourself.

This is all a lot easier now in Solr 5.3, using some underlying DocValues support. Using the field(...) function, you can now specify the name of a multi-valued numeric field (configured with docValues="true") along with either min or max to indicate which value should be selected.

For Example:

Sorting: sort=field(all_sensor_temps,min) asc
Filtering: fq={!frange u=0}field(all_sensor_temps,min)
As a pseudo-field: fl=device_id,min_temp:field(all_sensor_temps,min),max_temp:field(all_sensor_temps,max)
In complex functions: facet.query={!frange key=num_docs_with_large_sensor_range u=50}sub(field(all_sensor_temps,max),field(all_sensor_temps,min))

Performance?

One of the first things I wondered about when I started adding the code to Solr to take advantage of the underlying DocValues functionality that makes this all possible is: “How slow is it going to be to find the min/max of each doc at query time going to be?” The answer, surprisingly, is: “Not very slow at all.”

The reason why there isn’t a major performance hit in finding these min/max values at query time comes from the fact that the DocValues implementation sorts the multiple values at index time when writing them to disk. At query time they are accessible via SortedSetDocValues. Finding the “min” is as simple as accessing the first value in the set, while finding the “max” is only slightly harder: one method call to ask for the “count” of values are in the (sorted) set, and then ask for the “last one” (ie: count-1).

The theory was sound, but I wanted to do some quick benchmarking to prove it to myself. So I whipped up a few scripts to generate some test data and run a lot of queries with various sort options and compare the mean response time of different equivalent sorts. The basic idea behind these scripts are:

Generate 10 million random documents containing a random number of numeric “long” values in a multi_l field
- Each doc had a 10% chance of having no values at all
- The rest of the docs have at least 1 and at most 13 random values
Index these documents using a solrconfig.xml that combines CloneFieldUpdateProcessorFactory with MinFieldValueUpdateProcessorFactory and MaxFieldValueUpdateProcessorFactory to ensure that single valued min_l and max_l fields are also populated accordingly with the correct values.
Generate 500 random range queries against the uniqueKey field, such that there is exactly one query matching each multiple of 200 documents up to 100,000 documents, and such that the order of the queries is fixed but randomized
For each sort options of interest:
- Start Solr
- Loop over and execute all of the queries using that particular sort option
- Repeat the loop over all of the queries a total of 10 times to try and eliminate noise and find a mean/stddev response time
- Shutdown Solr
Plot the response time for each set of comparable sort options relative to number of matching documents that were sorted in that request

Before looking at the results, I’d like to remind everyone of a few important caveats:

I ran these tests on my laptop, while other applications where running and consuming CPU
There was only the one client providing query load to Solr during the test
The data in these tests was random and very synthetic, it doesn’t represent any sort of typical distribution
The queries themselves are very synthetic, and designed to try and minimize the total time Solr spends processing a request other then sorting the various number of results

Even with those caveats however, the tests — and the resulting graphs — should still be useful for doing an “apples to apples” comparison of the performance of sorting on a single valued field, vs sorting on the new field(multivaluedfield,minmax) function. For example, let’s start by looking at a comparison of the relative time needed to sort documents using sort=max_l asc vs field(multi_l,max) asc …

We can see that both sort options had fairly consistent, and fairly flat graphs, of the mean request time relative to the number of documents being sorted. Even if we “zoom in” to only look at the noisy left edge of the graph (requests that match at most 10,000 documents) we see that while the graphs aren’t as flat, they are still fairly consistent in terms of the relative response time…

This consistency is (ahem) consistent in all of the comparisons tested — you can use the links in the table below to review any of the graphs you are interested in.

single sort	multi sort	direction
`max_l`	`field(multi_l,max)`	`asc`	results	zoomed
`max_l`	`field(multi_l,max)`	`desc`	results	zoomed
`min_l`	`field(multi_l,min)`	`asc`	results	zoomed
`min_l`	`field(multi_l,min)`	`desc`	results	zoomed
`sum(min_l,max_l)`	`sum(def(field(multi_l,min),0),def(field(multi_l,max),0))`	`asc`	results	zoomed
`sum(min_l,max_l)`	`sum(def(field(multi_l,min),0),def(field(multi_l,max),0))`	`desc`	results	zoomed

The post Min/Max On Multi-Valued Field For Functions & Sorting appeared first on Lucidworks.

↧

How Cloudera Secures Solr with Apache Sentry

September 10, 2015, 3:30 pm

≫ Next: Stump The Chump: Meet The Panel Keeping Austin Weird

≪ Previous: Min/Max On Multi-Valued Field For Functions & Sorting

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Cloudera’s Gregory Chanan’s session on TOPIC. Apache Solr, unlike other enterprise Big Data applications that it is increasingly deployed alongside, provides minimal security features out of the box. This limitation makes it significantly more burdensome for organizations to deploy Solr than solutions that have built-in support for standard authentication and authorization mechanisms. Apache Sentry is a project in the Apache Incubator designed to address these concerns. Sentry augments Solr with support for Kerberos authentication as well as collection and document-level access control. In this talk, we’ll discuss the ACL models and features of Sentry’s security mechanisms. We will also present implementation details on Sentry’s integration with Solr. Finally, we will present performance measurements in order to characterize the impact of integrating Sentry with Solr. Gregory Chanan is a Software Engineer at Cloudera working on Search, where he leads the security integration efforts around Apache Solr. He is a committer on the Apache HBase and Apache Sentry (incubating) projects and a contributor to various other Apache projects. Prior to Cloudera, he worked as a Software Engineer for distributed computing software startup Optumsoft.

Secure Search – Using Apache Sentry to Add Authentication and Authorization Support to Solr: Presented by Gregory Chanan, Cloudera from Lucidworks

The post How Cloudera Secures Solr with Apache Sentry appeared first on Lucidworks.

↧

Stump The Chump: Meet The Panel Keeping Austin Weird

September 14, 2015, 7:14 am

≫ Next: Infographic: The Dangers of Bias in High-Stakes Data Science

≪ Previous: How Cloudera Secures Solr with Apache Sentry

As previously mentioned: On October 15th, Lucene/Solr Revolution 2015 will once again be hosting “Stump The Chump” in which I (The Chump) will be answering tough Solr questions — submitted by users like you — live, on stage, sight unseen.

Today, I’m happy to announce the Panel of experts that will be challenging me with those questions, and deciding which questions were able to Stump The Chump!

Our Moderator: Cassandra Targett
Additional Judges:

In addition to taunting me with the questions, and ridiculing all my “Um”s and “Uhh”s as I stall for time while I rack my brain to come up with a non-gibberish answer, the Panel members will be responsible for awarding prizes to the folks who have submitted the question that do the best job of “Stumping” me.

Check out the session information page for details on how to submit questions. Even if you can’t make it to Austin to attend the conference, you can still participate — and do your part to humiliate me — by submitting your questions.

To keep up with all the “Chump” related info, you can subscribe to this blog (or just the “Chump” tag).

The post Stump The Chump: Meet The Panel Keeping Austin Weird appeared first on Lucidworks.

↧

Infographic: The Dangers of Bias in High-Stakes Data Science

September 14, 2015, 10:04 am

≫ Next: Searching and Querying Knowledge Graphs with Solr/SIREn: a Reference Architecture

≪ Previous: Stump The Chump: Meet The Panel Keeping Austin Weird

A data set is only as powerful as the ability of data scientists to interpret it, and insights gleaned can have huge ramifications in business, public policy, health care, and elsewhere. As the stakes of data-driven decisions become increasingly high, let’s look at some of the most common data science fallacies.

The post Infographic: The Dangers of Bias in High-Stakes Data Science appeared first on Lucidworks.

↧

Searching and Querying Knowledge Graphs with Solr/SIREn: a Reference Architecture

September 14, 2015, 12:30 pm

≫ Next: Tuning Apache Solr for Log Analysis

≪ Previous: Infographic: The Dangers of Bias in High-Stakes Data Science

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Tummarello Delbru and Giovanni Renaud’s session on querying knowledge graphs with Solr/SIREn: Knowledge Graphs have recently gained press coverage as information giants like Google, Facebook, Yahoo and Microsoft, announced having deployed Knowledge Graphs at the core of their search and data management capabilities. Very richly structured datasets like “Freebase” or “DBPedia” can be said to be examples of these. In this talk we discuss a reference architecture for high performance structured querying and search on knowledge graphs. While graph databases, e.g., Triplestores or Graph Stores, have a role in this scenario, it is via Solr along with its schemaless structured search plugin SIREn that it is possible to deliver fast and accurate entity search with rich structured querying. During the presentation we will discuss an end to end example case study, a tourism social data use case. We will cover extraction, graph databases, SPARQL, JSON-LD and the role of Solr/SIREn both as search and as high speed structured query component. The audience will leave this session with an understanding of the Knowledge Graph idea and how graph databases, SPARQL, JSON-LD and Solr/SIREn can be combined together to implement high performance real world applications on rich and diverse structured datasets. Renaud Delbru, Ph. D., CTO and Founder at SindiceTech, is leading the research and development of the SIREn engine and of all aspects related to large scale data retrieval and analytics. He is author of over a dozen academic works in the area of semi-structured information retrieval and big data RDF processing. Prior to SindiceTech, Renaud completed his Ph.D. on Information Retrieval for Semantic Web data at the Digital enterprise Research Institute, Galway where he worked on the Sindice.com semantic search engine project. Among his achievements, he led the team that won the Entity Search track of the Yahoo’s Semantic Search 2011.

Searching and Querying Knowledge Graphs with Solr/SIREn – A Reference Architecture: Presented by Renaud Delbru & Giovanni Tummarello, SIREn Solutions from Lucidworks

The post Searching and Querying Knowledge Graphs with Solr/SIREn: a Reference Architecture appeared first on Lucidworks.

↧

Tuning Apache Solr for Log Analysis

September 15, 2015, 11:53 am

≫ Next: Lucidworks Fusion 2.1 Now Available!

≪ Previous: Searching and Querying Knowledge Graphs with Solr/SIREn: a Reference Architecture

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Radu Gheorghe’s session on tuning Solr for analyzing logs. Performance tuning is always nice for keeping your applications snappy and your costs down. This is especially the case for logs, social media and other stream-like data, that can easily grow into terabyte territory. While you can always use SolrCloud to scale out of performance issues, this talk is about optimizing. First, we’ll talk about Solr settings by answering the following questions:

How often should you commit and merge?
How can you have one collection per day/month/year/etc?
What are the performance trade-offs for these options?

Then, we’ll turn to hardware. We know SSDs are fast, especially on cold-cache searches, but are they worth the price? We’ll give you some numbers and let you decide what’s best for your use-case. The last part is about optimizing the infrastructure pushing logs to Solr. We’ll talk about tuning Apache Flume for handling large flows of logs and about overall design options that also apply to other shippers, like Logstash. As always, there are trade-offs, and we’ll discuss the pros and cons of each option. Radu is a search consultant at Sematext where he works with clients on Solr and Elasticsearch-based solutions. He is also passionate about the logging ecosystem (yes, that can be a passion!), and feeds this passion by working on Logsene, a log analytics SaaS. Naturally, at conferences such as Berlin Buzzwords, Monitorama, and of course Lucene Revolution, he speaks about indexing logs. Previous presentations were about designing logging infrastructures that provide: functionality (e.g.: parsing logs), performance and scalability. This time, the objective is to take a deeper dive on performance.

Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext from Lucidworks

The post Tuning Apache Solr for Log Analysis appeared first on Lucidworks.

↧

Lucidworks Fusion 2.1 Now Available!

September 16, 2015, 12:27 pm

≫ Next: How CareerBuilder Executes Semantic and Multilingual Strategies with Apache Lucene/Solr

≪ Previous: Tuning Apache Solr for Log Analysis

Today we’re releasing Fusion 2.1 LTS, our most recent version of Fusion offering Long-Term Support (LTS). Last month, we released version 2.0 which brought a slew of new features as well as a new user experience. With Fusion 2.1 LTS, we have polished these features, and tweaked the visual appearance and the interactions. With the refinements now in place, we’ll be providing support and maintenance releases on this version for at least the next 18 months.

If you’ve already tried out Fusion 2.0, Fusion 2.1 won’t be revolutionary, but you’ll find that it works a little more smoothly and gracefully. Besides the improvements to the UI, we’ve made a few back end changes:

Aggregation jobs now run only using Spark. In previous versions, you could run them in Spark optionally, or natively in Fusion. We’ve found we’re happy enough with Spark to make it the only option now.
You can now send alerts to PagerDuty. Previously, you could send and email or a Slack message. PagerDuty was a fairly popular request.
Several new options for crawling websites
Improvements to SSL when communicating between Fusion nodes
A reorganization of the Fusion directory structure to better isolate your site-specific data and config from version-specific Fusion binaries, for easier upgrades and maintenance releases
Better logging and debuggability
Incremental enhancements to document parsing
As always, some performance, reliability, and stability improvements

Whether you’re new to Fusion, or have only seen Fusion 1.x, we think there’s a lot you’ll like, so go ahead, download and try it out today!

The post Lucidworks Fusion 2.1 Now Available! appeared first on Lucidworks.

↧

How CareerBuilder Executes Semantic and Multilingual Strategies with Apache Lucene/Solr

September 18, 2015, 11:35 am

≫ Next: How Getty Images Executes Managed Search with Apache Solr

≪ Previous: Lucidworks Fusion 2.1 Now Available!

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Trey Grainger’s session on multilingual search at CareerBuilder. When searching on text, choosing the right CharFilters, Tokenizer, stemmers and other TokenFilters for each supported language is critical. Additional tools of the trade include language detection through UpdateRequestProcessors, parts of speech analysis, entity extraction, stopword and synonym lists, relevancy differentiation for exact vs. stemmed vs. conceptual matches, and identification of statistically interesting phrases per language. For multilingual search, you also need to choose between several strategies such as 1) searching across multiple fields, 2) using a separate collection per language combination, or 3) combining multiple languages in a single field (custom code is required for this and will be open sourced) each with their own strengths and weaknesses depending upon your use case. This talk will provide a tutorial (with code examples) on how to pull off each of these strategies. We will also compare and contrast the different kinds of stemmers, discuss the precision/recall impact of stemming vs. lemmatization, and describe some techniques for extracting meaningful relationships between terms to power a semantic search experience per-language. Come learn how to build an excellent semantic and multilingual search system using the best tools and techniques Lucene/Solr has to offer! Trey Grainger is the Director of Engineering for Search & Analytics at CareerBuilder.com and is the co-author of Solr in Action (2014, Manning Publications), the comprehensive example-driven guide to Apache Solr. His search experience includes handling multi-lingual content across dozens of markets/languages, machine learning, semantic search, big data analytics, customized Lucene/Solr scoring models, data mining, and recommendation systems. Trey is also the Founder of Celiaccess.com, a gluten-free search engine, and is a frequent speaker at Lucene and Solr-related conferences.

Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger, CareerBuilder from Lucidworks

The post How CareerBuilder Executes Semantic and Multilingual Strategies with Apache Lucene/Solr appeared first on Lucidworks.

↧