Increase Retail Sales with Recommendations

January 24, 2018, 6:10 am

≫ Next: Omnichannel Retail Store of the Future

≪ Previous: Keeping Retail Sites Up 24x7x365

Retailers know that it is harder and more expensive to acquire new customers than to sell new things to existing customers. That’s why they spend a lot on loyalty programs and Customer 360/Customer Journey programs. One of the best tools a retailer has for selling products to customers is recommendations.

Recommendations are simply that, suggestions by the retailer on other things the customer may be interested in. In order to do this, a retailer needs to know the customer. If I’ve never purchased anything pink and frilly, don’t recommend things to me that are pink and frilly. Instead, recommend something that suits my interests and preferences, like boxing gloves, a new cooking apron, or a nice oak table.

Strictly “brick and mortar” retailers are a rarity these days. I look at REI.com (see REI’s talk from the last Revolution) before I go into the store to get a general idea of what I’m going to purchase. I shop there especially for cycling gear because of their focus on customer service. In general, employees are there to help me decide and explain the differences between products. However, depending on how I got there that day (by bike, or convertible, or the family sedan), I may still end up making my purchase on their website.

Your Homepage: Items for User Recommender

Machine learning recommendations using Lucidworks Fusion

Regardless of how I get to a store, for loyal customers, recommendations should start on the homepage. Lucidworks Fusion gives you an Items for User Recommender based on a user’s past interests, purchases, or other information you may have captured.

Anyone who has been involved in the Internet at all knows that most users go to your homepage and then leave. Real estate on your homepage is precious. Sure you may have a promotion, but very soon after, I should see something that speaks to me. The Items for User Recommender is a great way to generate these kinds of recommendations.

A Product Page: Items for Items Recommender

Once I click on one of those items you recommended on your homepage or that I found in a search, I should quickly see other recommended items. This is important for a number of reasons. I might click on an item that is close to what I need, but isn’t exactly right. For instance, I may need a 4G phone that is Android not iPhone or a Quick-Dri shirt that is black rather than white. In other cases, you may have a recommendation for an add-on product.

This is where the Lucidworks Fusion Items for Items Recommender comes into play. It provides a recommendation for other items based on what other users have purchased or viewed when buying that item.

When buying products from video game machines, to coffee machines, to cycling gear, there is often a complementary product that goes with the purchase. If you buy a coffee press, you may want a grinder. If you buy a bike, you may need a tire pump or spare tube. Offering these at the time of purchase or on the product page is a good way to ensure the customer buys everything they would need from your site. In the physical world, it is a way to make offers at the register. “Hey did you get a spare tube for this bike?”

retail recommendations using Lucidworks Fusion machine learning capabilities

A Promotion: Users for Items Recommenders

Whether it is an offer a user sees when they return to your site, or a flash sale alert sent by email, it is important to target your promotions. The Users for Items Recommender lets me find which users are interested in an item that I’m trying to promote.

I frequently get events in the mail from Ticketmaster. These are obviously strictly geographically targeted, as they sent me WWE and The Eagles. There is nothing in my browsing, search, or purchase history that should cause either of these to be recommended to me. I don’t like The Eagles and am not interested in WWE, so I’ve stopped looking at their emails as a result. If they used Fusion, Ticketmaster could have looked at what events I’ve actually purchased tickets in order to better target events I’d be interested in. Promotions can boost sales — but only if they are targeted. Anything else is a waste of resources and risks reducing the effectiveness of future promotions.

Screenshot of Lucidworks Fusion's Items for User Recommender

Capturing Signals

Signals are data about what users have done, things like purchases, or queries, or clicks. Lucidworks Fusion has built-in signal capture capability. For custom applications this just means posting a Javascript object to a URL. However, you can do all of this automagically by using Fusion App Studio. App Studio allows you to put together search applications with pre-written components. Through App Studio you can automatically capture signals with just one html-like tag.

Signal Capture Diagram

In Depth

All of this works using signal capture, where an application sends Fusion a little data about what a user has done, combined with a machine learning algorithm and your data. Specifically, recommendations are based on a machine learning algorithm called Alternating Least Squares. If you think of behavior as composed of points on a line graph, ALS is an algorithm for finding the “most similar” behaviors.

Low-rank matrix factorization diagram

One of the most powerful capabilities about Fusion’s recommendations is that they are “just search”, which means that your developers can slice and dice and combine them with other criteria (geography, stated preferences, “not the Eagles”).

Next Steps

“We’ve seen our conversions increase more than we had initially hoped for,” said Andy House, IT Director of Menards. If conversions are what you’re after, consider looking into Lucidworks Fusion and its advanced recommendations and machine learning capabilities. If you’re interested in rapidly developing search applications that are ready to take advantage of these capabilities, consider adding Fusion App Studio.

Download Fusion.
Check out Lucidworks Site Search.
Download our E-Book: How to Create an Amazon-Like Experience with Fusion.
Drop us a line, we’re here for you.

The post Increase Retail Sales with Recommendations appeared first on Lucidworks.

↧

Omnichannel Retail Store of the Future

January 29, 2018, 8:41 am

≫ Next: Omnichannel Retailer Personalization Data Sources

≪ Previous: Increase Retail Sales with Recommendations

In 2011, Harvard Business Review published a seminal piece on where retail was heading and the challenges it faces in an article titled “The Future of Shopping.” Since then, retailers have adopted an Omnichannel approach which provides customers with both brick and mortar and digital experiences. Evidence is that this approach is working, but only with good customer profiling.

Lucidworks provides simply better search with better results.

If retailers need more validation that brick and mortar is here to stay and that omnichannel is the panacea it is prescribed to be, look at Amazon — the poster child for online business — deciding it needed to take itself into meatspace, literally. And just what are they doing with Whole Foods? Turning it into an omnichannel marketplace to reach consumers with more than groceries.

Some clear things have come out of all of this and marketplace research. Not all consumers are going to shop in both places. Distance from the store plays a key factor. Encouraging customers who shop online only to come into the store increases profits. Encouraging customers who come into the store to shop online decreases profits.

The reasons are simple. In-store customers are more affected by retail practices, make more impulse buys, and are less likely to compare prices. Online customers demand free shipping and are generally going to shop from the retailer with the lowest price. So changing an in-store customer to an online customer makes them more price sensitive and less impulsive.

“We saw a 50-60% increase in conversions  just from turning on Fusion” –Jacob Wagner, Director of IT – Content, Bluestem Brands

This is playing out in some interesting ways. Big Box stores with lots of floor space are closing, but on a selective basis. Meanwhile, the question for retailers has been how to move the stores closer to customers. So Amazon has created a completely automated store. The store is smaller than a typical Whole Foods, is entirely smartphone and camera driven and has only one floor employee to check IDs for booze. If you pull it off the shelf and put it in your grocery bag, you’re automatically charged. If you shoplift, you’re automatically charged.

While this is an interesting technical marvel, the real approach is likely to be more of a blend. A smaller more automated store that is closer to customers, but probably with real people who are just there to help customers rather than mind the registers.

The bottom line is that this is how you convert browsers into customers, whether they’re in the store or at home.

What does Omnichannel have to do with search?

Simply everything. Search has moved well beyond keywords and key phrases. Lucidworks has invested heavily in adding artificial intelligence to search. Modern search captures customer data and can automatically recognize which customers belong to which groups and even identify groups that a retailer didn’t even know was there.

While search is critical for an online retailer. This reaches beyond online retail. When retailers can gather customer data in-store and combine it with online data, this is the core of a data-driven omnichannel strategy. Moreover, as stores become more automated, they become more data-driven and at the same time generate more data. Customer intelligence combined with product, industry, marketing, demographics, and channel data is the core of what search is to retailers today and even more so in the future. The bottom line is that customer profiling is how you convert browsers into customers, whether they’re in the store or at home.

So why Lucidworks?

Other companies have search. Other companies have invested in AI. However, Lucidworks gives you better search results and is proven for reliability at scale. There are a lot of technical reasons for that, from the software clustering technology and its design to the type of indexing technology being used and the mathematical models in place. This all comes down to the right geeks with the right know-how and intimate experience with customers in the retail space.

Lucidworks has technologies that can work both on-premise or in the cloud and the expertise to help you deploy intelligently so that you stay up 24x7x365 and get the best results possible.

Next Steps:

System Operations: Learn about how to keep your retail site up 24x7x365.
Developers: Learn how to use recommendations to increase retail sales.
Grab an E-Book: Providing an Amazon-like experience using Lucidworks Fusion.
Drop us a line: as you help your customers reach their Zero Moment of Truth, we’d like to help you reach yours.

The post Omnichannel Retail Store of the Future appeared first on Lucidworks.

↧

Omnichannel Retailer Personalization Data Sources

February 7, 2018, 8:03 am

≫ Next: Fusion 4 Ready for Download

≪ Previous: Omnichannel Retail Store of the Future

Every Omnichannel retailer is seeking to provide better search and recommendations to influence customers and increase conversions. What sources of data should retailers utilize in generating recommendations?

Online retailers have only web-based events like clicks and queries to use for customer behavior modeling. An Omnichannel retailer can combine online data with other data from other channels like loyalty programs and in-store interactions, resulting in more comprehensive intelligence. Research shows that most customers want to be recognized across channels through which they encounter a brand. Customers are also more loyal towards retailers that personalize their experience.

Every action a customer takes is a “signal” that tells us more about who they are, what they like, and what they (and users like them) are likely to do next. Those signals include every customer action, from a purchase, to visiting a retailer website without purchasing anything. When a customer comes into a store and picks something off the shelf and doesn’t buy anything, that too is information. These “signals” can be used to make recommendations and influence the customer.

Here are some examples of signals that we should consider capturing:

Mobile
	action	data	source
	starts or opens app	Time, location	App, GPS
	searches	Time, location, query	App, GPS
	Taps item	Time, location query, item	App, GPS
	Adds item to cart	Time, location, item	App, GPS
	Abandons item in cart	Time, location, item	App, GPS
	Purchases item	Time, location, item	App, GPS
Web
	visit	time	Web tag, Appstudio
	search	Time, query	Appstudio
	Clicks on item	Time, item	Appstudio
	Adds an item to cart	Time, item	Appstudio, commerce suite
	Abandons item in cart	Time, item	Appstudio, commerce suite
	Purchases item	Time, item	Appstudio, commerce suite
	Clicks for service	Time, item, reason	Appstudio, CSR
Phone
	Customer calls	Time, Item, reason	CSR
Store
	Purchase	Time, item, store	Checkout
	Return	Time, item, store	Checkout
	interest	Time, item, store, aisle	Mobile, camera
	Uses coupon	Time, item store, publication	Checkout
	Employee contact	Time, employee, location, store	Mobile, Camera
Email
	Opens	Time, topic, item	Lead gen tool

AI Technology for the Omnichannel Retailer

Capturing mere purchase information reveals a lot. However, customers are unlikely to use our app when they are in the store. Research shows that instead of using an app, people text their friends or talk while shopping. As retailers, this behavior is something we have to embrace. The technology to recognize a customer by appearance is already available and has already been deployed.

Where a customer goes within the in a store says a lot about them. If a customer spends a lot of time in the sporting goods section they probably have an interest in sports. If they pick up everything with Nike on stamped on it, then they probably have an affinity for that brand. If they only buy things in the store after talking to an employee, then we may want to make sure they talk to someone.

The right technology on the backend is critical to any successful retail store of the future. The right technology is needed to receive customer signals. The right AI technology in the middle is needed to make recommendations and influence customers. Every customer’s visual experience is critical whether it be merchandising in-store, the layout on our app, or the search on our website.

omnichannel retailer personalized appstudio site

And in the web…

For web and mobile web applications, Lucidworks App Studio is our best friend. We can rapidly deploy a search application and improve it just as rapidly when our business evolves. With App Studio, you don’t need to write or maintain any of the code involved in signal capture. AppStudio can also automatically connect many of the kinds of UI features necessary to influence customers!

For the omnichannel retailer, there are signals on the web, in the store, on a mobile app and any other customer interaction. In order to use AI technology and provide personalized service to modern customers, we need to deploy best of breed technologies. Lucidworks is the only company that offers a complete AI-driven search solution that help win and retain customers.

Get Started

Attend our upcoming webinar AI and Machine Learning for Omnichannel Retailers
Check out our recent blog: Increase Retail Sales using Recommendations
Grab our EBook: How to Create an Amazon-Like Experience with Fusion
Contact Lucidworks, we’d love to influence you with our recommendations

The post Omnichannel Retailer Personalization Data Sources appeared first on Lucidworks.

↧

Fusion 4 Ready for Download

February 27, 2018, 8:22 am

≫ Next: Lucidworks Site Search is Now Available!

≪ Previous: Omnichannel Retailer Personalization Data Sources

We are pleased to announce the release of Fusion 4, our application development platform for creating powerful search-driven data applications.

Fusion 4 is our most significant release to date and we’ve been hard at work to bring you our most feature-rich and production-ready release.

Introducing Fusion Apps

Fusion Apps are a logical grouping of all linked Fusion objects. Apps can be exported and shared between Fusion instances, promoting multi tenant deployment and significantly reducing the time to value for business to deploy smart search applications. Fusion Objects within Apps can be shared as well, significantly reducing development time, reducing duplication and promoting reusability.

Updates to Fusion AI

We’ve added significant updates to our AI suite. Fusion AI now includes several new features to allow organizations to deliver superior, industry leading search relevance powered by our powerful AI Capabilities:

Experiment Management & A/B Testing

Our new Experiment Management framework provides a full suite of A/B testing tools for comparing different production pipeline configuration variants to determine which pipelines are most successful. This allows tuning of Fusion pipelines for a significant increase in relevancy, click throughs and conversions.

All New Smart Jobs

Smart jobs are pre-configured, tested, and optimized AI jobs for Spark that bring the most popular models and approaches of machine learning to your apps. Our data scientists have tweaked and optimized a couple dozen of these jobs through extensive deployment in both testing and customer production environments.

Just drop them into your query or index pipelines and you’re ready to go. Smart jobs range from clustering and outlier detection, classification, query insights like head-n-tail analysis, content insights like statistically interesting phrases, and user insights like item similarity recommenders.

App Insights

App Insights is our new interface for providing detailed, real-time, customizable dashboards to visualize your App and Query Analytics. Our built in analytics reports based on our Smart Jobs provide key metrics for analyzing query performance.

Refreshed UI and Enhanced App-Centric Workflows

We’ve taken your valuable feedback and overhauled our UI with a fresh new look and feel, optimizing for App development and deployment workflows. Significant updates to our Object Explorer allow visualization of Apps and the intrinsic relationships between shared Fusion objects.

Connectors SDK

Our new Connectors SDK provides a stable API interface for development of custom connectors to ingest data into Fusion. This can be used to augment our suite of 200+ data sources allowing for us to ingest data from ANY data source.

And Under the Hood

And of course, Fusion 4.0 is powered by Apache Solr 7.2.1 and Apache Spark 2.3.

Webinar: What’s New In Fusion 4

Join Lucidworks SVP of Engineering Trey Grainger for a guided tour of what’s new and improved with Fusion 4. You’ll learn how Fusion 4 lets you build portable apps that can be quickly deployed anywhere, manage experiments for more success queries, and execute sophisticated custom AI jobs across your data.

Full details and registration.

Learn More

Read the release notes.

Download Fusion now.

The post Fusion 4 Ready for Download appeared first on Lucidworks.

↧

Lucidworks Site Search is Now Available!

March 6, 2018, 9:13 am

≫ Next: Use Head-N-Tail Analysis to Increase Engagement

≪ Previous: Fusion 4 Ready for Download

We are proud to announce the availability of Lucidworks Site Search. Site Search is an embeddable, easy-to-configure, out-of-the-box site search solution that runs anywhere.

Key features of this new application:

Quick Configuration: Site Search is a fully functional search application. Once you’ve configured your data and interface, just point users to the URL we provide and voila: site search in a matter of minutes, ready to embed wherever you need it.

On-prem, Cloud, Hybrid, Everywhere: Site Search can be deployed on-prem, in the cloud, or on hybrid architectures so you can choose the deployment model that best fits your security and operational requirements.

Beautiful, Flexible UI: Users don’t always know what they’re looking for. We built Smart Panels to give you drag-and-drop control over the user’s data experience. Content discovery is front and center for every single search, so visitors to your site find what they need.

Every Page, Every Document: Lucidworks Site Search crawls all of your site’s content so search results are fresh, complete, and relevant. Users can narrow searches with rich faceting and filter by topic.

AI-Powered Personalization: User queries and behavior is constant used to fine tune relevancy to ensure the best results.

And much more!

Coverage of today’s announcement in TechCrunch.

Learn more about Site Search and start your trial today.

Press release.

The post Lucidworks Site Search is Now Available! appeared first on Lucidworks.

↧

Use Head-N-Tail Analysis to Increase Engagement

March 20, 2018, 12:27 pm

≫ Next: Machine Learning in Fusion 4

≪ Previous: Lucidworks Site Search is Now Available!

One of the most exciting new features in Fusion is Head-N-Tail Analysis. Strangely enough this has nothing to do with shampoo or horses, but it is a way to look at a large set of queries and identify:

Head – The queries that generate most of your traffic and conversions
Tail – The queries that generate very few or no clicks
Torso – Everything else in between

When to Use Head-N-Tail Analysis

Why you’d want to know which queries don’t result in clicks should be obvious. If a user searched on “blue trees” but didn’t find those little car air fresheners they were looking to purchase, it’s a missed opportunity. Maybe an incentive could encourage the user to convert to a purchase or a click.

Frequent queries are also an opportunity for higher clickthroughs. Popular queries could be optimized with a particular promotion or featured on the homepage or in an email campaign. Remember, most users that go to any website, bounce. The more you cut that bounce rate, the better you serve your users whether that means higher productivity in enterprise search or more sales in e-commerce.

Whether you’re in digital commerce or you’re developing an enterprise search app for a corporate intranet, Head-n-Tail reveals the reasons users leave your site without finding what they were searching for. Either it should have been on the front page or it should have been in the search results. Whether the user misspelled something or they should have been more descriptive isn’t the issue. The issues is that your search needs to anticipate the user’s needs.

Optimizing the Head, the Tail, and Everything Else

For the tail queries, Fusion doesn’t just tell you “this isn’t good” but suggests ways that you can rewrite the query. In some cases it is just adding “slop” or flexibility to the keyword or phrase. In some cases it is a spelling issue and maybe you want to add a synonym. Head-n-Tail analysis will tell you some of those right off the bat.

For the head queries, this may be as simple as adding the top n items to your front page. It also might be a good hint that you should use a recommender to personalize the front page to a user. There are other tools in Fusion that may also be useful in this case. You could just offer a redirect when someone types “contact information” in the search bar. Or you could enable signal boosting the more relevant results will automatically bubble up to the top.

How to Get Started

We’ve got an in-depth technical paper Fusion Head-Tail Analysis Reveals Why Users Leave available to guide you along with an upcoming webinar: Fusion 4 Head-n-Tail Analysis, with Lucidworks VP of Research, Chao Han.

Additional Resources:

Technical Paper: Fusion Head-Tail Analysis Reveals Why Users Leave
Webinar: Fusion 4 Head-n-Tail Analysis
Learn more about What’s New in Fusion 4 (webinar)
Contact us, we’d love to walk you through it.

The post Use Head-N-Tail Analysis to Increase Engagement appeared first on Lucidworks.

↧

Machine Learning in Fusion 4

March 26, 2018, 10:00 pm

≫ Next: Building Image-Based Perception Into Your Search Apps

≪ Previous: Use Head-N-Tail Analysis to Increase Engagement

Fusion 4 is the latest release of our AI-powered search platform and one of the most substantial releases to-date. Fusion 4 continues the evolution of our combination of Solr and Spark while adding the option of deploying App Studio (based on our acquisition of Twigkit). You can learn more about Fusion 4 from the recent webinar.

Fusion 4 is a major step for the industry in terms of understanding how central search is to the practical use of artificial intelligence generally but machine learning more specifically. This is called many things: cognitive search, insight engines and AI-powered search. However, it means that with Machine Learning we have moved beyond the era where mere keyword and keyphrase search are sufficient. Fusion makes machine learning accessible to every business and simplifies its implementation across the IT organization.

Let’s look in more detail at Fusion 4’s machine learning features.

Signals

But before we look at machine learning, let’s review the data that fuels it: signals. You’re sending signals right now. You clicked on a post on machine learning and Fusion 4 either from a link on our site or blog, an email, or from a search result. You’ve made it this far down the page of this post. If I turn that into data and send it to Fusion, it can use that behavioral data to tune search results. For developers this means capturing clicks and sending them to a REST endpoint as JSON. It looks like this:

[
{
“id”:”288fe4f7-6680-403e-8d18-27647cdd9989″,
“timestamp”:1518717749409,
“type”:”request”,
“params”:{
“user_id”:”admin”,
“session”:”ef4e00cd-91bb-45b4-be80-e81f9f9c5b27″,
“query”:”USER QUERY HERE”,
“app_id”:”SEARCH APP ID”,
“ip_address”:”0:0:0:0:0:0:0:1″,
“host”:”Lucids-MacBook-Pro-5.local”,
“filter”:[
“field1/value”,
…
],
“filter_field”:[
“field1”
]
}
}
]

Using this data Fusion is able to imply user interests. Based on those interests, Fusion is able to tailor search results and make recommendations. While there are other forms of signal data (location, shopping cart adds, returns, etc) this kind of “clickstream” data is by far the most common.

Signal Boosting

Fusion 4 is self-tuning; as more users query and click, the quality of the search results just gets better and better. This is based on “signal boosting.”

Let’s think of this in the simplest terms possible:

Most users tend to click on the first item in a set of results and each subsequent result on the page gets fewer clicks. If users click on a second or third result then that result probably should be higher on the page. Fusion 4 includes “Boost with Signals” in its query pipeline by default. All your search application needs to do is send it the signals. Purists may ask if this is machine learning at all since it is basically just “counting” but the effect is that the system “learns” that the latter results are better.

Recommenders

Recommenders are the easiest and most obvious AI tool for search. At its simplest, you know things about a user, their likes and desires through their explicit feedback and implicit actions. For ecommerce this is easy. We’ve talked about using recommenders in the AI and Machine Learning for Omnichannel Retail webinar and the Create and Amazon-like Experience with Fusion ebook.

However, recommenders are not just useful for retail, they are very useful for enterprise search applications (and any other place you might use search). If you’re looking for “employee benefits 2018” on your corporate intranet, you might also be interested in a PDF with the filename “Updates_to_Your_401k_Plan_for_2018.” If you’re working on the same project as other users, chances are you’ll look at similar documents. Whatever the case, Fusion 4 can learn what users need automatically!

Recommenders use events (signals) collected from user activity such as if they like something, click something or query something. Based on these signals and what other similar users did, we can recommend things to those users. Recommendations take different forms depending on your context:

Items for User – Based on the user’s history, what other items might be of interest?
Items for Item – Other users who were interested in this particular item were interested in other items.
Users for Item – Which users might be interested in this item?

Classification

When trying to determine user interests, the “type” of thing they are interested in is often a good indicator. For retail this means things like department or category the product is in. What if that data isn’t determined up front?

Using Fusion, you can classify data automatically. You can even use that data to help users facet or filter data to the specific department or departments that they need. While this is obvious in digital commerce and retail, it is just as valid for enterprise search or financial search applications (or anywhere else that you use search).

Classification is a supervised learning method. This means you first give Fusion a set of data that has already been classified. Based on this it learns how to classify future items.

Query Intent

By classifying data we can use this understanding to classify user queries. Meaning when a user types “elvis blue suede” the system might classify this query as music where if the user types “blue suede shoes waterproof” the system might figure out that this is footwear. Based on this the system can choose to boost or filter by our earlier classified department.

The power of this is clear in retail but it is also an important tool for Enterprise Search or Finance. Imagine if I’m searching for “internal employee benefits” or “IBM 10-k filings,” having the system automatically realize that my search should be limited to “HR Document” or “financial statements” or “10-k filings” is likely to improve my results substantially. Sure a user who selects a facet can do this for themselves but documents tend to fall into multiple categories, just because the team in charge of categorizing a document happened to categorize it as “HR Document” doesn’t mean the user will make the association. It is better to have a system that can intuit and learn what a user means so that users can easily find what they need without learning a whole filing system, especially one that may change.

Clustering

Along with the difficulty of categorizing a collection of documents, there’s the challenge of putting them into categories. One option is to use Fusion’s unsupervised learning option, clustering, in order to automatically categorize documents and find the outliers.

What qualifies as an outlier? Maybe one manufacturer puts “colour” instead of “color” or maybe there are common misspellings from a particular vendor. Outliers are just documents that do not fit well into any category that our system can detect. Outliers can be useful to detect anomalous behaviour or documents in your system. Using this information you can tweak the results to better include the outliers.

Aside from helping categorize data, you can also learn a lot about the collection you’re indexing. For instance, in ecommerce, which kinds of products each manufacturer tends to make.

Experiments

There is usually more than one way to do things. Maybe when you originally deployed your application, it was configured with basic signal boosting to improve relevancy. But now, your team wants to see if making a more personalized search with recommendations is actually better. Or perhaps you started adding query intent and filtering to categories. Someone asks, “but do we actually get better results.” How do you answer the question?

You experiment. You send some percentage of requests through one path and another set of requests through the other. If it is a web or mobile-based search then obviously whichever got the most clicks (clickthrough rate, or CTR) is the better configuration.

Fusion let’s you configure this to happen automatically and even gives you handy charts to prove that it worked. Maybe you have something other than CTR that you want to measure. Fusion will help you there still.

Learning to Rank

Fusion 4 includes Solr 7. Solr recently added an algorithm called “Learning to Rank.” Learning to Rank (LTR) is another classification tool. Basically, sometimes the algorithm that search uses (BM25 by default in Fusion) isn’t sufficient. Instead a combination of “characteristics” about the data and how users actually use the data to teach the system how to order a set of results. We have a webinar on Learning to Rank coming up on April 4th which will go into more detail about some of the results you can expect when combining LTR with Fusion’s signal capabilities.

Head-n-Tail

Although it sounds like a popular shampoo, it is actually a way to learn more about your most popular queries (head) as well as the outliers (tail) and how to improve them.

We have a webinar on “Fusion 4 Head-n-Tail analysis” on March 28th as well as an upcoming technical paper which will explain how you can use Head-n-Tail analysis to tune your results.

Apache Spark FTW!

Fusion 4 adds a significant amount of AI capability, it is built upon the shoulder of a giant, Apache Spark. Because of this, you have the power to add your own Spark-based machine learning jobs and even use Spark to manipulate data stored in Fusion.

…And Much Much More!

These are just the highlights of the AI-powered search features added to Fusion 4. We didn’t even touch upon things like Ground Truth, Co-ocurrence and Levenshtein Spell Checking. Fusion 4 is a great leap forward in intelligent, self-tuning AI-powered search.

Learn more:

The post Machine Learning in Fusion 4 appeared first on Lucidworks.

↧

Building Image-Based Perception Into Your Search Apps

March 29, 2018, 3:50 pm

≫ Next: Using Learning to Rank to Provide Better Search Results

≪ Previous: Machine Learning in Fusion 4

In this debut episode of Lucidworks Streams, Kord and Michael show you how to add image-based perception capabilities to your search apps with an AI neural network.

The post Building Image-Based Perception Into Your Search Apps appeared first on Lucidworks.

↧

Using Learning to Rank to Provide Better Search Results

April 2, 2018, 10:00 pm

≫ Next: Advanced Spell Check with Fusion 4

≪ Previous: Building Image-Based Perception Into Your Search Apps

Fusion 4 is built with Apache Solr 7. Solr 7 provides a powerful algorithm for improving search results called Learning to Rank, or LTR. At its essence, LTR uses machine learning to teach the system how to order a set of results based on certain characteristics.

When you send a query to Fusion or Solr, results are returned and ordered by a relevance algorithm called BM25 which is derived from TF-IDF. While this is a simplification, based on this algorithm, the more frequently a rare term occurs in a document, the higher the document will be ranked in the results. This is usually pretty good, but for some types of results it just doesn’t provide the best order possible.

Learning to Rank (LTR) lets you provide a set of results ordered the way you want them to then teach the machine how to rank future sets of results. The default search algorithm is still used to get the initial set of results, but then the system will reorder them based on the ranking model that it trained on.

While a hand-ordered input data is useful for small datasets, using Fusion’s signals capability you can go much further. By using behavioral data (i.e. what users clicked on or actually bought), you can transform your signal data into an automatic ranking set. In this way you can essentially let the users decide which result should be first.

Using Fusion’s signal capture and signal boosting, together with Solr 7’s Learning To Rank capability you can provide better results than using any one method alone.

I’m a Senior Data Engineer at Lucidworks and I’ve put together a technical paper explaining how to do this and how the results compare. Check it out Learning to Rank for Better Search Results. I’ll be hosting a webinar on the same topic, Learning to Rank for Improved Search Results, on April 4th.

Learn more:

Webinar: Learning to Rank for Improved Search Results
Technical Paper: Learning to Rank for Better Search Results
Blog: Fusion 4 Ready For Download
Contact us, we’d love to tell you more about it!

The post Using Learning to Rank to Provide Better Search Results appeared first on Lucidworks.

↧

Advanced Spell Check with Fusion 4

April 5, 2018, 3:05 am

≫ Next: A/B Testing Your Search Engine with Fusion 4.0

≪ Previous: Using Learning to Rank to Provide Better Search Results

Let’s look at how we can extend and improve spell checking within Fusion 4 by utilizing the number of occurrences of words in queries or documents to find misspellings. For example, if two queries are spelled similarly, but one leads to a lot of traffic (head) and the other leads to a few or zero traffic (tail), then very likely the tail query is misspelled and the head query is the correct spelling.

The “Token and phrase spell correction” job in Fusion 4 extracts tail tokens (one word) and phrases (two words) and finds similarly-spelled head tokens/phrases. If there are several matching heads found for each tail, the job can compare and pick the best correction using multiple configurable criteria.

How to Run Spell Correction Jobs in Fusion 4:

We will be using an ecommerce dataset from Kaggle to show how to perform spell correction based on signals.

In Fusion’s jobs manager, add a new “token and phrase spell correction” job and fill in the parameters as follows:

You can run the spell checker job on two types of data: signal data or non-signal data. If you are interested in finding misspellings in queries from signals, then check the “Input is Signal Data” box.

The configuration must specify:

Which collection contains the signals (the Input Collection parameter)
Which field in the collection contains the query string (the Query Field Name parameter)
Which field contains the count of the event (for example, if signal data follows the default Fusion setup, count_i is the field that records the count of raw signal, aggr_count_i is the field that records the count after aggregation)

The job allows you to analyze query performance based on two different events: main event and filtering/secondary event. For example, if you specify the main event to be clicks with a minimum count of 0 and the filtering event to be queries with a minimum count of 20, then the job will filter on the queries that get searched at least 20 times and check among those popular queries to see which ones didn’t get clicked at all or only a few times. If you only have one event type, leave the Filtering Event Type parameter empty. You can also upload your dictionary to a collection to compare the spellings against to and specify the location of dictionary in Dictionary Collection and Dictionary Field parameter. For example, in ecommerce use cases, the catalog can serve as a dictionary, and the job will check to make sure the misspellings found do not show up in the dictionary, while the corrections do show up.

If you are interested in finding misspellings in content documents (such as descriptions) rather than queries, then un-check the “Input is Signal Data” box. And there is no need to specify the parameters mentioned above for the signal data use case.

After specifying the configuration, click Run > Start. When the run finishes, you should see the status Success to the left of the Start button. If the run fails, you should check the error messages in “job history”. If the job history doesn’t give you insight into what went wrong, then you can debug by submitting the following curl command in a terminal:

tail -f var/log/api/spark-driver-default.log | grep Misspelling:

After the run finishes, misspellings and corrections will be output into the output collection. An example record is as follows:

You can export the result output into a CSV file for easy evaluation.

Usage of Misspelling Correction Results

The resulting corrections can be used in various ways. For example:

Put misspellings into the synonym list to perform auto-correction. Please checkout our Synonyms Files manager in Fusion. (https://doc.lucidworks.com/fusion/3.0/Collections/Synonyms-Files.html)
Help evaluate and guide the spell check configuration.
Put misspellings into typeahead or autosuggest lists.
Perform document cleansing (for example, clean a product catalog or medical records) by mapping misspellings to corrections.

Comparing Fusion’s spell check capabilities to the Solr spell checker, the advantages with this Fusion job are:

Have basic Solr spell checker settings such as min prefix match, max edit distance, min length of misspelling, count thresholds of misspellings, and corrections.
If signals are captured after the Solr spell checker was turned on, then these misspellings found from signals are mainly identifying erroneous corrections or no corrections from Solr.
The job compares potential corrections based on multiple criteria more than just edit distance. User can easily configure the weights they want to put over each criterion.
Rather than using a fixed max edit distance filter, we use an edit distance threshold relative to the query length to provide more wiggle room for long queries. Specifically, we apply a filter such that only pairs with edit_distance <= query_length/length_scale will be kept. For example, if we choose length_scale=4, for queries with lengths between 4 and 7, then the edit distance has to be 1 to be chosen. While for queries with lengths between 8 and 11, edit distance can be 2.
Since the job is running offline, it can ease concerns of expensive spell check tasks from Solr spell check. For example, it does not limit the maximum number of possible matches to review (the maxInspections parameter in Solr) and is able to find comprehensive lists of spelling errors resulting from misplaced whitespace (breakWords in Solr)
It allows offline human review to make sure the changes are all correct. If you have a dictionary (such as a product catalog) to check against the list, the job will go through the result list to make sure misspellings do not exist in the dictionary and corrections do exist in the dictionary.

Misspelling Correction Results Evaluation

Above is an example result that has been exported to a CSV file. Several fields are provided to facilitate the reviewing process. For example, by default, results are sorted by “mis_string_len”, (descending) and “edit_dist” (ascending) to position more probable corrections at the top. Sound match or last character match are also good criteria to pay attention to. You can also sort by the ratio of correction traffic over misspelling traffic (the “corCount_misCount_ratio” field) to only keep high-traffic boosting corrections.

Several additional fields (as shown in the table above) are provided to disclose relationships among the token corrections and phrase corrections to help further reduce the list. The intuition is based on the idea of a Solr collation check, that is, a legitimate token correction is more likely to show up in phrase corrections. Specifically, for phrase misspellings, the misspelled tokens are separated out and put in the “token_wise_correction” field. If the associated token correction is already included in the one-word correction list, then the “collation_check” field will be labeled as “token correction included”, and the user can choose to drop those phrase misspellings to reduce duplications.

We also count how many such phrase corrections can be solved by the same token correction and put the number into the “token_corr_for_phrase_cnt” field. For example, if both “outdoor servailance” and “servailance camera” can be solved by correcting “servailance” to “surveillance”, then this number is 2, which provides some confidence for dropping such phrase corrections and further confirms that “servailance” to “surveillance” is legitimate. You may also see cases where the token-wise correction is not included in the list. For example, “xbow” to “xbox” is not included in the list since it can be dangerous to allow an edit distance of 1 in a word of length 4. But if multiple phrase corrections can be made by changing this token, then you can add this token correction to the list. (Note, phrase corrections with a value of 1 for “token_corr_for_phrase_cnt” and with “collation_check” labeled as “token correction not included” could be potentially-problematic corrections.)

On the other side for token corrections, attention can be paid to the pairs with short string length that show no collation in phrases. But it’s also possible that the token correction does not have its corresponding phrase level corrections appear in the signal or only show up in single-word queries. For example, “broter” was only used as a single-word query, thus there is no collation found in phrases. At last, we label misspellings due to misplaced whitespaces with “combine/break words” in the “correction_types” field. If there is a user-provided dictionary to check against, and both spellings are in the dictionary with and without whitespace in the middle, we can treat these pairs as bi-directional synonyms (” combine/break words (bi-direction)” in the “correction_types” field).

The good news here is that you don’t need to worry about the above rules for review. There is a field called “suggested_corrections” which explicitly provides suggestions about using token correction or the whole phrase correction. If the confidence of the correction is not high, then the job labels the pair as “review” in this field. You can pay special attention to the ones with the review labels.

If we use Solr spell check shipped with Fusion to tackle new and rare misspellings, and Fusion’s advanced spell check list to improve the correction accuracy for common misspellings, combined together, we can provide a better user search experience.

The post Advanced Spell Check with Fusion 4 appeared first on Lucidworks.

↧

A/B Testing Your Search Engine with Fusion 4.0

April 26, 2018, 6:00 am

≫ Next: Retail Recommendations Across the Omnichannel

≪ Previous: Advanced Spell Check with Fusion 4

From head tail rewriting to recommendations and collaborative filtering based boosting, Lucidworks Fusion offers a whole host of strategies to help you improve the relevance of your search results. But discovering which set of strategies works best for your users can be a daunting task. Fusion’s new experiment management framework helps you A/B test different approaches and discover what works best for your system.

What is A/B Testing?

A/B testing is, essentially, a way to compare two or more variants of something to determine which performs better, according to certain metrics. In search, the variants in an A/B test are typically different search pipelines you want to compare; the metrics are typically aspects of user behavior you want to analyze. For example, let’s say you want to see how enabling Fusion’s item to item recommendation boosting stage impacts query click through rate.

You would begin by creating two pipelines, one with a baseline relevance measurement and one with that default plus item to item recommendations enabled.

Fusion has a compare tool that enables you to query and see results for two pipelines side by side.

As you can see, for the particular query ipod case, both pipelines are giving different results, but both seems reasonable. How do we determine which is better? Let’s let our system’s users tell us. Let’s launch an experiment to see which pipeline gives a higher click through rate. This will tell us which pipeline gives more relevant results for our particular users.

To set up an experiment, navigate to the experiment manager in Fusion. Select the two pipelines as your variants and CTR as your metric. When you are setting up your variants you will see the option to select what varies in each variant. This is essentially the parameter you plan on changing between your variants.

Fusion allows you to vary the pipeline, parameters, or collection. In this case we are comparing two query pipelines so I will select Query Pipeline for both variants. Also note, on the left, that we can specify how much traffic gets routed to each variant. In this case, since we have a default pipeline and we just want to see if enabling the recommendation stage improves the pipeline, lets route less traffic to the recommendations pipeline and keep most of our users looking at the default “good” search results we already have. An 80/20 split seems reasonable, so I am going to adjust the weight on the primary variant to be 8 and on the secondary variant to be 2.

This means 80% of the traffic is going to the default pipeline and 20% is going to our experimental variant.

Fusion also offers a number of metrics that you can compute for each variant.

You can select as many or as few of these metrics as you would like. For the purposes of this experiment, let’s focus on click through rate. For more info on the other metrics refer to our documentation here https://doc.lucidworks.com/fusion-ai/4.0/user-guide/experiments/index.html

Once you have your experiment configured and saved, your page should look something like this. Click Save.

In order to start gathering data for an experiment, the experiment must be linked with either a query profile or a query pipeline. This way, whenever a user hits the API endpoint associated with the query profile or the query pipeline, they will be automatically placed in the experiment and routed to a particular variant. Let’s link this experiment with the default query profile associated with my app. For more on query profiles see our documentation here: https://doc.lucidworks.com/fusion-server/4.0/search-development/getting-data-out/query-pipeline-configuration/query-profiles.html

You should see the enable experimentation check box. Select the experiment we just set up and click Save. Then hit Run Experiment. The screen should now look like this:

Once an experiment is running, it is locked. This means you cannot change the variants or the collected metrics without stopping and restarting the experiment. This is to safeguard against collecting inaccurate metrics. Now we have a running experiment! Every time a user hits the endpoint at the end of the query profile, they will be routed into this experiment.

80% of users that hit that endpoint will see the results coming from the default-relevancy-pipeline and 20% will see the results coming from the recommendations-test-pipeline. Now let’s examine the results of the experiment. Going back to the experiment manager and selecting the experiment we just created brings up an “Experiment Results” button. Clicking it will take you to a page that looks like this:

You’ll see that the confidence index is zero. This means there is not enough traffic running through your experiment to display statistically significant data yet. Let’s wait for a while, let the users interact with the system, and come back to the results. After some time, the results should look more like this:

Looks like the recommendation pipeline is consistently doing better. Now that we know this, we can end the experiment and deploy the recommendation pipeline into production. Fusion Experiment Management has helped us collect relevant data and make an informed decision about the search system!

A/B testing is a critical part of a functioning search architecture. Experimentation allows you to collect actionable data about how your users are interacting with the search system.

Next Steps

Sign up for our upcoming A/B Testing in Fusion 4 Webinar
Learn more about Fusion 4 from our previously recorded Fusion 4 Overview webinar.
Contact us, we’d love to hear from you.

The post A/B Testing Your Search Engine with Fusion 4.0 appeared first on Lucidworks.

↧

Retail Recommendations Across the Omnichannel

April 30, 2018, 6:00 am

≫ Next: When Worlds Collide – Artificial Intelligence Meets Search

≪ Previous: A/B Testing Your Search Engine with Fusion 4.0

Over the last few months we’ve been covering how AI, personalization, omnichannel strategies, and mobile technologies are changing the way retail works. This revolution is accelerating and technology is on the verge of changing how we shop forever. What this all boils down to is essentially embedding the personal shopper into the store by using customer data, technology, and AI.

If you’re trying to increase your conversions on an ecommerce site, a brick and mortar store, or through a combination of both here are some resources to help:

Blogs

EBooks / Whitepapers

Webinars

AI and Machine Learning for Omnichannel Retailers

Video Case Studies

Bluestem Brands

Next Steps

The post Retail Recommendations Across the Omnichannel appeared first on Lucidworks.

↧

When Worlds Collide – Artificial Intelligence Meets Search

April 30, 2018, 8:31 am

≫ Next: Why Productivity is so Elusive

≪ Previous: Retail Recommendations Across the Omnichannel

We are living in a time when the fields of Artificial Intelligence and Search are rapidly merging – and the benefits are substantial. We can now talk to search applications in our living rooms, cars, and as we walk down the street looking for nearby places to eat. As both fields mature, their convergence, while inevitable, is creating new sets of opportunities and challenges. Our Lucidworks Fusion product – especially the latest Fusion 4.0 release – is right in the middle of this latest search revolution. Its core technologies, Apache Solr and Apache Spark provide exactly the right resources to accomplish this critical technological merger. More on this later.

The Search Loop – Questions, Answers, and More Questions

At their core, all search applications, intelligent or otherwise, involve a conversation between a human user and the search app. Users pose questions and get an answer or a set of answers back. This may suffice or there may be follow-up questions – either because the initial answers are unsatisfactory or to obtain more detailed information. A critical component of this “search loop” is the initial translation from “human query language” to computer database or search engine lookup code. At the cutting edge are speech-driven interfaces such as Apple’s Siri and Amazon’s Alexa in which one simply talks to the search app (prefacing the question with her name to wake her up), asks a question, and gets an answer back in a state-of–the-art text to speech rendering. Under the covers, the Siri or Alexa software is first converting your speech to text, then parsing that text to ascertain what you are asking. They do a very good job with questions formatted in ways that their programmers anticipated, less well with more open ended questions that naïve users tend to pose. It is clearly Artificial Intelligence, but not yet the ultimate AI envisioned by Alan Turing some 68 years ago. The Alexa developers know this too. When I asked Alexa if she could pass the Turing Test, she rightly responded: “I don’t need to pass that. I’m not pretending to be human.” Fair enough. Kudos to the Alexa team for both honesty and some amazing functionality.

Query Parsing – Inferring User Intent

One of the driving motivations for adding AI to search is the realization that better answers can be given if we can do a better job of determining what the user is asking for. A critical component of this is a technology called Natural Language Processing or NLP. NLP works best when it has full sentences to work with, and in traditional search apps (i.e. not speech driven) this is often not available as users tend to put single terms or at most 3 to 4 terms into a search box. In some domains such as eCommerce, user intent is easier to infer from these short queries, as users tend to just put the name of the thing they are looking for. If this doesn’t bring back what they want, they tend to add modifiers (adjectives) that more fully describe their intent. Entity extraction, rather than Parts-of-Speech (POS) analysis becomes more relevant to this particular query parsing problem due to the lack of complete sentences. Machine learning models for entity extraction are part of the toolset used here, often based on training sets derived from query and click logs or by head-tail analyses to detect both the noun-phrases that directly indicate the “thing” (which tend to dominate head queries) and the added modifiers and misspellings that users sometimes insert (tail queries) – see Chao Han’s excellent blog post on this.

Another often used technique that is similar to what Siri and Alexa seem to do, is “directed “ or “pattern-based” NLP, in which common phrase structures are parsed (example “leather cases for iPhones” or “wireless cables under 30$”) to determine what is being searched for and what related attributes should be boosted. Pattern-based NLP does not involve complex linguistic analysis as is done with POS taggers. Rather it can be based on simpler techniques such as regular expressions. The downside is that it only works on expected patterns whereas more sophisticated POS taggers can work with any text. As noted above, in the latter case, user queries tend to be too cryptic to support “true” NLP. Extraction of entities as phrases is very important here, as this vastly improves search precision over the naïve “bag of words” approach.

In addition to ML-based approaches, another method for query intent inference which I have developed and blogged about called Query Autofiltering (QAF)[1,2,3,4], uses structured metadata in the search collection to infer intent – where term meanings are known based on their use as field values in the collection (e.g. “red” is known to be a Color because it is a value in the Color facet field). QAF requires highly structured metadata to work well and can benefit from techniques such as text mining at index time to provide better meta-information for query parsing of user intent. In particular, a critical piece of metadata in this respect – product type – is often missing in commercial datasets such as Best Buy and others, but can be mined from short product descriptions using NLP techniques similar to those discussed above. The use of synonyms (often derived from query logs) is also very important here.

Information Spaces – From Categorical to Numerical and Back

Search is an inherently spatial process. When we search for things in the real world like misplaced car keys, we look for them in physical space. When we search for information, we look in “information space” – a more highly dimensional and abstract concept to be sure, but thinking about the problem in this way can help to conceptualize how the various technologies work together – or to put it more succinctly, how these technologies map to each other within this information space.

Databases and search indexes are metadata-based structures that include both categorical and numerical information. In traditional search applications, metadata is surfaced as facets which can be used to navigate within what I call “meta-information space.” As I discussed in a previous blog post, facets can be used to find similar items within this space in both conventional and unconventional ways. Hidden categories also exist in “unstructured” text and the surfacing of these categorical relationships has long been a fertile ground for the application of AI techniques. Text analytics is a field by which free-text documents are categorized or classified – using machine learning algorithms –bringing them into the categorical space of the search index, thereby making them easier to search and navigate. The process is an automated abstraction or generalization exercise which enables us to search and navigate for higher level concepts rather than keywords.

In contrast to meta-informational data structures used in search applications, machine learning algorithms require purely numerical data, and model information spaces as n-dimensional vectors within continuous or numerical Euclidian spaces. For this reason, the first operation is to “vectorize” textual data – i.e. translate patterns in text to a purely numerical representation that machine learning algorithms can work with. This traditionally involves determining frequencies and probabilities of text “tokens.” A common metric is term frequency / inverse document frequency or TF/IDF. This metric is also the basis of the Lucene search engine’s default Similarity algorithm (the current default BM25 algorithm adds some “tweaks” to deal with things such as document length but maintains TF/IDF at its core). The assumption here is that “similarity” in conceptual space can be modeled by “closeness” or proximity of term vectors in n-dimensional numerical space. The proof of this assumption is in the pudding as it were – the success of machine learning approaches generally validates this basic “leap of faith” in traversing between categorical and numerical spaces.

The machine learning “loop” involves translating textual to numeric information, detecting patterns using mathematical algorithms, and finally translating the resulting matrices of numbers back to predictions (e.g. predicted memberships in categories), given a new document that the pattern detecting algorithm has not yet encountered. The artifact produced by the pattern detection step is called a “model.” Creating good machine learning models is both an art and a science, meaning that the work of “data scientists” will continue to be in great demand.

There are two main types of machine learning algorithms, which basically differ in the source of the label or category that is mapped by the algorithm. With “unsupervised” learning algorithms such as Latent Dirichlet Allocation (LDA) or KMeans, the categories are automatically discovered from patterns in the data. Typically, this involves finding “clusters” of terms that tend to be co-located in documents in a statistically significant fashion. Applying a higher level label to these clusters is often difficult and may be augmented by inspection by human experts, after the automated clustering is done. In contrast, with “supervised” learning, the labels are provided by subject matter experts up front as part of a training set. The human supplied labels are then used to detect patterns in the training data that correlate to the labels, to create a model that can be used to predict the best category or label given a new unclassified document. By this means, the work of the subject matter expert in manually categorizing a small training set can be scaled to categorize a much larger corpus. The accuracy of the model can be tested using another set of human-tagged documents – the test set.

Word Embedded Vectors – Direct Modeling of Semantic Space

Another machine learning technique that is rapidly gaining ground in this field are algorithms such as Word2Vec, FastText, and GloVe that can discover syntactic and semantic relationships between terms in a document corpus using statistical analysis of term relationships, by scanning large amounts of text (with tens or hundreds of billions of terms). The output of these algorithms are known as “word embedded vectors” in which terms are related to other terms within a multi-dimensional vector space computed by the machine learning process, based on their distance (Euclidian or angular) from each other in this space. As with other forms of machine learning, the fact that semantic categories have been learned and encapsulated within the sea of numbers that constitutes the trained model (i.e. the “black box”) can be shown by probing the model’s outputs. Given a term’s learned vector, we can find its nearest neighbors in the vector space and these near neighbors are found to be highly syntactically or semantically correlated! The relationships include synonymous terms, adjective relationships (big -> bigger -> biggest), opposites (slow -> fast), noun phrase completions, and so forth. Other semantic relationships such as Country -> Capital City are encoded as well.

The value here is that the vectors are both categorical and numerical providing a direct mapping between these different representations of information space – thereby showing the validity of such a mapping. The mapping is a rich, multi-dimensional one in both numerical and semantic senses. The most famous example from the Google team’s Word2Vec work (also reproduced by the GloVe team) showed that the vector for “King” minus the vector for “Man” plus the vector for “Woman” yields a vector very close to the learned one for “Queen.” A “King” can be roughly defined as an “adult male with royal status.” Peeling off the “adult maleness” aspect of the “King” vector by subtracting the “Man” vector leaves its “royal status” context behind and when this is combined with the vector for “Woman” yields a resultant vector that is very close to the one that was learned for “Queen.” The vectors connecting word pairs that express some semantic category such as “gender opposites” like “father, mother”, “brother, sister”, “uncle, aunt”, “son, daughter” etc. are seen to be roughly parallel to each other within what can be thought of as a “semantic projection plane” in the space. However, the examples above have more than one “semantic dimension” or classification category (e.g. parent, offspring, sibling, etc). This suggests that not only can semantic relationships be encoded mathematically, but that this encoding is composed of more than one categorical dimension! In their research paper on Word2Vec, Mikolov et.al put it, “words can have multiple degrees of similarity” and this multi-dimensionality in “syntactic and semantic space” is reflected in word vector space. That’s an impressive demonstration that we can use abstract mathematical spaces to encode categorical ones!

Among the related things that surface in word embedded vectors are noun phrase completions. For example, in an IT issues dataset, the term ‘network’ was found to be related to the words ‘card’ and ‘adapter’ and ‘failure’ However, if I train the model with phrases like ‘network adapter’ and ‘network card’ and ‘network failure’, the relationships will include these higher level semantic units as well. In other words, using noun phrases as input tokens turns “Word2Vec” into “Thing2Vec”. Similar observations are found in the literature.

Although interesting in and of themselves from a theoretical perspective, the results of word vector embeddings are proving to have great utility in other forms of machine learning. Recent research has shown that using pre-trained word vectors for other machine learning tasks such as document classification, spam detection, sentiment analysis, named entity extraction, etc., yields better accuracy over more basic vectorizations based on term frequencies. This make sense as moving from token based vectorizations, which have no knowledge of semantics, to vectorizations that have such embedded knowledge, improves the “signal-to-noise” of higher level text mining efforts.

Knowledge Spaces and Semantic Reference Frames

As we (humans) acquire language, first as young children and then as we continue our education (hopefully throughout life), we develop an internal “knowledge base” that we can use to disambiguate word meanings based on their context. For example, the term “apple” has multiple meanings as it can occur in several subject domains. The default meaning is the one known to our forbearers – that of a kind of fruit that grows on trees (horticultural domain). A secondary meaning that your great-great grandmother would also have known is as a food ingredient (culinary domain) as in “apple cobbler.” In our modern culture however, this term has obtained other meanings since it has been used a) to name a computer / tech company, b) to name a recording company of a famous popular music band and c) a nickname of a large city on the east coast of the United States. Which meaning is intended can be gleaned from context – if “apple” is in proximity to names of varieties such as “Granny Smith”, “Honey Crisp”, or “Golden Delicious” – we are talking about apples proper. If the context terms include “iPhone”, “OS X”, “iTunes”, “Steve Jobs”, or “Tim Cook” – we are talking about Apple Computer, and so forth. We use our knowledge of these contextual relationships between terms or phrases to orient ourselves to the appropriate subject domain when dealing with text. Linguists refer to this knowledge-based orientation as semantic reference frames.

Taxonomies, Ontologies: Knowledge Graphs

Giving this same ability to search applications enables them to better infer query intent and improve the relevance of search results. One famous example is IBM’s Watson – which learned millions of facts by scanning an enormous number of text documents, building a knowledge base that could then be used to play (and win) the TV game show Jeopardy. Watson used a framework that IBM open sourced called UIMA to first extract entities using Named Entity Recognition or NER techniques and then to extract facts by detecting semantic patterns of named entities. Performing a similar exercise on the query (NER + pattern detection) enabled Watson to plug the named entity variables (person:X book_title:Y) into a lookup to its knowledge base. The AI systems that are now infiltrating our homes and cars such as Google, Cortana, Siri, and Alexa use similar techniques. The key is to use some knowledge base to tag terms and phrases within a query or document so as to ascribe semantic meaning to them. Once this is accomplished, the rest is just like SQL. The Query Autofilter discussed above does a similar thing except that in this case, the knowledge base is the search collection itself.

Ontologies and taxonomies are types of “knowledge graph” – data structures that enable known facts and relationships to be codified – they are in other words, graphical representations of knowledge space. Taxonomies encode hierarchical relationships – either IS-A class / sub class (hypernym/hyponym) or HAS-A whole-part (meronym) relationships. Ontologies are more open-ended and encode knowledge spaces as a set of nodes that typically define entities and edges that define the relationship between pairs of nodes.

Taxonomies and ontologies have traditionally been manually created and curated. This has the advantage of accuracy but suffers in terms of completeness and currency (i.e. keeping it up to date). For this reason, machine learning approaches, while more error prone, are increasingly being used for knowledge mining. For example, the semantic frames approach that is used for user intent detection in queries and for pattern detection in text can also be used to build knowledge graphs. An example of this would be to take the text “NoSQL databases such as MongoDB, Cassandra, and Couchbase”, where the prepositional phrase “such as” or “like” implies that the things on the right are instances of the class of things on the left. These “lexico-syntactic patterns” are known as Hearst Patterns [Hearst 1992] and can be used to build taxonomies automatically via text analysis [e.g. TAXI]. Another approach is to couple metadata in a search collection with text fields to build graphs of related terms using a search engine like Solr. The resulting Semantic Knowledge Graph can then be used for a number of search improvement tasks such as “knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems” (Grainger et.al. 2016). From this example and the ones involving word embedded vectors (word2Vec, GloVe) – machine learning paradigms can be layered – creating outputs where the sum is greater than the parts.

If the output of the automated knowledge mining operation is something that humans can inspect (i.e. not a matrix of numbers), hybrid systems are possible in which subject matter experts correct the ML algorithm’s mistakes. Better still, if the ML algorithm can be designed to learn from these corrections, it will do better over time. Human-supplied knowledge base entries or negative examples gleaned from manual deletions can be used to build training sets that learning algorithms can use to improve the quality of automated knowledge mining. The development of large training sets in a number of domains is one of the things fueling this collision of search and AI. Human knowledge is thus the “seed crystal” that machine learning algorithms can expand upon. As my friend the Search Curmudgeon blogged about recently, in addition to data scientists, hire more librarians.

Although this exercise may never get us to 100% accuracy, once we get past some threshold where the computer-generated results are no longer routinely embarrassing, we can aggressively go to market. Although knowledgable search and data science professionals may be tolerant of mistakes and impressed by 85% accuracy, this is not so true of the general public, who expect computers to be smart and take occasional egregious mistakes as confirmation-bias that they are in fact not smart. As AI-powered devices are popping up everywhere we look now, that we have passed the general public’s machine intelligence “sniff test” is beyond question. As I mentioned earlier, while we are still a ways short of passing the gauntlet set by Alan Turing, we are clearly making progress towards that goal.

Lucidworks Fusion AI

So given all of this, what does our product Lucidworks Fusion bring to the table? To begin with, at Fusion’s core is the best-in-class search engine Apache Solr. However, as discussed above, to go beyond the capabilities of traditional search engines, bleeding edge search applications rely on artificial intelligence to help bridge the gap between human and computer language. For this reason, Apache Spark – a best-in-class engine for distributed data processing with built-in machine learning libraries was added to the mix as of Fusion 2.0 (the current version is 4). In addition to Spark, it is also possible to integrate other artificial intelligence libraries into Fusion such as the impressive set of tools written in the Python language such as the Natural Language Toolkit (NLTK) and TensorFlow. This makes building complex machine learning systems such as Convolutional Neural Networks or Random Forest Decision Trees and integrating them into a search application much easier. Fusion also provides ways to optimize ML algorithms by searching “hyper parameter” spaces (each model has tuning parameters which are often highly non-intuitive) using a technique called “grid search.” In short, while still benefiting from some data science expertise, Lucidworks Fusion puts these capabilities with reach of search developers that would not consider themselves to be machine learning experts. As of the latest release, Fusion 4.0, more intelligence has been “baked in” and this will continue to be the case as the product continues to evolve.

Machine Learning Jobs, Blobs, Index and Query Pipelines

That Lucidworks Fusion is a great platform to integrate artificial intelligence with search can be seen by examining its architecture. Apache Spark serves as the engine by which machine learning jobs – either OOTB or custom – are scheduled and executed. The output of these jobs are the so-called ML “models” discussed above. To Fusion, a “model” can be simply considered as just a collection of bytes – a potentially large collection that in the tradition of the relational database nomenclature is called a BLOB ( which stands for Binary Large Object). Fusion contains a built-in repository we call the Blob Store that serves as the destination for these ML models. Once in the Blob Store, the model is accessable to our Index and Query Pipelines. As discussed above, machine learning is generally a two phase process in which supervised or unsupervised algorithms are used to learn something about a collection of content which is encapsulated in a model (phase 1). The model is then used to predict something about previously unseen content (phase 2). In Fusion, phase 1 (model training) is done using Fusion Apache Spark clusters and phase 2 is applied either to content sent to a Fusion search cluster for indexing into Solr via the Index Pipeline or to queries submitted to Fusion via the Query Pipeline through the intermediary of the Blob Store. In this way, the tagging or marking up content to enhance search-ability or inspecting the query to infer user intent or to re-rank results are achievable using the same basic plumbing.

Signals, Machine Learning and Search Optimization

In addition to the plumbing, Fusion “bakes in” the ability to merge your data with information about how your users interact with it – enabling you to create search applications that become more useful the more they are used. We call this capability “signals”. In short, signals are metadata objects that describe user actions. It could be a mouse click on a document link, a click on a facet, an “add to cart”, or “bookmark URL”, or “download PDF” event. Other types of events could be “viewed page for 30 seconds” or “hovered over flyout description” – basically anything that your front-end search application can capture that your UX designers would consider to be an indicator of “interest” in a search result. The metadata describing the event is then sent to Fusion’s REST API to be indexed into the raw signals collection along with the id of the query that preceded the event. The signals collection automatically receives query events (as of Fusion 4.0 it functions as the query log collection as well).

Signals –> Aggregation –> Models –> Pipelines

Once you have acquired a sufficient amount of information on how users interact with your data, you can use this information to improve their search experience. The first step is to aggregate or roll-up the raw signal data so that statistical analyses can be done. Fusion comes with pre-built Spark aggregation jobs that you can use or you can roll your own if necessary. The resultant aggregations are stored back in Solr. At this point, we can use the aggregated data to train ML algorithms for a number of purposes – collaborative filtering, user intent classification, item to item or query to item recommendations, learning to rank and so on. The resulting models are put in the Blob Store where they can be used by Fusion Index or Query Pipeline stages to affect the desired transformations of incoming content or queries.

In this way, the Lucidworks architecture provides the perfect platform for the fusion of Search and Artificial Intelligence (pun very much intended!)

The post When Worlds Collide – Artificial Intelligence Meets Search appeared first on Lucidworks.

↧

Why Productivity is so Elusive

May 2, 2018, 6:00 am

≫ Next: Enterprise Search Buyers, Your Questions Answered

≪ Previous: When Worlds Collide – Artificial Intelligence Meets Search

There are three major theories as to why the US, UK, and other developed economies have seen a stagnation in productivity growth in recent years. The first is that we’ve already squeezed all the major productivity gains we’re really going to get. The second is that we’re not measuring productivity correctly. The third is that productivity lags while major innovations are developed.

The first theory is interesting to economists but seems wrong on its face. If you’ve ever sat on the phone for hours trying to fix a screw up or had a tough business question to answer you know there are still things we’re not doing. I can stare across any office in the world and see that we haven’t quite finished optimizing the business.

At the same time, we probably can’t expect to see the kind of productivity gains in the economy that we saw from telecommunication, electrification, and digitization, at least not for a long while.

The second theory, that we’re not measuring productivity right seems to ring some truth. In the last few decades, business has transformed in two major ways. The first is globalization and the second is digitalization.

Major manufacturing has moved to developing economies, and largely the finance and data management resides in the developed economies. This trend is changing as developing economies mature, but this has largely been the state of the economy for the last few decades. This means that a national measure of productivity may not be as relevant as it once was.

With digitization, more work is being done by computers, but we’ve yet to complete digitization. Most decisions, even relatively minor decisions are made in the analog world. This requires people to sift through an ever increasing amount of data. While digitization realized major gains in the early 2000s, data exploded after that. Businesses are still figuring out how to use that for better productivity.

The third theory, that gains are on a long curve, also has a lot on it. Major investments in commercialized outer space applications, self-driving cars, and artificial intelligence are a drain on economic measures. A lot is being put into them, but only AI is beginning to pay off and is just on the verge of the kind of mass deployment it deserves.

What can we do about it?

If the first theory were true then I suppose we couldn’t do much. However, if bits of all of the theories are true, there is a lot we can do for an individual business.

First of all, we need to adapt our business processes to a data-driven economy. This means automating decision making, using workflow systems to drive actions but most of all being ready to change that as the situation on the ground changes.

Secondly, the most important thing is to equip our workers to deal with ever increasing amounts of data. This means deploying systems that let them find answers to what they need when they need it, if not a bit before. When you stare across the office, ask the question “who is waiting on something and why?” When you receive an email asking a question, ask “how could they have answered this question for themselves?”

Third, it is time to deploy AI and make use of data science throughout the enterprise. With more and more data, people can’t make the small decisions. It starts with an investment in and a strategy for AI. How is your company going to educate its workforce, adapt its business processes, deploy the technologies, operationalize them, and fully instrument the business? It is time to answer these questions.

So basically…

Measuring productivity is difficult and we’re probably doing it wrong. Meanwhile, innovation takes time and investment first looks like a drain on productivity. With increasing amounts of data and the centrality of data, we need to equip our workforce to work with that data. A key piece of working with data is deploying AI technologies and turning over the smaller, every-day decision to the machines.

Find out more

Enterprise Search Buyers Guide – If you are looking for an Enterprise Search Solution to increase your productivity, this buyers guide goes over the marketplace, terminology, features and selection criteria.
Fusion 4 Overview – Lucidworks Fusion has technologies that can help you answer questions and incorporates essential AI technologies
Head-n-Tail analysis webinar – Our previously recorded webinar on using AI to answer why users didn’t find what they need (and automatically correct it).
Contact us, we’d love to help you.

The post Why Productivity is so Elusive appeared first on Lucidworks.

↧

Enterprise Search Buyers, Your Questions Answered

May 2, 2018, 2:11 pm

≫ Next: Lucidworks Announces $50M in Growth Financing

≪ Previous: Why Productivity is so Elusive

Enterprise search is to corporate intranets what Google is to the wider internet. When deployed correctly and completely, business productivity can be greatly increased by search functionality. Without effective and powerful enterprise search, users are forced to do things “the old way,” passing organizational knowledge via word of mouth or by emailing links or bookmarks. This introduces an overall tax of inefficiency into the workplace.

Are you trying to buy an Enterprise Search Solution, but finding all of the terminology that vendors use or features that vendors offer a bit baffling? We’ve put together a buyers guide to help you navigate the market.

The buyers’ guide goes over:

Different types of solutions
Search capabilities offered
UI Capabilities
Data Import Capabilities
Operations
Security
Analytics
Selection Criteria

Click here to register and download a copy!

The post Enterprise Search Buyers, Your Questions Answered appeared first on Lucidworks.

↧

Lucidworks Announces $50M in Growth Financing

May 3, 2018, 6:31 am

≫ Next: Introducing Activate, the Search & AI Conference

≪ Previous: Enterprise Search Buyers, Your Questions Answered

A very happy Thursday it is!

“Today Lucidworks, the leader in AI-powered search and discovery, announced $50 million in growth financing. Lucidworks will use the capital infusion to expand the company’s enterprise offerings, and to help the world’s leading companies bring Smart Data Experiences to the market. Top Tier Capital Partners led the round with participation from Silver Lake Waterman, Silver Lake’s late stage growth capital fund. The round also includes existing investors Shasta Ventures, Granite Ventures, and Allegis Capital and comes after Lucidworks more than quadrupled its annual recurring revenue (ARR) over the last three years and added top quality talent at the highest levels. The company expects to continue as a powerful technology provider for delivering actionable insights to end users in the world’s largest organizations, including a broad selection of Fortune 500 companies.”

More details in the full press release.

Coverage in VentureBeat.

The post Lucidworks Announces $50M in Growth Financing appeared first on Lucidworks.

↧

Introducing Activate, the Search & AI Conference

May 3, 2018, 11:14 am

≫ Next: Smarter Image Search in Fusion with Google’s Vision API

≪ Previous: Lucidworks Announces $50M in Growth Financing

The Revolution has been won.

We’re excited to announce that Lucene/Solr Revolution is expanding its coverage this year to include more talks about applied artificial intelligence in the realm of search and discovery, in addition to the already rich coverage of Apache Lucene/Solr everyone has grown to love.

As the fields of Search and Artificial Intelligence mature and collide, there are many opportunities to deliver more personalized and relevant user experiences by incorporating AI into our search applications. We feel it is a natural step in the evolution of the conference and also important to those working in the Lucene/Solr community that we explicitly expand the scope of the conference to fully embrace this deeper focus. To highlight the way that the convergence of Search & AI is increasingly activating relevancy and insights to enable the interaction of users with data in a smarter way, we’re changing the conference name from Lucene/Solr Revolution to Activate. Click here to register for Activate.

Activate will be positioned as the premier Search and AI conference, bringing the best of the open source Lucene/Solr project, applied AI techniques for driving enhanced relevancy and personalization, and the complementary research, technologies, and innovations happening within the industry to drive the cutting edge in search and information retrieval. The conference will be held October 15-18, 2018 in Montreal, Canada.

A Look Back

When Lucene/Solr Revolution started back in 2010, we truly were at the forefront of a revolution. Open source search was really beginning to take hold as a viable replacement for legacy black-box search vendors, and the Big Data era was still in its infancy. Lucidworks started Lucene/Solr Revolution as a way to bring the Lucene and Solr communities together to meet, network, and drive innovation to the open source projects.

This conference has always held a place near and dear to my heart – I attended from the very first conference, have spoken every year since, recruited for my previous team at CareerBuilder, found my new home at Lucidworks, and met many fascinating and smart people along the way.

Fast forward 8 years from that first conference, and Apache Solr has been adopted to power many of the most critical applications at the world’s most innovative companies. Companies like Apple, Reddit, and Salesforce have taken the main stage at Revolution to show us how they are building tools with Solr to power their most important applications. In all of these cutting-edge use cases for Solr, there are consistent themes. Legacy search technologies have been replaced with open source Solr-based platforms, these platforms have to be able to scale well beyond all the platforms they have displaced and to enable many more use cases, and there is now an ever-increasing demand for smarter, AI-powered search capabilities to drive relevancy and insights.

The Revolution has been won

Today, Lucene/Solr are the de-facto standard for running search across most organizations. The revolution we started back in 2010 has been a wild success – the Revolution has been won! With the overall focus on core Lucene having declined over the years as people have increasingly embraced Solr to build more sophisticated search solutions, and with the realization that the revolution has accomplished its mission, the name “Lucene/Solr Revolution” is ripe for this branding facelift to “Activate, the Search & AI conference”. While continuing to keep you up-to-date on the latest and greatest in Solr, we also want to drive innovation around the larger ecosystem surrounding Solr, including complementary technologies (Spark, Fusion, NLP, Machine Learning, UX, etc.) and to ensure that as a community we increasingly attract and embrace a much larger, diverse set of participants interested in tackling many of these emerging interdisciplinary problems in search and relevancy.

What to Expect from Activate

Although we’ve always had a diverse set of sessions at the conference, based upon feedback from last year and our expectation of an expanded audience this year, we will be more clearly segmenting sessions for attendees based upon areas of interest: talks dedicated specifically to deep dives into evolving capabilities in open source Solr, sessions for developers and operators harnessing the power of Solr along with complementary technologies, demonstrations of applied Artificial Intelligence and advanced relevancy techniques, and additional presentations covering new and evolving applications for search on the horizon.

In addition to talks from the brightest committers and contributors to the Apache Lucene/Solr project – including more than a dozen full-time committers working on the open source project at Lucidworks – you can expect to find more extensive content around techniques like NLP, Entity Extraction, Text Analytics, ML Algorithms, and Knowledge Graphs, as well as practical examples of the enhancement of search applications with AI.

I am very excited for this change and I think the Solr community will greatly benefit from the expanded content we will offer at Activate. Save the Dates: October 15-18 in Montreal.

We’ll be announcing more details and a call for speakers very soon. Details will be available at https://activate-conf.com/.

I look forward to seeing you there!

Trey Grainger
SVP of Engineering
Lucidworks

The post Introducing Activate, the Search & AI Conference appeared first on Lucidworks.

↧

Smarter Image Search in Fusion with Google’s Vision API

May 7, 2018, 6:00 am

≫ Next: Podcast: Data Science, Machine Learning, and Artificial Intelligence

≪ Previous: Introducing Activate, the Search & AI Conference

After five days in Mexico with your friends, surfing sunny, secluded point breaks along the Pacific side of Baja, you’re on your way back home, squinting at an overly dense photo gallery on your phone (or worse, sifting through a collection of files in your DCIM folder with names like DSC10000.JPG), trying to find that one photo of the classic 1948 Woodie Station Wagon with the single-fin surfboard on the roof rack…

woodie car with surfboards

Granted, even if you are not a surfer or a car enthusiast, better cases have been made for smart image search, including the ubiquitous search for “blue suede shoes” on your favorite commerce site, social media networks, and facial recognition applications, to name a few. Companies like Google, Facebook, and Pinterest have invested heavily in image recognition and classification using artificial intelligence and deep learning technologies, and users, who have grown accustomed to this level of functionality, now expect the same behavior in their enterprise search applications.

Lucidworks Fusion can index images and other unstructured document formats, such as HTML, PDF, Microsoft Office documents, OpenOffice, RTF, audio, and video. To that end, Fusion uses the Apache Tika parser to process these files. Tika is a very versatile parser that provides comprehensive metadata about image files such as dimensions, resolution, color palette, orientation, compression ratio, and even the make and model of the camera and lens from which the photo originated, if that information is added by the imaging software. That said, such low-level metadata is not useful enough when searching for a “red car with white wall tires”…

**Figure 1:** Default Image Parsing Results

Enter the Google Cloud Vision API

The Google Cloud Vision API is a REST service that enables developers to easily understand the content of an image while completely abstracting the underlying machine learning models. It quickly classifies images into categories (e.g. “surfboard”, “car”, “beach”, “vacation”), detects individual objects and faces within images, extracts printed words contained within images, and can even moderate offensive content via image sentiment analysis. When the GCV API is used to augment the image parsing capabilities of Fusion Server, the combined metadata can be used by Fusion AppStudio to create a sophisticated image search experience.

Using GCV From Fusion

To incorporate the GCV API functionality into your search app, follow these three simple steps:

Add a Tika Parser stage to the Index Pipeline. Ensure that the Include Images and Add original document content (raw bytes) options are checked, and that the Flatten Compound Documents option is unchecked.
Figure 2: Tika Parser

Add a JavaScript stage to your Fusion Index Pipeline. The following code snippet will convert the binary format Tika stores the image in into a base64-encoded string:

function (doc) {
    if (null != doc.getId()) {
       var ByteArrayInputStream = java.io.ByteArrayInputStream;
       var ByteArrayOutputStream = java.io.ByteArrayOutputStream;
       var DatatypeConverter = javax.xml.bind.DatatypeConverter;
 
       var raw = doc.getFirstFieldValue("_raw_content_");
       if (null != raw) {
          var bais = new ByteArrayInputStream(raw);
          if (null != bais) {
             var bytes;
             var imports = new JavaImporter(
                org.apache.commons.io.IOUtils,
                org.apache.http.client);
 
             with(imports) {
                var baos = new ByteArrayOutputStream();
                IOUtils.copy(bais, baos);
                bytes = baos.toByteArray();
                var base64Input = DatatypeConverter.printBase64Binary(bytes);
                doc.setField("base64image_t", base64Input);
             }
          }
       }
    }
    return doc;
 }

Note: the script uses the javax.xml.bind.DatatypeConverter class. In order to load this class at runtime, you’ll need to add the JAXB library to the connector’s classpath and restart the connectors-classic process, as follows:

cp $FUSION_HOME/apps/spark/lib/jaxb-api-2.2.2.jar $FUSION_HOME/apps/libs/
 ./bin/connectors-classic stop
 ./bin/connectors-classic start

Add a REST Query stage to your Fusion Index Pipeline. Configure the REST query as shown in the following figures:

Note: you can generate your own API key on the Google Cloud Console.

Field Mapping — **Figure 4:** Field mappings

Use the following JSON as the request entity. Note the use of the ${base64input} field that was defined in the preceding JavaScript stage as the value for the image.content request property.

{
    "requests": [{
       "image": { "content": "${base64image_t}" },
       "features": [
          { "type": "TYPE_UNSPECIFIED", "maxResults": 50 },
          { "type": "LANDMARK_DETECTION", "maxResults": 50 },
          { "type": "FACE_DETECTION", "maxResults": 50 },
          { "type": "LOGO_DETECTION", "maxResults": 50 },
          { "type": "LABEL_DETECTION", "maxResults": 50 },
          { "type": "TEXT_DETECTION", "maxResults": 50 },
          { "type": "SAFE_SEARCH_DETECTION", "maxResults": 50 },
          { "type": "IMAGE_PROPERTIES", "maxResults": 50 },
          { "type": "CROP_HINTS", "maxResults": 50 },
          { "type": "WEB_DETECTION", "maxResults": 50 }
       ]
    }]
 }

Bonus — use Name That Color to find the name of the closest matching color to the image’s dominant RGB code. Add another JavaScript stage to the Index Pipeline with the following code:

function (doc) {
   /*
    +---------------------------------------------------------+
    | paste the content of http://chir.ag/projects/ntc/ntc.js |
    | right below this comment
    +---------------------------------------------------------+
    */
   
   if (null != doc.getId()) {
      var reds = doc.getFieldValues(
         "gv_dominant_colors.colors.color.red");
      var greens = doc.getFieldValues(
         "gv_dominant_colors.colors.color.green");
      var blues = doc.getFieldValues(
         "gv_dominant_colors.colors.color.blue");
      var scores = doc.getFieldValues(
         "gv_dominant_colors.colors.score");
      if (null != reds && null != greens &&
          null != blues && null != scores) {
         var dominant = -1;
         var highScore = -1;
         var score = -1;
         
         for (var i = 0; i < scores.size(); i++) {
            score = scores.get(i);
            if (null != score && parseFloat(score) > highScore) {
               highScore = score;
               dominant = i;
            }
         }
 
         if (dominant >= 0) {
            var red = parseInt(reds.get(dominant)).toString(16);
            if (red.length == 1) red = "0" + red;
            var green = parseInt(greens.get(dominant)).toString(16);
            if (green.length == 1) green = "0" + green;
            var blue = parseInt(blues.get(dominant)).toString(16);
            if (blue.length == 1) blue = "0" + blue;
            var rgb = "#" + red + green + blue;
            score = parseFloat(scores.get(dominant));
            if (null != rgb && null != score) {
               var match = ntc.name(rgb);
               if (null != match && undefined != match &&
                   match.length > 0) {
                  if (rgb != match[0]) {
                     doc.setField(
                        "gv_dominant_color_s", match[0]);
                     doc.setField(
                        "gv_dominant_color_score_d", score);
                  }
                  if (match.length > 1) {
                     doc.setField(
                        "gv_dominant_color_name_s", match[1]);
                  }
               }
            }
         }
      }
   }
 }

Next, initiate a crawl of your web or file system datasource. For this demo, I am indexing a directory of photos that I previously downloaded from the Internet, served by a local web server.

Note: ensure that the Parser selection drop-down remains empty, as we are using a Tika Parser stage instead.

Once we have some indexed content, we can set up a few facets using the Query Workbench.

Putting it All Together

Using the newly enriched metadata, my new search app can now produce meaningful results for queries like “wooden surfboard” and a “car with white wall tires and a surfboard”. As an added bonus, the GCV AOPI also performed a reverse image lookup on the images in my local server and found the original images or hosting pages on the Internet, and those links are rendered alongside the search results.

wooden surfboards — **Figure 7:** Wooden Surfboards

**Figure 8:** Cars with White Wall Tires and Surfboard

Conclusion

Lucidworks Fusion is a powerful and scalable search platform built on the open foundation of Apache Solr and Spark. Fusion’s index pipelines transform incoming data through a series of configurable stages into document objects for indexing by Fusion’s Solr core. The REST Query and JavaScript stages offer quick and easy extensibility of the data ingestion process to include external APIs such as Google’s Cloud Vision API that further enrich the metadata used by search applications.

Next Steps

Watch my webcast Using Google Vision Image Search in Fusion and attend Lasya Marla’s A/B Testing in Fusion 4 Webinar
Check out our previously recorded Fusion 4 Overview
Contact us, we’d love to help.

The post Smarter Image Search in Fusion with Google’s Vision API appeared first on Lucidworks.

↧

Podcast: Data Science, Machine Learning, and Artificial Intelligence

May 30, 2018, 3:19 pm

≫ Next: Five Things You Missed at IRCE18

≪ Previous: Smarter Image Search in Fusion with Google’s Vision API

In the below podcasts, Lucidworks VP of Research, Chao Han, speaks about artificial intelligence, machine learning, and the life of a data scientist:

Chao discusses with Futuretech how her journey as a data scientist led her to Lucidworks, what’s happening in deep learning, and how big data and AI have changed over the years. She answers the questions “what does Lucidworks do?” and “how is Lucidworks Fusion unique?”. Click here to listen.

The Cloudcast asks Chao what data science skills are critical for AI and ML. Chao gives her insight into the Lucidworks Fusion AI platform and emerging research in the AI/ML space. Click here to listen.

The post Podcast: Data Science, Machine Learning, and Artificial Intelligence appeared first on Lucidworks.

↧

Five Things You Missed at IRCE18

June 7, 2018, 9:46 am

≫ Next: A Look Back at Fusion 4

≪ Previous: Podcast: Data Science, Machine Learning, and Artificial Intelligence

Some of the Lucidworks gang is in Chicago this week at the annual Internet Retailer Conference + Exhibition (IRCE). Be sure to come say hello at booth #281 if you’re looking for AI, search, or a complimentary Americano. If you’re not familiar with IRCE, it is a huge conference for retailers, and the vendors and partners that serve them. Everyone is here, all of the big names you know – from Oracle to MailChimp to the US Postal Service – are present and accounted for.

If I had to decide on a theme for the year, it is definitely AI. Lucidworks was here last year with “AI-powered” on our booth and this year everyone claims to be “AI-powered.” The theme was further reinforced in a keynote with Accenture’s Vish Ganapathy coining “XAI” for “explainable AI” and explaining that AI might not be the panacea everyone is expecting.

I hate fun facts hat

One of the great things about this year’s conference is that you can hear from people working in the trenches in both omnichannel and online channels, like Born+Made CEO Jackie Dew. Rather than a hundred aimless vendor spiels, you get a host of actionable insights for how other retailers worked their way through the business and technical journey to a more viable and profitable business.

It is only day two, but here are five major takeaways for customers looking for solutions in this space:

One. “Nothing is ‘out-of-the-box,’ it may be in the box, but you have to put in the work to get it out of the box.” This advice comes from Rodney J. Woodruff, VP of Engineering at Weight Watchers. For solutions like Lucidworks Fusion, this means making sure you get your signals into the system and test which combination of collaborative filtering, search tuning, and AI-based recommendations work best for your customers.

Two. For the love of all that is good and holy, perform A/B testing. When the crowd was polled on the first day of the conference, most retailers responded that they were not doing A/B testing on their online properties. Guido Campello, CEO of lingerie company Cosabella, explained how essential this had been to their efforts. They were not just A/B testing their search experience but even what tagline goes on the top of the page. Lucidworks built A/B testing right into our Fusion platform so that you can test what changes work best for your site.

Lucidworks Fusion screenshot

Three. There are a lot of vendors. I have pictures that don’t capture the sheer scale of the conference and accompanying vendor expo. There are vendors who build machines that take 360 degree photos of your product so shoppers can virtually walk around the item. Nearly every software vendor claims to have implemented AI in the last year. However, unless they have a holistic solution and proven customer references on how that will lead to increased conversions, keep in mind that some firms will stamp whatever buzzword is currently cool on their product. IRCE had a whole workshop on evaluating the AI capabilities of vendors. For Fusion, we not only have customer references but we can also educate your team on strategies that you can use to produce better results than Amazon using AI.

Four. Speakers from the most successful vendors like Rodney Woodruff of Weight Watchers and Nathan Richter, head of digital strategy for Urban Outfitters said avoiding any monolithic closed software that doesn’t provide an API was the best way to go. Your second best choice was building a REST API on top of it so that your organization can adapt to an ever changing and more competitive landscape. Both Lucidworks Fusion and its Apache Solr core have been API-based, modular, and extensible from the beginning to give retailers the most flexibility and power.

irce audience

Five. Bernardine Wu, a consultant with retail experts Fit for Commerce, polled the room and found that about 80% said they were looking to “re-platform” their technology stack in then next 18 months. “You just heard all of the vendors gasp.” They were gasping because they knew this meant that they hadn’t delivered results and technology compelling enough to keep a customer around.

lucidworkers in booth

I’m excited for tomorrow’s sessions on managing the technology stack particularly the session on machine learning. I’ve found a lot of further validation for all of the hard work Lucidworks has put into its AI solutions. Many of the technology vendors are starting to catch up to where we were last year, but customer success is what speaks the loudest.

free espressos

Learn more:

Drop by booth 281 at IRCE, if not for the technology, then for the coffee or a bandana for your dog.
Sign up for our upcoming webinar with AI expert Ronald Van Loon, Driving Retail Sales through AI-powered Search on Jun 21st
Read our Retail Marketplace Trends for 2018 outlining what trends to watch for the remainder of the year and into 2019

Or drop us a line, we’d love to hear from you!

The post Five Things You Missed at IRCE18 appeared first on Lucidworks.

↧