Quantcast
Channel: Lucidworks
Viewing all 731 articles
Browse latest View live

Hello Boston, My Old Friend

$
0
0

It all started in Boston…

In 2010, for the inaugural Lucene Revolution in Boston MA, I tried to weasel out of giving a prepared talk by proposing a Live Q&A style session where I’d be put on the spot with tough, challenging, unusual questions about Solr & Lucene — live, on stage. I don’t remember what my original session title was, but the conference organizer realized it sounded a lot like the “Stump The Chump” segment of the popular “Car Talk” radio show, hosted by Boston’s own Click & Clack, and insisted that be the title we use.

It’s been 6 years since that first “Stump The Chump” session in Boston, and now — one month from today — Stump The Chump will be returning to Boston for Lucene/Solr Revolution 2016.

If you’ve never seen our version of “Stump the Chump” it’s a little different then Click & Clack’s original radio call in format. In addition to being live in front of hundreds of rowdy convention goers, we also have a panel of judges who have had a chance to see and think about many of the questions in advance — Because folks like you are free to submit questions via email prior to conference (even if you can’t attend in person). The judges take every opportunity to mock The Chump (ie: Me) anytime I flounder, and ultimately the panel will award prizes to people whose questions do the best job of “Stumping The Chump”.

As my boss Cassandra (a Boston native, and this year’s Stump the Chump moderator) would say: “It’s a Wicked Pissa!”

You can see for yourself by checking out the videos from the past events like Lucene/Solr Revolution 2015 in Austin TX, or Lucene/Solr Revolution Dublin 2013. If you want a real blast from the past, check out the video from the last time “Stump The Chump” was in Boston: Lucene Revolution 2012. (Regrettably, there is no video from that first Stump The Chump in 2010)

Information on how to submit questions can be found on the session agenda page, and I’ll be posting more details with the Chump tag as we get closer to the conference.

(And don’t forget to register for the conference ASAP if you plan on attending! The registration price will be increasing on September 16th.)

The post Hello Boston, My Old Friend appeared first on Lucidworks.com.


News Search at Bloomberg

$
0
0

As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Solr Committer Ramkumar Aiyengar’s talk, “Building the News Search Engine”.

Meet the backend which drives News Search at Bloomberg LP. In this session, Ramkumar Aiyengar talks about how he and his colleagues have successfully pushed Solr to unchartered territories over the last three years, delivering a real-time search engine critical to the workflow of hundreds of thousands of customers worldwide.

Ramkumar Aiyengar leads the News Search backend team at the Bloomberg R&D office in London. He joined Bloomberg from his university in India and has been with the News R&D team for nine years. He started working with Apache Solr/Lucene four years ago, and is now a committer to the project. Ramkumar is especially curious about Solr’s search distribution, architecture, and cloud functionality. He considers himself a Linux evangelist, and is one of those weird geeky creatures who considers Lisp beautiful and believes that Emacs is an operating system.

lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post News Search at Bloomberg appeared first on Lucidworks.com.

Now Everybody Knows Their Names…

$
0
0

As previously mentioned: On October 13th, Lucene/Solr Revolution 2016 will once again be hosting “Stump The Chump” in which I (The Chump) will be answering tough Solr questions — submitted by users like you — live, on stage, sight unseen.

Today, I’m happy to announce the Panel of experts that will be challenging me with those questions, and deciding which questions were able to Stump The Chump!

In addition to taunting me with the questions, and ridiculing all my attempts to stall while I rack my brain for answers, the Panel members will be responsible for deciding which questions did the best job of “Stumping” me and awarding prizes to the folks who submitted them.

Information on how to submit questions can be found on the session agenda page, and I’ll be posting more details with the Chump tag as we get closer to the conference.

(And don’t forget to register for the conference ASAP if you plan on attending!)

The post Now Everybody Knows Their Names… appeared first on Lucidworks.com.

Solr Distributed Indexing at WalmartLabs

$
0
0

As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Shenghua Wan’s talk, “Solr Distributed Indexing at WalmartLabs”.

As a retail giant, Walmart provides millions of items’ information via its e-commerce websites, and the number grows quickly. This calls for big data technologies to index the documents. Map-Reduce framework is a scalable and high-available base on top of which the distributed indexing can be built. While original Solr has a map-reduce index tool, there exist some barriers which makes it unable to deal with Walmart’s use case easily and efficiently. In this case study, Shenghua demonstrates a way to build your own distributed indexing tool and optimize the performance by making the indexing stage a map-only job before they are merged.

Shenghua Wan is a Senior Software Engineer on the Polaris Search Team at WalmartLabs. His focus is applying big data technologies to deal with large-scale product information to be searched online.

lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Solr Distributed Indexing at WalmartLabs appeared first on Lucidworks.com.

Be A Winner In Boston: Stump The Chump Prizes

$
0
0

As previously mentioned: On October 13th (two weeks From tonight) I’ll be on the main stage at Lucene/Solr Revolution 2016 once again answering tough Solr questions — submitted by users like you — live, sight unseen.

A panel of judges will be on hand to mock me when I flounder, and award prizes to the folks who submitted the top 3 questions that stumped me the most:

  • 1st Prize: $100 Amazon gift certificate
  • 2nd Prize: $50 Amazon gift certificate
  • 3rd Prize: $25 Amazon gift certificate

Even if you can’t make it to The Revolution, you can still submit questions and to try and stump me. Information on how to submit questions can be found on the session agenda page, and I’ll be posting more details with the Chump tag as we get closer to the conference.

(And if you do plan to attend, don’t forget to register for the conference ASAP!)

The post Be A Winner In Boston: Stump The Chump Prizes appeared first on Lucidworks.com.

Tick Tock: Time Is Running out to Stump The Chump

$
0
0

Time Is Running Out!

There’s only a few days left to submit your questions for Stump The Chump at Lucene/Solr Revolution 2016.

If the panel of judges decide you’re question did the best job of stumping me, you could win some great prizes — so what are you waiting for?

Even if you can’t make it to the Revolution, you can still submit questions and to try and stump me. Information on how to submit questions can be found on the session agenda page, and follow the Chump tag here on this blog to find out who won after the conference.

(And if you do plan to attend, don’t forget to register for the conference ASAP!)

The post Tick Tock: Time Is Running out to Stump The Chump appeared first on Lucidworks.com.

Lucidworks Integrates IBM Watson into Fusion Enterprise Discovery Platform

$
0
0

We’re excited today to announce our new integration with IBM Watson!

Lucidworks is tapping into the IBM Watson Developer Cloud platform for its Fusion platform, an application framework that helps developers to create enterprise discovery applications so companies can understand their data and take action on insights.

Today’s knowledge workers face an avalanche of data and documents. Lucidworks’ Fusion is an application framework for creating powerful enterprise discovery apps that help organizations access their information to make better, data-driven decisions. By integrating Watson, Fusion can deliver insights within seconds and can process massive amounts of structured and multi-structured data in context, including voice, text, numerical, and spatial data. Fusion’s platform upgrade with Watson reduces the work and time it takes to create enterprise discovery apps from months of in-house development to only weeks with Watson.

“Watson is a powerful platform for companies and developers to build collaboration solutions and data analysis apps,” said Will Hayes, Chief Executive Officer of Lucidworks. “We’re bringing next-generation capabilities to enterprise discovery by embedding cognitive computing technology like Watson into our Fusion platform to transform how customers create discovery applications for today’s data-rich, fast-paced work environment.”

Customers can rely on Fusion to develop and deploy powerful discovery apps quickly thanks to its advanced cognitive computing features and machine learning from Watson. Fusion applies Watson’s machine learning capabilities to an organization’s unique and proprietary mix of structured and unstructured data so each app gets smarter over time by learning to deliver better answers to users with each query. Fusion also integrates several Watson services such as Retrieve and Rank, Speech to Text, Natural Language Classifier, and AlchemyLanguage to bolster the platform’s performance by making it easier to interact naturally with the platform and improving the relevance of query results for enterprise users.

Watson APIs at Your Fingertips: Enrich your search app with Watson’s capabilities as data and documents are indexed and as queries are received by the application and a set of search results are returned to the end user.

Retrieve and Rank: Find the most relevant information for their query by using a combination of search and machine learning algorithms to detect “signals” in the data. Built on top of Apache Solr, developers load their data into the service, train a machine learning model based on known relevant results, then leverage this model to provide improved results to their end users based on their question or query.

Speech to Text: Use machine intelligence to combine information about grammar and language structure with knowledge of the composition of an audio signal to generate accurate transcriptions. Transcription of incoming audio is continuously sent back to the client with minimal delay and corrected as more speech is heard.

Natural Language Classifier: Create natural language interfaces for your applications without needing a background in machine learning or statistical algorithms. The service interprets the intent behind text and returns a corresponding classification with associated confidence levels.

AlchemyLanguage: Analyze your content with sophisticated natural language processing techniques including entity extraction, sentiment analysis, emotion analysis, keyword extraction, microformat parsing, and more.

Full details on our Watson integration page.

Learn more about Fusion.

Press release.

The post Lucidworks Integrates IBM Watson into Fusion Enterprise Discovery Platform appeared first on Lucidworks.com.

Stump The Chump: Boston Winners

$
0
0

Last Thursday we had another great Stump The Chump session in Boston (followed by an awesome party at the top of the Prudential Building). Today I’m happy to announce the top 3 prize winners:

I’d like to thank everyone who participated, either by sending in your questions or by being there in person to heckle me, but I would especially like to thank the judges and our moderator Cassandra Targett who had to do all the hard work preparing the questions and choosing the winners.

Keep an eye on the Lucidworks YouTube page to see the video of this session — as well as videos of all the other sessions — once it is available.

The post Stump The Chump: Boston Winners appeared first on Lucidworks.com.


The Twilight of the Vengine Gods (Die Göttervenginedämmerung) or Die Hard with A Vengines!!!

$
0
0

The term ‘Vengines’** is short for “Vendor Engines” – like HP Autonomy, Google Search Appliance, MS Fast and Oracle Endeca who as we speak are fading from the scene. Not that this is news to anyone who works in this field. The Curmudgeon doesn’t dispense news, he just tells you what information, new or old sucks or what pisses him off and then rants about it. I should also say that there is absolutely no fact checking in this post. For some things that I am absolutely sure about, I don’t have to – I mean if you can’t trust the Curmudgeon, who can you trust? Seriously. For other things that may or may not be true, I decided to put them on the Internet so that they would become so. Also, I’m just lazy.

“Back in the day” as us geezers are wont to say – about 1995 or so, Linus Torvalds introduced Linux and the concept of Open Source to the world. Torvalds published a web page in which he stated that it should be pronounced “Lee-nooxe” like his name not “Lih-nix” and we’ve been ignoring him on that point ever since. At the time, I was working in a small Computer Graphics company and my boss Jim Spatz predicted that Linux would ultimately rule the OS world, replacing Windows as the dominant server OS for Intel machines. Jim was not a developer so I disagreed with him, thinking that hundreds or thousands of programmers working independently can never get anything right. (I remembered a sign on an outhouse once that said “Eat poo because 40 trillion flies can’t be wrong” – although it didn’t say “poo” – use your imagination here – I’m just trying to be PC like Mike Rowe used to do on Discovery Channel’s “Dirty Jobs”). Of course, as it turned out, the Curmudgeon was wrong!! I know, hard to believe, right? Obviously one of the very few times that this has ever happened, but in my defense I wasn’t a curmudgeon then. Also, I couldn’t convert erroneous assertions to facts by uploading them to a web site as I can now. Even the Search Curmudgeon sucks at prognostication. I’m more into nostalgia now or in this case bad dreams as I think about the times that the Vendor Gods Ruled the Planet. Eric Raymond published a great book – “The Cathedral and the Bazaar” that explains why the open-source paradigm works so well. Still a great read even though it is 15 years old now. Get it on Amazon like everything else.

So just like OSes for commodity Intel hardware, Open Source is killing the search vendor engines as a buddy of mine predicted in a previous blog on this site. He told me that he fantasizes that The Larry’s read his blog, decided that he was right and made big business decisions because of it. (So if you are reading this blog Larry E. – we’ll take a ride on your America’s Cup yacht as a thank you – that would be cool – just post a comment to this blog and we’ll set up a date. Thanks in advance.) We know that Bill bought Fast just to improve Sharepoint search and then imprison it into his .NET castle, but one of the first things that Microsoft did was to deprecate support for Fast ESP on Linux – who could have seen that one coming?

Autonomy is an interesting story. I heard a story once that when Lynch was pitching it to HP he pointed at a closet and told them that there were 50 developers in there – or maybe he just didn’t spend much time hanging with the worker bees. Anyway, HP was ultimately mighty disillusioned with the purchase and there are rumors that they want to dump it but nobody wants to buy it. Before that, Autonomy bought Verity of course to get rid of them as a competitor and to get their customers. Then they created this total kludge called K2 V7 which had a Verity K2 API and an IDOL core that they never really got working and it was rumored that this was never their intention. They just wanted to upsell K2 customers to IDOL – what, by pissing them off? The joke was on them because most of Verity’s customers were Ultraseek customers who couldn’t afford IDOL anyway. Most or all of these have almost certainly transitioned to Open Source by now. Ultraseek wasn’t bad as search engines go (it was originally Inktomi before Verity bought it) – a little bit ahead of its time actually, but was totally buried within Autonomy and certainly within HP – but that’s the way it crumbles vendor-wise. Verity used to give it for free up to a certain document count.

Anyway, compared to Solr, IDOL (I forget what that stands for but I really don’t give a poo) should really be spelled IDLE because Solr blows its doors off for speed and such. Its also kinda black-boxy like GSA. That’s why it is called Autonomy – you just plug in your data and it works. But not really, as customers are discovering. That, as some would say, was another hoax perpetrated by Mike Lynch. IDLE also has a horrendous configuration layer that always caused problems for us. Good riddance I say. I started off with Verity which was a fine search engine and Autonomy at that time was our enemy until they bought Verity and we started to work with them – but they sucked way more than Verity did.

Another vendor is Fast Search and Transfer, which pitched itself as highly scalable and fast but is really neither. I remember a project at a pharmaceutical in which we were trying to index about 1.5 TB of eRoom data (Fast was claiming petabyte scale back then). The project never really got done because jobs would fail in the middle of the night and you had to try to figure out where it crashed so you could start from there. One of those very tedious, months long fire fights where the Fast indexes were erroring out all the time and you had to write your own code to error check, collate and reindex. Fast spend a large amount of effort on error correction and recovery – much more than SolrCloud for example. It needed to because it’s clusters were inherently unstable. Another thing is speed. I was once on a customer call where we were pitching a Fast – Fusion/Solr conversion or ‘rip-and-replace’ and I quipped that of the two engines in question, one is named Fast but the other one actually is fast. Everyone laughed.

When Microsoft bought them, everybody knew that it was to replace native Sharepoint search (BING is an entirely separate project). I had the dubious pleasure of working on Fast Search for Sharepoint 2010 – another kludge that has been replaced I believe with a more .NET based version. Fast has now disappeared into the MS fortress, never to be used for anything but Sharepoint search, all as originally intended by the Gods of Redmond. Fast ESP is less so and continues to be a rip-and-replace target although I think that its sunset date has passed.

That brings us to Endeca – which was a company in Cambridge MA that I worked a lot with that was geared primarily for eCommerce. Endeca is now of course owned by Larry E’s company and is fast losing ground to Open Source. One reason is that most (all?) of the engineers that were at Endeca left after Oracle acquired them (including the principal Steve Papa) and now nobody in this behemoth organization really understands it. I had a brilliant young engineer that used to work for me who had Endeca chops, joined Oracle to get that on his resume and spent several miserable months being thrown into the deep pool as “The Endeca Guy”. He then left to do other things. We had tried to recruit him to Lucidworks and are interested again (I’ll use the pseudonym Saurav for him) – more on that later. Endeca doesn’t scale well – we used to cringe when customers told us that they wanted to index 1 million documents which Solr eats for a quick breakfast snack. Getting that much into Endeca was a struggle. Its also not fast as the others. My first experience indexing data into Solr that I had previously indexed into Endeca was a revelation –what took several hours in Endeca indexed in about 10 minutes and at first I though that my Solr set up was broken. Many others have experienced the same thing. We are doing a lot of rip & replaces with Endeca now and while Solr is not built primarily as an eCommerce engine as Endeca was – we are putting features into Fusion so that it can do all the things that Endeca does at much better scale and speed and at lower cost than Larry’s product – and with much better support.

Finally to what a buddy of mine called the Google Toaster in his blog, which they are now putting in End of Life. The main problem there is lack of flexibility and programability, but also scale. It is literally a black box except that its not black – it has a user friendly yellow color. I remember being at a Search Summit where Google was paying for our lunch and making us watch a presentation on GSA and they presented some pedestrian number for scalability that we were chuckling over. I am really glad that Larry P.’s company is getting out of the Enterprise Search business and other do-it-yourself usages because I am sick and tired of having potential customers tell us that “We want it to work just like Google” – although many times that’s nonsensical because its Enterprise not Web search and you can’t use things like Page Rank. Of course Google will continue to innovate on their awesome web search engine which is were they make the vast majority of their money anyway.

All four of them over promise, over charge and under deliver, whereas Solr even with Fusion on top does the opposite in every case. As a result, the Vengines are disappearing like the Dinosaurs once did, leaving us to compete mostly with that other distributed search platform that is built on top of Lucene. I won’t mention their name because Lucidworks might ask me to remove it (no free press for them) but you know who I am talking about. Lucidworks by the way has never asked me to change anything by sending an email saying “Could you tone it down Curmudgeon?” or something like that but the joke’s on them because I don’t have an email address – but they could send something to my buddy and ask him to relay it to me. Anyway to avoid redaction by the LW censors, I’ll use code words for now. The name of the company is like the material that is used to hold up my Jockey Shorts (hint, hint).

I heard a story from one of my Solr committer friends that the guys that started Fruit-of-the-Loom Finders, had been in the Solr community but got into a disagreement with Yonik, Hoss, Erickson (?) and others about the direction of the architecture, stormed out like petulant little boys with a “We’ll show you!!” attitude and started the company named like a Tightie Whitie Quest. If so, maybe they have a few curmudgeons of their own. So if someone who works or evangelizes for the RubberBand Finders wants to challenge me to a mud-slinging debate ALA Trump/Hillary or WWF, I think that would be fun. We could trade hyperbolic insults about each others flaws like Trump does – which make for more heated entertainment but don’t really make a point. “Your code absolutely SUCKS!”, “It can’t scale for POO!”, “We’re easier to use”, “We’re less filling” or “GC HELL Raisers!”, “Brain Splitters” and other such geeky stuff. Or we could do some trash talk with ridiculous, trumped up (rimshot please) numbers like “400 Quadrillion documents running on 24 thousand shards at 75 thousand QPS with 10 ms average latency, 500 Billion updates a second, 24/7 for MONTHS without a reboot – in your FACE BungeeSeek!!!” Hey come on – it would be fun. To keep my anonymity maybe I could come with a paper bag over my head like the unknown comic used to do.

Anyway, as we all know, all software has bugs but in our view, theirs has more and we fix ours faster. All fair game for a manno-a-manno or manno-a-womanno style mud wrestling contest. If anyone from the Lucene Dark Side wants to take me up on this, just comment on this blog post and my representatives will get back to you. (Note that we are the Bright Side because Solr is the Sun and its Hot!) It would also give me a chance to bone up on your code so I don’t look totally clueless and as Sun Tzu would say, a chance to study my enemy. Another thing about the bra-strap guys is that they call their input things “Rivers” while we at Lucidworks call them “Pipelines”. I have heard stories that their customers sometimes feel that they are up poo creek without a paddle. But whatever, we are both children of what The Doug has brought forth and will duke it out for market supremacy. I’m betting on Solr (surprised?) And whatever the outcome, we are both pushing each other to get even better which will be good for all of us, so I don’t see a resurgence of the vengines anytime soon (like Night of the Living Dead or something).

Another sign that Open Source search is booming is that everyone is looking to hire Solr engineers. Our strategy is to train them through our Solr Unleashed and Solr Under The Hood training courses, send them out into the world to mature and ripen and then poach them back from our customers to work for our company when they get really good. Working well I hear, but we still need more of them. If you come to Lucidworks, you might even get to work with me, but I would understand if you made a stipulation on accepting a job offer that “I’ll come if you don’t make me work with that crotchety old bastard!” I’m sure that the Stretch Armstrong Boys are also hiring vigorously.

So getting back to the vendor gods. To their credit, they are also embracing the Open Source Revolution in a big way. If Oracle does nothing more than to contribute, maintain and evolve Java, they’ve done plenty! Google of course has contributed many awesome things including Guice, AngularJS and Word2Vec to mention a few. They have also published research papers on their core technologies like BigTable and MapReduce that helped The Doug when he started working on HDFS and Hadoop respectively. So they “get” it. Microsoft is of course a totally different story and always will be. HP isn’t even a software company – really. We’ve learned to interoperate with the MS parallel universe but its always been a royal pain. The thing that floors me is how they managed to rip-off Java when they created C-sharp (D flat?) – and made some improvements to be sure – but kept us from being able to easily interoperate with it. It only works natively with their own poo. Black-box genius.

So to close, it is abundantly clear that the search world is fully embracing the enterprise that Yonik and The Chump have created for us and are running with it. I for one am having a great time and don’t miss working with the vengines at all. Thanks Yonik. Thanks Chumpman! Thanks Solr Community – especially the committers thou Sultans of Solr, Lords of Lucene – super job! And I’ll see you Rubbermaid Retrievalware knuckleheads on the debate floor/pit when you’re ready to take the gloves off (just not on a Football night please). Bring it on!

** I stole the term ‘vengines’ from my close friend Ted Sullivan’s Well Tempered Search Application blogs published on Lucidworks almost two years ago. To paraphrase Groucho – if Ted were any closer, he’d be in back of me. Check out the hint at the bottom of this post.

The post The Twilight of the Vengine Gods (Die Göttervenginedämmerung) or Die Hard with A Vengines!!! appeared first on Lucidworks.com.

Google/Amazon-like Search for Your Retail Site

$
0
0

When shopping online, today’s customers do not expect to put much effort into finding what they need. Effective search is the difference between a successful online retailer and ending up a parked domain for sale. Search on an ecommerce site today goes well beyond the little box in the upper left-hand corner. It weaves through every part of the site from the first time a customer visits to the site to their next purchase. Today’s customers expect a Google-like or Amazon-like experience with their search results and effective online retailers anticipate their needs before they search.

Key Concerns for Online Retail Sites

Making a customer or potential customer’s experience so that “it’s easy to find things” means bringing back relevant search results from the very first query. This includes the results that are personalized based on who the customer is are and what they’ve ordered or browsed in the past. A smart retail site doesn’t wait for the next query but targets the customer with recommended products whenever they come to the site. Further, relevant results are timely, don’t show winter coats on sale when a customer is shopping for shorts in summer. To provide this type of highly detailed relevancy, the retailer needs relevant information about:

  • The customer including any demographic information, regional data, and past purchase and shopping behavior
  • Organized/indexed data from the backend inventory so the search app is always selling items that are in stock – and boosting ones that are trending.
  • Product information containing the various keywords and text a customer might enter into a search box
  • Data from auxiliary systems like credit-processing and loyalty.

Achieving effective search, targeting and personalization often means overcoming obstacles such as connecting feeds to/from legacy systems and keeping your data as fresh as possible.

Obstacles Retailers Face

Many online retailers didn’t start out online and often have legacy systems ill-suited for the realities of today’s ecommerce. Many of them have a brick and mortar presence and have systems that are dual use between both the online and offline lines of business. Even retailers that were born online, generally didn’t debut yesterday and not all of those systems are as easily accessible as modern-day ecommerce platforms (we’re looking at you, Blue Martini). This data may be difficult to integrate with other systems and may not even be “clean” enough in some cases to develop an effective schema out of the box.

Meanwhile, to personalize and target customers effectively, a retailer needs to know something about those customers. That means developing an effective customer profile which stores characteristics of the customer. This data helps produce relevant search results that align with the customer’s preferences at query time. Sometimes this data is in a custom database, other times it is in a CRM system, or stored in the ecommerce platform itself.

The schema which represents the customer must be flexible and adapt to the ever-changing needs of the business – and needs of the customer. This means being able to add new types of data without a major system change. For example, just because people use Skype today for customer support doesn’t mean they won’t use an entirely different means of communicating or paying tomorrow. Storing the customer’s shoe size is one thing, but what kind of smartwatch they may be something a retailer needs to know in the future as it might be userful for both payment and suggesting peripherals. A fixed schema approach such as those inherent to an RDBMS may not be an effective and efficient way to capture and use this information.

Finally, Google Analytics is fine for tracking how your whole site is used, but effective relevance tuning and search requires more personalized and comprehensive clickstream tracking. That means capturing what a customer is looking at and storing it with the customer profile. Many online stores do this effectively, others do not. Capturing the signals associated with a customer’s search is a non-trivial task. To capture them you need hooks at both the site and search level. Integrating signals at the search level is a fair amount of code.

Getting Your Data Feeds

Getting data can be hard. In an ideal world, you buy a connector rather than write and maintain one. Even if the source has a database connector, REST or WebServices API, what do you do to normalize the data? Plus, data frequently needs to be massaged and enriched with other data. To do that you need to organize the data through pipelines, in order to both manage change and provide an operational way to configure the manipulation of data.

Data should be indexed into a flexible schema and combined with other sources of product data. The correct characteristics for one type of product (color, shape, standard, plug type) might not be the same for another (color, wrist size, analog or digital). Customers need to be able to facet their search based on effective categorization (electronics, televisions, screen size, vs clothing, dress size, etc). Moreover, effective characteristics need to match your customer profiling effort for effective targeting. Often times this overlaps search, sometimes it is more specific (bargain shopper, season, closeout, priced to go, premium item).

Getting the data into the search index in a timely manner without affecting system operations can be a major effort unto itself. You need a system that is flexible, operationally efficient and manageable, not to mention scalable.

Relevance

Tuning relevance is both an art and a science. The techniques and technologies for both are forever evolving. At one time simple keyword search was enough. Tuning relevant results today takes into account context and history. Relevance means tuning the search to both product and customer characteristics and plugging in effective algorithms. Often times this means correcting terrible spelling and determining intent from fractional information with a lot of noise. Increasingly this is even involving voice search so users can speak their queries directly into a device

Targeting and Personalizing for On-line Retail Customers

Search and targeting are now personalized, a customer whose favorite color is blue might see blue shirts boosted over red ones. A customer who bought a TV might see HDMI cables advertised. For goodness sake, don’t show a customer who just for the first time bought men’s dress shoes and a jacket more men’s dress shoes. Increasingly companies are using predictive models from neural networks to simple clustering to extract and apply these recommendations. It is said there is No Free Lunch in picking these systems and the best retailers spend a lot of time tuning these systems. A system needs to predict change and be flexible enough to plug in new algorithms over the passage of time.

Timeliness

Timeliness matters not just in making sure a customer sees Winter clothes from mid-fall to mid-winter. Timeliness means that the data is fresh, boost the things we have readily in stock first and be up to date with that information. The Index needs to be up-to-date in real-time or close enough for our business cycle. Our search results need to return instantly to meet today’s expectations. Time is money and every nickel or millisecond matters.

The best retail sites anticipate customer needs, target customers with relevant content before they even ask, personalize and contextualize search results and pull together timely information. Doing this requires pulling in data from multiple sources, developing a smart customer profile and identifying the right characteristics of products as well as tracking signals and other customer usage information. Keeping this up and running means doing this in an operationally intelligent and change-tolerant way. Doing this is a lot of work, we’ve taken the bulk of it done it for you and called it Lucidworks Fusion.

Fusion

At Lucidworks, we’ve spent a lot of time helping our customers tune their search, targeting and personalization. This includes some of the largest online retailers like Bluestem, Staples, and B&H Photo. It also includes brands that sell through offline channels but maintain an online presence. We’ve developed a solution called Lucidworks Fusion that is designed to capture customer signals, integrate pipelines, and connect data from various sources and make it easy to tune and manage your search and indexing solution in an operationally efficient way. Learn more about our solution here.

gettingstarted

The post Google/Amazon-like Search for Your Retail Site appeared first on Lucidworks.com.

Using Word2Vec in Fusion for Better Search Results

$
0
0

Introduction

The power to have a computer process information the same way a human does has countless applications, but the area in which it is possibly most important is search. The ability to interface with technology in a natural and intuitive way increases search traffic, facilitates ease of querying and, most importantly, gives people what they want… accurate results! But, while it’s clear that Natural Language Processing (NLP) is crucial for good search, actually teaching a computer to understand and mimic natural speech is far easier said than done.

That previous sentence itself is a great example of why that is. The human brain is an efficient language processing engine because humans have access to years of context for all types of speech. Human brains can call upon this context to identify, interpret, and respond to complicated speech patterns including jokes, sarcasm and, as in the case of the of the previous paragraph, bad puns. Computers, on the other hand, have no inherent context and so must be taught which words are associated with which others and what those associations imply. Simply encoding all types of speech patterns into a computer, however, is a virtually endless process. We have had to develop a better way to teach a computer to talk and read like a human.

Enter Word2Vec1. Word2Vec is a powerful modeling technique commonly used in natural language processing. Standard Word2Vec uses a shallow neural network2 to teach a computer which words are “close to” other words and in this way teaches context by exploiting locality. The algorithm proceeds in two steps, Training and Learning. A rough overview of these steps is as follows.

The Algorithm

Training

To train a computer to read we must first give it something to read from. The first step of training the Word2Vec model supplies the algorithm with a corpus of text. It’s important to note that the computer will only be learning context from the input text that we provide it. So, if we provide an obscure training text the model will learn obscure and possibly outdated associations. It is very important that we supply the model with a comprehensive and modern corpus of text.

Each word in the input text gets converted to a position vector, which corresponds to the word’s location in the input text. In this way, the entire training set gets converted to a high dimensional vector space. Below is a trivial example example of what a 2d projection of this vector space might look like for the sentences “Bob is a search engineer. Bob likes search” (removing the stopwords3 ‘is’ and ‘a’ for simplicity)

 

screen-shot-2016-11-14-at-9-36-16-am

Figure 1: Diagram of Initial Vector Space for “Bob Search Engineer Bob Likes Search”,

This vector space then gets passed to the learning step of the algorithm.

Learning

Now that we have converting our text into a vector space we can start learning which words are contextually similar in our input text.

To do this, the algorithm uses the corpus of text to determine which words occur close to one another. For example, in the above example of “Bob is a search engineer. Bob likes search” the words “Bob” and “search” occur together twice. Since the two words occurred together more than once, it is likely they are related. To reflect this idea the algorithm will move the vector corresponding to “Bob” closer to the vector corresponding to “search” in the high dimensional space. A 2d projection of this process is illustrated below.

screen-shot-2016-11-14-at-9-36-58-am

Figure 2: This diagram shows that after the learning step BOB and SEARCH have moved closer to each other in the vector space.

In this way Word2Vec tweaks “nearby words” to have “nearby vectors” and so each word in the corpus of text moves closer to the words it co occurs with.

Word2Vec can use one of two algorithms to learn context: Continuous Bag of Words4  (or CBOW) and SkipGram5 . CBOW seeks to predict a target word from its context while SkipGrams seeks to predict the context given a target word.

Both algorithms employ this vector moving strategy. The slight distinction between the two algorithms is the way in which the vectors are moved. In CBOW we seek to predict a word from its context and so we consider the entire corpus of text in “windows” of a single word and its corresponding context vectors. In the case of,  “Bob is a search engineer. Bob likes search.” We would be considering something like below,

screen-shot-2016-11-14-at-10-16-33-am

Figure 3: For n = 2 we see what words constitute the context of the word engineer. Bob and search on the left and Bob and likes on the right.

 

 

 

Each of the context word vectors serve as the input to a 2 layer neural network. The single word “search” serves as the intended output layer. The internal hidden layer of the network corresponds to the manipulations occurring in the vector space to push each of the context vectors nearer to the word.

screen-shot-2016-11-14-at-10-26-26-am

Figure 4: The CBOW neural network takes the context words, BOB, BOB, LIKES and SEARCH, as input. It averages and modifies them to produce the Weight Transformation Vector which produces the desired target word, ENGINEER.

There is one disadvantage to this method. We are taking the context for each given word to be the n words surrounding it. Since each word gets considered to be a context word multiple times we are effectively averaging the the context words in the hidden layer of the neural network and then moving the target vector closer to this average. This may unwittingly cause us to lose some distributional information about the space.

The alternative technique is skip grams which does effectively the opposite of CBOW. Given a specific word skip grams strives to discover the associated context. The 2 layer neural network for skip grams uses the target word as input and trains the weights such that the nearest vectors are the context to that word.

screen-shot-2016-11-14-at-10-29-17-am

Figure 5: The skip-gram neural network a single input word, ENGINEER. This word gets transformed to produce the Weight Transformation Vector which in turn indicates the context word vectors for that word, BOB, BOB, LIKES and SEARCH.

In this case there is no need to average the context vectors. The hidden layer simply maps the target word to the n context words. So, instead of moving the target towards the average of the context vectors (as we do in CBOW) we are simply moving all of the context vectors closer to the target. A 2d projection of these two techniques is below with engineer as the Target Word and “Bob”, Search and Likes as the Context Words.

 

screen-shot-2016-11-14-at-10-43-26-am

Figure 6:: Diagram of how the CBOW vectors move. We compute the resultant vector for the three context vectors, BOB (red), SEARCH (blue) and LIKES (green). This resultant vector is the dotted black arrow. The ENGINEER (orange) vector moves towards this resultant as denoted by the Movement Vector (pink).

screen-shot-2016-11-14-at-10-46-00-am

Figure 7:: Diagram of how the Skip-Grams vectors move. The three context vectors, BOB (red), SEARCH (blue) and LIKES (green) all move towards the ENGINEER (orange) vector as denoted by the Movement Vectors (purple).


Using in Lucidworks Fusion

Word2Vec has huge applications for automatic synonym and category discovery in large unlabeled datasets. This makes it an useful tool to supplement the existing textual analysis capabilities of Lucidworks Fusion6. Our implementation leverages Apache Spark’s ML Library, which has a convenient implementation of the skip-grams technique7 .

We will be analyzing the Apache projects8 for our input corpus of text. Employing this technique in Lucidworks Fusion has further complications because each mailing list or project configuration coming from Apache is split into ‘documents’ in Fusion. This means generated synonyms will be for the entire corpus of text and not necessarily for the words in each document. To circumvent this problem we train the model on the entire set of documents. We then employ TF-IDF on each individual document to determine which words in the document are the “most important” and query the Word2Vec model for synonyms of those important terms. The effectively generates a topics list for each document in Fusion.

Our preliminary results were subpar because our initial pass through the data employed only minimal stopword removal. In Word2Vec, if certain words are frequently followed or preceded by stop words the algorithm will treat those stop words as important pieces of context whereas in reality they do not provide valuable information.

After more comprehensive stopwords removal we began to see much more reasonable results. For example, for an email thread discussing Kerberos9 (an authentication protocol) the results were as follows.

kerebos_authentication

These are accurate and relevant synonyms to the subject of the email.

Not all the associations are completely accurate however, as evidenced by the next example. The source was an email concerning JIRA issues.

funny_donuts

For some context Alexey and Tim are two major Fusion contributors and so it makes sense that they should have high association with a JIRA related email chain. Atlassian, administrators, fileitems, tutorials and architecture are all also reasonable terms to be related to the issue. Donuts, however, is almost completely random (unless Alexey and Tim are bigger fans of Donuts then I realized).

Conclusion

If you want to explore our Word2Vec example for yourself we have implemented Word2Vec as a Fusion plugin in Lucidworks’ demo site, Searchhub10.

You can also play around with the plug in on your own data. Just download the plugin, crawl some data, train the model by running the word2vec Spark job and recrawl the data to see the related terms to the “most important terms” for each document. These fields can then be searched against to offer query expansion capabilities, or faceted upon to automatically ‘tag’ each document with its general description.

Word2Vec offers a powerful way to classify and understand both documents and queries. By leveraging the power of Apache Spark ML Word2Vec within Fusion we at Lucidworks are able to offer you this unique insight into your data.

The post Using Word2Vec in Fusion for Better Search Results appeared first on Lucidworks.com.

Holiday Viewing: Revolution 2016 Videos

$
0
0

We had an amazing time with the Solr community in Boston last month for Lucene/Solr Revolution 2016. Now that we’ve come down from the excitement and craziness of this year’s event, we are busy planning for next year.

We will have some big news coming your way soon about a date and location for 2017, but in the mean time, we want to make sure you have a chance to check out the content that you may have missed (or want to watch again) from this year’s show.

What better time than a long holiday weekend to amp up your Solr knowledge?! In case you need a break from the in-laws, an excuse to skip that flag football game, or to get out of an uncomfortable discussion about the news at the Thanksgiving table, we are here for you!

Check out photos from this year’s event, watch videos of keynote and breakout presentations, and view or download presentation slides.

Although it may be easy to spend hours and hours catching up on session videos, we hope you make some time to relax, enjoy, and spend time with loved ones this holiday season!

 

The post Holiday Viewing: Revolution 2016 Videos appeared first on Lucidworks.com.

Top Trends in Search for 2017

$
0
0

Search is changing.

For the past few years search has become much more relevant. There was a time when savvy users had to scroll past the first few results habitually, when composing the right query was as much of an art as it was about learning a secret parser grammar. Now, users are accustomed to looking at the first 3 results at most and generally just click on the first result. Search has gotten smarter and users have become accustomed to those smarts.

Some of these smarts have come because search has become much more personal. Search that captures the signals a user gives about their preferences and tunes future results accordingly tends to produce highly relevant results. For example, in online retail when I search for shoes, I expect it to bring up mens shoes first because I’ve been searching on men’s clothing. I expect products that come in my size to be boosted to the top.

That where we are today, as we look forward to 2017 what trends can we look forward to?

Search will drive big data

image00

The last few years of Big Data have largely been a freakshow compared to what is on the horizon. “Let’s Create a Data Lake!” this unindexed dump of data on a distributed filesystem… then load every bit of it into memory every time you want to analyze it. Sure your analytics might be fast now with Spark, but you still have to wait on I/O. Having the data indexed is the ticket.

Search will become more anticipatory

At Lucidworks we’ve long explained that search is more than the search box. Using signals and personalization and new techniques in machine learning, search will start to bring you results before you ask for them. When I go to a webpage about to click in a search box and realize that what I want is already staring me right in the face. There are many techniques for this. A tried and true method involves using a simple clustering algorithm to look at what similar users have searched on (or purchased) and boost those results for similar users.

Search will become more conversational

image02

Google “What is the distance to mars” and you get an answer. The answer isn’t a command but a specialized form of search, a boosted result and a special UI template. It looks almost like Google composed something for you. This is really just the start of more to come. In 2017, we will start to see this become ubiquitous even through enterprise search. We will start to see grammatical forms beyond “Who is, what is, how do I, etc.” Users will also come to expect this. Gone are the days of carefully crafted keywords and strange queries like +mars +“distance from earth.” Search is a query and a query is a question and the result is an answer. In talking to a person you don’t expect them to answer such a question with a list of webpages, just an answer. Why should it be different just because you asked a machine?

Search will continue to become speech

image01

Whether it is your phone, the integrated voice search on Google or Enterprise or the microphone button on your favorite e-commerce site, speech will continue to become speech. In 2017 this will accelerate.

Search will become even more ubiquitous

Whether it is your smartphone, your smart TV, you Amazon Fire Stick or your Amazon Echo, voice search is popping up everywhere and search will continue to become a bigger part of your life. Long gone are the days of the unanswered question that you forget on the long drive home.

I’m looking forward to 2017, the Year of Search. The year we index big data, have our customer needs anticipated, just ask the question we need answered using our voice and anywhere we damn well want to. If you’re interested in getting a jump on the new year by more relevant search results, enabling conversational search and voice search, I suggest checking out Lucidworks Fusion our Solr-powered platform for building and deploying powerful search apps.

The post Top Trends in Search for 2017 appeared first on Lucidworks.com.

11 Common Mistakes in Search

$
0
0

Your company or organization is energized. Finally you’re going to provide an organization enterprise search solution or enhance your online retail customer experience! You get everyone onboard, make some technology decisions, deploy a solution and thud…it doesn’t work. Where did you go wrong? Here is where others have failed. Learn from their misery.

  1. Bad schema design – Just like your relational database or any other nosql database, your search solution requires some thought on how you represent an entry (aka document). Bungle this and you end up with sub-optimal search performance or an index process that takes too long to complete.
  2. Inadequate resource planning – That old desktop in the corner of your cube makes a fine footrest. It’s Itanium processor was ahead of its time and totally unappreciated. However great it is at alleviating your back problems or restless leg syndrome, it probably won’t on its own be an adequate host for an Enterprise search solution. Complicated queries or calculations may require more.
  3. Inadequate scalability testing – You bought some hardware, ingested some data, designed a UI and turned your users loose on it… It sank. Queries started returning in minutes, “connection refused” errors and you start thinking that maybe your life is just not what you want it to be. With some testing, you’d have realized the combination of your schema design, hardware choices and use cases don’t work.
  4. Returning too much data – Whether you stuck a big bag of everything in your bad schema design as a “just in case” and called it a “grab-bag pattern” or designed dumb queries that return more than you need, remember less is more and The Buffet Rule: You can always go back for more. Deceptively this works fine until you get a few more users putting load on the system.
  5. Not using compatible index and query-time analyzers – The most common example of this is stemming on the index side but not on the query side or vice-versa which appears to work…until it doesn’t. Meaning you start searching on differently stemmed words and nothing comes back when it should.
  6. Not planning for and testing relevance – A big fallacy of big data is that you can find an answer without any idea what you’re looking for. The Rolling Stones said you MIGHT get what you need, but you do have to know what you want! This means understanding what your users are looking for and testing that they get it before you roll the whole thing out. See how Salesforce does this with subsequent releases. 
  7. Not planning for HA and DR – High Availability and Disaster Recovery aren’t the hottest buzzwords anymore but good gosh having your service constantly available and planning for a fiber cut or lost data center is like remembering to buy food. You just need to do it. (see Solr Cloud and Cross Data Center Replication)
  8. Not capturing signals from the start – Or incomplete signal data, e.g. a click event that doesn’t include info about where the clicked on doc occurred in the ranked results. Too often, user interaction with search results is an afterthought and then you have to piece together an incomplete story from query logs.
  9. Inadequate KPIs – Whether it is performance tuning or relevance, you need to have goals. To know when you’re done tuning you need to measure those goals. Fail on either side of that and you won’t even know if you’re failing.
  10. Using a technology not proven to reliably scale – We get it. You read that this one search technology is the hot shizzle. You went to their conference and even heard about someone using it on a big project. This is all fine and well until you have a split brain problem. Maybe your company used a client-server solution in the past or something based on RDBMS technology and now you find that your data and search requirements exceed its capabilities.
  11. Rolling your own – In any major or even many minor search projects you have: a data ingestion function that has to connect to some data sources and transform the data appropriately; a server/search management and monitoring process; access control; a UI; a query process that may need to use more than one datasource or collection. Writing all of these pieces is a lot of work and cost. Next there is the ongoing cost of maintaining all of that. There is consequently no reason to. Use a product written to manage all of that for you.

That’s an accountant’s dozen ways you can blow your next search project that aren’t specific to any particular technology. Which is your favorite?

The post 11 Common Mistakes in Search appeared first on Lucidworks.com.

Analyzing Enron with Solr, Spark, and Fusion

$
0
0

At Lucene/Solr Revolution this year I taught a course on using Big Data technologies with Apache Solr and by extension Lucidworks Fusion. As part of the course we ingested the Enron corpus as provided by Carnegie Mellon University. The corpus consists of a lot of emails, some poorly formed from the early part of the last decade that were part of the legal discovery phase of the investigation into energy company’s flameout in 2001. It is a pretty substantial piece of unstructured data at 423MB compressed.

During the course of this article, we’re going to index the Enron corpus, search it with Solr and then perform sentiment analysis on the results. Feel free to follow along, grab a copy of the Enron corpus and untar it into a directory on your drive. You can grab the referenced code from my repository here.

Why Spark and Solr?

Why would you want to use Solr if you’re already using Spark? Consider this: If you’ve got a substantial amount of stored data that you’re going to perform analytics on, do you really want to load ALL of that data in memory just to find the subset you want to work on? Wouldn’t it make more sense to load only the data you need from storage? To do this, you need smarter storage! You can to index the storage as you store the data or in batches sometime afterwards. Otherwise, everyone performing the same analysis is probably running the same useless “find it” process before they actually start to analyze it.

Why would you want to use Spark if you’re already using Solr? Solr is amazing at what it does, finding needles or small haystacks inside of a big barn full of hay. It can find the data you need to match the query you provide it. Solr isn’t, by itself, however machine learning or has any of the analytics you might want to run. When the answer to your query needs to be derived from a massive amount of data, you really need a really fast distributed processing engine – like Spark – to do the algorithmic manipulation.

Lucidworks Fusion actually gets its name from “fusing” Spark and Solr together into one solution. In deploying Solr solutions for customers, Lucidworks discovered that Spark was a great way to augment the capabilities of Solr for everything from machine learning to distributing the processing of index pipelines.

Ingesting

Back in 2013, Erik Hatcher showed you a way you could ingest emails and other entities using Solr and a fair amount of monkey code. At Lucene/Solr Revolution, given that my mission wasn’t to cover ingestion but using Spark with Solr, I just used Fusion to ingest the data into Solr. The process of ingesting the Enron emails with is easy and Fusion UI makes it dead simple.

  1. Download, Install and Run Lucidworks Fusion.
  2. Go to http://localhost:8764 and complete the fusion setup process. Skip the Quickstart tutorial.

image00

Note: screenshots are from the upcoming Fusion 3.x release, this has been tested with 2.4.x but the look/feel will be slightly different.

image05

  1. Navigate to the Fusion collection screen (on 2.4 this is the first screen that comes up, on 3.x click devops)

image10

image12

  1. Click New collection, type enron and click save.

image07

image13

  1. Click on the newly created “enron” collection, then click “datasources”.

image11

  1. Create a new Local Filesystem Datasource

image01

  1. Call the datasource enron-data then scroll down

image02

  1. Expand the “StartLinks” and put in the full path to where you unzipped the enron corpus (root directory), then scroll or expand “limit documents”

image06

  1. Up the maximum filesize, I just added a 0. Some of the files in the enron corpus are large. Scroll down more.

image09

  1. Save the datasource

image03

  1. Click Start Crawl

image08

  1. Click Job History and select your job

image04

  1. Watch the pot boil, this will take a bit if you’re doing it on your laptop.

There will be some of the emails that will be skipped due to poor-formedness, lack of content or other reasons. In a real project, we’d adjust the pipeline a bit. Either way, note that we got this far without touching a line of code, the command line or really doing any heavy lifting!

Spark / Spark-Solr

If you’re using Solr, my colleague Tim, showed you how to get and build the spark-solr repository back in August of last year.

If you use Fusion, Spark-Solr connectivity and Spark itself are provided. You can find them in the $FUSION_HOME/apps/libs/spark-solr-2.2.2.jar and the spark-shell in $FUSION_HOME/bin. To launch the Spark console you type: bin/spark-shell

Scala Briefly

Scala is the language that Spark is written in. It is a type-safe functional language that runs on the Java Virtual Machine. The syntax is not too alien to people familiar with Java or any C-like language. There are some noticeable differences.

 object HelloWorld {
    def main(args: Array[String]) {
      println("Hello, world!")
    }
  }

Spark also supports Python and other languages (you can even write in functional Java). Since Spark is Scala natively, Scala always has the best support and there are performance penalties for most other languages. There are great reasons to use languages such as Python and R for statistics and analytics such as various libraries and developer familiarity, however, a Spark developer should at least be familiar with Scala.

In this article we’ll give examples in Scala.

Connecting Spark to Solr

Again, launch the Spark shell that comes with Fusion by typing bin/spark-shell. Now we need to specify the collection, query, and fields we want returned. We do this in an “options” variable. Next we read a dataframe from Solr and load the data. See the example below:

val options = Map(
  "collection" -> "enron",
  "query" -> "content_txt:*FERC*",
  "fields" -> "subject,Message_From,Message_To,content_txt"
)
val dfr = sqlContext.read.format("solr");
val df = dfr.options(options).load;

If you’re not using Fusion, you’ll need to specify the zkHost and you may also need to add the spark-data library and its dependencies with a –jars argument to spark-shell. In Fusion, we’ve already handled this for you.

Poor Man’s Sentiment Analysis

Provided in the repository mentioned above is a simple script called sentiment.scala. We demonstrate a connection to Solr and a basic script to calculate sentiment. Sentiment analysis on emails can be done for any number of purposes: maybe you want to find out general employee morale; or to track customer reactions to particular staff, topics or policies; or maybe just because it is fun!

We need to do something with our DataFrame. I’ve always had a soft spot for sentiment analysis because what could be better than mathematically calculating people’s feelings (or at least the ones they’re trying to convey in their textual communication)? There are different algorithms for this. The most basic is to take all the words, remove stop-words (a, the, it, he, she, etc), and assign the remaining words a value that is negative, positive, or neutral (say -5 to +5), add up the values and average it. There are problems with this like “This is not a terrible way to work” comes out pretty negative when it is actually a  positive to neutral sentiment.

To really do this, you need more sophisticated machine learning techniques as well as natural language processing. We’re just going to do it the dumb way. So how do we know what is positive or negative? Someone already cataloged that for us in the AFINN database. It is a textfile with 2477 words ranked with a number between -5 and +5. There are alternative word lists but we’ll use this one.

Based on this, we’ll loop through every word and add them up then divide the words by the number of words. That’s our average.

Understanding the Code

abandon -2
abandoned       -2
abandons        -2
abducted        -2
abduction       -2
...

AFINN-111.txt first 5 lines

The first step is to get our afinn datafile as a map of words to values.

We define our bone-headed simplistic algorithm as follows:

val defaultFile = "/AFINN/AFINN-111.txt"
val in = getClass.getResourceAsStream(defaultFile)
val alphaRegex = "[^a-zA-Z\\s]".r
val redundantWhitespaceRegex = "[\\s]{2,}".r
val whitespaceRegex = "\\s".r
val words = collection.mutable.Map[String, Int]().withDefaultValue(0)
  for (line <- Source.fromInputStream(in).getLines()) {
    val parsed = line.split("\\t")
    words += (parsed(0) -> parsed(1).toInt)
  }

Sentiment.scala extract

Next we connect to Solr, query all messages in our “enron” collection related to the Federal Energy Regulatory Commission (FERC) and extract the relevant fields:

val options = Map(
  "collection" -> "enron",
  "zkhost" -> "localhost:9983",
  "query" -> "content_txt:*FERC*",
  "fields" -> "subject,Message_From,Message_To,content_txt"
)
val dfr = sqlContext.read.format("solr");
val df = dfr.options(options).load;

Sentiment.scala extract

The idea is to get a collection of from addresses with the affinity score. To do that we have to create a few more collections and then normalize the score by dividing the number of emails mentioning our subject “FERC.”  

val peopleEmails = collection.mutable.Map[String, Int]().withDefaultValue(0)
val peopleAfins = collection.mutable.Map[String, Float]().withDefaultValue(0)

 def peoplesEmails(email: String, sentiment: Float) = {
  var peopleEmail: Int = peopleEmails(email);
  var peopleAfin: Float = peopleAfins(email);
  peopleEmail += 1;
  peopleAfin += sentiment;
  peopleEmails.put(email, peopleEmail);
  peopleAfins.put(email, peopleAfin);
 }

 def normalize(email: String): Float = {
   var score: Float = peopleAfins(email);
   var mails: Int = peopleEmails(email);
   var retVal : Float = score / mails
   return retVal
 }

sentiment.scala extract

Finally, we run our algorithms and print the result:

df.collect().foreach( t => peoplesEmails(t.getString(1), sentiment(tokenize(t.getString(3)))
                     )
          )
for ((k,v) <- peopleEmails) println( ""+ k + ":" +  normalize(k))

Sentiment.scala extract

The result looks like several lines of

ken@enron.com -0.0075

Because corporate emails of the time looked more like memos than the more direct communications of today most of the sentiment will be pretty close to neutral. “FERC is doing this” rather than “Those jerks at FERC are about to freaking subpoena all of our emails and I hope to heck they don’t end up in the public record!!!”

Next Steps

Obviously a list of addresses and numbers are kind of just the start. I mean you might as well visualize the data.

As fun as that may have been and a decent way to demonstrate connecting to Solr/Fusion with Spark, it might not be the best way to do sentiment analysis. Why not just analyze sentiment while you index and store the score in the index itself? You could still use Spark, but rather than analyze a piece of mail every time you query, do it once.

You can also use a more sophisticated machine learning technique already included with Spark’s mllib. See how to do this in the Fusion documentation.

Moreover, we are looking at the sentiment in the data, what about your user’s sentiment in their conversational searches? What about other contextual information about their behavior i.e. Signals. There are a lot of opportunities to use the power of Spark inside of Fusion to produce more powerful analytics and relevant information.

The post Analyzing Enron with Solr, Spark, and Fusion appeared first on Lucidworks.com.


2016: The Year In Review

$
0
0

It has been another whirlwind year in the world of search here at Lucidworks!

First a few customer successes from the field:

How Infoblox Moved from Google Search Appliance to Lucidworks Fusion in Just 5 Weeks

When GSA got EOLed, it left everyone in a lurch. Network security company Infoblox needed a replacement – and fast. They cut their licensing costs in half and dev time from months to weeks with Lucidworks Fusion. Full case study.

And as you spike your nog and brave the snow (or make a getaway to the beach), let’s do a quick Casey Kasem countdown of the top search stories and blog posts of 2016:

Better Feature Engineering with Spark, Solr, and Lucene Analyzers

Steve Rowe’s post on how to use the spark-solr open source toolkit to break down full text into words (tokens) using Apache Lucene’s text analysis framework.

Solr Suggesters and Fusion for Less Typing

Mitzi Morris walked us through implementing Solr suggester components in a Fusion query pipeline to provide the auto-complete functionality end users expect.

What’s New in Apache Solr 6

The Apache Solr community celebrated the release of Solr 6 which included improved defaults for similarity, better date-time formatting, support for executing parallel SQL queries across collections, new streaming capabilities and more. Read the full rundown or watch the video of Cassandra Targett’s webinar.

RIP Google Search Appliance

Search giant Google announced the end-of-life for their workhorse Google Search Appliance and pointed towards future enterprise products focused in the cloud. Grab a tissue and read Lucidworks CEO Will Hayes’s eulogy.

Solr’s DateRangeField – How Does It Perform?

Erick Erickson brought some clarity to Solr’s DateRangeField and how it fits more naturally with some of the ways we need to support searching dates in documents.

example/files – a Concrete Useful Domain-Specific Example of bin/post and /browse

Erik Hatcher did a three part series demonstrating how to build a simple Solr app with just a few simple commands. First getting data into Solr using bin/post, then visualizing your search results with /browse, and finally putting it together for a domain-specific example.

Solr as a SparkSQL DataSource

Timothy Potter and Kiran Chitturi showed you how to use Solr as a SparkSQL Data Source with one tutorial focused on read/query operations and a second tutorial showing you how to write data using the SparkSQL DataFrame API.

Using Word2Vec in Fusion for Better Search Results

Lasya Marla showed how to use the Word2Vec modeling technique for natural language processing with Fusion.

Revolution 2016

And of course a rollicking good time at Lucene/Solr Revolution in Boston with sessions about cognitive computing with IBM Watson, optimal findability at Salesforce, Solr highlighting at Bloomberg, and keynotes from Lucidworks CEO Will Hayes on the future of search, CTO Grant Ingersoll’s all-emoji finale. and another hysterical edition of Stump the Chump. Browse the full archive of videos on YouTube and slide decks on SlideShare.

The post 2016: The Year In Review appeared first on Lucidworks.com.

Fusion Working for You: A Custom RSS Crawler JavaScript Stage

$
0
0

One of the most powerful features of Fusion is the built-in JavaScript stage.  However, you shouldn’t really think of this stage as merely JavaScript stage.   Fusion uses the Nashorn JavaScript engine, which means you have at your fingertips access to all the Java class libraries used in the application.  What this means is that you can effectively script your own customized Java and/or JavaScript processing stages, and really make the Indexing Pipeline work for you, the way you want it to work. 

So first off, we need to get our feet wet with Nashorn JavaScript.  Nashorn (German for Rhinoceros) was released around the end of 2012.   While much faster than it’s predecessor ‘Rhino’, it also incorporated one long-desired feature:  The combining of Java and JavaScript.  Though two entirely separate languages, these two have, as the saying goes, “had a date with destiny” for a long time. 

 

Hello World

To start with, let us take a look at the most fundamental function in a Fusion JavaScript Stage:

function(doc){
   logger.info("Doc ID: "+doc.getId());
return doc;
}

At the heart of things, this is the most basic function. Note that you will pass in a ‘doc’ argument, which will always be a PipelineDocument, and you will return that document (or, as an alternative, an array of documents, but that’s another story we’ll cover in separate article).  The ‘id’ of this document will be the URL being crawled; and, thanks to the Tika Parser, the ‘body’ attribute will be the raw XML of our RSS feed. 

To that end, the first thing you’ll want to do is open your Pipeline Editor and select the Apache Tika Parser.  Make sure the “Return parsed content as XML or HTML” checkbox is checked.  The Tika Parser in this case will really only be used to initially pull in the RSS XML from the feed you want to crawl.  The remainder of the processing will be handled in our custom JavaScript stage. 

Now let’s add the stage to our Indexing Pipeline.  Click “Add Stage” and select “JavaScript” stage from the menu. 

Our function will operate in two stages.  The first stage will pull the raw xml from the document, and use Jsoup to parse the XML and create an java.util.ArrayList of urls to be crawled.  The second phase will take that ArrayList, loop through and crawl each url using Jsoup, and then spin up a CloudSolrClient to save the results.   

So now that we’ve defined our processes, let’s show the work:

The overall architecture of our function will be to create two nested functions within the main function that will handle the processing.  The super-structure will look like this:

 

    function(doc){
        var processedDocs = java.util.ArrayList;
        
          var parseXML = function(doc){
               return docList;
           }
           processedDocs = parseXML(doc);

          var saveCrawledUrls = function(docList){
        
            return docList;
        }
       saveCrawledUrls(processedDocs);
       return doc;
     }

 So what is happening here is that we’re taking in the PipelineDocument, parsing the XML and pulling out the URLs, and passing that off to a separate method that will crawl the list.  One point of note:  the “processedDocs” variable declared at the top of the function is a Java ArrayList.  This is a simple example of Nashorn’s Java/JavaScript interoperability. 

 

  Parsing the XML

 var jsoupXmlParser = function(doc){
     var Jsoup = org.jsoup.Jsoup;
     var jdoc = org.jsoup.nodes.Document;
     var ex   = java.lang.Exception;
     var Parser = org.jsoup.parser.Parser;
     var element = org.jsoup.Element;
     var xmlstr = java.lang.String;
     var docs = java.util.ArrayList;
     var outdocs = java.util.ArrayList;
     var pipelineDoc = com.lucidworks.apollo.common.pipeline.PipelineDocument;
     var docurl = java.lang.String;
     var elements = org.jsoup.select.Elements;
     var extractedText = java.lang.String;
     var ele = org.jsoup.Element;
       
     
     try{
         docs = new java.util.ArrayList();
         
         xmlstr = doc.getFirstFieldValue("body");
         jdoc = Jsoup.parse(xmlstr, "", Parser.xmlParser());
         for each(element in jdoc.select("loc")) {
             docurl = element.ownText();
             if(docurl !== null && docurl !== ""){
             logger.info("Parsed URL: "+element.ownText());
             pipelineDoc = new com.lucidworks.apollo.common.pipeline.PipelineDocument(element.ownText());
             docs.add(pipelineDoc);
             }
             
          }
          
          outdocs = new java.util.ArrayList;
          // now crawl each doc in the feed
          for each(pipelineDoc in docs){
              docurl = pipelineDoc.getId();
              jdoc = Jsoup.connect(docurl).get();
              if(jdoc !== null){
                  logger.info("FOUND a JSoup document for url  "+doc.getId());
                  extractedText = new java.lang.String();
                   elements = jdoc.select("p");
                        logger.info("ITERATE OVER ELEMENTS");
                        // then parse elements and pull just the text
                        for each (ele in elements) {
                            if (ele !== null && ele.ownText() != null) {
                                    extractedText += ele.ownText();
                            }
                        }
                        pipelineDoc.addField('content_text', extractedText);
                        //pipelineDoc.addField("_raw_content_", jdoc.toString());
                        pipelineDoc.addMetadata("Content-Type", "text/html");
                        logger.info("Extracted: "+extractedText);
                        outdocs.add(pipelineDoc);
                  
              } else {
                  logger.warn("Jsoup Document was NULL **** ");
              }
          }
     }catch(ex){
         logger.error(ex);
     }
     return outdocs;
 }

 So in the above function the first step is to parse the raw XML into a Jsoup Document.  From there, we iterate over the elements found in the document (jdoc.select(“loc”))   Once we have a list of urls, we pass that on to a bit of script that loops through this list and uses Jsoup to extract all the text from the ‘p’ elements (jdoc.select(“p”))

Once we’ve extracted the text, we spin up a new PipelineDocument and set whatever fields are relevant to our collection.  Here I’ve used “content_text,” but really you can use whatever fields you find appropriate.  Note that I’ve commented out saving the raw text.   You want to avoid putting raw text into your collection unless you have a specific need to do so.   It’s best to just extract all the critical data/metadata and discard the raw text. 

 

Saving the Results to Solr

Moving forward, now that we have our list of crawled pipeline documents, we’re going to want to save them to the Solr index.   This is done by spinning up a CloudSolrClient in our JavaScript stage, like so:

 

var solrCloudClient = function(doc){
       var client = org.apache.http.client.HttpClient;
       var cloudServer = org.apache.solr.client.solrj.impl.CloudSolrClient;
       var DefaultHttpClient = org.apache.http.impl.client.DefaultHttpClient;
       var ClientConnectionManager = org.apache.http.conn.ClientConnectionManager;
       var PoolingClientConnectionManager = org.apache.http.impl.conn.PoolingClientConnectionManager;
       var CloudSolrClient = org.apache.solr.client.solrj.impl.CloudSolrClient;
       var cm = org.apache.http.impl.conn.PoolingClientConnectionManager;
       var String = java.lang.String;
       var pdoc  = com.lucidworks.apollo.common.pipeline.PipelineDocument;
       
       var ZOOKEEPER_URL = new String("localhost:9983");
       var DEFAULT_COLLECTION = new String("cityofsacramento");
       var server = ZOOKEEPER_URL;
       var collection = DEFAULT_COLLECTION;
       var docList = java.util.ArrayList;
       var inputDoc = org.apache.solr.common.SolrInputDocument;
       var pingResp = org.apache.solr.client.solrj.response.SolrPingResponse;
       var res = org.apache.solr.client.solrj.response.UpdateResponse;
       var SolrInputDocument = org.apache.solr.common.SolrInputDocument;
       var UUID = java.util.UUID;
         
       
       try{
           // PoolingClientConnectionManager cm = new PoolingClientConnectionManager();
            cm = new PoolingClientConnectionManager();
            client = new DefaultHttpClient(cm);
            cloudServer = new CloudSolrClient(server, client);
            cloudServer.setDefaultCollection(collection);
            logger.info("CLOUD SERVER INIT OK...");
            docList = new java.util.ArrayList();
            pingResp = cloudServer.ping();
            logger.info(pingResp);
            docList = new java.util.ArrayList();
            for each(pdoc in doc){
                inputDoc = new SolrInputDocument();
                inputDoc.addField("id", UUID.randomUUID().toString());
                inputDoc.addField("q_txt", pdoc.getFirstFieldValue("extracted_text"));
                docList.add(inputDoc);
            }
            
            logger.info(" DO SUBMIT OF "+docList.size()+" DOCUMENTS TO SOLR **** ");
            cloudServer.add(docList);
            res = cloudServer.commit();
            logger.info(res);
            
           
       }catch(ex){
           logger.error(ex);
       }
     
     return doc;
 }
    

Here you can see we make extensive use of Nashorn’s Java/JavaScript interoperability. For all practical intents and purposes this is a Java class running in a JavaScript content.  Note the rather lengthy stack of declarations at the top of this method.  In any case, what we’re doing here is spinning up a CloudSolrClient, iterating over our PipelineDocument ArrayList, and turning the Pipeline documents into SolrInputDocuments and then committing them as a batch to Solr.  

 

Putting It All Together

function(doc){
       var parsedDocs = java.util.ArrayList;
    
      
     var jsoupXmlParser = function(doc){
     var Jsoup = org.jsoup.Jsoup;
     var jdoc = org.jsoup.nodes.Document;
     var ex   = java.lang.Exception;
     var Parser = org.jsoup.parser.Parser;
     var element = org.jsoup.Element;
     var xmlstr = java.lang.String;
     var docs = java.util.ArrayList;
     var outdocs = java.util.ArrayList;
     var pipelineDoc = com.lucidworks.apollo.common.pipeline.PipelineDocument;
     var docurl = java.lang.String;
     var elements = org.jsoup.select.Elements;
     var extractedText = java.lang.String;
     var ele = org.jsoup.Element;

     
     try{
         docs = new java.util.ArrayList();
         
         xmlstr = doc.getFirstFieldValue("body");
         jdoc = Jsoup.parse(xmlstr, "", Parser.xmlParser());
         for each(element in jdoc.select("loc")) {
             docurl = element.ownText();
             if(docurl !== null && docurl !== ""){
             logger.info("Parsed URL: "+element.ownText());
             pipelineDoc = new com.lucidworks.apollo.common.pipeline.PipelineDocument(element.ownText());
             docs.add(pipelineDoc);
             }
             
          }
          
          outdocs = new java.util.ArrayList();
          // now crawl each doc in the feed
          for each(pipelineDoc in docs){
              docurl = pipelineDoc.getId();
              jdoc = Jsoup.connect(docurl).get();
              if(jdoc !== null){
                  logger.info("FOUND a JSoup document for url  "+doc.getId());
                  extractedText = new java.lang.String();
                   elements = jdoc.select("p");
                        logger.info("ITERATE OVER ELEMENTS");
                        // then parse elements and pull just the text
                        for each (ele in elements) {
                            if (ele !== null) {
                                if (ele.ownText() !== null) {
                                    extractedText += ele.ownText();
                                }
                            }
                        }
                        pipelineDoc.addField('extracted_text', extractedText);
                        logger.info("Extracted: "+extractedText);
                        outdocs.add(pipelineDoc);
                  
              } else {
                  logger.warn("Jsoup Document was NULL **** ");
              }
          }
     }catch(ex){
         logger.error(ex);
     }
     return outdocs;
 };
 
   parsedDocs = jsoupXmlParser(doc);
   logger.info(" SUBMITTING "+parsedDocs.size()+" to solr index... ****** ");
   
 var solrCloudClient = function(doc){
       var client = org.apache.http.client.HttpClient;
       var cloudServer = org.apache.solr.client.solrj.impl.CloudSolrClient;
       var DefaultHttpClient = org.apache.http.impl.client.DefaultHttpClient;
       var ClientConnectionManager = org.apache.http.conn.ClientConnectionManager;
       var PoolingClientConnectionManager = org.apache.http.impl.conn.PoolingClientConnectionManager;
       var CloudSolrClient = org.apache.solr.client.solrj.impl.CloudSolrClient;
       var cm = org.apache.http.impl.conn.PoolingClientConnectionManager;
       var String = java.lang.String;
       var pdoc  = com.lucidworks.apollo.common.pipeline.PipelineDocument;
       
       var ZOOKEEPER_URL = new String("localhost:9983");
       var DEFAULT_COLLECTION = new String("cityofsacramento");
       var server = ZOOKEEPER_URL;
       var collection = DEFAULT_COLLECTION;
       var docList = java.util.ArrayList;
       var inputDoc = org.apache.solr.common.SolrInputDocument;
       var pingResp = org.apache.solr.client.solrj.response.SolrPingResponse;
       var res = org.apache.solr.client.solrj.response.UpdateResponse;
       var SolrInputDocument = org.apache.solr.common.SolrInputDocument;
       var UUID = java.util.UUID;
         
       
       try{
           // PoolingClientConnectionManager cm = new PoolingClientConnectionManager();
            cm = new PoolingClientConnectionManager();
            client = new DefaultHttpClient(cm);
            cloudServer = new CloudSolrClient(server, client);
            cloudServer.setDefaultCollection(collection);
            logger.info("CLOUD SERVER INIT OK...");
            docList = new java.util.ArrayList();
            pingResp = cloudServer.ping();
            logger.info(pingResp);
            docList = new java.util.ArrayList();
            for each(pdoc in doc){
                inputDoc = new SolrInputDocument();
                inputDoc.addField("id", UUID.randomUUID().toString());
                inputDoc.addField("q_txt", pdoc.getFirstFieldValue("extracted_text"));
                docList.add(inputDoc);
            }
            
            logger.info(" DO SUBMIT OF "+docList.size()+" DOCUMENTS TO SOLR **** ");
            cloudServer.add(docList);
            res = cloudServer.commit();
            logger.info(res);
            
           
       }catch(ex){
           logger.error(ex);
       }
     
     return doc;
 };
 
 
    solrCloudClient(parsedDocs);
    logger.info("RSS CRAWL COMPLETE...");
    return doc;
}

And that’s really all there is to it.  This implementation has been tested on Fusion 2.4.2. 

The post Fusion Working for You: A Custom RSS Crawler JavaScript Stage appeared first on Lucidworks.com.

Generating a Sitemap from a Solr Index

$
0
0

Our clients often ask if Solr supports generating a sitemap from an existing Solr Index. While Solr has a full-featured set of APIs, these interfaces are generally geared more towards providing a generic data-management platform for your application.  Thus the short answer is: No, Solr doesn’t have a specialized API for generating sitemaps, RSS feeds, and so on.  

That said, with just a few lines of code you can create your own sitemap generator.  

For the purposes of this article, I’ve written rudimentary sitemap generators in Java, PHP and Python.  You’ll find each of these examples below.   They are all about the same length, and all pretty much do the same thing:  

   1) Call the collections API with the collection name and retrieve the data. 

   2) Spin the raw content up into a JSON object. 

   3) Iterate over the document extracting the URLs and writing them to the XML string output. 

   4) Print out the result. 

   I). Java Sitemap Example

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Iterator;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;

public class Sitemap {

    public static void main(String[] args) {
        String url = "http://localhost:8983/solr/[MY_COLLECTION_NAME]/select?q=*%3A*&wt=json";
        StringBuffer buf = new StringBuffer();
        try {
            URL solrSite = new URL(url);

            BufferedReader in = new BufferedReader(
                    new InputStreamReader(solrSite.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                buf.append(inputLine);
            }
            in.close();

            JSONParser parser = new JSONParser();
            JSONObject jsonObject = (JSONObject) parser.parse(buf.toString());
            JSONObject resp = (JSONObject) jsonObject.get("response");
            JSONArray docs = (JSONArray) resp.get("docs");
            Iterator<JSONObject> iter = docs.iterator();
            JSONObject doc;
            buf = new StringBuffer();
            buf.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");

            buf.append("<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">");

            while (iter.hasNext()) {
                doc = iter.next();
                buf.append("<url><loc>" + doc.get("id") + " </loc></url>");
            }

            buf.append("</urlset>");

        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            System.out.println(buf.toString());
        }

    }
}

  II). PHP Sitemap Example

 
<?php header("Content-Type: text/xml");
   
   $content = file_get_contents("http://localhost:8983/solr/[MY_COLLECTION_NAME]/select?q=*%3A*&wt=json");
   $json = json_decode($content, true);
   
   $output = '<?xml version="1.0" encoding="UTF-8"?>

    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">';

    $docs = $json["response"]["docs"];
    
    foreach($docs as $key=>$doc){
       
        $output .= "<url>";
        $output .= "<loc>" . $doc["id"] . "</loc>";
        $output .= "</url>";
        
        
    }
   
   $output .= "</urlset>";
   
   
   
   echo $output;

?>

  III). Python Sitemap Example

#!/usr/bin/env python2
#encoding: UTF-8
import urllib
import json

if __name__ == "__main__"
   link = "http://localhost:8983/solr/[MY_COLLECTION_NAME]/select?q=*%3A*&wt=json"
   f = urllib.urlopen(link)
   myfile = f.read()
   stdout = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">"
 
   jsonout = json.loads(myfile);
   resp = jsonout["response"]["docs"];
   for url in resp:
       #print url["id"]
       stdout += "<url><loc>" + url["id"] + " </loc></url>";
       
      
   stdout += "</urlset>"
   print stdout

 Note:  In the examples above, I am simply printing out the result.  For your implementation, you will like write the output to your site’s root directory.  Automating this task can be accomplished with a simple Cron job.  There is a nice tutorial on creating Cron jobs here.   Also for this example, I tried to contain my imports to whatever would be available with a simple core installation of any language.  There are certainly many ways you could go about it, but this provides basic examples.  Further, I’m only setting the required ‘loc’ element here, using the ‘id’ field gathered when the document was crawled. You could extends these to include the other option elements (E.g. last_modified, etc). 

 Happy Mapping! 

The post Generating a Sitemap from a Solr Index appeared first on Lucidworks.com.

Context Filtering with Solr Suggester

$
0
0
"Q: What did the Filter Query say to the Solr Suggester?

Introduction

The available literature on the Solr Suggester primarily centers on surface-level configuration and common use-cases. This article provides a thorough introduction to the Solr Suggester and discusses its history, design, implementation; and even provides some comprehensive examples of its usage. 

This blog post aims to showcase the versatility of the Solr Suggester and the process of achieving context-filtered suggestions in Solr.

A Little Context…

Suppose you have a collection which is comprised of various datasources. For this example, let’s choose two of the datasources, “datasource_A” and “datasource_B”. The goal is to enable suggestions on your search application, but to only return suggestions on documents from datasource_A, excluding any and all documents from datasource_B. 

Enter “suggest.cfq”, the parameter which to some degree emulates the well-known Solr fq param. The widely-used fq parameter, however, does not filter results rendered by the suggest component. So issuing a query such as /suggest?q=do&fq=_lw_data_source_s:datasource_A is essentially equivalent to /suggest?q=do, ignoring the filter query completely. 

If you look to the Solr documentation you’ll see a note about how Context filtering (suggest.cfq) is currently only supported by AnalyzingInfixLookupFactory and BlendedInfixLookupFactory, and only when backed by a Document*Dictionary. All other implementations will return unfiltered matches as if filtering was not requested. 

May I Suggest a Solution?

In the following example, I create a Fusion collection called suggestTest and assign it two datasources “art” and “tv”.In Fusion, datasources are distinguished by a _lw_data_source_s field. After indexing documents to both datasources, I would like to enable suggestions on one of the datasources, but not the other.

I created a script which automates the process I’m about to describe – you can run the script by cloning the following repo: https://github.com/essiequoi/suggestTest.git 

STEP ONE

Make sure to set the following environment variables or else defaults will be used:

$FUSION_HOME  (ex: $HOME/Lucid/fusion/fusion2.4.3/)
$FUSION_API_BASE (ex: http://localhost:8764/api/apollo)
$SOLR_API_BASE (ex: http://localhost:8983/solr)
$FUSION_API_CREDENTIALS (ex: admin:password123)
$ZK_HOST (ex: localhost:9983)

STEP TWO

Create the suggestTest collection 

STEP THREE

Edit solrconfig.xml to enable suggestions. As mentioned previously, you have the choice of using the AnalyzingInfixLookupFactory or BlendedInfixLookupFactory as your dictionary implementation. In my example, I use the former. We will be suggesting on the title field. The contextField parameter designates the field on which you’ll be filtering. I use the _lw_data_source_s field which holds the name of Fusion datasources.

STEP FOUR

Create the datasources “art” and “tv”. I use the local filesystem connector for both, but connector type is arbitrary in this example. And because I am indexing CSV files, I use the default CSV index pipeline which ships with Fusion.  I index the following documents:

art.csv

tv.csv
 

STEP FIVE

Run the datasources. Once both datasources have finished running, you should have 6 documents total in your entire collection.

STEP SIX

In the suggestTest-default query pipeline, edit the Query Solr stage to allow the /suggest requestHandler.

STEP SEVEN

Build the suggester with http://{host}:8764/api/apollo/query-pipelines/suggestTest-default/collections/suggestTest/suggest?suggest.build=true . A “0” status indicates that the suggester built successfully. 

STEP EIGHT

Query the collection (using the /suggest handler) for “ba” and observe the response. Next query for “be”, then for “ch”. You should be returned 2 suggestions for each query. 

STEP NINE

Let’s say we’d like to get suggestions for “ba” but only those generated from the “art” datasource. The following query fails: http://{host}:8764/api/apollo/query-pipelines/suggestTest-default/collections/suggestTest/suggest?suggest.q=ba&fq=_lw_data_source_s:art . As you can see, it still returns suggestions from both datasources. Using the suggest.cfq parameter, and entering the appropriate datasource as the value, gets us the correct result. 

STEP TEN

Keep searching!

 

Some Known Issues

It’s important to note that the following JIRAs exist in reference to the suggest.cfq parameter, none of which directly affect the use-case mentioned above:

SOLR-8928 – “suggest.cfq does not work with DocumentExpressionDictionaryFactory/weightExpression”

SOLR-7963 – “Suggester context filter query to accept local params query”

SOLR-7964 – “suggest.highlight=true does not work when using context filter query”

 

Conclusion

Out of the box, the Solr Suggester is capable of solving even the most specialized of use cases. With a little tweaking of your configuration files and just as much experimentation, you could open up your application to worlds of possibility. 

 

A: Con-text me some time.

The post Context Filtering with Solr Suggester appeared first on Lucidworks.com.

Extracting Values from Element Attributes using Jsoup and a JavaScript Stage

$
0
0

While Fusion comes with built-in Jsoup selector functionality, it is limited in its extraction capability.  If you want to do something like extract attribute values — in particular attribute values with special characters or empty spaces in the values, you’ll need to do a custom JavaScript stage and implement the extraction there. 

To accomplish this:

1) Create a custom JavaScript stage and order it directly after the Apache Tika Parser. In the Apache Tika Parser stage, make sure that both “Return parsed content as XML or HTML” and “Return original XML and HTML instead of Tika XML Output” are checked. 

2) Add your code.   For the purposes of this article, I’ve created the following example.  Depending on what you’re trying to accomplish, your code may vary:

function(doc){
    
    var File = java.io.File;
var Iterator = java.util.Iterator;
var Jsoup = org.jsoup.Jsoup;
var Document = org.jsoup.nodes.Document;
var Element =  org.jsoup.nodes.Element;
var Elements = org.jsoup.select.Elements;

var content = doc.getFirstFieldValue("body");
var doc = org.jsoup.nodes.Document;
var e = java.lang.Exception;
var div = org.jsoup.nodes.Element;
var img = org.jsoup.nodes.Element;
var iter = java.util.Iterator;
var divs = org.jsoup.select.Elements;


  
   
   try {
             doc = Jsoup.parse(content);
             divs = doc.select("div");
             iter = divs.iterator();
             div = null; // initialize our value to null
            while (iter.hasNext()) {
                div = iter.next();
                if (div.attr("id").equals("featured-img")) {
                    break;
                }
            }

            if (div != null) {
                 img = div.child(0);
                logger.info("SRC: " + img.attr("src"));
                logger.info("ORIG FILE: " + img.attr("data-orig-file"));
                doc.addField("post_image", img.attr("src") + " | " + img.attr("data-orig-file"));
            } else {
                logger.warn("Div was null");
            }

        } catch ( e) {
           logger.error(e);
        }

    return doc;
}
          

 So let’s go ahead and break down what is happening here:

1) Declare Java classes to be used. 

                       
var File = java.io.File;
var Iterator = java.util.Iterator;
var Jsoup = org.jsoup.Jsoup;
var Document = org.jsoup.nodes.Document;
var Element =  org.jsoup.nodes.Element;
var Elements = org.jsoup.select.Elements;

2) Next, declare our JavaScript variables to be used. Note that we assign the content variable to be the content pulled by the Apache Tika Parser

var content = doc.getFirstFieldValue("body");
var doc = org.jsoup.nodes.Document;
var e = java.lang.Exception;
var div = org.jsoup.nodes.Element;
var img = org.jsoup.nodes.Element;
var iter = java.util.Iterator;
var divs = org.jsoup.select.Elements;

3) Next, we pull the “div” elements out and look for one with an ID of “featured-img.” Once we find it, we ‘break’ the iteration and move on. Note: I’m using this type of example to illustrate how to work with element attribute values that contain special characters or empty space. Jsoups selector syntax doesn’t really play well with these types of key names.

 doc = Jsoup.parse(content); // parse the document
             divs = doc.select("div"); // select all the 'div' elements
             iter = divs.iterator(); // get an iterator for the list
            while (iter.hasNext()) { // iterate over the elements
                div = iter.next();
                if (div.attr("id").equals("featured-img")) { // if we find a match, assign and move on. 
                    break;
                }
            }

4) Finally, we set the values in the document. I’ve added some extra logging here, which can ultimately be removed.

   if (div != null) {
                 img = div.child(0); // get the image element
                logger.info("SRC: " + img.attr("src"));
                logger.info("ORIG FILE: " + img.attr("data-orig-file"));
                doc.addField("post_image", img.attr("src") + " | " + img.attr("data-orig-file")); // set the values in the PipelineDocument
            } else {
                logger.warn("Div was null");
            }

 

 

And that’s all there is to it! Happy Extracting!

The post Extracting Values from Element Attributes using Jsoup and a JavaScript Stage appeared first on Lucidworks.com.

Viewing all 731 articles
Browse latest View live