Category Archives: Sphinx Search Engine Integration

The Short Comings of Sphinx GEO Search

When I first got Sphinx installed I was thrilled. I had found a simple easy solution to all of my problems. I started churning out indexes left and right. One day my boss asked me “Hey can we put some geo distance information on here?” … “No problem!”. I had been reading the docs and saw how easy it was to add in geo distance and to sort by geo distance and to limit within a geo distance. Heck it was even easy to limit with in a range of distance(x with in 40 to 50 miles). Needless to say I added the columns added the 3 lines of code and poof there it was. What used to take stupid mathematical queries is poof done in 3 lines of code. At this point I was pounding the Sphinx Kool Aid, and boy did it tasted great.

A couple days later my boss pointed out to me, that we didn’t have a 1:1 relationship between our products and their locations, we had products distributed throughout the United States, and that my solution while correct(we had a featured location) only worked on a 1:1 correlation and so could i please implement a 1:N relationship. Now completely drunk on the Kool Aid my response “OF COURSE!”.
I proudly opened up my fully customized conf file and set about to change the latitude and longitude columns to be Multi-valued attributes . But before I could press the first key I realized that Latitude and Longitude need to be compared as ordered pairs, if I made them mva’s they’d be compared as n:n of latitude to longitude comparisons(I didn’t try this I know a bad idea when it see it, sometimes).

I then began to do some thinking, and then some reading, and then some more thinking and more reading. I came to the realization that addresses must be stored in their own index, not a problem I quickly create my new index. Knowing that I could run multiple queries on different indexes at the same time, I wrote up a quick sphinx query and executed. My jaw nearly hit the floor, there was a storm trooper telling me “These aren’t the results I was looking for”. I did some quick reading and find out that it is impossible in a multi query to pass results from the first query to the second query with in sphinx. Down hearted but still convinced this is possible with Sphinx, I decided I can hack it.

And Hack It I Did. I would run two queries, one would limit our products by location, the second would limit actual products by product specific constraints and I would pass in the product ids using php. This looked pretty damn ugly, slightly better than sql solution, but it was working. I had to tweak the memory usage up on the search daemon to accommodate for the larger data set, but all in all it seemed to work fast the code was straight forward and everything was in one area Sphinx. I began to fill up my cup with kool aid. A week had passed, now was the big moment the push live. We pushed live and instantaneously I saw memory issue errors over flow my error log. I said not a problem, the live server has more ram than dev, I’ll throw some machine at it. Then I saw the PHP memory errors coming. And that’s when the KoolAid went bad. As it was late, the push live had been backed up for over a month now, I patched code to stop the error messages at the expense of quality data and I started looking in svn for the old trusted distance formulas.

In this specific instance I end up using straight SQL, there were a couple joins, but performance was not horrible with our traffic load. In other instances where I needed the full text search, I did the same distance constraints and passing the specific role ids into sphinx. But there was always that pain in my neck, the jabbing in my side, knowing that this code should be cleaner.

Sphinx Configuration Inheritance

Sphinx Configuration files support inheritance between data of the same type eg source, index.
This can be a huge time saver when rolling code live or when supporting multiple indexes.
Take this configuration for example, as you can see the database information is stored multiple times.

source product {
    type = mysql
    sql_user = logan
    sql_host = localhost
    sql_pass = password
    sql_db = development
    sql_port = 3306
    
    sql_query = \
                              SELECT name, retail_location, retail_name, quantity \
                              FROM product_sphinx
    sql_ranged_throttle = 0
    sql_query_info = SELECT * FROM product_sphinx WHERE id=$id
    sql_attr_float = latitude
    sql_attr_float = longitude
}

source retail {
    type = mysql
    sql_user = logan
    sql_host = localhost
    sql_pass = password
    sql_db = development
    sql_port = 3306
    
    sql_query = \
                              SELECT id, name \
                              FROM retail_sphinx
    sql_ranged_throttle = 0
    sql_query_info = SELECT * FROM retail_sphinx WHERE id=$id
    sql_attr_multi = uint products from query; \
                                   select retail.id, product.id, FROM retail_outlet retail \
                                    JOIN retail_outlet_product product ON \
                                    product.retail_outlet.id = retail.id
}
view raw gistfile1.txt This Gist brought to you by GitHub.

Inheritance will ease updates and make your configuration file readable.
site_live {
    type = mysql
    sql_user = logan
    sql_pass = somepassword
    sql_host = localhost
    sql_port = 3306
    sql_db = live
}

site_dev {
    type = mysql
    sql_user = logan
    sql_pass = somepassword
    sql_host = localhost.dev
    sql_port = 3306
    sql_db = development
}

source product : site_dev {
    sql_query = \
                              SELECT name, retail_location, retail_name, quantity \
                              FROM product_sphinx
    sql_ranged_throttle = 0
    sql_query_info = SELECT * FROM product_sphinx WHERE id=$id
    sql_attr_float = latitude
    sql_attr_float = longitude
}

source retail : site_dev {
    
    sql_query = \
                              SELECT id, name \
                              FROM retail_sphinx
    sql_ranged_throttle = 0
    sql_query_info = SELECT * FROM retail_sphinx WHERE id=$id
    sql_attr_multi = uint products from query; \
                                   select retail.id, product.id, FROM retail_outlet retail \
                                    JOIN retail_outlet_product product ON \
                                    product.retail_outlet.id = retail.id
}
view raw gistfile1.txt This Gist brought to you by GitHub.

The two newly created sources site_live and site_dev allow for easy changing of which database credentials. “:” is the inheritance operator, so retail inherits from site_dev. To change sphinx from reading the live configuration file simple replace “retail : site_dev” to “retail : site_live”. Also any database credential changes are now made in one place.

Sphinx Installation and Setup

Sphinx is an open-source full-text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It supports multiple databases as well as files and streaming data. It has a command line implementation as well as API’s in many different languages. This guide will go over installation, setup, as well as some common problems that many first time users have while working with the LAMP stack. Read more »