<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MySolr</title>
	<atom:link href="http://mysolr.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://mysolr.com</link>
	<description>Solr/Lucene development tips</description>
	<lastBuildDate>Tue, 25 Mar 2014 22:06:08 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.5.33</generator>
	<item>
		<title>Occasional blank pages in Nginx + APC+ PHP-FPM (or PHP-CGI) + WordPress + WP Super Cache</title>
		<link>https://mysolr.com/tips/occasional-blank-pages-in-nginx-apc-php-fpm-or-php-cgi-wordpress-wp-super-cache/</link>
		<comments>https://mysolr.com/tips/occasional-blank-pages-in-nginx-apc-php-fpm-or-php-cgi-wordpress-wp-super-cache/#respond</comments>
		<pubDate>Tue, 25 Mar 2014 22:06:08 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://mysolr.com/?p=223</guid>
		<description><![CDATA[This is just a quick note on something I found recently that’s quite annoying to debug.  It’s difficult to narrow down the issue because you can’t find any error logs in PHP-FPM nor PHP error log.  So here is the behavior, your site works fine for a short period after starting PHP-FPM.  Then, all of [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>This is just a quick note on something I found recently that’s quite annoying to debug.  It’s difficult to narrow down the issue because you can’t find any error logs in PHP-FPM nor PHP error log.  So here is the behavior, your site works fine for a short period after starting PHP-FPM.  Then, all of a sudden, you see blank pages in some pages.  If you wait a while and try again, some blank pages will come back.  It appears the problem is gone, but when you refresh again, blank page will show up again.  You can also fix it temporarily by restarting PHP-FPM.  However, you will face the same issue again after the restart.  The annoying part is that you can’t find any error logs anywhere.  The only way to expose the error is setting WP_DEBUG to true in wp-config.php.  It can be set as followed:</p>
<pre style="padding-left: 30px;">define('WP_DEBUG', true);</pre>
<p>If you are experiencing the same issue as I do,  you will see an error similar to follow:</p>
<pre style="padding-left: 30px;">Fatal error: Internal Zend error - Missing class information for in .../wp-content/plugins/wp-super-cache/wp-cache-base.php on line 5</pre>
<p>Here is the content of this file wp-cache-base.php:</p>
<pre style="padding-left: 30px;"><?php
$known_headers = array("Last-Modified", "Expires", "Content-Type", "Content-type", "X-Pingback", "ETag", "Cache-Control", "Pragma");

if (!class_exists('CacheMeta')) {

    class CacheMeta {
        var $dynamic = false;
        var $headers = array();
        var $uri = '';
        var $post = 0;
    }
}

?></pre>
<p>Line 5 points to “class CacheMeta {“.  What it looks like is that APC can’t cope with how this class is dynamically declared.  A workaround is add the following line in your php.ini file.</p>
<pre style="padding-left: 30px;">apc.filters = wp-cache-base</pre>
<p>This tells APC to ignore this file so instead of caching the runtime, it gets compiled every time it’s called upon.  The performance penalty is negligible for this small file so it’s safe to use this workaround.</p>
<p><span style="line-height: 1.5em;">Restart PHP-FPM (or PHP-CGI) now.  The blank pages issue should be gone.  </span></p>
<p><span style="line-height: 1.5em;">Good luck! </span></p>
<p> </p>
]]></content:encoded>
			<wfw:commentRss>https://mysolr.com/tips/occasional-blank-pages-in-nginx-apc-php-fpm-or-php-cgi-wordpress-wp-super-cache/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Setting up MacPorts on Lion</title>
		<link>https://mysolr.com/tips/setting-up-macports-on-lion/</link>
		<comments>https://mysolr.com/tips/setting-up-macports-on-lion/#respond</comments>
		<pubDate>Thu, 21 Jul 2011 21:06:39 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://mysolr.com/?p=194</guid>
		<description><![CDATA[Just installed OSX Lion on my MacBook Pro last night and immediately I found my development environment was completely messed up. After the upgrade, all PHP modules I installed previously were gone. Some libraries I installed via MacPorts were gone as well. I tried a few things, “port selfudpate” and “port upgrade outdated”, in an [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Just installed OSX Lion on my MacBook Pro last night and immediately I found my development environment was completely messed up.  After the upgrade, all PHP modules I installed previously were gone.  Some libraries I installed via MacPorts were gone as well.  I tried a few things, “port selfudpate” and “port upgrade outdated”, in an attempt to stabilize the environment with no success.  I was getting a lot of dependency issues.  That prevented me from installing any more new modules.  I read somewhere that installing MacPorts from their Subversion repository should help.  I gave it a try and it appeared to have solved my problem.</p>
<p>First of all, check out the trunk from <a href="http://guide.macports.org/#installing.macports" rel="nofollow">MacPorts</a>, compile and install.<br />
<code>% mkdir -p /opt/mports<br />
% cd /opt/mports<br />
% svn checkout http://svn.macports.org/repository/macports/trunk<br />
% cd /opt/mports/trunk/base<br />
% ./configure --enable-readline<br />
% make<br />
% sudo make install<br />
% make distclean<br />
</code></p>
<p>Then, update MacPorts:<br />
<code># sudo port selfudpate<br />
# sudo port upgrade outdated<br />
</code></p>
<p>You may still encounter some dependency issues.  But at this stage, it should be very easy to resolve by temporarily pulling old libraries from Time Machine.  For example, I ran into issue with gettext.  The Lion upgrade removed this library /opt/local/lib/libintl.8.dylib that was used by gettext but it did not remove gettext completely.  MacPorts was confused and refused to reinstall gettext.  I had to go back to Time Machine to restore this file temporarily so I can set up gettext again.  If you don’t have Time Machine, it would be difficult to acquire these old libraries.  If it happens that you need this file, you can download it <a href="http://mysolr.com/libintl.8.dylib">here</a>.</p>
<p>You will need to force activate the module after you placed the library.<br />
<code>port -f activate gettext</code></p>
<p>Now, you can uninstall or upgrade the modules.</p>
<p>If you are still encountering dependency issues, try adding/removing binpath in /opt/local/etc/macports/macports.conf.<br />
<code>binpath /bin:/sbin:/usr/bin:/usr/sbin:/opt/local/bin:/opt/local/sbin:/usr/X11R6/bin</code></p>
<p>Some modules may need this binpath in order to install but most do not require this change.</p>
]]></content:encoded>
			<wfw:commentRss>https://mysolr.com/tips/setting-up-macports-on-lion/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A HAL bug?</title>
		<link>https://mysolr.com/tips/a-hal-bug/</link>
		<comments>https://mysolr.com/tips/a-hal-bug/#respond</comments>
		<pubDate>Mon, 15 Nov 2010 16:02:30 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://mysolr.com/?p=163</guid>
		<description><![CDATA[This is not related to Solr/Lucene but I have to write it down somewhere so I don’t forget it next time.  I have seen this problem a few times in the past and have always forgotten about it.  Here is the behavior, when you run a Linux VPS environment, SOMETIMES (I have to stress this) [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>This is not related to Solr/Lucene but I have to write it down somewhere so I don’t forget it next time.  I have seen this problem a few times in the past and have always forgotten about it.  Here is the behavior, when you run a Linux VPS environment, SOMETIMES (I have to stress this) hald does not play nice with your CD/DVD virtual drive.  It may ping the drive every second unnecessarily.  For most of the time, you will hardly notice it.  However, on some occasions, it may consume enough CPU and I/O that can push your idle machine to have an extra load of 1.0 or above.  It’s quite annoying when it’s happening in a small VPS where resources is already very limited.  If this is happening in your environment, you will see a process called hald-addon-storage lingers that always consume at least 1% CPU.  To confirm that it really is eating up your resources, you can use the following command to trace the system calls it makes:</p>
<pre style="padding-left: 30px;">strace -t -p <pid></pre>
<p>You may see a bunch of open calls, such as following, on your CD/DVD drive.</p>
<pre style="padding-left: 30px;">open("/dev/hdc", O_RDONLY|O_NONBLOCK|O_EXCL|O_LARGEFILE) = -1
EBUSY (Device or resource busy)</pre>
<p>So, what’s the solution?  I haven’t found a good solution for it other than stopping this pulling entirely.  Usually, you only use the CD/DVD during OS installation so stopping it shouldn’t create any adverse effect on your environment.  Here is what you should do to stop it:</p>
<ul>
<li>Make sure haldaemon is still running, run following command to find your CD/DVD drive’s UDI</li>
</ul>
<pre style="padding-left: 30px;">hal-find-by-capability --capability storage.cdrom</pre>
<ul>
<li>You should see something like this</li>
</ul>
<pre style="padding-left: 30px;">/org/freedesktop/Hal/devices/storage_serial_QM00003</pre>
<ul>
<li>Now create this file, /etc/hal/fdi/information/media-check-disable-cd.fdi, with following content:</li>
</ul>
<pre style="padding-left: 30px;"><?xml version="1.0" encoding="UTF-8"?>
<deviceinfo version="0.2">
  <device>
    <match key="info.udi" string="<span style="color: #ff0000;"><strong>/org/freedesktop/Hal/devices/storage_serial_QM00003</strong></span>">
      <merge key="storage.media_check_enabled" type="bool">false</merge>
    </match>
    </device>
</deviceinfo></pre>
<p>And of course, replace the “info.udi” value with your own value.  Restart haldaemon now and you should be all good.</p>
]]></content:encoded>
			<wfw:commentRss>https://mysolr.com/tips/a-hal-bug/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tingling with Bobo-browse MultiValueFacetHandler Limit</title>
		<link>https://mysolr.com/tips/tingling-with-bobo-browse-multivaluefacethandler-limit/</link>
		<comments>https://mysolr.com/tips/tingling-with-bobo-browse-multivaluefacethandler-limit/#respond</comments>
		<pubDate>Tue, 07 Sep 2010 00:42:35 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://mysolr.com/?p=143</guid>
		<description><![CDATA[In one of my recent project implemented with Bobo-Browse for faceting, I ran into an issue with MultiValueFacetHandler’s 1024 values per field per record limitation.  I have some odd cases in my data set where a publication can have more 2000 authors.  This limit would stop at 1024 authors and left the rest of the [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>In one of my recent project implemented with Bobo-Browse for faceting, I ran into an issue with MultiValueFacetHandler’s 1024 values per field per record limitation.  I have some odd cases in my data set where a publication can have more 2000 authors.  This limit would stop at 1024 authors and left the rest of the authors uncredited in my search results.  It wasn’t a great loss as there are perhaps only a handful of publications with this many authors.  However, it was not great for the business so I had to create a solution.  After some searches around and tingling with the source a little, I found that by removing the hard 1024 limit was a viable solution in this particular project.  My benchmark number didn’t change after the hack and I was able to get all authors in the facet.  I can understand the hard limit was to prevent overloading the facet with too many values that could kill performance or worst, cause out of memory issue.   In fact, 1024 is a pretty high limit for a single field.  In my experience, I hardly found any multi-value field can reach anywhere near that number on a single record.  However, I just encountered such oddity.  Luckily, the number of publications with more than 1024 authors is negligible compare to the million of publications in my index.  So, this little hack didn’t produce any adverse effect for me.</p>
<p>Couple ways to do this hack.  One is to modify BigNestedIntArray.MAX_ITEMS directly with your desire max value.  Another way is to modify the following 3 files.</p>
<ul>
<li>MultiValueFacetDataCache.java</li>
<li>MultiValueFacetHandler.java</li>
<li>BigNestedIntArray.java</li>
</ul>
<p>Look for the following code:</p>
<p>_maxItems = Math.min(maxItems, BigNestedIntArray.MAX_ITEMS);</p>
<p>and replace it with:</p>
<p>_maxItems = maxItems;</p>
<p>And of course, call setMaxItems on your MultiValueFacetHandler to set the desired max value.</p>
<p>Just want to iterate that this hack has only been tested in one particular project.  It may create instability in your project if you have a lot of records with 1024+ facet values.  I was told that the array’s key is 11 bit.  So, there is a max value at 2048.  Having more than 2048 value will likely cause an exception.  Memory consumption can also become a real issue.  Consider yourself warned.</p>
]]></content:encoded>
			<wfw:commentRss>https://mysolr.com/tips/tingling-with-bobo-browse-multivaluefacethandler-limit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache DocumentRoot does not exist</title>
		<link>https://mysolr.com/tips/apache-documentroot-does-not-exist/</link>
		<comments>https://mysolr.com/tips/apache-documentroot-does-not-exist/#respond</comments>
		<pubDate>Sat, 29 May 2010 15:47:09 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://mysolr.com/?p=140</guid>
		<description><![CDATA[I got to write this down this time although this is not related to Solr/Lucene.  This has come back and bite me many times.  The error is due to incorrect SELinux context on the DocumentRoot directory.  Here is what you need to do to correct it: chcon -R user_u:object_r:httpd_sys_content_t You may want to check on [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>I got to write this down this time although this is not related to Solr/Lucene.  This has come back and bite me many times.  The error is due to incorrect SELinux context on the DocumentRoot directory.  Here is what you need to do to correct it:</p>
<pre>chcon -R user_u:object_r:httpd_sys_content_t <directory></pre>
<p>You may want to check on /var/www first to see if the context is correct by issuing this command:</p>
<pre>ls -al --context /var/www</pre>
<p>Okay, this is it.  This should fix this annoying issue.</p>
]]></content:encoded>
			<wfw:commentRss>https://mysolr.com/tips/apache-documentroot-does-not-exist/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>High Performance Faceting with Bobo-Browse</title>
		<link>https://mysolr.com/tips/high-performance-faceting-with-bobo-browse/</link>
		<comments>https://mysolr.com/tips/high-performance-faceting-with-bobo-browse/#respond</comments>
		<pubDate>Tue, 25 May 2010 03:05:36 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://mysolr.com/?p=132</guid>
		<description><![CDATA[I got the chance to do a barebone Lucene implementation for a client with 40 million records.  They liked to introduce faceting on the author field.  I was tempted to just go ahead with Solr.  However, it’d be counterproductive to the project because they don’t need the full package provided by Solr.  My client only wants [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>I got the chance to do a barebone Lucene implementation for a client with 40 million records.  They liked to introduce faceting on the author field.  I was tempted to just go ahead with Solr.  However, it’d be counterproductive to the project because they don’t need the full package provided by Solr.  My client only wants to build the facets on top of their index with minimal changes.  <a title="Bobo" href="http://code.google.com/p/bobo-browse/" target="_blank">Bobo</a> became the obvious choice in this matter.  To say the least, Bobo is amazingly simple to use and yet it provides decent performance.</p>
<p>The biggest roadblock we faced with this implementation is the memory footprint.  When the author index was loaded using Bobo, it allocated 12G of memory.  Initially, we set our young generation size way too small, the GC algorithm we selected, CMS (Concurrent Mark Sweep), had to constantly do full sweep after every 2-3 searches.  The full sweep would halt the entire service for about a minute before returning.  It was unacceptable as it pretty much killed search altogether.  It appeared that Bobo allocates quite a bit of temporary memory to calculate the facets.  Perhaps it was the nature of our data with a lot of intersection between authors that caused the excessive memory usage.  We slowly increased the young generation size from 2G (yes I know, it’s very small) to around 8G to get a stable system with virtually zero full sweep.</p>
<p>Here is our current JVM config:</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">java -Dcom.sun.management.jmxremote \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-Dcom.sun.management.jmxremote.port=9091 \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-Dcom.sun.management.jmxremote.authenticate=false \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-Dcom.sun.management.jmxremote.ssl=false \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-Djava.rmi.server.hostname=se01.us.researchgate.net \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-verbosegc -XX:+PrintGCDetails \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:+UseConcMarkSweepGC \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:+CMSIncrementalMode \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:+CMSIncrementalPacing \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:+UseParNewGC \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:+CMSParallelRemarkEnabled \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:+DisableExplicitGC \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:MaxGCPauseMillis=2000 \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:SoftRefLRUPolicyMSPerMB=1 \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:CMSIncrementalDutyCycleMin=10 \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:CMSIncrementalDutyCycle=50 \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:ParallelGCThreads=8 \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-XX:GCTimeRatio=10 \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-Xmn8g \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-Xms22g \</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 270px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">-Xmx22g \</div>
<pre>java -verbosegc -XX:+PrintGCDetails \
     -XX:+UseConcMarkSweepGC \
     -XX:+CMSIncrementalMode \
     -XX:+CMSIncrementalPacing \
     -XX:+UseParNewGC \
     -XX:+CMSParallelRemarkEnabled \
     -XX:+DisableExplicitGC \
     -XX:MaxGCPauseMillis=2000 \
     -XX:SoftRefLRUPolicyMSPerMB=1 \
     -XX:CMSIncrementalDutyCycleMin=10 \
     -XX:CMSIncrementalDutyCycle=50 \
     -XX:ParallelGCThreads=8 \
     -XX:GCTimeRatio=10 \
     -Xmn8g \
     -Xms22g \
     -Xmx22g
<span style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; line-height: 19px; white-space: normal; font-size: 13px; ">
</span></pre>
<p><span style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; line-height: 19px; white-space: normal; font-size: 13px; ">This configuration works for us.  If you run into similar JVM garbage collection issue, I hope this set of configuration will help you too.</span></p>
<span class="sfforumlink"><a href="https://mysolr.com/forum/development-tips/high-performance-faceting-with-bobo-browse"><img src="https://mysolr.com/wp-content/plugins/simple-forum/styles/icons/default/bloglink.png" alt="" /> Join the forum discussion on this post</a> - (1) Posts</span>]]></content:encoded>
			<wfw:commentRss>https://mysolr.com/tips/high-performance-faceting-with-bobo-browse/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>DataImportHandler Runs Out of Memory on Large Table</title>
		<link>https://mysolr.com/tips/dataimporthandler-runs-out-of-memory-on-large-table/</link>
		<comments>https://mysolr.com/tips/dataimporthandler-runs-out-of-memory-on-large-table/#comments</comments>
		<pubDate>Sat, 16 May 2009 22:53:45 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://mysolr.com/?p=124</guid>
		<description><![CDATA[One DataImportHandler(DIH) configuration people may overlook is the batchSize attribute.  If you start your JVM with enough memory to store the entire table, you won’t even need to set batchSize at all.  batchSize basically tells DIH to call setFetchSize through JDBC to bring back certain number of records at once.  If you use MySQL, you [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>One DataImportHandler(DIH) configuration people may overlook is the batchSize attribute.  If you start your JVM with enough memory to store the entire table, you won’t even need to set batchSize at all.  batchSize basically tells DIH to call setFetchSize through JDBC to bring back certain number of records at once.  If you use MySQL, you may still run out of memory even when you set the batchSize attribute.  That’s due to a limitation in MySQL’s drive where the setting is ignored.  The workaround is setting batchSize to “-1”.  This will pass Integer.MIN_VALUE to MySQL as fetch size and prevent the driver from running out of memory.</p>
<pre><dataSource driver="org.gjt.mm.mysql.Driver" url="jdbc:mysql://localhost/db"
user="root" password="root" batchSize="-1"/></pre>
<span class="sfforumlink"><a href="https://mysolr.com/forum/development-tips/dataimporthandler-runs-out-of-memory-on-large-table"><img src="https://mysolr.com/wp-content/plugins/simple-forum/styles/icons/default/bloglink.png" alt="" /> Join the forum discussion on this post</a> - (1) Posts</span>]]></content:encoded>
			<wfw:commentRss>https://mysolr.com/tips/dataimporthandler-runs-out-of-memory-on-large-table/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Denormalized Data Structure</title>
		<link>https://mysolr.com/tips/denormalized-data-structure/</link>
		<comments>https://mysolr.com/tips/denormalized-data-structure/#comments</comments>
		<pubDate>Fri, 08 May 2009 04:16:58 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://mysolr.com/?p=109</guid>
		<description><![CDATA[The best data structure for effective use of faceted search in Solr is a flat (denormalized) structure. This may be contrary to a lot of application design principles, but this is an important concept to understand in order to use Solr effectively. To decide on how you store data in Solr, you have to know [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>The best data structure for effective use of faceted search in Solr is a flat (denormalized) structure.  This may be contrary to a lot of application design principles, but this is an important concept to understand in order to use Solr effectively.  To decide on how you store data in Solr, you have to know the kind of record(s) you want to return ultimately.  Do you have multiple record types you like to return?  Are there any commonalities between the record types?  The main idea is knowing what a base record is and how child records associate with the base record.</p>
<p>Let’s look at an books inventory example:</p>
<pre style="padding-left: 30px;">Author
---------------------------
auth_id | author_name
--------+------------------
      1 | J.K. Rowling
      2 | Michael Crichton
      3 | J. R. R. Tolkien

Category
------------------
cat_id | category
-------+----------
     1 | Fantasy
     2 | Sci-Fi

Book
--------------------------------------------------------------------------------
id | auth_id | cat_id | release_date | title
---+---------+--------+--------------+------------------------------------------
 1 |       1 |      1 |   1999-09-01 | H. P. and the Sorcerer’s Stone
 2 |       1 |      1 |   2004-08-01 | H. P. and the Order of the Phoenix
 3 |       2 |      2 |   1991-03-01 | Jurassic Park
 4 |       2 |      2 |   2000-10-01 | Timeline
 5 |       3 |      1 |   1990-04-01 | LOTR, the Fellowship of the Ring
 6 |       3 |      1 |   1990-04-01 | LOTR, the Two Towers
 7 |       3 |      1 |   1990-04-01 | LOTR, Return of the King</pre>
<p>It should be obvious to you that the base record in the above example is Book.  We will denormalize the above tables into a flat structure.  The final table to be fed into Solr will look like this:</p>
<pre style="padding-left: 30px;">-------------------------------------------------------------------------------------
id | author_name      | category | release_date | title
---+------------------+----------+--------------+------------------------------------
 1 | J.K. Rowling     | Fantasy  |   1999-09-01 | H. P. and the Sorcerer’s Stone
 2 | J.K. Rowling     | Fantasy  |   2004-08-01 | H. P. and the Order of the Phoenix
 3 | Michael Crichton | Sci-Fi   |   1991-03-01 | Jurassic Park
 4 | Michael Crichton | Sci-Fi   |   2000-10-01 | Timeline
 5 | J. R. R. Tolkien | Fantasy  |   1990-04-01 | LOTR, the Fellowship of the Ring
 6 | J. R. R. Tolkien | Fantasy  |   1990-04-01 | LOTR, the Two Towers
 7 | J. R. R. Tolkien | Fantasy  |   1990-04-01 | LOTR, Return of the King</pre>
<p>Notice that the flatten records look more verbose now.  The redundant data is intended for Solr to build facets from.  Let’s create facets on author_name and category.  First, enable indexing on author_name and category.  Second, enable facet search in the requestHandler.  Here is a sample configuration:</p>
<pre style="padding-left: 30px;"><span style="color: #008000;"><!-- facet fields --></span>
<strong><str name="facet.field">author_name</str>
<str name="facet.field">category</str></strong>
<span style="color: #008000;"><!-- determines the ordering of the facet field constraints, true sort constraints by count (highest count first), false sort alphabetically --></span>
<str name="facet.sort">true</str>
<span style="color: #008000;"><!-- maximum number of constraint should be returned for facet field, negative value means unlimited --></span>
<int name="facet.limit">100</int>
<span style="color: #008000;"><!-- minimum counts for facet fields should be included in the response --></span>
<int name="facet.mincount">1</int></pre>
<p>After you rebuilt your index and restarted Solr, you should start seeing facets in your search results.  This is a very simple example of flattening records for faceted search.  As you add more record types to the index, things can get a little messy.  Assuming we have done so well with book search that we want to add music search as well.  Let’s flatten music records using technique we just discussed.  Here is the flatten version of music records:</p>
<pre style="padding-left: 30px;">-------------------------------------------------------------------
id | artist_name      | genre         | release_date | album_title
---+------------------+---------------+--------------+-------------
 1 | The Beatles      | Classic Rock  |   1969-09-26 | Abbey Road
 2 | The Doors        | Classic Rock  |   1970-08-01 | The Doors
 3 | Madonna          | Pop           |   1998-03-03 | Ray of Light
 4 | Prince           | Sci-Fi        |   1990-10-17 | Purple Rain</pre>
<p>Notice that this table look awfully similar to the flattened version of the book records.  We can in fact reuse fields to map this new music data so a common search query can search on both book and music records.  However, we cannot just simply mix the two tables together.  That’s because doing so will lose the data type (book, music) information once the tables is merged.  We will have to introduce a new field for record type identification.  We also need to create a globally unique identifier for these records.  Here is what the final merged table look like:</p>
<pre style="padding-left: 30px;">--------------------------------------------------------------------------------------------------------
guid | id | type  | author_name      | category      | release_date | title
-----+----+-------+------------------+---------------+--------------+-----------------------------------
B_1  |  1 | Book  | J.K. Rowling     | Fantasy       |   1999-09-01 | H. P. and the Sorcerer’s Stone
B_2  |  2 | Book  | J.K. Rowling     | Fantasy       |   2004-08-01 | H. P. and the Order of the Phoenix
B_3  |  3 | Book  | Michael Crichton | Sci-Fi        |   1991-03-01 | Jurassic Park
B_4  |  4 | Book  | Michael Crichton | Sci-Fi        |   2000-10-01 | Timeline
B_5  |  5 | Book  | J. R. R. Tolkien | Fantasy       |   1990-04-01 | LOTR, the Fellowship of the Ring
B_6  |  6 | Book  | J. R. R. Tolkien | Fantasy       |   1990-04-01 | LOTR, the Two Towers
B_7  |  7 | Book  | J. R. R. Tolkien | Fantasy       |   1990-04-01 | LOTR, Return of the King
M_1  |  1 | Music | The Beatles      | Classic Rock  |   1969-09-26 | Abbey Road
M_2  |  2 | Music | The Doors        | Classic Rock  |   1970-08-01 | The Doors
M_3  |  3 | Music | Madonna          | Pop           |   1998-03-03 | Ray of Light
M_4  |  4 | Music | Prince           | Sci-Fi        |   1990-10-17 | Purple Rain</pre>
<p>As you can see, when we reference more tables, the bigger the flattened table will become.  The table itself also become more human readable because you no longer have to refer to other tables to retrieve the textual representation of a key.  This is exactly what Lucene (underlying search technology of Solr) needs.  Lucene doesn’t care about normalized data structure.  Lucene only care about building indexes on top of flat records and you search against those indexes.  It’s as simple as that.  Lucene is not a database.</p>
<p>There are limitation to this denormalization approach though as we deal with more complex data structure such as multivalued fields.  Let’s say we want to add some sellers and their location information to our search engine where a seller can sell many books/music and the same book/music can be sold by many sellers (This is essentially a many-to-many relationship).  We also want to enable faceting on seller and seller location so user can filter by sellers and their location to locate books/music availability at certain location.  The records would look something like this when flatten.</p>
<pre style="padding-left: 30px;">-------------------------------------------------------------------------------------------------
guid | id | type | author_name  | category | seller          | seller_location    | title
-----+----+------+--------------+----------+-----------------+--------------------+--------------
B_1  |  1 | Book | J.K. Rowling | Fantasy  | Smith, Johnson  | Boston, New York   | ..Sorcerer..
B_2  |  2 | Book | J.K. Rowling | Fantasy  | Smith, Williams | Boston, Providence | ..Phoenix
B_3  |  3 | Book | M. Crichton  | Sci-Fi   | Smith, Wilson   | Boston, Austin     | Jurassic..
B_4  |  4 | Book | M. Crichton  | Sci-Fi   | Smith, Johnson  | Boston, New York   | Timeline</pre>
<p>Notice that each book has 2 sellers and 2 seller locations.  Look more carefully and you will see that the relationship between seller and seller location is not there anymore.  We can use their position as a hint to map between seller and seller location but position is not very reliable so this piece of information is considered lost.  Besides, the facet values will be incorrect when user start filtering by seller or seller location.  Here is an example to illustrate the point:</p>
<ul>
<li>Apply seller -> Smith as filter</li>
<li>seller_location facet values would return Boston, New York, Providence, Austin</li>
</ul>
<p>From Solr/Lucene’s point of view, this is a correct result because it’s basically returning possible seller location values based on records found by the filter.  From user’s point of view, the results is totaly wrong, because seller Smith does not have locations in New York, Providence or Austin.</p>
<p>So, how do we fix this?  Remember I mentioned earlier that Lucene is not a database.  It really isn’t, so a join query is out of the question.  Normally, when we deal with such query, we would do a inner join between seller and book tables and filter on the seller field.  The results will be a list of books with the associating seller location.  In this case, seller_location should return Boston on every records.  It is apparent that the only way to perform this query is by keeping the book and seller tables separate.  To make this query possible in Solr/Lucene, we need to workaround Lucene’s limitation.  We will need additional record type and we will need to run multiple queries.  First, we need to keep the seller table separate (as separate records) because it let us maintain relationship between seller and seller_location.  Second, we need additional column in seller records to map to book/music records to make filtering book/music by seller possible.  The seller records would look like this:</p>
<pre style="padding-left: 30px;">---------------------------------------------------------------------
guid | id | type   | seller   | seller_location | product_guid
-----+----+--------+----------+-----------------+--------------------
S_1  |  1 | Seller | Smith    | Boston          | B_1, B_2, B_3, B_4
S_2  |  2 | Seller | Williams | Providence      | B_2
S_3  |  3 | Seller | Wilson   | Austin          | B_3
S_4  |  4 | Seller | Johnson  | New York        | B_4</pre>
<p>For this query to work, we will need to query Solr twice.  The first query will filter on seller -> Smith, type -> Seller.  The facet values on seller_location should return Boston only.  Keep this facet values so you can display it as part of the results.  Then, we need to find all the facet values in product_guid (assuming facet is already enabled on this field) to compose the second query to query against Book and Music records.  The second query will return all associating Book and Music records of seller Smith.  Now you have the right Book and Music records and correct facet value on seller_location.</p>
<p>Solr/Lucene may not be the best solution when dealing with many-to-many relationships.  A database would do a much better job at joining tables.  If it becomes unavoidable and you can only use Solr/Lucene, give this workaround a try and let me know how things go.  If you have a better solution than this, let me know as well.  I like to hear how people solve similar challenge when using Solr/Lucene.</p>
<span class="sfforumlink"><a href="https://mysolr.com/forum/development-tips/denormalized-data-structure"><img src="https://mysolr.com/wp-content/plugins/simple-forum/styles/icons/default/bloglink.png" alt="" /> Join the forum discussion on this post</a> - (1) Posts</span>]]></content:encoded>
			<wfw:commentRss>https://mysolr.com/tips/denormalized-data-structure/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
		</item>
		<item>
		<title>Getting Client IP in Apache/Tomcat</title>
		<link>https://mysolr.com/tips/getting-client-ip-in-apachetomcat/</link>
		<comments>https://mysolr.com/tips/getting-client-ip-in-apachetomcat/#respond</comments>
		<pubDate>Fri, 20 Mar 2009 01:09:49 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://mysolr.com/?p=100</guid>
		<description><![CDATA[This is an issue that just keep coming back to me but I always tend to forget how it’s been resolved before.  I would have to research into the issue every single time.  The problem is that request.getRemoteAddr() doesn’t always return the client IP in Tomcat (or any app servers) when it’s sitting behind a [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>This is an issue that just keep coming back to me but I always tend to forget how it’s been resolved before.  I would have to research into the issue every single time.  The problem is that request.getRemoteAddr() doesn’t always return the client IP in Tomcat (or any app servers) when it’s sitting behind a proxy.  In my case, Tomcat is behind Apache’s balancer.  When you call request.getRemoteAddr(), it would return the Apache server’s IP instead of the client IP.  </p>
<p>To get the actual client IP, you need to pull it from the header.  Apache sets a header variable called “x-forwarded-for” with the client IP.  You can simply grab it from the header:</p>
<pre>
String ip = request.getHeader("x-forwarded-for");
</pre>
<p>This is just one of the thing I found annoying.  I can understand why request.getRemoteAddr() doesn’t always return client IP.  I’m hoping eventually Tomcat will have a reliable method to return a proper client IP.  </p>
<span class="sfforumlink"><a href="https://mysolr.com/forum/development-tips/getting-client-ip-in-apachetomcat"><img src="https://mysolr.com/wp-content/plugins/simple-forum/styles/icons/default/bloglink.png" alt="" /> Join the forum discussion on this post</a> - (1) Posts</span>]]></content:encoded>
			<wfw:commentRss>https://mysolr.com/tips/getting-client-ip-in-apachetomcat/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Effective Use of Solr Index Distribution Scripts</title>
		<link>https://mysolr.com/tips/effective-use-of-solr-index-distribution-scripts/</link>
		<comments>https://mysolr.com/tips/effective-use-of-solr-index-distribution-scripts/#respond</comments>
		<pubDate>Tue, 10 Feb 2009 21:34:57 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://mysolr.com/?p=78</guid>
		<description><![CDATA[Operation or automation tasks sometimes is an after-thought at the end of development. For Solr development, it’s actually not that bad to think about automation at the very end. Solr provides a set of very useful scripts to make automation easy. You can consider yourself lucky if you are short on time to build automation. [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Operation or automation tasks sometimes is an after-thought at the end of development.  For Solr development, it’s actually not that bad to think about automation at the very end.  Solr provides a set of very useful scripts to make automation easy.  You can consider yourself lucky if you are short on time to build automation.  I will first talk about basic architecture with Solr and then I will dive into leveraging Solr’s distribtion and operation scripts.  </p>
<p>The most basic form of architecture for a Solr-based application only require a single application server.  Assuming you develop in Java, you can have both Solr and your webapp served by the same application server.  A more common and effective architecture would involve an dedicated indexing server (or indexer) and one or more slave index servers.  The idea is to separate all index building work from normal queries.  Conceptually, this is similar to database clustering where you have a read/write server as master and read-only servers as slaves.  </p>
<p>The following set up involves Tomcat, Apache and Linux assuming Solr’s home is under /solr on every Solr servers.</p>
<p><em>Note: you may be able to replicate similar configuration on a Windows environment running Cygwin.  I haven’t tried it on Windows yet so YMMV.  </em></p>
<ul>
<li>Scripts configuration
<ul>
<li>Environment can be configured in solr/conf/scripts.conf.  Here is a sample indexer configuration:</li>
<pre>user=solr
solr_hostname=indexer
solr_port=8080
rsyncd_port=18080
data_dir=data
webapp_name=solr
master_host=indexer
master_data_dir=/solr/data
master_status_dir=/solr/logs</pre>
<li>Sample slave server configuration:</li>
<pre>user=solr
solr_hostname=slave1
solr_port=8080
rsyncd_port=18080
data_dir=data
webapp_name=solr
master_host=indexer
master_data_dir=/solr/data
master_status_dir=/solr/logs</pre>
</ul>
</li>
<li>SSH set up
<ul>
<li>Solr uses SSH and Rsync in its index distrubtion scripts so we need to make sure SSH keys are configured and public keys are exchanged between indexer and slave index servers.  If you haven’t configured SSH key yet, use the ssh-keygen command to generate public/private key pair on every Solr servers.</li>
<pre>$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/solr/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/solr/.ssh/id_rsa.
Your public key has been saved in /home/solr/.ssh/id_rsa.pub.
The key fingerprint is:
0c:27:27:f5:81:36:87:82:0f:4f:39:b5:aa:fd:e4:2f solr@solr</pre>
<li>Exchange public key between indexer and slave index servers</li>
<pre>$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 644 ~/.ssh/authorized_keys
$ ssh solr@indexer "cat .ssh/id_rsa.pub" >> ~/.ssh/authorized_keys</pre>
</ul>
<li>Rsyncd set up
<ul>
<li>Solr uses rsync for index distribution so you need to make sure rsync is functional in your operating system.  Start Rsyncd the first time with following commands:</li>
<pre>$ /solr/bin/rsyncd-enable
$ /solr/bin/rsyncd-start</pre>
</ul>
</li>
<li>Configure Solr to automatically generate a snapshot after optimize.  Update solr/conf/solrconfig.xml with following:</li>
<pre>
<listener event="postOptimize" class="solr.RunExecutableListener">
      <str name="exe">/solr/bin/snapshooter</str>
      <str name="dir">/solr/bin/</str>
      <bool name="wait">true</bool>
</listener>
</pre>
<li>Enable snapshot pulling on slave servers:</li>
<pre>$ /usr/bin/snappuller-enable</pre>
<li>Set up snapshot pulling on slave servers at 3am in cron:</li>
<pre>0 3 * * * /solr/bin/snappuller; /solr/bin/snapinstaller; /solr/bin/snapcleaner -N 3</pre>
<li>OPTIONAL: set up Apache load balancing of your slave index servers (running Tomcat), update /etc/conf/httpd.conf with following:</li>
<pre>
LoadModule proxy_module modules/mod_proxy.so
LoadModule proxy_balancer_module modules/mod_proxy_balancer.so
....
<VirtualHost *:80>
    ProxyRequests Off
    ProxyPreserveHost On
    ProxyPass / balancer://tomcats/ stickysession=JSESSIONID lbmethod=byrequests
    ProxyPassReverse /  balancer://tomcats/
    <Proxy balancer://tomcats>
        BalancerMember ajp://slave1:8080 route=jvm1 loadfactor=20
        BalancerMember ajp://slave2:8080 route=jvm2 loadfactor=20
    </Proxy>
</VirtualHost></pre>
</ul>
<p>All indexing work should be done on your indexer.  When you issue the optimize command, Solr will automatically generate a snapshot.  Snapshot should be generate well ahead of the scheduled snapshot pulling time (3am in this case).  Apache load balancing is optional if you only have one slave server or you have other load balancing solution.  </p>
<p><strong>Reference links:</strong></p>
<p><a href="http://wiki.apache.org/solr/CollectionDistribution" target="_blank">http://wiki.apache.org/solr/CollectionDistribution</a></p>
<p><a href="http://wiki.apache.org/solr/SolrOperationsTools" target="_blank">http://wiki.apache.org/solr/SolrOperationsTools</a></p>
<span class="sfforumlink"><a href="https://mysolr.com/forum/development-tips/effective-use-of-solr-index-distribution-scripts"><img src="https://mysolr.com/wp-content/plugins/simple-forum/styles/icons/default/bloglink.png" alt="" /> Join the forum discussion on this post</a> - (12) Posts</span>]]></content:encoded>
			<wfw:commentRss>https://mysolr.com/tips/effective-use-of-solr-index-distribution-scripts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.280 seconds -->
<!-- Cached page generated by WP-Super-Cache on 2025-12-27 19:29:43 -->
